Skip to main content
To get predictions from KumoRFM, you first need to create a graph locally to be used for sampling operations. KumoRFM will extract a context object per prediction task. The context object serves as the input to the KumoRFM and is passed to the GPU Cluster for inference. You will load your data first to create tables and then connect them together to form a graph, which will be passed to rfm for model initialization.

1. Loading Data

KumoRFM interacts with pandas.DataFrame.
You can ingest data into memory from various sources — local files, cloud data warehouses, REST API, etc.
There’s no hard limit on data size, but all DataFrames should fit into memory for processing.

Some examples:

import pandas as pd
import kumoai.experimental.rfm as rfm

# -----------------------------
# Load data from a CSV file
# -----------------------------
df = pd.read_csv("data.csv")

# -----------------------------
# Load data from Snowflake
# -----------------------------
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL

# Define your Snowflake connection
snowflake_engine = create_engine(URL(
    account="your_account",
    user="your_username",
    password="your_password",
    database="YOUR_DATABASE",
    schema="PUBLIC",
    warehouse="YOUR_WAREHOUSE",
    role="YOUR_ROLE"
))

# Query table into Pandas DataFrame
query = "SELECT * FROM CUSTOMERS LIMIT 1000"
snowflake_df = pd.read_sql(query, snowflake_engine)

# -----------------------------
# Load data from BigQuery
# -----------------------------
from google.cloud import bigquery
# Define your BigQuery client
client = bigquery.Client(project="your_project_id")

# Query table into Pandas DataFrame
bq_query = "SELECT * FROM `your_project.your_dataset.transactions` LIMIT 1000"
bq_df = client.query(bq_query).to_dataframe()


# -----------------------------
# Load data from Amazon S3
# -----------------------------
import boto3
# Define your boto3 client
s3 = boto3.client(
    "s3",
    aws_access_key_id="YOUR_ACCESS_KEY",
    aws_secret_access_key="YOUR_SECRET_KEY",
    region_name="YOUR_REGION"
)

# Read a CSV file directly from S3 into Pandas
bucket_name = "your-bucket-name"
key = "path/to/your/data.csv"
obj = s3.get_object(Bucket=bucket_name, Key=key)
s3_df = pd.read_csv(obj["Body"])

2. Creating LocalTable

Once loaded, you will create LocalTable objects on top of the DataFrames. A LocalTable acts as a lightweight abstraction of a DataFrame, providing additional integration. It defines four critical properties:
  1. stype (Semantic Type):
    • A stype will determine how the column will be encoded downstream.
    • Correctly setting each column’s stype is critical for model performance. For instance, if you want to perform missing value imputation, the semantic type will determine whether it is treated as a regression task (stype="numerical") or a classification task (stype="categorical").
TypeExplanationExample
"numerical"Numerical values (e.g., price, age)25, 3.14, -10
"categorical"Discrete categories with limited cardinalityColor: "red", "blue", "green" (one cell may only have one category)
"multicategorical"Multiple categories in a single cell`“ActionDramaComedy”, ”ActionThriller”`
"ID"An identifier, e.g., primary keys or foreign keysuser_id: 123, product_id: PRD-8729453
"text"Natural language textDescriptions
"timestamp"Specific point in time"2025-07-11", "2023-02-12 09:47:58
"sequence"Custom embeddings or sequential data[0.25, -0.75, 0.50, ...]
  1. primary_key:
    • The primary key is a unique identifier of each row in a table.
    • If there are duplicated primary keys, the system will only keep the first one.
    • A primary key can be used to link tables through primary key—foreign key relationship.
      • In the users table: user_id is the primary key.
      • In the orders table: order_id is the primary key, and user_id is a foreign key that points back to the users table.
      • These tables can be linked via user_id (see example code below on how to link).
      • A primary key does not need to link to other tables. For example, in the orders table, the primary key (order_id) is not used for linking, but it still serves its main purpose—to uniquely identify each row in the table.
    • primary_key can only be assigned to columns holding integers, floating point values or strings.
    • Each table can have at most one primary_key column. uniquely identifies each row in a table (e.g., user_id is the primary key in the users table). It serves two purpose: (1) when creating a graph, it’s the reference point to link other tables (2) when making predictions, it identifies the entity to generate predictions for. For instance, if you want to predict user outcomes, you’ll need a table with user_id as the primary key.
  2. time_column:
    • Indicates the timestamp column that record when the event occurred.
    • Time column data must be able to be parsed via pandas.to_datetime.
    • Each table can have at most one time_column column.
  3. end_time_column:
    • Indicates the timestamp column that record when the event should be dropped from consideration (e.g. when a user becomes inactive).
    • End time column data must be able to be parsed via pandas.to_datetime.
    • Each table can have at most one end_time_column column.
# Create a table explicitly:
table = rfm.LocalTable(
    df=df,
    name="my_table",
    primary_key="id_column",
    time_column="time_column",
    end_time_clumn=None
)

# ... Or create a table by infer its metadata ...
table = rfm.LocalTable(df, name="my_table").infer_metadata()
KumoRFM is smart enough to infer most things correctly. Though,you may still want to inspect the results of inferred metadata and make adjustments to ensure correctness downstream:
# Verify metadata:
table.print_metadata()

# Change the semantic type of a column:
table[column].stype = "text"

# Set primary key:
table.primary_key = "id_column"

# Set time column:
table.time_column = "time_column"

# Set end time column:
table.end_time_column = "end_time_column"

3. Connecting Tables to Form a Graph

After creating your tables, the next step is to link them into a LocalGraph.
A good guiding principle is to start simple: begin with just the minimal set of tables needed to support the prediction task you care about. Focus on the core entities and relationships essential to prediction.
For example, suppose your goal is to predict a user’s future orders (how much they’d purchase). At a minimum, your graph only needs two tables:
  1. users: representing each user
  2. orders: representing the orders placed by those users
This minimal setup forms a usable graph for prediction. From there, you can gradually add complexity. For instance, you might later introduce an items table, so that RFM can take into account item information.

Example: Building a Customer–Transaction Graph

# step 1: select tables
graph = rfm.LocalGraph(tables=[users,orders])

# step 2: link the tables 
# In the orders table (src_table), there exists a column named user_id (fkey), which we can use as a foreign key to link to the primary key in the users table (dst_table). 
graph.link(src_table="orders", fkey="user_id", dst_table="users");
You can verify that graph connectivity is set up by visualizing the graph …
# Requires graphviz to be installed
graph.visualize();
… or by printing all necessary information:
graph.print_metadata()
graph.print_links()
You can update and modify links as needed:
# Remove link:
graph.unlink(src_table="orders", fkey="user_id", dst_table="users")

# Re-add link:
graph.link(src_table="orders", fkey="user_id", dst_table="users");
You can extend this to more complex schemas — e.g., adding items table.
# add the items table to the graph 
graph.add_table(items)
# link items to orders
graph.link(src_table="orders", fkey="item_id", dst_table="items");
You can also combine the 2 steps and let rfm infer the connection.
graph = rfm.LocalGraph.from_data({
    'users': users_df,
    'orders': orders_df,
    'items': items_df,
}, infer_metadata=True)

4. Initiating the model

You are now ready to plug your graph into KumoRFM to make predictions!
This is a one-time setup—once it’s in place, you can generate a variety of predictions from it and power many business use cases.
model = rfm.KumoRFM(graph)
Next step is to start making predictions!