Anatomy of a Training Set
A training set consists of a single label paired with one or more features. Below is an example of registering a training set named “fraud_training” with the variant “quickstart.” It comprises the “fraudulent/quickstart” label and a single feature “avg_transactions/quickstart.”
@ff.entity class User: avg_transactions = ff.Feature( average_user_transaction[["CustomerID", "TransactionAmount"]], variant="quickstart", type=ff.Float32, inference_store=redis, timestamp_column="timestamp", ) fraudulent = ff.Label( transactions[["CustomerID", "IsFraud"]], variant="quickstart", type=ff.Bool, timestamp_column="timestamp", ) ff.register_training_set( "fraud_training", "quickstart", label=("fraudulent", "quickstart"), features=[("avg_transactions", "quickstart")], ) client.apply() # The training set's feature values will be point-in-time correct. ts = client.training_set("fraud_training", "quickstart").dataframe()
How a Label gets joined With Features into a Training Set
Training sets are constructed by pairing features and labels using their entity key. The process involves looping through the labels and, for each feature, selecting the row with the same entity key as the label. To create point-in-time correct training sets, the feature value is obtained from the row with a timestamp closest to, but less than, the label timestamp. This ensures that the feature value aligns with the label’s time.
Working with Training Sets
Every training set can be directly retrieved as a Dataframe. This approach is suitable for small datasets or when working with Dataframe-based training systems like Databricks. It offers ease and flexibility in handling training sets.
ts = client.training_set("fraud_training", "quickstart").dataframe()
Using Mini Batches
For larger training sets that do not fit into memory or when distributed Dataframes are not viable, you can iterate through them using mini-batches.
import featureform as ff client = ff.Client(host) dataset = client.training_set(name, variant).repeat(5).shuffle(1000).batch(64) for feature_batch, label_batch in dataset: # Run through a shuffled dataset 5 times in batches of 64