Models require training, a process that typically involves feeding in a set of features with known labels. During training, the model makes inferences based on these features, and the labels are used to adjust the model’s weights.
A training set consists of a single label paired with one or more features. Below is an example of registering a training set named “fraud_training” with the variant “quickstart.” It comprises the “fraudulent/quickstart” label and a single feature “avg_transactions/quickstart.”
Copy
Ask AI
@ff.entityclass User: avg_transactions = ff.Feature( average_user_transaction[["CustomerID", "TransactionAmount"]], variant="quickstart", type=ff.Float32, inference_store=redis, timestamp_column="timestamp", ) fraudulent = ff.Label( transactions[["CustomerID", "IsFraud"]], variant="quickstart", type=ff.Bool, timestamp_column="timestamp", )ff.register_training_set( "fraud_training", "quickstart", label=("fraudulent", "quickstart"), features=[("avg_transactions", "quickstart")],)client.apply()# The training set's feature values will be point-in-time correct.ts = client.training_set("fraud_training", "quickstart").dataframe()
How a Label gets joined With Features into a Training Set
Training sets are constructed by pairing features and labels using their entity key. The process involves looping through the labels and, for each feature, selecting the row with the same entity key as the label. To create point-in-time correct training sets, the feature value is obtained from the row with a timestamp closest to, but less than, the label timestamp. This ensures that the feature value aligns with the label’s time.
Every training set can be directly retrieved as a Dataframe. This approach is suitable for small datasets or when working with Dataframe-based training systems like Databricks. It offers ease and flexibility in handling training sets.
For larger training sets that do not fit into memory or when distributed Dataframes are not viable, you can iterate through them using mini-batches.
Copy
Ask AI
import featureform as ffclient = ff.Client(host)dataset = client.training_set(name, variant).repeat(5).shuffle(1000).batch(64)for feature_batch, label_batch in dataset: # Run through a shuffled dataset 5 times in batches of 64