Training Sets
Anatomy of a Training Set
A training set consists of a single label paired with one or more features. Below is an example of registering a training set named “fraud_training” with the variant “quickstart.” It comprises the “fraudulent/quickstart” label and a single feature “avg_transactions/quickstart.”
How a Label gets joined With Features into a Training Set
Training sets are constructed by pairing features and labels using their entity key. The process involves looping through the labels and, for each feature, selecting the row with the same entity key as the label. To create point-in-time correct training sets, the feature value is obtained from the row with a timestamp closest to, but less than, the label timestamp. This ensures that the feature value aligns with the label’s time.
Working with Training Sets
Using Dataframes
Every training set can be directly retrieved as a Dataframe. This approach is suitable for small datasets or when working with Dataframe-based training systems like Databricks. It offers ease and flexibility in handling training sets.
Using Mini Batches
For larger training sets that do not fit into memory or when distributed Dataframes are not viable, you can iterate through them using mini-batches.