To illustrate point-in-time correctness, consider the creation of a training set using the following fraud label data:

TransactionUserFraudulent ChargeTimestamp
1AFalseJan 3, 2022
2ATrueJan 5, 2022

Suppose we have a feature denoting the user’s average purchase price:

UserAvg Purchase PriceTimestamp
A5Jan 2, 2022
A10Jan 4, 2022

The resulting training set should appear as follows:

Avg Purchase PriceFraudulent Charge
5False
10True

Notice how the first row’s feature value is 5, reflecting its state on Jan 3rd, 2022—the time of the first label. Conversely, the second row’s feature value is 10, corresponding to the situation on Jan 5th, 2022—the time of the second label. This adherence to historical feature values ensures point-in-time correctness.

In Featureform, when defining features and labels, you have the option to include a timestamp column. By doing so, when creating the training set, Featureform will automatically align labels with features at the specified points in time.

Here’s an example of defining features and labels with timestamps in Featureform:

@ff.entity
class User:
    avg_transactions = ff.Feature(
        average_user_transaction[["CustomerID", "TransactionAmount"]],
        variant="quickstart",
        type=ff.Float32,
        inference_store=redis,
        timestamp_column="timestamp",
    )
    fraudulent = ff.Label(
        transactions[["CustomerID", "IsFraud"]],
        variant="quickstart",
        type=ff.Bool,
        timestamp_column="timestamp",
    )

ff.register_training_set(
    "fraud_training",
    "quickstart",
    label=("fraudulent", "quickstart"),
    features=[("avg_transactions", "quickstart")],
)

client.apply()

# The training set's feature values will be point-in-time correct.
ts = client.training_set("fraud_training", "quickstart").dataframe()

Additionally, it’s worth noting that when working with the streaming data API in Featureform Enterprise, timestamps are tracked, and you have the capability to backfill data for point-in-time correct training sets as well. This comprehensive approach ensures the integrity of historical features in your machine learning workflows.