Documentation Index
Fetch the complete documentation index at: https://docs.featureform.com/llms.txt
Use this file to discover all available pages before exploring further.
Registering Primary Data Sources
All features and labels originate from a set of initial data sources. These sources can be streams, files, or tables. A feature can simply be a single field of a data source. More commonly, it’s created via a set of transformations from one or multiple sources.
Files
In the case of Offline Stores that work with a file system, the primary data may be a file or a set of files. Even when the Offline Store is a database, the primary data may be files that should be copied into a table in the Offline Store.
s3.register_file(
name = "titanic",
variant = "kaggle",
description = "The Titanic Dataset from Kaggle",
path = "bucket/kaggle/titanic.csv",
)
Tables
In the case of an Offline Store, the primary data may be a table that already exists. In this case, we can register the table with Featureform.
titanic = postgres.register_table(
name = "titanic",
variant = "kaggle",
description = "The Titanic Dataset from Kaggle",
table = "Titanic", # This is the table's name in Postgres
)
Streams
Streaming support in Enterprise Featureform. Book a call with us today!
There are two supported transformation types: SQL and Dataframes. Not all providers support all transformation types. For example, a dataframe transformation cannot currently be run on Snowflake. Each transformation definition also includes a set of metadata like its name, variant, and description.
SQL
SQL transformations are defined by a decorated python function that returns a templated SQL string. The decorator specifies the necessary metadata, and the SQL string has its table names replaced with templated source names and versions.
@postgres.register_transformation(variant="quickstart")
def fare_per_family_member():
""" The average fare paid per family member in a family
"""
return "SELECT PassengerId, Fare / Parch FROM {{titanic.kaggle}}"
Dataframes
Dataframe are defined by a decorated python function that takes a set of input dataframes and returns an output dataframe. The decorator specifies the necessary metadata and inputs, then the function outlines the transformation logic.
@spark.register_transformation(
variant="quickstart",
inputs=[("titanic", "kaggle")],
)
def fare_per_family_member(titanic):
""" The average fare paid per family member in a family
"""
titanic["Fare/Parch"] = titanic["Fare"] / titanic["Parch"]
return titanic[["PassengerId", "Fare/Parch"]]
Transformations can be chained together, where one transformation is the input into another transformation. Transformations will wait for any dependent transformations to complete successfully before running.
@postgres.register_transformation(variant="quickstart")
def survival_first_class():
""" The age and survival status of first class passengers
"""
return "SELECT age, survival FROM {{titanic.kaggle}} WHERE pclass='1st'"
@postgres.register_transformation(variant="quickstart")
def average_age_survival_first_class():
""" The average survivability of first class passengers by age
"""
return "SELECT age, AVG(survival) FROM {{survival_first_class.quickstart}} GROUP BY age"
Checking status
Since some registrations may take longer than other depending on size of dataset and complexity of a query, the Featureform API has the ability to check the status of any resource programatically.
Fetching the resource
import featureform as ff
client = ff.ResourceClient(host)
source = client.get_source("name", "variant")
training_set = client.get_training_set("name", "variant")
label = client.get_label("name", "variant")
feature = client.get_feature("name", "variant")
Checking the resource status
import featureform as ff
training_set = client.get_training_set("name", "variant")
print(training_set.status == "READY")
print(training_set.get_status() == ff.ResourceStatus.READY)
print(training_set.is_ready())