Featureform supports Spark as an Offline Store.

This means that Featureform can handle all flavors of Spark using S3, GCS, Azure Blob Store or HDFS as a backing store.

Common use cases of Featureform with Spark include:

Spark with EMR

Spark with Databricks

Understanding The Different Flavors of Spark

Spark is a powerful, open-source general computing framework developed for large-scale data processing.

Databricks is a managed data and analytics platform developed on top of Spark.

Both Spark and Databricks can be self-hosted in Kubernetes and other non-cloud implementations, as well as hosted with popular cloud providers such as AWS, Azure, GCP and on Databricks.

CategoryFeatureform Supported
Types of SparkSpark, Databricks
Specific Cloud OfferingsEMR
File storesS3, GCS, Azure Blobs, HDFS

Transforming Data

Transformation Sources

Using Spark and a file store (GCS, Azure Blob Storage, S3, HDFS) as an Offline Store, you can define new transformations via SQL and Spark DataFrames. Using either these transformations or pre-existing files in your file store, a user can chain transformations and register columns in the resulting tables as new features and labels.

Training Sets and Inference Store Materialization

Any column in a preexisting table or user-created transformation can be registered as a feature or label. These features and labels can be used, as with any other Offline Store, for creating training sets and inference serving.

Dataframe Transformations

Using Spark with Featureform, a user can define transformations in SQL like with other offline providers.

transactions = spark.register_file(
   file_path="<insert_file_path_here", )  @spark.sql_transformation() def max_transaction_amount(): """the average transaction amount for a user """ return "select customerid as user_id, max(transactionamount) " \ "as max_transaction_amt from {{transactions.kaggle}} group by user_id"< code></insert_file_path_here",>

In addition, registering a provider via Spark allows you to perform DataFrame transformations using your source tables as inputs.

   inputs=[("transactions", "kaggle")], variant="default")
def average_user_transaction(df):
   from pyspark.sql.functions import avg
   return df