Spark

On this page

Understanding The Different Flavors of Spark
Transforming Data
Transformation Sources
Training Sets and Inference Store Materialization
Dataframe Transformations

Featureform supports Spark as an Offline Store. This means that Featureform can handle all flavors of Spark using S3, GCS, Azure Blob Store or HDFS as a backing store. Common use cases of Featureform with Spark include: Spark with EMR Spark with Databricks

Understanding The Different Flavors of Spark

Spark is a powerful, open-source general computing framework developed for large-scale data processing. Databricks is a managed data and analytics platform developed on top of Spark. Both Spark and Databricks can be self-hosted in Kubernetes and other non-cloud implementations, as well as hosted with popular cloud providers such as AWS, Azure, GCP and on Databricks.

Category	Featureform Supported
Types of Spark	Spark, Databricks
Specific Cloud Offerings	EMR
File stores	S3, GCS, Azure Blobs, HDFS

Transforming Data

Transformation Sources

Using Spark and a file store (GCS, Azure Blob Storage, S3, HDFS) as an Offline Store, you can define new transformations via SQL and Spark DataFrames. Using either these transformations or pre-existing files in your file store, a user can chain transformations and register columns in the resulting tables as new features and labels.

Training Sets and Inference Store Materialization

Any column in a preexisting table or user-created transformation can be registered as a feature or label. These features and labels can be used, as with any other Offline Store, for creating training sets and inference serving.

Dataframe Transformations

Using Spark with Featureform, a user can define transformations in SQL like with other offline providers.

spark_quickstart.py

transactions = spark.register_file(
   name="transactions",
   variant="kaggle",
   owner="featureformer",
   file_path="<insert_file_path_here", )  @spark.sql_transformation() def max_transaction_amount(): """the average transaction amount for a user """ return "select customerid as user_id, max(transactionamount) " \ "as max_transaction_amt from {{transactions.kaggle}} group by user_id"< code></insert_file_path_here",>

In addition, registering a provider via Spark allows you to perform DataFrame transformations using your source tables as inputs.

spark_quickstart.py

@spark.df_transformation(
   inputs=[("transactions", "kaggle")], variant="default")
def average_user_transaction(df):
   from pyspark.sql.functions import avg
   df.groupBy("CustomerID").agg(avg("TransactionAmount").alias("average_user_transaction"))
   return df

Snowflake Spark with Databricks

Overview

Using Featureform

Featureform Resource Types

LLMs, Embeddings, and Vector Databases

Common Use Cases and Examples

Supported Infrastructure Providers

Deployment

Understanding The Different Flavors of Spark

Transforming Data

Transformation Sources

Training Sets and Inference Store Materialization

Dataframe Transformations

Overview

Using Featureform

Featureform Resource Types

LLMs, Embeddings, and Vector Databases

Common Use Cases and Examples

Supported Infrastructure Providers

Deployment

​Understanding The Different Flavors of Spark

​Transforming Data

​Transformation Sources

​Training Sets and Inference Store Materialization

​Dataframe Transformations

Understanding The Different Flavors of Spark

Transforming Data

Transformation Sources

Training Sets and Inference Store Materialization

Dataframe Transformations