Featureform supports Spark as an Offline Store.This means that Featureform can handle all flavors of Spark using S3, GCS, Azure Blob Store or HDFS as a backing store.Common use cases of Featureform with Spark include:Spark with EMRSpark with Databricks
Spark is a powerful, open-source general computing framework developed for large-scale data processing.Databricks is a managed data and analytics platform developed on top of Spark.Both Spark and Databricks can be self-hosted in Kubernetes and other non-cloud implementations, as well as hosted with popular cloud providers such as AWS, Azure, GCP and on Databricks.
Using Spark and a file store (GCS, Azure Blob Storage, S3, HDFS) as an Offline Store, you can define new transformations via SQL and Spark DataFrames. Using either these transformations or pre-existing files in your file store, a user can chain transformations and register columns in the resulting tables as new features and labels.
Any column in a preexisting table or user-created transformation can be registered as a feature or label. These features and labels can be used, as with any other Offline Store, for creating training sets and inference serving.
Using Spark with Featureform, a user can define transformations in SQL like with other offline providers.
spark_quickstart.py
Copy
Ask AI
transactions = spark.register_file( name="transactions", variant="kaggle", owner="featureformer", file_path="<insert_file_path_here", ) @spark.sql_transformation() def max_transaction_amount(): """the average transaction amount for a user """ return "select customerid as user_id, max(transactionamount) " \ "as max_transaction_amt from {{transactions.kaggle}} group by user_id"< code></insert_file_path_here",>
In addition, registering a provider via Spark allows you to perform DataFrame transformations using your source tables as inputs.