Spark with EMR
Using Spark for computation, Featureform leverages EMR to compute user defined transformations and training sets. The user can author new tables and iterate through training sets sourced directly from S3 via the Featureform CLI.
The EMR cluster, before being deployed, must run a bootstrap action to install the necessary python packages to run Featureform's Spark script. The following link contains the script that must be added as a bootstrap action for your cluster to be compatible with Featureform:
import featureform as ff
emr = EMRCredentials(
s3 = ff.register_s3(
spark = ff.register_spark(
description="A Spark deployment we created for the Featureform quickstart",
aws_access_key_id(Executor and File Store)
aws_secret_access_key(Executor and File Store)
Because Featureform supports the generic implementation of Spark, transformations written in SQL and Dataframe operations for the different Spark providers will be very similar except for the file_path or table name.
Examples of Dataframe transformations for both SQL and Dataframe operations can be found in the main Spark providers page.