Spark with EMR
Featureform supports Spark on AWS as an Offline Store.
Implementation
The AWS Spark Offline store implements AWS Elastic Map Reduce (EMR) as a compute layer, and S3 as a storage layer. The transformations, training sets, and feature definitions a user registers via the Featureform client are stored as parquet tables in S3.
Using Spark for computation, Featureform leverages EMR to compute user defined transformations and training sets. The user can author new tables and iterate through training sets sourced directly from S3 via the Featureform CLI.
Features registered on the Spark Offline Store can be materialized to an Inference Store (ex: Redis) for real-time feature serving.
Requirements
Configuration
To configure a Spark provider via AWS, you need an IAM Role with access to account’s EMR cluster and S3 bucket.
Your AWS access key id and AWS secret access key are used as credentials when registering your Spark Offline Store.
Your EMR cluster must be running and support Spark.
The EMR cluster, before being deployed, must run a bootstrap action to install the necessary python packages to run Featureform’s Spark script. The following link contains the script that must be added as a bootstrap action for your cluster to be compatible with Featureform:
https://featureform-demo-files.s3.amazonaws.com/python_packages.sh
Mutable Configuration Fields
-
description
-
aws_access_key_id
(Executor and File Store) -
aws_secret_access_key
(Executor and File Store)
Dataframe Transformations
Because Featureform supports the generic implementation of Spark, transformations written in SQL and Dataframe operations for the different Spark providers will be very similar except for the file_path or table name.
Examples of Dataframe transformations for both SQL and Dataframe operations can be found in the main Spark providers page.