Spark with Databricks
Featureform supports Databricks as an Offline Store.
Implementation
With Databricks, you can leverage your Databricks cluster to compute transformations and training sets. Featureform however does not handle storage in non-local mode, so it is necessary to separately register a file store provider like Azure to store the results of its computation.
Requirements
-
Databricks Cluster
Required Azure Configurations
Azure Blob Storage
If you encounter the following error, or one similar to it, when registering a transformation, you may need to disable certain default container configurations:
To disable these configurations, you can navigate to Home > Storage Accounts > YOUR STORAGE ACCOUNT
, and select “Data Protection” under the “Data Management” section. Then uncheck:
-
Enable soft delete for blobs
-
Enable soft delete for containers
Databricks
If you encounter the following error, or one similar to it, when registering a transformation, you may need to add credentials for your Azure Blob Storage account to your Databricks cluster:
To add Azure Blob Store account credentials to your Databricks cluster, you’ll need to:
-
Launch your Databricks workspace from the Azure portal (e.g. by clicking “Launch Workspace” from the “Overview” page of your Databricks workspace)
-
Select “Compute” from the left-hand menu and click on your cluster
-
Click “Edit” and then select “Advanced Options” tab to show the “Spark Config” text input field
-
Add the following configuration to the “Spark Config” text input field:
Once you’ve clicked “Confirm”, your cluster will need to restart before you can apply the transformation again.
Transformation Sources
Using Databricks as an Offline Store, you can define new transformations via SQL and Spark DataFrames. Using either these transformations or preexisting tables in your file store, a user can chain transformations and register columns in the resulting tables as new features and labels.
Training Sets and Inference Store Materialization
Any column in a preexisting table or user-created transformation can be registered as a feature or label. These features and labels can be used, as with any other Offline Store, for creating training sets and inference serving.
Configuration
To configure a Databricks store as a provider, you need to have a Databricks cluster. Featureform automatically downloads and uploads the necessary files to handle all necessary functionality of a native offline store like Postgres or BigQuery.
Mutable Configuration Fields
-
description
-
username
(Executor) -
password
(Executor) -
token
(Executor) -
account_key
(File Store)
Dataframe Transformations
Because Featureform supports the generic implementation of Spark, transformations written in SQL and Dataframe operations for the different Spark providers will be very similar except for the file_path or table name.