Spark with Databricks
With Databricks, you can leverage your Databricks cluster to compute transformations and training sets. Featureform however does not handle storage in non-local mode, so it is necessary to separately register a file store provider like Azure to store the results of its computation.
If you encounter the following error, or one similar to it, when registering a transformation, you may need to disable certain default container configurations:
Caused by: Operation failed: "This endpoint does not support BlobStorageEvents or SoftDelete. Please disable these account features if you would like to use this endpoint."
To disable these configurations, you can navigate to
Home > Storage Accounts > YOUR STORAGE ACCOUNT, and select "Data Protection" under the "Data Management" section. Then uncheck:
- Enable soft delete for blobs
- Enable soft delete for containers
If you encounter the following error, or one similar to it, when registering a transformation, you may need to add credentials for your Azure Blob Storage account to your Databricks cluster:
Cannot read the python file abfss://<container_name>@<storage_account_name>/<root_path>/featureform/scripts/spark/offline_store_spark_runner_py.
Please check driver logs for more details.
To add Azure Blob Store account credentials to your Databricks cluster, you'll need to:
- Launch your Databricks workspace from the Azure portal (e.g. by clicking "Launch Workspace" from the "Overview" page of your Databricks workspace)
- Select "Compute" from the left-hand menu and click on your cluster
- Click "Edit" and then select "Advanced Options" tab to show the "Spark Config" text input field
- Add the following configuration to the "Spark Config" text input field:
spark.hadoop.fs.azure.account.key.<AZURE BLOB STORAGE SERVICE ACCOUNT NAME>.blob.core.windows.net <AZURE BLOB STORAGE SERVICE ACCOUNT KEY>
Once you've clicked "Confirm", your cluster will need to restart before you can apply the transformation again.
Any column in a preexisting table or user-created transformation can be registered as a feature or label. These features and labels can be used, as with any other Offline Store, for creating training sets and inference serving.
To configure a Databricks store as a provider, you need to have a Databricks cluster. Featureform automatically downloads and uploads the necessary files to handle all necessary functionality of a native offline store like Postgres or BigQuery.
import featureform as ff
databricks = ff.DatabricksCredentials(
# You can either use username and password ...
# ... or host and token
azure_blob = ff.register_blob_store(
description="An azure blob store provider to store offline and inference data"
# Will either be the container name or the container name plus a path if you plan read/write
# to a specific directory in your container
spark = ff.register_spark(
transactions = spark.register_file(
# Must be an absolute path using the abfss:// protocol
Because Featureform supports the generic implementation of Spark, transformations written in SQL and Dataframe operations for the different Spark providers will be very similar except for the file_path or table name.