Snowflake
Featureform supports Snowflake as an Offline Store.
Configuration
First we have to add a declarative Snowflake configuration in Python.
Credentials with Account and Organization
Legacy Credentials
Older Snowflake accounts may have credentials that use an Account Locator
rather than an account
and organization
to connect. Featureform provides a separate registration function to support these credentials.
Dynamic Apache Iceberg™ Tables
Featureform’s Snowflake integration leverages dynamic Apache Iceberg™ tables to offer automated data transformation pipelines from SQL transformations to training sets.
This requires users to create an external volume prior to registering Snowflake as a provider. See the repo terraforming-snowflake-external-volumes
for an example of creating an external volume backed by an AWS S3 bucket using Terraform.
Additionally, users may have to manually enable change tracking on source tables for incremental refreshes to work.
The name of the external volume, as well as a folder name in the bucket (i.e. base_location
), is required to instantiate ff.SnowflakeCatalog
. Currently, the only option for catalog is to use Snowflake as the Iceberg catalog.
Additionally, users can set “provider-level” values for target lag and refresh mode that will be used by all resources (i.e. SQL transformations, features, labels and training sets) unless otherwise specified at the resource level.
Session Parameters
Users may optionally add session parameters when configuring Snowflake as an offline store. The parameters will be used when Featureform creates a connection with Snowflake.
For example, providing a value for query_tag
at configuration ensures all queries issues by Featureform will have the value of query_tag
applied to them, which will be visible in the Snowflake UI.
Applying
Once our config file is complete, we can apply it to our Featureform deployment
We can re-verify that the provider is created by checking the Providers tab of the Feature Registry.
Mutable Configuration Fields
Many of the values provided at the time of registration can be changed if users want to re-apply the same Python file (e.g. snowflake_config.py
) again with different values. The following list of configurations fields are mutable:
-
description
-
username
-
password
-
role
-
warehouse
-
database
-
schema
-
session_params
Implementation
Datasets
Primary Datasets
Primary datasets (i.e. tables) are read-only and are only used as inputs in transformations, features and/or labels.
Transformation Sources
SQL transformations are used to create a dynamic Apache Iceberg™ table. The refresh mode (i.e. full or incremental) depends on the user-defined SQL query, so users should be aware of query expressions, keywords, etc. that might prevent incremental refreshes.
Offline to Inference Store Materialization
When a feature is registered, Featureform creates an internal transformation to get the newest value of every feature and its associated entity. A Kubernetes job is then kicked off to sync this up with the inference store.
Training Set Generation
Training set are created from one or more features and a label. Label are registered separately as a dynamic table to ensure the latest label values are used in the training set. When users define a timestamp column for a feature and label combination, Featureform uses Snowflake’s ASOF JOIN
to combine feature values with their labels.