Requirements

Step 1: Clone the Featureform Repo

https://github.com/featureform/featureform.git
cd featureform/terraform/gcp

Step 2: Create GCP Services

We’ll start BigQuery, Firestore, and Google Kubernetes Engine (GKE). (Specific services can be enabled/disabled as needed in terraform.auto.tfvars)

We need to set:

export PROJECT_ID=           # Your GCP Project ID
export DATASET_ID=featureform                 # The BigQuery Dataset we'll use
export BUCKET_NAME=         # A GCP Storage Bucket where we can store test data
export COLLECTION_ID=featureform_collection   # A Firestore Collection ID
export FEATUREFORM_HOST=    # The domain name that you own

Set our CLI to our current project

cd gcp_services
gcloud auth application-default login   # Gives Terraform access to GCP
gcloud config set project $PROJECT_ID   # Sets  our GCP Project
terraform init; \
terraform apply -auto-approve \
-var="project_id=$PROJECT_ID" \
-var="bigquery_dataset_id=$DATASET_ID" \
-var="storage_bucket_name=$BUCKET_NAME" \
-var="firestore_collection_name=$COLLECTION_ID"

Step 3: Configure Kubectl

We need to load the GKE config into our kubeconfig.

gcloud container clusters get-credentials $(terraform output -raw kubernetes_cluster_name) --region $(terraform output -raw region)

Step 4: Install Featureform

We’ll use Terraform to install Featureform on our GKE cluster.

cd ../featureform
terraform init; terraform apply -auto-approve -var="featureform_hostname=$FEATUREFORM_HOST"

Step 5: Direct Your Domain To Featureform

Featureform automatically provisions a public certificate for your domain name.

To connect, you need to point your domain name at the Featureform GKE Cluster.

We can get the IP Address for the cluster using:

kubectl get ingress | grep "grpc-ingress" | awk {'print $4'} | column -t

You need to add 2 records to your DNS provider for the (sub)domain you intend to use:

  1. A CAA record for letsencrypt.org value: 0 issuewild "letsencrypt.org". This allows letsencrypt to automatically generate a public certificate

  2. An A record with the value of the outputted value from above

Step 6: Load Demo Data

We can load some demo data into BigQuery that we can transform and serve.

# Load sample data into a bucket in the same project
curl  https://featureform-demo-files.s3.amazonaws.com/transactions_short.csv | gsutil cp - gs://$BUCKET_NAME/transactions.csv

# Load the bucket data into BigQuery
bq load --autodetect --source_format=CSV $DATASET_ID.Transactions gs://$BUCKET_NAME/transactions.csv

Step 7: Install the Featureform SDK

pip install featureform

Step 8: Register providers

GCP Registered providers require a GCP Credentials file for a user that has permissions for Firestore and BigQuery.

definitions.py
import os
import featureform as ff

project_id = os.getenv("PROJECT_ID")
collection_id=os.getenv("COLLECTION_ID")
dataset_id = os.getenv("DATASET_ID")

firestore = ff.register_firestore(
    name="firestore-quickstart",
    description="A Firestore deployment we created for the Featureform quickstart",
    project_id=project_id,
    collection=collection_id,
    credentials_path=""
)

bigquery = ff.register_bigquery(
    name="bigquery-quickstart",
    description="A BigQuery deployment we created for the Featureform quickstart",
    project_id=project_id,
    dataset_id=dataset_id,
    credentials_path=""
)

Once we create our config file, we can apply it to our Featureform deployment.

featureform apply definitions.py

Step 9: Define our resources

We will create a user profile for us, and set it as the default owner for all the following resource definitions.

Now we’ll register our user fraud dataset in Featureform.

definitions.py
transactions = bigquery.register_table(
    name="transactions",
    description="Fraud Dataset From Kaggle",
    table="Transactions", # This is the table's name in BigQuery
)

Next, we’ll define a SQL transformation on our dataset.

definitions.py
@bigquery.sql_transformation()
def average_user_transaction():
    return "SELECT CustomerID as user_id, avg(TransactionAmount) " \
           "as avg_transaction_amt from {{transactions.default}} GROUP BY user_id"
    

Next, we’ll register a passenger entity to associate with a feature and label.

definitions.py
# Register a column from our transformation as a feature
user = ff.register_entity("user")

average_user_transaction.register_resources(
    entity=user,
    entity_column="user_id",
    inference_store=firestore,
    features=[
        {"name": "avg_transactions", "column": "avg_transaction_amt", "type": "float32"},
    ],
)
# Register label from our base Transactions table
transactions.register_resources(
    entity=user,
    entity_column="customerid",
    labels=[
        {"name": "fraudulent", "column": "isfraud", "type": "bool"},
    ],
)

Finally, we’ll join together the feature and label into a training set.

definitions.py
ff.register_training_set(
    "fraud_training",
    label="fraudulent",
    features=["avg_transactions"],
)

Now that our definitions are complete, we can apply it to our Featureform instance.

featureform apply definitions.py

Step 10: Serve features for training and inference

Once we have our training set and features registered, we can train our model.

import featureform as ff

client = ff.ServingClient()
dataset = client.training_set("fraud_training")
training_set = dataset.shuffle(10000)
for batch in training_set:
    print(batch)

Example Output:

Features: [279.76] , Label: False
Features: [254.] , Label: False
Features: [1000.] , Label: False
Features: [5036.] , Label: False
Features: [10.] , Label: False
Features: [884.08] , Label: False
Features: [56.] , Label: False
...

We can serve features in production once we deploy our trained model as well.

import featureform as ff

client = ff.ServingClient()
fpf = client.features(["avg_transactions"], {"user": "C1011381"})
print(fpf)

Example Output:

[1500.0]