Google Cloud
A quick start guide for Featureform on GCP using Terraform.
This quickstart will walk through creating a few simple features, labels, and a training set using BigQuery and Firestore. We will use a transaction fraud training set.
- Python 3.7+
- An available domain/subdomain you own that can be pointed at your cluster IP
https://github.com/featureform/featureform.git
cd featureform/terraform/gcp
We'll start BigQuery, Firestore, and Google Kubernetes Engine (GKE). (Specific services can be enabled/disabled as needed in terraform.auto.tfvars)
We need to set:
export PROJECT_ID=<your-project-id> # Your GCP Project ID
export DATASET_ID=featureform # The BigQuery Dataset we'll use
export BUCKET_NAME=<your-bucket-name> # A GCP Storage Bucket where we can store test data
export COLLECTION_ID=featureform_collection # A Firestore Collection ID
export FEATUREFORM_HOST=<your-domain-name> # The domain name that you own
cd gcp_services
gcloud auth application-default login # Gives Terraform access to GCP
gcloud config set project $PROJECT_ID # Sets our GCP Project
terraform init; \
terraform apply -auto-approve \
-var="project_id=$PROJECT_ID" \
-var="bigquery_dataset_id=$DATASET_ID" \
-var="storage_bucket_name=$BUCKET_NAME" \
-var="firestore_collection_name=$COLLECTION_ID"
We need to load the GKE config into our kubeconfig.
gcloud container clusters get-credentials $(terraform output -raw kubernetes_cluster_name) --region $(terraform output -raw region)
We'll use terraform to install Featureform on our GKE cluster.
cd ../featureform
terraform init; terraform apply -auto-approve -var="featureform_hostname=$FEATUREFORM_HOST"
Featureform automatically provisions a public certificate for your domain name.
To connect, you need to point your domain name at the Featureform GKE Cluster.
We can get the IP Address for the cluster using:
kubectl get ingress | grep "grpc-ingress" | awk {'print $4'} | column -t
You need to add 2 records to your DNS provider for the (sub)domain you intend to use:
- 1.A CAA record for letsencrypt.org value:
0 issuewild "letsencrypt.org"
. This allows letsencrypt to automatically generate a public certificate - 2.An A record with the value of the outputted value from above
We can load some demo data into BigQuery that we can transform and serve.
# Load sample data into a bucket in the same project
curl https://featureform-demo-files.s3.amazonaws.com/transactions_short.csv | gsutil cp - gs://$BUCKET_NAME/transactions.csv
# Load the bucket data into BigQuery
bq load --autodetect --source_format=CSV $DATASET_ID.Transactions gs://$BUCKET_NAME/transactions.csv
pip install featureform
GCP Registered providers require a GCP Credentials file for a user that has permissions for Firestore and BigQuery.
definitions.py
import os
import featureform as ff
project_id = os.getenv("PROJECT_ID")
collection_id=os.getenv("COLLECTION_ID")
dataset_id = os.getenv("DATASET_ID")
firestore = ff.register_firestore(
name="firestore-quickstart",
description="A Firestore deployment we created for the Featureform quickstart",
project_id=project_id,
collection=collection_id,
credentials_path="<path-to-firestore-credentials-file>"
)
bigquery = ff.register_bigquery(
name="bigquery-quickstart",
description="A BigQuery deployment we created for the Featureform quickstart",
project_id=project_id,
dataset_id=dataset_id,
credentials_path="<path-to-bigquery-credentials-file>"
)
Once we create our config file, we can apply it to our Featureform deployment.
featureform apply definitions.py
We will create a user profile for us, and set it as the default owner for all the following resource definitions.
Now we'll register our user fraud dataset in Featureform.
definitions.py
transactions = bigquery.register_table(
name="transactions",
description="Fraud Dataset From Kaggle",
table="Transactions", # This is the table's name in BigQuery
)
Next, we'll define a SQL transformation on our dataset.
definitions.py
@bigquery.sql_transformation()
def average_user_transaction():
return "SELECT CustomerID as user_id, avg(TransactionAmount) " \
"as avg_transaction_amt from {{transactions.default}} GROUP BY user_id"
Next, we'll register a passenger entity to associate with a feature and label.
definitions.py
# Register a column from our transformation as a feature
user = ff.register_entity("user")
average_user_transaction.register_resources(
entity=user,
entity_column="user_id",
inference_store=firestore,
features=[
{"name": "avg_transactions", "column": "avg_transaction_amt", "type": "float32"},
],
)
# Register label from our base Transactions table
transactions.register_resources(
entity=user,
entity_column="customerid",
labels=[
{"name": "fraudulent", "column": "isfraud", "type": "bool"},
],
)
Finally, we'll join together the feature and label into a training set.
definitions.py
ff.register_training_set(
"fraud_training",
label="fraudulent",
features=["avg_transactions"],
)
Now that our definitions are complete, we can apply it to our Featureform instance.
featureform apply definitions.py
Once we have our training set and features registered, we can train our model.
import featureform as ff
client = ff.ServingClient()
dataset = client.training_set("fraud_training")
training_set = dataset.shuffle(10000)
for batch in training_set:
print(batch)
Example Output:
Features: [279.76] , Label: False
Features: [254.] , Label: False
Features: [1000.] , Label: False
Features: [5036.] , Label: False
Features: [10.] , Label: False
Features: [884.08] , Label: False
Features: [56.] , Label: False
...
We can serve features in production once we deploy our trained model as well.
import featureform as ff
client = ff.ServingClient()
fpf = client.features(["avg_transactions"], {"user": "C1011381"})
print(fpf)
Example Output:
[1500.0]
Last modified 1yr ago