Kubernetes

Metadata

Coordinator

Serving

Monitoring

Dashboard

System Architecture

Featureform

What is Featureform?

A quick start guide for Featureform with Docker.

Quickstart

The feature engineering process involves three key stages: experimentation, production, and evaluation. Collaboration among data scientists is crucial during these stages, as it often leads to the creation of innovative features and insights. Featureform streamlines the feature engineering workflow, facilitating collaboration and enhancing the efficiency of the entire process.

Featureform Workflow

Featureform's architecture and components are designed to streamline the feature engineering process. It follows a Virtual Feature Store architecture, allowing for pluggable data infrastructure and serving as an overarching application framework for feature definition, management, and serving. Let's explore the key components and interfaces of Featureform's architecture:

Architecture and Components

Featureform coordinates a set of infrastructure providers to act together as a feature store. This Virtual Feature Store approach allows teams to choose the right infrastructure to meet their needs and interface across them with the same abstraction. Teams can also use multiple infrastructure providers for different use cases across the organization while maintaining one unified abstraction across all of them.

Registering Infrastructure Providers

Once we have everything registered (e.g. features, training sets, providers), we can see information about them on the Feature Registry.

Exploring the Feature Registry

Abstractions

Featureform functions as a Virtual Feature Store, strategically positioned atop your existing data infrastructure. It acts orchestrator, conducting your infrastructure to construct and serve the features you define. Importantly, this approach means your data remains within your infrastructure—it's not stored in Featureform. Moreover, transformations occur using the same language and performance capabilities as if Featureform were not in the picture. To operate in this framework, you need to register and configure your data infrastructure.

Data Infrastructure Provider

Once you've configured your data infrastructure, you can commence the process of registering your primary data sets. These primary data sets either directly contain your features and labels or serve as the foundation for their creation.

Primary Data Sets

In most scenarios, primary data sets serve as the raw materials, which are then transformed into data sets containing the set of features and labels required for serving and training machine learning models. These transformations can be directly applied to primary data sets or sequenced and executed on other previously transformed data sets. It's essential to understand that, with the exception of pandas, Featureform itself doesn't perform the data transformations. Instead, it orchestrates your existing data infrastructure to execute the transformations.

Transforming Data Sets

An **entity** serves as a collection of semantically related features and labels. Users define entities to map to the domain of their specific use cases. For instance, in the context of a ride-hailing service, entities could include customers and drivers, grouping related features and labels associated with these respective entities.

Entities

**Features** represent the core abstraction in Featureform. They serve as inputs to machine learning models, providing context or observations that the model leverages to make inferences. In practice, feature engineering often yields the highest return on investment for data scientists, significantly improving model performance and reliability. Features are employed in two primary contexts: building training sets and serving for inference.

Features

**Embeddings** represent a specialized type of [feature](/abstractions/feature) stored in a [Vector DB](/llms-embeddings-and-vector-databases/llm-workflow-with-featureform). They are primarily used for nearest neighbor lookups. If you'd like to explore a comprehensive explanation of what an embedding is, you can read our [definitive guide to embeddings](https://www.featureform.com/post/the-definitive-guide-to-embeddings).

Embeddings

**Labels** are a core component of a [training set](/abstractions/training-set). A model relies on a set of [features](/abstractions/feature) to make an inference. During the training process, this inference is compared to a label, and the model is adjusted incrementally.

Labels

Models require training, a process that typically involves feeding in a set of [features](/abstractions/feature) with known [labels](/abstractions/label). During training, the model makes inferences based on these features, and the labels are used to adjust the model's weights.

Training Sets

LLM Workflow

Building a Chatbot with OpenAI and a Vector Database

Fraud detection use-cases showcase several key advantages of the Featureform platform.

Fraud Detection and Featureform

The retrieval augmented generation workflow pulls information that’s relevant to the user’s query and feeds it into the LLM via the prompt. That information might be similar documents pulled from a vector database, or features looked up from an inference store.

Retrieval Augmented Generation (RAG) Workflow for Chatbots with Featureform

Featureform is designed around a Virtual Feature Store architecture, which manages metadata and orchestrates various infrastructure providers. This approach allows data scientists to interact with their data using the Featureform framework while ensuring that data continues to be stored and processed in a manner consistent with existing infrastructure. Featureform supports four primary provider abstractions:

Overview of Infrastructure Providers

Featureform's architecture is built upon a foundation of provider abstractions, which include Offline Stores, Object/File Stores, Inference Stores, and Vector Databases. Each of these providers adheres to a generic interface, allowing Featureform to seamlessly manage various types of infrastructure. This flexibility is achieved without the need for writing custom code for every specific use case, making it adaptable to heterogeneous infrastructure environments.

Extending Featureform with Custom Providers and Requesting New Providers

Local mode

This guide will walk through deploying Featureform on Kubernetes. The Featureform ingress currently supports AWS load balancers.

This quickstart will walk through creating a few simple features, labels, and a training set using Postgres and Redis. We will use a transaction fraud training set.

Azure

This quickstart will walk through creating a few simple features, labels, and a training set using BigQuery and Firestore. We will use a transaction fraud training set.

Google Cloud

Featureform can be configured to take periodic snapshots of itself that are backed up to your specified cloud storage. In case of an incident, this snapshot can be pulled and reloaded to restore Featureform to a previous state.

Backup and Restore

Community

Github

Python SDK

Schedule a Demo

In the Featureform ecosystem, our declarative API establishes a DAG that outlines the relationship between resources. Starting from primary sources, these resources undergo transformations and ultimately evolve into features and training sets. For enhanced visibility and insight, you can readily explore the lineage of features through both our dashboard and CLI interfaces.

Immutability, Lineage, and Directed Acyclic Graphs (DAGs)

Managing versioning is crucial for effective ML resource management. Featureform empowers you to implement versioning across your data sources, transformations, features, labels, and training sets. Each of these resources is inherently immutable by default, ensuring you can confidently utilize versioned resources created by others without the risk of disruption due to upstream modifications.

Versioning and Variants

In Featureform, your ML data resources' metadata is stored comprehensively. Our metadata engine offers adaptability, enabling you to establish personalized tags and properties. These become particularly crucial when utilizing our Governance APIs within Featureform Enterprise.

Setting Custom Tags and Properties

Certain features necessitate continuous updates through a data stream, surpassing the capabilities of scheduled batch processing or triggered executions. *Featureform Enterprise* offers an API tailored for streaming feature values. This not only ensures real-time relevance but also retains historical values to build point-in-time correct training datasets.

Streaming Data: Real-time Updates

When it comes to working with data for machine learning, dataframes are ubiquitous. Featureform simplifies interaction with its sources and transformations, allowing you to fetch them into local memory as dataframes using the *client.dataframe()* API.

Exploring Resources with Dataframes

Data scientists have diverse preferences in tools, with some favoring SQL while others lean towards Dataframes. Featureform transformations also exhibit varying compatibility with each API, often influenced by underlying data infrastructure like Postgres, which may support only one of the two.

Dataframe and SQL Transformation Support

Large Language Models (LLMs) are pre-trained models that take a text prompt as input and generate a response based on the prompt.

The LLM Workflow with Featureform

Certain machine learning predictions rely on data available only at the time of the request. For instance, testing a user transaction for fraud might require data that's passed with the request and cannot be preprocessed. While stream processing offers near real-time features, it can lead to race conditions, potentially rendering the current data unavailable when you access features from the feature store. In such cases, an ideal approach is to compute the required feature at the moment of the request. To achieve this, Featureform exposes an On-Demand Feature API.

Calculating On-Demand Features at Request Time

Features and the transformations backing them often hold value that transcends individual models and teams. Featureform offers built-in search and discovery capabilities, accessible via various avenues:

Search and Discover Features and Transformations

In the realm of time-series data, it's a common scenario for feature values to evolve over time. For instance, in a fraud detection model, you might define a feature like user's average transaction amount based on a series of transactions from your users. This value will continuously change as new transactions pour in. A typical training set comprises a label (what the model aims to predict) and a set of features. Each row often represents a historical transaction. Therefore, it's crucial for the feature values in these rows to reflect their state at the time of the associated label. This concept is known as point-in-time correctness, where we need to obtain the historical values of features.

Achieving Point-in-Time Correctness and Handling Historical Features with Time Series Data

Featureform Enterprise's product include governance, access controls, and audit logs. Many machine learning models, such as fraud detection and recommender systems, rely on features that are projected from personally identifiable information (PII). Consequently, establishing and enforcing robust policies supported by comprehensive audit logs become imperative for your organization to maintain compliance.

Governance and Access Control: Ensuring Compliance

At Featureform, we are deeply committed to delivering an exceptional standalone experience with our open-source product. However, as a business, we adhere to an open-core philosophy. Since the inception of our open-source offering, we've been transparent about the two primary distinctions between the open-source version and Featureform Enterprise: **streaming data support** and **governance, ACLs, and audit logs**. All other features we charge for fall under the purview of SLAs, cloud hosting, and dedicated support.

Open-Source vs. Enterprise Featureform: What Sets Them Apart

Overview

Often times we want to keep our Features and Training Sets up to date with the latest data. Featureform offers the ability to schedule and run updates for Transformations, Features, Labels, and Training Sets.

Scheduling Resources

Once you've created your primary data sets, you can define features, labels, and training sets based on them.

Register and Serve Features, Labels, and Training Sets

Model to Feature Lineage

The Offline Store provider is a versatile component within Featureform, serving multiple key functions. It plays a central role in running transformations and storing data. Due to Featureform's virtual architecture, you can expect similar performance and cost characteristics from the Offline Store as you would without Featureform, making it a nearly seamless abstraction layer.

Offline Store

Featureform supports [Snowflake](https://www.snowflake.com/) as an Offline Store.

Snowflake

Spark

Featureform supports [Databricks](https://www.databricks.com) as an Offline Store.

Spark with Databricks

Featureform supports [Spark on AWS](https://aws.amazon.com/emr/features/spark/) as an Offline Store.

Spark with EMR

Featureform supports [Kubernetes](https://kubernetes.io/) as an Offline Store.

Featureform supports [BigQuery](https://cloud.google.com/bigquery) as an Offline Store.

BigQuery

Featureform supports [Postgres](https://www.postgresql.org/) as an Offline Store.

Postgres

Featureform supports [Redshift](https://aws.amazon.com/redshift/) as an Offline Store.

Redshift

Featureform supports [ClickHouse](https://clickhouse.com/) as an Offline Store.

ClickHouse

Object and File Stores serve as fundamental components within the Featureform framework, particularly in the context of ETL-based offline stores like Spark and Pandas on K8s. These stores fulfill two primary functions within Featureform:

Object and File Stores

Featureform supports [AWS S3](https://aws.amazon.com/s3/) as a [File Store](/providers/object-and-file-stores)

Featureform supports [Azure Blob Store](https://azure.microsoft.com/en-us/products/storage/blobs/) as a [File Store](/providers/object-and-file-stores)

Azure Blobs

Featureform supports [Google Cloud Storage (GCS)](https://cloud.google.com/storage) as a [File Store](/providers/object-and-file-stores)

Google Cloud (GCS)

HDFS

Featureform supports [Cassandra](https://cassandra.apache.org/%5F/index.html) as an Inference Store.

Cassandra

Featureform supports [DynamoDB](https://aws.amazon.com/dynamodb/) as an Inference Store.

DynamoDB

Featureform supports [Firestore](https://firebase.google.com/docs/firestore) as an Inference Store.

Firestore

Featureform supports [MongoDB](https://www.mongodb.com/) as an Inference Store.

MongoDB

Featureform supports [Redis](https://redis.io/) as an Inference Store.

Redis

A Vector Database provider is designed to facilitate nearest neighbor lookups. It shares several similarities with an inference store but is distinguished by its support for the `client.nearest` API. Configuration is typically done when registering an [embedding](../abstractions/embedding.md) associated with an [entity](../abstractions/entity.md). This setup enables efficient retrieval of nearest neighbors based on feature vectors.

Vector Database

**The RedisSearch module is required to use Redis as a Vector DB**

Featureform supports [Pinecone](https://pinecone.io/) as a [Vector DB](/providers/vector-db).

Pinecone

Featureform supports [Weaviate](https://weaviate.io/) as a [Vector DB](/providers/vector-db).

Overview

Using Featureform

Featureform Resource Types

LLMs, Embeddings, and Vector Databases

Common Use Cases and Examples

Supported Infrastructure Providers

Deployment

System Architecture

Kubernetes

Metadata

Coordinator

Serving

Monitoring

Dashboard

Overview

Using Featureform

Featureform Resource Types

LLMs, Embeddings, and Vector Databases

Common Use Cases and Examples

Supported Infrastructure Providers

Deployment

​Kubernetes

​Metadata

​Coordinator

​Serving

​Monitoring

​Dashboard

Kubernetes

Metadata

Coordinator

Serving

Monitoring

Dashboard