Understanding The Different Flavors of Spark
Spark is a powerful, open-source general computing framework developed for large-scale data processing. Databricks is a managed data and analytics platform developed on top of Spark. Both Spark and Databricks can be self-hosted in Kubernetes and other non-cloud implementations, as well as hosted with popular cloud providers such as AWS, Azure, GCP and on Databricks.Category | Featureform Supported |
---|---|
Types of Spark | Spark, Databricks |
Specific Cloud Offerings | EMR |
File stores | S3, GCS, Azure Blobs, HDFS |
Transforming Data
Transformation Sources
Using Spark and a file store (GCS, Azure Blob Storage, S3, HDFS) as an Offline Store, you can define new transformations via SQL and Spark DataFrames. Using either these transformations or pre-existing files in your file store, a user can chain transformations and register columns in the resulting tables as new features and labels.Training Sets and Inference Store Materialization
Any column in a preexisting table or user-created transformation can be registered as a feature or label. These features and labels can be used, as with any other Offline Store, for creating training sets and inference serving.Dataframe Transformations
Using Spark with Featureform, a user can define transformations in SQL like with other offline providers.spark_quickstart.py
spark_quickstart.py