登录查看更多内容

Feature Store as a Service (FSaaS) with Data Virtualization

Lyftrondata

Go from data siloes and data mess into analysis-ready data in minutes without any engineering.

发布日期: 2024年10月8日

Feature Store as a Service

Discover why Data Virtualization is a top choice for Feature Store as a Service (FSaaS) and learn about Feature Store, a new idea in AI and advanced analytics.

In today's AI and Advanced Analytics, feature stores are a hot topic. Many companies are actively researching this topic and developing products and solutions to meet the requirements. Let's first go over a feature of data virtualization and why it works well with FSaaS.

Characteristic

A feature is simply a type of data that is constructed based on raw or existing feature(s), although the term "feature" is more frequently used in AI and advanced analytics. Although new types of data were previously produced via integration, ETL, RPA, and other technologies, features are generally understood to be data types produced by feature engineering procedures and intended for use by AI services.

Examining some actual feature implementations reveals that most features are not very complicated or AI-specific and that they are frequently applicable to integration, analytics, and reporting.

No matter how sophisticated the feature engineering is, a feature can be an enterprise data asset with potential uses beyond artificial intelligence. If it proves to be an enterprise asset, it still requires a centralized repository, a data catalog, governance, and the ability to be shared with authorized individuals or systems.

Let's now examine a few feature-generating examples:

Adding new columns, changing or deleting current columns containing raw data or feature(s) that already exist
Features of the current raw data, such as feature normalization, feature scaling, dimension reduction, outliers, anomalies, etc.
Utilizing the current raw data/feature(s) to create new tabular objects (tables, views, materialized views, CSV, etc.), new unstructured objects (blobs, texts, JSON, etc.), and new semi-structured objects (No-SQL, graphs, etc.
AI algorithm objects include auto-generated features, matrices, and vectors.

Further details regarding the feature

The goal of the feature

The rationale behind a feature's generation is crucial since it clarifies where the feature falls on the feature spectrum (see the following item). The following objectives lead to the generation of a feature:

Enhancement of data

A baseline data asset is created through the feature generation process known as data enrichment.

Reports & Analytics

Among the analytics and reporting features are summary tables, snapshot tables, materialized views, semantic views, and so forth. Snapshot tables are variations of a feature in analytics that are determined by the DateTime dimension of the snapshot.

Integration

Integration systems can be integrated via streaming, APIs, ODBC/JDBC connections, interface files, etc. Data, which is frequently transferred between parties, is an integrating component in the majority of these techniques.

AI

Three key factors drive the demand for features in algorithm correctness, AI scoring, and AI training.

Training

AI model training frequently requires a large amount of previous and stale data. Offline (cold) characteristics are another term for AI training features.

Scoring

Smaller but more current data are required for AI scoring. Online, or "hot," features are another term for AI scoring characteristics.

Accuracy

Increasing the number of iterations, automatically creating new features (columns and/or cells) from preexisting features, and attempting to create new algorithms with fewer errors are common ways to improve the accuracy of AI models and algorithms. Many new and current AI products on the market today automatically produce these properties. Citation: Python-Based Automated Feature Engineering.

Spectrum of features

The new type of data (feature) could be an alteration to the current data, an analytical or reporting object, or additional AI-related objects, depending on what we do with the existing data/features and what features we develop. Even when we merely include a column in already-existing data, the data remains basic. Data is transformed into an object that works well for reporting, analytics, and integration use cases when it is de-normalized; but, when a vector object is generated, it is only helpful for AI use cases. Here's an illustration of a feature spectrum.

Storage features

A feature may develop a data type that differs from the raw/feature(s) data type(s) from which it originally came. As an illustration, we take an audio file and use an AI cognitive service to create an audio transcript functionality. This indicates that the previously unstructured blob format of the new feature (transcript) has been replaced with a semi-structured No-SQL format (I chose No-SQL because it allows someone to simply drop it into a text file!). Additionally, we may create a "Bag of Words" feature based on the preceding transcript; this feature functions more like a JSON or Key-Value feature. Lastly, a matrix feature that is kept in an in-memory multidimensional array can also represent a bag of words feature. A feature may require a different sort of storage than its original raw or feature(s), based on the type of data it contains. An illustration of features, their forms of storage, and the available storage technology is provided below. Remember that features do not have set data types or storage formats. For instance, a transcript can be stored as a text blob in AWS S3, but it can also be stored as No-SQL data in MongoDB by another user; this is a rather individualized design.

Tool for feature engineering

We may select distinct feature engineering techniques based on the feature spectrum. While T-SQL-supported tools are frequently used to generate data-oriented features, advanced AI services or AI scripts (such as Python, R, etc.) are typically used to generate AI features. Analytical features fall in the middle and may only require a programming tool or T-SQL, or they may require both. As a result, data engineering and feature engineering tools need to be supported by Feature Store as a Service (FSaaS).

Configuring features

Additional crucial features include their size, frequency of updates, and variations. For instance, large amounts of data are required for AI deep learning, however, they don't always have to be new (most of the time). We require more precise, more recent, and smaller data when integrating systems via APIs. For stream analytics, for instance, we are discussing only the most recent data frame. However, features do have two reasons to be versioned as they change over time with an organization:

A) Modifications to a feature's logic or schema; and

Palantir Technologies 2 年前

Democratizing Data Analytics with SLEGO System

Data & Analytics 5 个月前

Knowledge Graphs And Their Role In Data-Driven…

Bluelupin Technologies Pvt. Ltd. 4 周前

B) Data that a feature uses and exposes.

What each of the aforementioned scenarios teaches us is

A feature might be a common enterprise data asset and is not always an AI object.

It is preferable to separate feature stores from storage technologies and feature engineering tools.

Feature stores are better equipped to provide capabilities such as configuration, governance, security, and sharing.

Data virtualization

The FSaaS standards make it clear that a data virtualization platform is one of the best options. The following is a summary of the features of the Data Virtualization platform that help us evaluate it for FSaaS:

Disconnection from storage

Data virtualization creates virtual data objects (VDOs) from any source of data, anywhere in the globe, and it doesn't depend on destination or source storage technology!

Decoupling of tools

Since it can be accessed and used by any programming language, scripts, and tools, T-SQL, ETLs, visualization tools, and AI platforms via virtualization platforms, APIs, or ODBC, data virtualization is essentially a No-ETL concept that is separated from tooling.

The hub for data and sharing

Data virtualization is used to construct real data hubs, and only authorized users and systems can access any VDO.

Governance and security

Every VDO is protected and supervised by a single process.

Databank

Data catalog and metadata are automatically generated when a VDO is formed or changed since VDOs are logical.

Plus additional

We have two more options for a more accurate comparison: FEAST is a feature store solution, and Snowflake is a generic data/analytic tool that can be used as one.

FEAST

Given that FEAST is one of the best feature store goods out there, we can see that: It emphasizes artificial intelligence.

Data security and governance are concerned because it only supports a limited range of data types and has local storage.

The corporate data platform's capabilities are not restricted from enterprise use by any feature.

Snowflake

Snowflake employs DataRobots to provide auto-AI features; nonetheless, like FSaaS, Snowflake has the following drawbacks:

It cannot produce cross-platform features since it is a distributed platform rather than a centralized data store.

It is not a platform that is independent of cloud providers; rather, its engine is tied to one of them.

Here is an FSaaS comparison among Data Virtualization, typical data platforms, and some of the existing feature store solutions:

Coffee with Modern Data Stack

1,241 位关注者

Simu Scott - Data Warehouse Expert

1 个月

This approach makes it easier to access

Emma Nuesi - Business Analyst and Data Expert

1 个月

A great read! FSaaS with Data Virtualization sounds like a smart way.

Lorna Gomez - Founder of Modern Data Architecture

1 个月

Interesting..

Gary Player - ETL/ELT Data Warrior

1 个月

Insightful!

查看更多评论

要查看或添加评论，请登录

Feature Store as a Service

Characteristic

Let's now examine a few feature-generating examples:

Further details regarding the feature

The goal of the feature

Enhancement of data

Reports & Analytics

Integration

AI

Training

Scoring

Accuracy

Spectrum of features

Storage features

Tool for feature engineering

Configuring features

领英推荐

What each of the aforementioned scenarios teaches us is

Data virtualization

Disconnection from storage

Decoupling of tools

The hub for data and sharing

Governance and security

Databank

Plus additional

FEAST

Snowflake

Coffee with Modern Data Stack

1,241 位关注者

Lyftrondata Enables Data Virtualization on Snowflake? – Part I

2024年11月26日

Google BigQuery vs Amazon Redshift: Learn Key Difference

2024年11月8日

Snowflake Data Marketplace

2024年11月6日

Load data from Quickbook to Snow?ake in minutes

2024年10月31日

Snowflake Data Exchange

2024年10月25日

Data Virtualization for Snowflake with a Powerful Combination of Lyftrondata

2024年10月22日

An Overview of the materialized view of Snowflake

2024年10月16日

Myths and Misconceptions Data Mesh and Data Warehousing

2024年10月14日

How to create a customer advocacy program that will help you increase your sales

2024年10月10日

What is Change Data Capture?

2024年10月3日

社区洞察

其他会员也浏览了

From Data to Decisions: How Data Engineering Fuels AI Transformation and Common Pitfalls to Avoid?

Unlocking the Power of Data with AI Data Catalogs: The Future of Metadata Management

Unlocking Data Value: A Comprehensive Guide to SDT Methodologies

Microsoft Fabric for Data Science: Advanced ML Model Lifecycle Management

Unstructured to Structured Data: Domain-Specific AI-Agents Driven Systems of Intelligence

Is Your Data Ready for AI? Practical Steps and Proven Frameworks to Prepare for AI Adoption

Transforming Unstructured Data into Insights with Power Query

Top 5 Big Data Trends for 2023 & Beyond

How to Build a Scalable Data Pipeline for Your Product