Feature Store as a Service (FSaaS) with Data Virtualization
Lyftrondata
Go from data siloes and data mess into analysis-ready data in minutes without any engineering.
Feature Store as a Service
Discover why Data Virtualization is a top choice for Feature Store as a Service (FSaaS) and learn about Feature Store, a new idea in AI and advanced analytics.
In today's AI and Advanced Analytics, feature stores are a hot topic. Many companies are actively researching this topic and developing products and solutions to meet the requirements. Let's first go over a feature of data virtualization and why it works well with FSaaS.
Characteristic
A feature is simply a type of data that is constructed based on raw or existing feature(s), although the term "feature" is more frequently used in AI and advanced analytics. Although new types of data were previously produced via integration, ETL, RPA, and other technologies, features are generally understood to be data types produced by feature engineering procedures and intended for use by AI services.
Examining some actual feature implementations reveals that most features are not very complicated or AI-specific and that they are frequently applicable to integration, analytics, and reporting.
No matter how sophisticated the feature engineering is, a feature can be an enterprise data asset with potential uses beyond artificial intelligence. If it proves to be an enterprise asset, it still requires a centralized repository, a data catalog, governance, and the ability to be shared with authorized individuals or systems.
Let's now examine a few feature-generating examples:
Further details regarding the feature
The goal of the feature
The rationale behind a feature's generation is crucial since it clarifies where the feature falls on the feature spectrum (see the following item). The following objectives lead to the generation of a feature:
Enhancement of data
A baseline data asset is created through the feature generation process known as data enrichment.
Reports & Analytics
Among the analytics and reporting features are summary tables, snapshot tables, materialized views, semantic views, and so forth. Snapshot tables are variations of a feature in analytics that are determined by the DateTime dimension of the snapshot.
Integration
Integration systems can be integrated via streaming, APIs, ODBC/JDBC connections, interface files, etc. Data, which is frequently transferred between parties, is an integrating component in the majority of these techniques.
AI
Three key factors drive the demand for features in algorithm correctness, AI scoring, and AI training.
Training
AI model training frequently requires a large amount of previous and stale data. Offline (cold) characteristics are another term for AI training features.
Scoring
Smaller but more current data are required for AI scoring. Online, or "hot," features are another term for AI scoring characteristics.
Accuracy
Increasing the number of iterations, automatically creating new features (columns and/or cells) from preexisting features, and attempting to create new algorithms with fewer errors are common ways to improve the accuracy of AI models and algorithms. Many new and current AI products on the market today automatically produce these properties. Citation: Python-Based Automated Feature Engineering.
Spectrum of features
The new type of data (feature) could be an alteration to the current data, an analytical or reporting object, or additional AI-related objects, depending on what we do with the existing data/features and what features we develop. Even when we merely include a column in already-existing data, the data remains basic. Data is transformed into an object that works well for reporting, analytics, and integration use cases when it is de-normalized; but, when a vector object is generated, it is only helpful for AI use cases. Here's an illustration of a feature spectrum.
Storage features
A feature may develop a data type that differs from the raw/feature(s) data type(s) from which it originally came. As an illustration, we take an audio file and use an AI cognitive service to create an audio transcript functionality. This indicates that the previously unstructured blob format of the new feature (transcript) has been replaced with a semi-structured No-SQL format (I chose No-SQL because it allows someone to simply drop it into a text file!). Additionally, we may create a "Bag of Words" feature based on the preceding transcript; this feature functions more like a JSON or Key-Value feature. Lastly, a matrix feature that is kept in an in-memory multidimensional array can also represent a bag of words feature. A feature may require a different sort of storage than its original raw or feature(s), based on the type of data it contains. An illustration of features, their forms of storage, and the available storage technology is provided below. Remember that features do not have set data types or storage formats. For instance, a transcript can be stored as a text blob in AWS S3, but it can also be stored as No-SQL data in MongoDB by another user; this is a rather individualized design.
Tool for feature engineering
We may select distinct feature engineering techniques based on the feature spectrum. While T-SQL-supported tools are frequently used to generate data-oriented features, advanced AI services or AI scripts (such as Python, R, etc.) are typically used to generate AI features. Analytical features fall in the middle and may only require a programming tool or T-SQL, or they may require both. As a result, data engineering and feature engineering tools need to be supported by Feature Store as a Service (FSaaS).
Configuring features
Additional crucial features include their size, frequency of updates, and variations. For instance, large amounts of data are required for AI deep learning, however, they don't always have to be new (most of the time). We require more precise, more recent, and smaller data when integrating systems via APIs. For stream analytics, for instance, we are discussing only the most recent data frame. However, features do have two reasons to be versioned as they change over time with an organization:
A) Modifications to a feature's logic or schema; and
领英推荐
B) Data that a feature uses and exposes.
What each of the aforementioned scenarios teaches us is
A feature might be a common enterprise data asset and is not always an AI object.
It is preferable to separate feature stores from storage technologies and feature engineering tools.
Feature stores are better equipped to provide capabilities such as configuration, governance, security, and sharing.
Data virtualization
The FSaaS standards make it clear that a data virtualization platform is one of the best options. The following is a summary of the features of the Data Virtualization platform that help us evaluate it for FSaaS:
Disconnection from storage
Data virtualization creates virtual data objects (VDOs) from any source of data, anywhere in the globe, and it doesn't depend on destination or source storage technology!
Decoupling of tools
Since it can be accessed and used by any programming language, scripts, and tools, T-SQL, ETLs, visualization tools, and AI platforms via virtualization platforms, APIs, or ODBC, data virtualization is essentially a No-ETL concept that is separated from tooling.
The hub for data and sharing
Data virtualization is used to construct real data hubs, and only authorized users and systems can access any VDO.
Governance and security
Every VDO is protected and supervised by a single process.
Databank
Data catalog and metadata are automatically generated when a VDO is formed or changed since VDOs are logical.
Plus additional
We have two more options for a more accurate comparison: FEAST is a feature store solution, and Snowflake is a generic data/analytic tool that can be used as one.
FEAST
Given that FEAST is one of the best feature store goods out there, we can see that: It emphasizes artificial intelligence.
Data security and governance are concerned because it only supports a limited range of data types and has local storage.
The corporate data platform's capabilities are not restricted from enterprise use by any feature.
Snowflake
Snowflake employs DataRobots to provide auto-AI features; nonetheless, like FSaaS, Snowflake has the following drawbacks:
It cannot produce cross-platform features since it is a distributed platform rather than a centralized data store.
It is not a platform that is independent of cloud providers; rather, its engine is tied to one of them.
Here is an FSaaS comparison among Data Virtualization, typical data platforms, and some of the existing feature store solutions:
This approach makes it easier to access
A great read! FSaaS with Data Virtualization sounds like a smart way.
Interesting..
Insightful!