Advancing Feature Engineering for Structured Data Beyond Generative AI
As data scientists, we continuously seek tools that not only streamline our workflows but also push the boundaries of what's achievable with data. While Generative AI models like Large Language Models (LLMs) have made significant strides in handling unstructured data such as text and images, they often hit a wall when dealing with structured, relational, and time-series data. This limitation becomes particularly apparent in feature engineering for predictive modeling.
In this post, we'll delve deeper into why LLMs struggle with feature engineering on structured data and how getML addresses these challenges by generalizing gradient boosting to multi-relational decision trees, effectively bringing supervised learning directly to raw relational data.
The Limitations of LLMs in Feature Engineering
LLMs are trained on vast amounts of unstructured data, allowing them to generate coherent text and perform tasks like translation, summarization, and question-answering. However, they are not inherently designed to understand the complexities of structured data, especially when it involves multiple related tables and temporal relationships.
Consider the task of feature engineering in a relational database. LLMs can provide generic suggestions based on common patterns, but they lack the capability to perform data-specific, supervised feature learning. They cannot access your dataset to understand distributions, relationships, or the target variable's influence. This gap makes them insufficient for advanced feature engineering tasks required in predictive modeling with structured data.
A Practical Example: Predicting Customer Churn
Let's consider predicting customer churn for AdventureWorks, a fictional bicycle company. The company uses a relational database that includes multiple tables: customers, orders, products, special offers, and more.
We define customer churn as a customer not making another purchase within 180 days of their last purchase. Our goal is to engineer features that accurately predict this behavior by leveraging the rich relational data available.
The Complexity of the Data
The AdventureWorks database has intricate relationships:
This complexity makes manual feature engineering time-consuming and error-prone. It involves writing extensive SQL queries to perform joins, aggregations, and time-based calculations, often resulting in thousands of lines of code.
Here's a simplified diagram representing some relationships in the AdventureWorks database:
The Need for Advanced Feature Learning
Effective feature engineering in such a setting requires:
LLMs aren't built to handle these tasks. Their sequence-based attention mechanisms are great for unstructured data, but they can't perform relational operations or manage supervised feedback loops. This makes them unsuitable for feature engineering on structured data.
Introducing getML: Generalizing Gradient Boosting to Relational Data
getML addresses these challenges by extending gradient boosting algorithms to handle multi-relational data through the construction of multi-relational decision trees. This innovative approach enables getML to perform supervised feature learning directly on raw relational data, eliminating the need for extensive manual feature engineering.
How getML Works
getML automates the feature learning process by integrating several key components:
Data Ingestion & Modelling
getML offers an easy-to-use API to model complex relationships inside raw data, allowing to represent multivariate time series and construct star or snowflake schemas. This flexibility enables to accurately model the search space for features, whether it's a simple table or a complex graph of related entities.
The Need for Advanced Feature Learning
Effective feature engineering in such a setting requires:
LLMs aren't built to handle these tasks. Their sequence-based attention mechanisms are great for unstructured data, but they can't perform relational operations or manage supervised feedback loops. This makes them unsuitable for feature engineering on structured data.
Introducing getML: Generalizing Gradient Boosting to Relational Data
getML addresses these challenges by extending gradient boosting algorithms to handle multi-relational data through the construction of multi-relational decision trees. This innovative approach enables getML to perform supervised feature learning directly on raw relational data, eliminating the need for extensive manual feature engineering.
领英推荐
How getML Works
getML automates the feature learning process by integrating several key components:
Data Ingestion & Modelling
getML offers an easy-to-use API to model complex relationships inside raw data, allowing to represent multivariate time series and construct star or snowflake schemas. This flexibility enables to accurately model the search space for features, whether it's a simple table or a complex graph of related entities.
Supervised Feature Learning
getML includes different feature learners, each designed to capture various aspects of the data, the three most prominet Feature Learners of getML Enterpirse are:
Using the target variable (e.g., churn), getML evaluates the predictive power of generated features. It selects the most relevant ones based on their contribution to reducing the loss function in the gradient boosting framework.
Feature Learner automatically generates features by:
Prediction Pipelines and Evaluation
getML seamlessly transitions from feature learning to model training, using the selected features to build predictive models. It provides performance metrics and insights into feature importance, aiding in model interpretation and validation.
An In-Depth Look: Feature Generation Example
To illustrate getML's capability, let's examine an example of a feature generated by the Relboost feature learner.
Suppose getML generates the following SQL-like feature:
This feature comes from getML’s adventure_works.ipynb example and has been truncated for brevity. Note, that getML is not an SQL generator. getML functionality relies on a relational data base that
Interpretation:
This feature encapsulates complex relationships and interactions that would be challenging to identify and code manually. By leveraging supervised learning, getML ensures that such features are predictive of the target variable.
Advantages of getML for Advanced Data Scientists
getML offers several benefits tailored for experienced practitioners:
Overcoming Feature Drift and Ensuring Model Relevance
One of the challenges in predictive modeling is feature drift—when the relationships in the data change over time, leading to model degradation. getML addresses this by allowing for quick retraining and feature re-evaluation. Since the feature generation process is automated and directly tied to the data, updating the model to reflect new patterns is streamlined.
Conclusion
While LLMs have transformed many aspects of data science, they are not equipped to handle the complexities of feature engineering in structured, relational data. getML fills this gap by extending gradient boosting to multi-relational decision trees, enabling advanced, supervised feature learning directly on raw relational data.
For advanced data scientists dealing with complex datasets, getML offers a powerful tool that enhances predictive modeling capabilities while reducing the overhead of manual feature engineering. By embracing this approach, practitioners can focus on higher-level insights and modeling strategies, driving better outcomes in their projects.
Ready to Elevate Your Feature Engineering?
Explore getML by checking out our notebook on Predicting Customer Churn on the AdventureWorks data set and see getML in action. We'd love to hear from you and help you get started! Reach out here: Contact getML
Let's push the boundaries of what's possible with structured data.