Advancing Feature Engineering for Structured Data Beyond Generative AI

As data scientists, we continuously seek tools that not only streamline our workflows but also push the boundaries of what's achievable with data. While Generative AI models like Large Language Models (LLMs) have made significant strides in handling unstructured data such as text and images, they often hit a wall when dealing with structured, relational, and time-series data. This limitation becomes particularly apparent in feature engineering for predictive modeling.

In this post, we'll delve deeper into why LLMs struggle with feature engineering on structured data and how getML addresses these challenges by generalizing gradient boosting to multi-relational decision trees, effectively bringing supervised learning directly to raw relational data.

The Limitations of LLMs in Feature Engineering

LLMs are trained on vast amounts of unstructured data, allowing them to generate coherent text and perform tasks like translation, summarization, and question-answering. However, they are not inherently designed to understand the complexities of structured data, especially when it involves multiple related tables and temporal relationships.

Consider the task of feature engineering in a relational database. LLMs can provide generic suggestions based on common patterns, but they lack the capability to perform data-specific, supervised feature learning. They cannot access your dataset to understand distributions, relationships, or the target variable's influence. This gap makes them insufficient for advanced feature engineering tasks required in predictive modeling with structured data.

A Practical Example: Predicting Customer Churn

Let's consider predicting customer churn for AdventureWorks, a fictional bicycle company. The company uses a relational database that includes multiple tables: customers, orders, products, special offers, and more.

We define customer churn as a customer not making another purchase within 180 days of their last purchase. Our goal is to engineer features that accurately predict this behavior by leveraging the rich relational data available.

The Complexity of the Data

The AdventureWorks database has intricate relationships:

Customers have multiple orders.
Orders contain multiple products.
Products may be associated with special offers.
Salespersons are linked to stores and influence customer interactions.
Temporal aspects like order dates, ship dates, and special offer periods add another layer of complexity.

This complexity makes manual feature engineering time-consuming and error-prone. It involves writing extensive SQL queries to perform joins, aggregations, and time-based calculations, often resulting in thousands of lines of code.

Here's a simplified diagram representing some relationships in the AdventureWorks database:

The Need for Advanced Feature Learning

Effective feature engineering in such a setting requires:

Understanding Temporal Dynamics: Capturing how customer behavior evolves over time.
Leveraging Relational Structures: Utilizing the relationships between different entities (e.g., customers, orders, products) to extract meaningful patterns.
Incorporating Hierarchical Data: Managing data with inherent hierarchies or nested relationships.
Supervised Learning Feedback: Using the target variable to guide feature selection and transformation, ensuring that engineered features improve model performance.

LLMs aren't built to handle these tasks. Their sequence-based attention mechanisms are great for unstructured data, but they can't perform relational operations or manage supervised feedback loops. This makes them unsuitable for feature engineering on structured data.

Introducing getML: Generalizing Gradient Boosting to Relational Data

getML addresses these challenges by extending gradient boosting algorithms to handle multi-relational data through the construction of multi-relational decision trees. This innovative approach enables getML to perform supervised feature learning directly on raw relational data, eliminating the need for extensive manual feature engineering.

How getML Works

getML automates the feature learning process by integrating several key components:

Data Ingestion & Modelling

getML offers an easy-to-use API to model complex relationships inside raw data, allowing to represent multivariate time series and construct star or snowflake schemas. This flexibility enables to accurately model the search space for features, whether it's a simple table or a complex graph of related entities.

The Need for Advanced Feature Learning

Effective feature engineering in such a setting requires:

Understanding Temporal Dynamics: Capturing how customer behavior evolves over time.
Leveraging Relational Structures: Utilizing the relationships between different entities (e.g., customers, orders, products) to extract meaningful patterns.
Incorporating Hierarchical Data: Managing data with inherent hierarchies or nested relationships.
Supervised Learning Feedback: Using the target variable to guide feature selection and transformation, ensuring that engineered features improve model performance.

LLMs aren't built to handle these tasks. Their sequence-based attention mechanisms are great for unstructured data, but they can't perform relational operations or manage supervised feedback loops. This makes them unsuitable for feature engineering on structured data.

Introducing getML: Generalizing Gradient Boosting to Relational Data

getML addresses these challenges by extending gradient boosting algorithms to handle multi-relational data through the construction of multi-relational decision trees. This innovative approach enables getML to perform supervised feature learning directly on raw relational data, eliminating the need for extensive manual feature engineering.

How getML Works

getML automates the feature learning process by integrating several key components:

Data Ingestion & Modelling

getML offers an easy-to-use API to model complex relationships inside raw data, allowing to represent multivariate time series and construct star or snowflake schemas. This flexibility enables to accurately model the search space for features, whether it's a simple table or a complex graph of related entities.

Supervised Feature Learning

getML includes different feature learners, each designed to capture various aspects of the data, the three most prominet Feature Learners of getML Enterpirse are:

Relboost extends the gradient boosting approach, to relational learning by focusing on aggregating learnable weights rather than columns, addressing computational complexity and exponentially growing feature space.
RelMT adapts linear model trees to relational data, combining linear models at each tree leaf to effectively capture both linear and non-linear relationships, making it particularly advantageous for modelling time-series data.
Fastboost uses a simpler, faster, and more scalable algorithm than Relboost, making it ideal for large datasets and many cross-joins. Fastboost can outperform FastProp in speed for datasets with many columns.

Using the target variable (e.g., churn), getML evaluates the predictive power of generated features. It selects the most relevant ones based on their contribution to reducing the loss function in the gradient boosting framework.

Feature Learner automatically generates features by:

Exploring Joins: Navigating through the relationships between tables to combine data meaningfully.
Applying Aggregations: Calculating statistics like sums, averages, counts over specified time windows or groupings.

Prediction Pipelines and Evaluation

getML seamlessly transitions from feature learning to model training, using the selected features to build predictive models. It provides performance metrics and insights into feature importance, aiding in model interpretation and validation.

An In-Depth Look: Feature Generation Example

To illustrate getML's capability, let's examine an example of a feature generated by the Relboost feature learner.

Suppose getML generates the following SQL-like feature:

This feature comes from getML’s adventure_works.ipynb example and has been truncated for brevity. Note, that getML is not an SQL generator. getML functionality relies on a relational data base that

Interpretation:

Temporal Conditions: The feature considers the time since the customer's last order (t1.orderdate - t2.startdate > threshold), capturing recency effects.
Categorical Splits: It segments customers based on salespersonid and territoryid, recognizing that interactions with certain salespersons or regions may influence churn risk.
Aggregated Metrics: It uses avg_monthly_orders, reflecting purchasing frequency and customer engagement.
Product Preferences: By examining productmodelid, it accounts for customer affinity towards specific product models.
Scoring Mechanism: Different conditions assign different score values, representing the learned impact on churn risk.

This feature encapsulates complex relationships and interactions that would be challenging to identify and code manually. By leveraging supervised learning, getML ensures that such features are predictive of the target variable.

Advantages of getML for Advanced Data Scientists

getML offers several benefits tailored for experienced practitioners:

Direct Use of Relational Data: It operates directly on raw relational data without the need for flattening, preserving the richness of the data's relational structure.
Automated, Supervised Feature Learning: By automating the feature generation process and using supervised learning for feature selection, getML saves time and enhances model performance.
Handling of Complex Relationships: It can model intricate interactions across multiple tables and time periods, capturing nuances that traditional methods might miss.
Scalability and Efficiency: getML efficiently processes large datasets with complex schemas, making it suitable for enterprise-scale applications.
Interpretability: The features generated are transparent and can be inspected, aiding in understanding the model's behavior and facilitating explainability.

Overcoming Feature Drift and Ensuring Model Relevance

One of the challenges in predictive modeling is feature drift—when the relationships in the data change over time, leading to model degradation. getML addresses this by allowing for quick retraining and feature re-evaluation. Since the feature generation process is automated and directly tied to the data, updating the model to reflect new patterns is streamlined.

Conclusion

While LLMs have transformed many aspects of data science, they are not equipped to handle the complexities of feature engineering in structured, relational data. getML fills this gap by extending gradient boosting to multi-relational decision trees, enabling advanced, supervised feature learning directly on raw relational data.

For advanced data scientists dealing with complex datasets, getML offers a powerful tool that enhances predictive modeling capabilities while reducing the overhead of manual feature engineering. By embracing this approach, practitioners can focus on higher-level insights and modeling strategies, driving better outcomes in their projects.

Ready to Elevate Your Feature Engineering?

Explore getML by checking out our notebook on Predicting Customer Churn on the AdventureWorks data set and see getML in action. We'd love to hear from you and help you get started! Reach out here: Contact getML

Let's push the boundaries of what's possible with structured data.

Advancing Feature Engineering for Structured Data Beyond Generative AI

Alexander Uhlig

Data Scientist, Econophysicist | Co-Founder at getML

The Limitations of LLMs in Feature Engineering

A Practical Example: Predicting Customer Churn

The Complexity of the Data

The Need for Advanced Feature Learning

Introducing getML: Generalizing Gradient Boosting to Relational Data

How getML Works

Data Ingestion & Modelling

The Need for Advanced Feature Learning

Introducing getML: Generalizing Gradient Boosting to Relational Data

领英推荐

How getML Works

Data Ingestion & Modelling

Supervised Feature Learning

Prediction Pipelines and Evaluation

An In-Depth Look: Feature Generation Example

Advantages of getML for Advanced Data Scientists

Overcoming Feature Drift and Ensuring Model Relevance

Conclusion

Ready to Elevate Your Feature Engineering?

社区洞察

其他会员也浏览了

Leveraging Advanced AI Technologies for Enhanced Data Engineering in the Enterprise: A Comprehensive Summary

Shifting Landscapes: How LLMs and Generative AI are Reshaping Data Careers

Data and AI: Building Solid Foundations for Success

Is Your Data Strategy Ready for Generative AI?

Data Science Talent | Newsletter Edition 2

The Rise of Low-Code/No-Code MLOps Platforms

ARTIFICIAL INTELLIGENCE - PART 6.7 - VECTOR DATABASE

AutoML Revolution: Future of Automated Machine Learning in Transforming Data Science, Industry Applications, and Ethical Considerations

Machine Learning vs Data Science: Unraveling the Essentials

Why Data Strategy Has Become Essential in the Age of Generative AI, ESG, and Cybersecurity