TransmogrifAI

TransmogrifAI

TransmogrifAI is a machine learning automation framework designed to simplify the machine learning workflow. It was created by Salesforce and is open-source software. TransmogrifAI is built on top of Apache Spark and is designed to work with big data. The name TransmogrifAI is a reference to a comic strip named Calvin and Hobbes, in which Calvin uses a transmogrifier to transform himself into different creatures.

What is TransmogrifAI?

TransmogrifAI is a machine learning automation framework that is designed to simplify the machine learning workflow. It provides a unified API for data cleaning, feature engineering, and model training. TransmogrifAI uses automated feature engineering to automatically create new features based on the input data. It uses feature selection to select the best features for the model, reducing the risk of over fitting. TransmogrifAI also provides automatic hyper parameter tuning, which helps to optimize the performance of the model.

Why use TransmogrifAI?

TransmogrifAI makes it easier to build and deploy machine learning models. It provides a simple, unified API that abstracts away the complexity of the underlying machine learning algorithms. It also provides a range of tools for data cleaning, feature engineering, and model training. TransmogrifAI is designed to work with big data, which means it can handle large datasets without running into memory or performance issues.?

How does TransmogrifAI work?

TransmogrifAI works by automating many of the steps in the machine learning workflow. It uses automated feature engineering to create new features based on the input data. It also?uses feature selection to select the best features for the model. TransmogrifAI provides automatic hyper parameter tuning, which helps to optimize the performance of the model. TransmogrifAI is built on top of Apache Spark, which means it can handle big data. It also provides a range of tools for data cleaning, feature engineering, and model training.?


The TransmogrifAI Workflow

No alt text provided for this image

Feature Inference:?The first step in any machine learning process is data preparation. A data scientist collects all relevant data and compares, combines and aggregates different data sources to extract raw signals that could have predictive power. The extracted signals are then placed into a flexible data structure, commonly known as a data frame, from where they can be further manipulated. Although these data structures are simple and easy to manipulate, they do not provide data scientists with protection against consequential errors such as incorrect assumptions about types or nulls in the data. Features are strongly typed and TransmogrifAI supports a rich and extensible hierarchy of feature types. In addition to allowing user-specified types, TransmogrifAI also derives its own. For example, if it detects that a low cardinality text element is actually a hidden categorical element, it will catalogue it and deal with it accordingly. Strongly typed functions allow developers to catch most errors at compile time, not at runtime. They are also key in automating the type-specific post-processing common to machine learning pipelines.


The TransmogrifAI Feature type hierarchy

No alt text provided for this image

Transmogrification (a.k.a automated feature engineering):?While strongly typed functions are very helpful in thinking about your data and minimizing subsequent errors, ultimately all functions need to be transformed into a numerical representation that reveals patterns in the data in a way that machine learning algorithms can easily exploit. . This process is known as feature engineering. There are endless ways to transform the element types in the image above and doing it the right way is the art of data science.

As an example, let's ask ourselves how we would go about transforming a US state (eg CA, NY, TX, etc.) to a number. The problem with this encoding is that it does not store any information about the geographical proximity of the states. However, proximity can be an important property in modelling purchasing behaviour. This would solve the first problem, but would still not encode information about whether the states are in the north, south, west, or east of the country. This was a simple illustration of one feature - imagine doing this in the hundreds or thousands! What makes this process particularly challenging is that there is no single correct way, and successful approaches are highly dependent on the problem we are trying to optimize.

TransmogrifAI comes with a myriad of techniques for all supported feature types, from phone numbers, email addresses, geographic locations to text data. These transformations aren't just about getting data into a format that algorithms can use, TransmogrifAI also optimizes transformations to make it easier for machine learning algorithms to learn from data. For example, it can transform a numerical property such as age into the most appropriate age groups for a particular problem – age groups for the fashion industry may be different from wealth management age groups.

No alt text provided for this image

Automated Feature Validation:?The function can lead to an explosion in data dimensions. And high-dimensional data is often full of problems! For example, the usage of particular fields in the data may change over time, and models trained on those fields may perform poorly on fresh data. Another big (and often overlooked) problem is hindsight bias or data leakage. This occurs when information is introduced into the training examples that will not actually be present at the time of prediction. The result is models that look amazing on paper but are completely useless in practice. Consider a dataset containing trade information where the task is to predict trades that are likely to be forthcoming. Imagine a field in this dataset called "Deal Amount" that is populated only after the deal is closed. However, in reality, this field will never be filled for a deal that is still running, and the machine learning model will perform poorly on those trades where predictions really matter! These algorithms are particularly useful for maintaining sanity when dealing with high-dimensional and unknown data that can be fraught with hindsight bias. They apply a lot of statistical tests based on feature types and additionally use feature pedigree to detect and remove such bias

Automated Model Selection:?The final stage of the data scientist's process involves applying machine learning algorithms to the prepared data to create a predictive model. There are many different algorithms to try, each with a number of knobs that can be tweaked to varying degrees. Finding the right algorithm and setting the parameters can mean the difference between a powerful model and one that is no better than a coin toss.

It also automatically deals with the problem of imbalanced data by appropriately sampling the data and recalibrating predictions to match true priors. There is often a significant gap in the performance of the best and worst models a data scientist trains on the data, and exploring the space of possible models.


No alt text provided for this image

Hyper parameter Optimization:?Underlying all of the stages above is a hyper parameter optimization layer.?However the reality is that all of the stages above come with a variety of knobs that matter. The sampling rate for dealing with imbalanced data is yet another knob that can be adjusted. Tuning all of these parameters can be overwhelming to a data scientist, but can really make the difference between a great model and one that is essentially a random number generator. This is why TransmogrifAI comes with some techniques for automatically tuning these Hyper parameter and a framework to extend to more advance tuning techniques.

Benefits of using TransmogrifAI:

TransmogrifAI simplifies the machine learning workflow by automating many of the steps. This means that data scientists can focus on the more creative aspects of machine learning, such as choosing the right algorithm and interpreting the results. TransmogrifAI provides a range of tools for data cleaning, feature engineering, and model training. It is also designed to work with big data, which means it can handle large datasets without running into memory or performance issues.

Conclusion:

TransmogrifAI is a machine learning automation framework designed to simplify the machine learning workflow. It provides a unified API for data cleaning, feature engineering, and model training. TransmogrifAI uses automated feature engineering to create new features based on the input data. It also provides automatic hyper parameter tuning, which helps to optimize the performance of the model. TransmogrifAI is built on top of Apache Spark, which means it can handle big data. TransmogrifAI simplifies the machine learning workflow by automating many of the steps, which means that data scientists can focus on the more creative aspects of machine learning.


Ronaald Patrik (He/Him/His)

Leadership And Development Manager /Visiting Faculty

1 年

Amazing

回复
Manish Nehra

Education Counselor || Career Counselor || Top Voice in Education& Entrepreneurship || Entrepreneur || Startup Mentor

1 年

Amazing?

回复
Jagdish Saini

Data scientist | Senior Full Stack Developer | React.js | Node.js | Python | Django | Fast API & Flask | MongoDB | Data Structure and Algorithms | SQL | DevOps | Frontend | Backend | Machine learning | GenAI

1 年

Thanks for posting

回复
Jandeep Singh Sethi

| HR Leader & Founder | I help you build your brand and skyrocket audience | 375K+ | Helped 500+ brands on LinkedIn | Organic LinkedIn Growth | Author |900M+ content views | Lead Generation | Influencer Marketing

1 年

Brilliant work

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了