Deep Neural Networks and Tabular Data Survey Review
The aim of the survey is to give an overview of state-of-the-art deep learning methods for tabular data for each of the three groups: data transformations, specialized architectures and regularization models. In addition to the deep learning approaches for tabular data generation, strategies for explaining deep models on tabular data are also discussed, including the main research approaches and established methodologies, as well as relevant challenges and open questions in this field.
The availability of large, labeled data sets in addition to the presence of affordable computational and storage resources have driven the development of deep neural networks that exhibit high performance at classification or data generation tasks based on homogeneous data (e,g,, image, audio, text data). Deep neural networks are able to generate tabular data and can be used as inputs in certain learning applications and other scenarios. However, heterogeneous data collection is a costly and time-consuming task, and generating realistic synthetic data is challenging since heterogeneous tabular data include mixes of discrete and continuous variables that may include multiple modes, and can be imbalanced. Combined with data noise and fragmentation, tabular data generation remains a challenge even for modern deep neural network models.
Homogenous Data and Machine Learning Applications?
Contrary to homogeneous data such as images, tabular data are heterogeneous, with dense numerical and sparse categorical features that are more weakly correlated to one another than features on homogeneous data sets. In general, heterogeneous data contain a variety of attribute types such as continuous or discrete numerical variables, as well as categorical variables (qualitative values without a numerical ordering).
More specifically, in tabular data, variables can be correlated or independent, and the features lacking spatial information than has been helpful at establishing correlations in regular data sets. Formally, tabular data (also called structured data) is a subcategory of the heterogeneous data format that are usually structured as a table, where the rows correspond to data points, and the columns correspond to features.
Columns can be either numerical or categorical, and can be regarded as random variables following a joint distribution. It is also important to note that there is limited prior knowledge regarding the structure and relationships between the features in a tabular data set.
Heterogeneous data are the most commonly used form of data in many practical machine learning applications such as medical diagnosis, predictive analysis in financial applications (such as risk analysis, investment strategies, creditworthiness estimation and portfolio management), recommendation systems, customer churn prediction, security and privacy etc, and data science.
Tabular data are the oldest form of data collections and have therefore been used in early machine learning research. Although the development of deep neural networks was focused on homogeneous data, there are various recent deep learning approaches focused specifically on tabular data modeling, stimulated by some of the aforementioned applications such as finance and advertising, where traditional machine learning methods lack in terms of performance and accuracy. These novel methods are able to identify complex dependencies in the data, and are based on various flexible data transformation methods, specialized architectures, regularization, or attention based approaches.
Deep Neural Networks for Tabular Data?
Deep neural networks are usually inferior compared to more traditional machine learning methods in dealing with tabular data in terms of predictive quality and performance. While this discrepancy has been associated with the large complexity and non-linearity of dependencies in the tabular data format, it still remains unclear why deep neural networks cannot achieve similar performance as in other domains of applications in which they are handling homogeneous data sets, such as image classification and natural language processing. The four possible reasons that are identified as potential answers to this are the following:
? Inappropriate training data: Real-world tabular data sets are often low in quality, and include missing values, erroneous or inconsistent data, and have small overall size relative to the large amount of features generated from these data sets.
? Missing or Complex Irregular Spatial Dependencies: There is often no spatial correlation between the variables in tabular data sets, and the dependencies between features are complex and irregular, making the analytic methods employed by models for homogeneous data unsuitable for modeling data of this format.
? Extensive preprocessing: When dealing with categorical features, it is common to represent them numerically. However, when these features are very sparse, this can lead to a very sparse feature representation that makes the embedding of these features problematic. Extensive preprocessing can also lead to information loss with respect to the original data and thus reduce the predictive performance of deep neural networks.
领英推荐
? Model sensitivity: Deep neural networks can be sensitive to small perturbations of the input data, with a small change of a categorical feature potentially having a large impact on the prediction.
The main approaches and methods of deep neural networks for tabular data sets can be classified in the following categories:
? Data Transformation Method, where categorical and numerical data are transformed so that a deep neural network model can extract information more effectively. These methods do not require new architectures or data pipeline adaptations, but require more preprocessing time. It is noted that most traditional approaches for deep neural networks on tabular data fall into this category.
? Specialized Architectures, where the development of novel deep neural network architectures, designed specifically for heterogeneous tabular data is investigated. The two most important types of architecture proposed are hybrid models that merge classical machine learning approaches (e.g. decision trees) with neural networks, and transformer-based models that rely on the attention mechanism inspired by deep neural network methods on text and visual data. Specialized architectures form the largest group of approaches for deep tabular data learning.
? Regularization Models, where strong data regularization schemes for the extreme, non-linear and complex tabular data sets are proposed for the performance improvement of deep learning models on tabular data sets. In this class of approaches, the extreme flexibility of deep learning models for tabular data is considered to be one of the main learning obstacles and therefore, strong regularization of learned parameters is expected to increase the overall performance. A potential downside is that extensive regularization and optimization can be computationally expensive, however, several works indicate that strong regularization of deep neural networks is beneficial for tabular data manipulation.
The interpretation of deep neural networks on tabular data is another important topic that is directly related to the problem of the explainability of machine learning models. The most common approaches for the interpretation of deep neural networks are based on methods employed in the computer vision domain.?
For tabular data sets, it is just as essential to highlight a variable relationship, and many existing methods are based on attention mechanisms. Broadly, research indicates two distinct explainability classes:
? Feature Highlighting Explanations, where techniques seek to explain the behavior of machine learning models instance by instance. These methods aim to understand how all the available inputs to a model are being used to arrive at a certain prediction. This can be done by enforcing local linearity on the deep neural network models, or by accessing the model parameters, if they are known, and using them in order to generate the model explanation. If the model parameters are inaccessible, model-agnostic approaches can be employed, such as? applying surrogate models that are interpretable by design.
? Counterfactual explanations, the main purpose of which is to suggest constructive interventions to deep neural network inputs so that the output can change to the advantage of the end user. Thus, by emphasizing both the feature importance and the recommendation, counterfactual explanations can aid towards model explainability.
Summing up
Summing up, research drawing from a comparison on multiple data sets, has demonstrated that models based on tree-ensembles still outperform deep learning models when dealing with heterogeneous tabular data. The role of applying deep learning methods on tabular data and whether there is a benefit to be gained remains unclear. It is thus thought that a fundamental reorientation of this domain, as well as a refocus on information-preserving representation mechanisms for deep learning with tabular data may be necessary and beneficial.
Many of the challenges for deep neural networks on tabular stem from the heterogeneity of the data, therefore some deep learning solution transforming them into a more suitable homogeneous representation format can significantly boost performance and should be a recommended strategy for real-world applications.
Regularization has also shown to reduce the hypersensitivity and improve the overall performance and robustness of deep neural network models on tabular data. When it comes to interpretability, counterfactual explanations for deep tabular learning can be used to improve the perceived fairness in human-AI interactions, and to enable personalized decision-making
However, the heterogeneity of tabular data renders the deployment of counterfactual explanation methods problematic in practice, and the development of efficient heterogeneous tabular data handling methods in the presence of feasibility constraints remains an open question. Due to the high importance of tabular data to industry, finance and research, novel ideas that can alleviate some of the challenges associated with efficient handling of heterogeneous tabular data can be potentially significant.