In artificial intelligence (AI) and data science projects, baseline models are built using available data, domain knowledge, set of original or transformed features, and machine learning algorithms suitable for the domain, problem type, size of data and feature, as well as infra and high-performance computing (HPC) resources available. Then the performance of the baseline model is assessed and usually been attempted to be improved using the following techniques:
- Further capturing and embedding domain knowledge and business process into the AI and ML algorithms can provide more perspective and understanding of the problem, variables (inputs and targets), as well as relationships between variables.
- Collect and add more data, high-quality data to the training dataset, add provides more information to be learnt by the algorithms and likely to improve the performance of the model. Adding low-quality and non-conformant data is not only going to help but also worsen the model performance.
- Quality-control and data wrangling, e.g. cleaning or imputation (replacing missing variables with meaningful data) is likely to improve the model performance; precautions need to be taken, as important data is removed in the data wrangling step or missing data replaced with wrong data, the performance is likely to be deteriorated.
- Feature selection, transformation, and engineering are if done appropriately can boost the performance and accuracy of the models significantly. All three require domain knowledge and can help to extract more information from data.
- Perform hyperparameter tuning of the algorithms, which is both of art and science; it is usually time consuming, requires experience and usually cannot be fully automated. There are many tools and packages to facilitate or automate the process, however human intelligence combined with those tools usually yields better results than fully automated hyperparameter tuning.
- Optimising split and cross-validation strategy highly affect the results of hyperparameter tuning, if one is lucky, a simple train-validation-test split may result in good improvement in the accuracy metrics, however more robust and generalisable model likely to be achieved with k-fold cross-validation (for large sample size) or one-at-a-time (for small sample size) approaches.
- Algorithm selection is usually done depending on the problem type, context, and size of the data (number of samples and features). Other factors important for the selection are the ability to be tuned for balancing bias and variance (correction of under- and over-fitting problem), explainability, and runtime and HPC requirements.
From the seven ways above, 1 to 3 are related to the data. Our experience in real-life AI applications has shown that data related techniques are more impactful and helpful in improving the performance of the models. This is in contrast with the toy, learning-purpose and standard AI and ML applications that can be found on the net, for which, the data science and AI enthusiast?are utilising techniques 4 to 7 to improve the performance for.
We always need to be mindful of the concept garbage in, garbage out (GIGO). To avoid garbage-in situation, errors and biases in the data need to be identified and corrected with help from subject matter experts (SMEs). If not corrected, the input data errors lead to false or biased results, regardless of algorithm, techniques, and the effort we put into model improvement.
Professor, GeoData Science: Spatial Stats & UQ
2 年how about the model set up? - the way you feed the data that can vary dramatically
Senior Data Scientist at Marathon Petroleum Corporation
2 年Exactly Aso, I spend all my time on the quality and transformation of the (subsurface) data, particularly if you can co-erce it into non-dimensional groups or compound features with physical relevance (e.g. material balance groups). Usually the first-pass ML model isn't much improved by changing model family or hyperparameter tuning! (assuming you made a reasonable first selection of the statistical model to be used ...), but generalisability can be really improved by adding physical constraints to the loss function.