Unmasking Real-World Data Science: A Departure from Kaggle’s Accuracy Frenzy and Model-Centric Approaches

Unmasking Real-World Data Science: A Departure from Kaggle’s Accuracy Frenzy and Model-Centric Approaches

TLDR: This article takes you off the beaten path in real-world data science, ditching Kaggle’s obsession with accuracy and rigid models. Dive into the messy world of unfiltered data, where solutions go beyond chasing small accuracy gains. Get ready for a shift from precision focus to impactful stories, uncovering the true heart of data science.

Introduction

My suggestion to aspiring Data Scientists is that learning the intuition of basic algorithms and implementing them straightforwardly or blindly on any dataset is a naive approach. It reveals a lot about your skill set. Avoid replicating your Kaggle methods in real-life use cases. Your business acumen, usability, and data understanding should be of the next level. Always strive to comprehend the problem at hand before hastily concluding aspects like the choice of model, labeling, or categorizing it as a time series problem when it might actually be a simple statistical problem, and so forth.

Here is my 5-step strategy for solving real-life business problems with data science:

1. Understand and formulate the problem:

Determine the end goal, desired outcomes, and identify the pain points that need resolution. Define the Key Performance Indicators (KPIs) necessary to assist the business in addressing the problem, distinguishing between diagnostic, descriptive, proactive predictive, or suggestive measures.

2. Data Modeling:

Once you’ve completed the formulation, the next step is data modeling. This involves preparing the data that will be fed into the model. Data modeling goes beyond simple preprocessing and feature engineering; it requires a deep understanding of the granularity of the data. Utilizing this granularity, you must model both the data and the label. Often, people focus solely on independent features without properly modeling the target variable. Aligning the modeling of the target variable with your desired outcome is crucial. For instance, if predicting next month’s sales or the timing of an event, you need to define multiple KPIs for the target label. Feature engineering typically involves creating features for independent variables, but in time-based e.g historical sales or panel datasets?e.g?machine?data?or?logs?of an equipment recorded?over?time, you should consider generating lag features for the target variables. This approach can uncover autocorrelation between current and previous lag values.

3.?Predictive Modeling:

Begin with simple models in the first iteration, but ensure you possess a rudimentary understanding of the model and its limitations. For instance, when using a regression-based model, encode categorical features into dummy variables, eliminate any null values in features, ensure consistent feature scaling, and be mindful of the sensitivity to outliers in distance-based algorithms. If dealing with time series data, maintain stationarity, address outliers, and incorporate lag features created during the modeling stage. Understand the limitations and evaluation metrics of the chosen model, as well as the potential for target variable extrapolation leading to over-forecasting issues. When opting for tree-based regressors, be well-versed in the model’s pros and cons. These models create homogeneous sets, are not sensitive to outliers, and allow for a single generic model across categories. However, they are complex, black-box models with numerous estimators or trees, and the inability to extrapolate target features may result in consistent under forecasting,

Using simple linear, ridge regression, or polynomial can help extrapolate data. Time series models such as ARIMA and SARIMA can also lead to over-forecasting. Which one is ideal for you based on the business use caseIn supply chain replenishment planning, demand forecasting is used when under- or over-forecasting can be harmful, SKU holding costs are low, and business requirements require never being out of stock, even?if?we?have?to?keep?extra?safety?stocks?then?we?can?use?models?trained for extrapolation and doing over-forecasting.?We will favor models with underforecast or untrained for extrapolation, such as XGboost regressors or random forest regressors, if stock keeping costs are extremely high and affect revenue more than lost sales..

When using unsupervised learning models like clustring, you must carefully choose the model to cluster your data. For example, in customer segmentation, if your data plot shows a crescent-shaped structure, circular shapes, or dense data points, you cannot use K-means because it will not properly carve out your data, regardless of the K value you choose using the elbow method. Similarly, when performing cluster validation using the silhouette score, you may receive a very good score, but it will be misleading; instead, you must use dbscan. Kmeans cannot be used if your data contains categorical information, such as customer IDs, department IDs, demographics, etc. Instead, you must use K-Mod.

Use approaches like Synthetic Minority Over-sampling Technique (SMOTE) if your dataset is unbalanced, for example, in use cases like anomaly detection or spam classifier, SMOTE creates synthetic examples for the minority class by interpolating between instances that already exist, In order to address class imbalance in the modeling process. You can try using simple techniques like IQR to complex Deep Learning based models auto-encoders to detect the anomaly in logs, to capture fraudulent transections and creating proactive KPI for Predictive Maintenance to minimise the unscheduled maintenance. Alternatively, you can use GANs for synthetic data generation. This helps balance the class distribution and can improve the performance of models on imbalanced datasets. If you have labelled audio logs data coming from equipments like Wind tribune you can use vector databases and embedding model to index the historical audio logs and then you can label new audio logs streams in Real-Time with possible issues with the equipment which can leads to the failure in near future. This is very good use case of continuous monitoring

3.?Hybrid Modeling and Feature Selection:

Think outside the box when solving business problems. Some issues may require the use of multiple models, which could include ML models, ML models with optimization, or simulation models. Always identify lagging features in your data; using leading features to predict the target variable is not a recommended approach. For example, use weather data to predict rain and then predict potatoes sales based on the rain forecast. Weather data is a lagging variable for rain, and sales of potatoes is a leading variable for rain forecast; you cannot use potato sales to forecast rain. so before building any model try to understand from domain expert about lagging and leading feature. In many use cases these features can be identified easily with your own knowledge & understanding

4. Evaluation and Tuning:

In this final step, evaluate your model. You cannot use R2 or Adjusted-R to evaluate tree based regressors instead use RMSE or MAPE. For simple data, employ k-fold cross-validation, but for time-based data?or?panel?dataset, use time sequence-based folds cross-validation. Optimize your model’s hyperparameters using methods such as Random Search, Grid Search, Bayesian Search, and packages like Optuna. Keep in mind that grid search can be slow, so avoid it if you have limited computational resources, a large dataset, or if you need to explore numerous hyperparameter combinations. If you are working with LLMs you can evaluate your results with your ground truth by using ROUGE & BLEU scores. ROUGE is commonly used in tasks related to summarization and automatic evaluation of text summaries, BLEU is widely used in machine translation tasks. There are some limitations of ROUGE & BELU and that’s why we have some optimized versions of these scores like ROUGH-L, ROUGH-N etc and TER?, NIST are variants of BELU, while making a choice for LLMs, Vectordatabaes & tons of RAG/Hybrid RAG approaches.

It all comes down to your use case and the outcomes you are obtaining, however if you can afford it and aren’t too concerned about data privacy, I would recommend giving OpenAI’s LLM model a try. I have discovered through my own experience solving multiple real-world use cases that, aside from benchmarks like MMLU and HELM popping up daily to win the race at the leaderboard, there isn’t currently an open-source LLM that can be used to replace the gpt3.5 or 4 without sacrificing accuracy, a well-formed response, and comprehension of the given query.

Choise of RAG approach is:

A. Use cases and Data specific

B. trial and error" approach.

Likewise, with Vector databaes.

When selecting Knowledge Graph over RDMS & VDB or combining the two or three to increase RAG or using different chunking techniques will depend on the quantity and complexity of relationships in your unstructured data, Together with some excellent textual data, an ideal RAG system for your complex unstructured data should also be able to preserve all the information include tables, graphics, entities, information receipts, and domain-specific lingo.

In such cases you should go with custom parsers or train a custom model by labeling your data using Azure Documents intelligent services instead of heavy lifting for creating custom OCR models you can train your custom model using 5-10 images with good accuracy and with few hundred examples you can will be able to get very high quality results

I personally used it for a POC to detect highly unstructured domain specific information and results were outstanding

This model can also provide a json output that can be used to create a KG to preserve relationships and textual chunks from those documents can be used to create a vector index or can be stored in a separate node of the KG, now you can run hybrid queries to get the correct answers with perfectly aligned relationships, In this way you will be able to create a lossless information preservation RAG pipeline.

Last point: Before becoming overwhelmed by your model’s results, try understanding the economics of the data science problem. You should be aware of the type of matrix that can help you obtain optimal values or numbers to address the economics of your data science problem. Whether it is precision, recall, F1, Rouge, BLEU, IoU, or a custom matrix, consider what additional value it will generate and how it will enhance your current ROI and KPIs.

Thank you for reading. Please use LinkedIn to contact me with any questions, comments, or advice.

Medium Link

https://www.dhirubhai.net/in/zaid-ahmad-awan-486039222?utm_source=share&utm_campaign=share_via&utm_content=profile

要查看或添加评论,请登录

Royal Cyber Asia的更多文章

社区洞察