Garbage-in, Garbage-out – whether in the AIML Data world or any other field.

Garbage-in, Garbage-out – whether in the AIML Data world or any other field.

Yes, you read it correctly; If you consume bad or excessive food, no matter how physically fit you are, you will get sick. Similarly, your AIML model will always produce incorrect results if it is trained with poor-quality data, regardless of how advanced or well-trained the model is, even if it runs on high-performance computing (HPC) systems like CPU, GPU, TPU, NPU, or DPU. High speed or low latency with bad results will not benefit your business.

Data is crucial in the AIML world and plays a major role. To improve your AIML model's accuracy, you may need to re-train, re-rank, or re-classify it. It is common practice (MLOps) to evaluate models and retrain them based on data changes (CDC) over time. If CDC is below your business value threshold, the MLOps cycle will be reduced. Training and inferences are expensive, and data is immensely valuable, leverage it properly for your business growth.

At a high level 10K feet:

·?????? Data: Curated or "gold" data is immensely valuable. Extracting valuable data from raw data is a key Data Engineering effort. Data pipelines help bring CDC for the behaviors and dynamics of your business data, ensuring it is up-to-date and actionable. Data can be avoided AIML, but AIML cannot be avoided without data (AIML is nothing without Data).

·?????? Training: Building a model requires curated training datasets to learn patterns. This process demands significant storage space and compute power, involving various permutations and combinations for fine-tuning until the error rate is acceptable.

·?????? Inference: Analyzing new data to make predictions or decisions or QA pairs using pre-trained models.

Mining extracts useful materials from the earth, like gold or iron. Similarly, data mining extracts and converts raw data into useful and functional data for various use cases using different data curation techniques and pipelines by data engineers. Eliminating data silos and integrating data from various sources at any frequency is essential to create unified data that supports various data players with help of data engineers.

Data around the AIML

All three areas (training, inference, and data) are expensive; training on cycle (CICD) as always if data change behavior is rapid. Inference is less computationally intensive than training. Turning data into a valuable gold dataset for AIML use cases is key, with the help of functional/data domain experts.

AI helps drive innovation, increase competitiveness, and ultimately boost productivity and revenue. Why do most AI applications and ML models struggle in production with enterprise data till today? Data intelligence or AIML is more meaningful and accurate with your own data, not with public or simulated sample data, as it does not understand the semantics of your own business data if not trained & learn by own data. Common reasons include bias due to insufficient or inconsistent data, inability to choose & use the right features, variance issues from overtraining on specific data patterns, unable to see fluctuations or real generalized data and significant outliers during inferencing. Trust your own data and make it useful & meaningful and feed into models to train and learn it.

Catalogs play a major role in the AIML Data world, like how the human body has five senses: sight, hearing, smell, taste, and touch. A catalog provides visibility and accessibility to the data, showing what data players/consumers can do with it and how best it can be leveraged to help the business.

In conclusion: Just as you need the right groceries, meat, and ingredients to prepare proper food and consume only what is required to stay healthy, you need the right data from valid sources, prepared according to business needs, to train models with proper, valid, and required data only. Unweighted data can lead to serious consequences on the efficiency and effectiveness of the models.

****?? DATA SPEAKS LOUDER THAN WORDS?? ****

#data #machinelearning #dataengineering #dataanalysts #datascientists #AIML #artificialintelligence


要查看或添加评论,请登录

SUBBAREDDY JANGALAPALLI的更多文章

社区洞察

其他会员也浏览了