登录查看更多内容

How can you validate and test data for machine learning?

由人工智能和领英社区提供技术支持

Data engineering is the process of preparing and managing data for machine learning and other analytical tasks. It involves collecting, cleaning, transforming, integrating, and storing data from various sources and formats. Data engineering also requires validating and testing data to ensure its quality, reliability, and suitability for machine learning models. In this article, you will learn some of the common methods and tools for data validation and testing in data engineering for machine learning.

此文章中的业界达人

由社区从 31 条内容中精选。了解更多

Earl Mark Joseph Santos

Data Science | Quantitative Trading | Engineering
Andre De Almeida

Founder, CEO, Board member @ Dom Rock | Generative AI, Business Development | Member of the R&D group in Artificial…
Dominic Ligot

Technologist, Social Impact, Data Ethics, AI

1 Data validation

Data validation is the process of checking if the data meets certain criteria or expectations, such as data types, ranges, formats, completeness, accuracy, consistency, and uniqueness. Data validation can help you identify and correct errors, outliers, missing values, duplicates, and anomalies in your data before feeding it to machine learning algorithms. Data validation can be performed at different stages of the data pipeline, such as during data ingestion, transformation, integration, or loading. Some of the tools and frameworks that can help you with data validation are:

- Pandas : A popular Python library for data analysis and manipulation that provides various methods and functions for validating data, such as info() , describe() , isnull() , dropna() , fillna() , unique() , duplicated() , drop_duplicates() , and assert_frame_equal() .

- Great Expectations : An open-source Python library that allows you to define and test data quality expectations using a declarative syntax. You can use Great Expectations to validate data against schemas, rules, distributions, patterns, and thresholds, and generate data documentation and profiling reports.

- Deequ : An open-source Scala library that enables you to define and verify data quality metrics using Apache Spark. You can use Deequ to compute data quality statistics, such as completeness, uniqueness, distinctness, compliance, and correlation, and apply data quality constraints and checks.

添加您的观点

Earl Mark Joseph Santos

Data Science | Quantitative Trading | Engineering
举报内容
When diving into data validation, visualization packages like Matplotlib and Seaborn are super handy. They help you visually spot any oddities or trends right from the get-go. If you're focusing on a specific sector or field, you'll want to make sure your data lines up with what's typically expected there. For those working with time-bound data, ensuring there aren't unexpected gaps or jumps is key. And a pro tip? Always double-check where your data's coming from – a reliable source can save you a ton of validation headaches down the road!

已翻译

赞
Andre De Almeida

Founder, CEO, Board member @ Dom Rock | Generative AI, Business Development | Member of the R&D group in Artificial Cognitive Systems @ UNIFESP
举报内容
Alongside with technical aspect of the data itself, there is a room here to implement business criterias for data validation process. For example, data that could not be null, data that must fall in certain ranges of values, data that has a business meaning when linked to other data fields and so on. This approach can be easily implemented as class or function having yaml configuration files as tool.

已翻译

赞
Aurimas Griciūnas

AI Engineer ? Follow me to Learn about AI Systems ? Author of SwirlAI Newsletter ? Public Speaker
举报内容
On top of regular data validation techniques common in Data Engineering pipelines I would also add: - At validation stage you should check for Feature Drifts. While it is a good practice to do it after the ML Model is deployed to serve the use case, it is expensive to implement so checking for drifts on incoming training data is a good alternative to signal data distribution shifts that require model retraining. - Be sure to version your data that is used for training ML models. Don't forget to track training validation splits and random seeds used in your training runs for full reproducibility.

已翻译

赞
Mahsut Demiro?lu

Generative AI & LLMs | Machine Learning |?Data Science | Data Engineering | Cloud
(已编辑)
举报内容
TensorFlow Extended, aka TFX, offers a full-fledged data validation framework. It comprises a sequence of components which are used for data ingestion, validation, transformation and preparation purposes. ExampleGen ingests and optionally splits the input dataset. StatisticsGen calculates statistics for the dataset. SchemaGen examines the statistics and creates a data schema. ExampleValidator looks for anomalies and missing values in the dataset. Transform performs feature engineering on the dataset. TFX is especially helpful in developing production grade ML model. It has many built in data validation capabilities that enables to inspect training and test data distributions visually.

已翻译

赞
Christelle JULIAS

AI & ML Eng. ? DSML Advocate ? DeepRL ? GenAI | CX Specialist ? Query Valkyrie ? Educator
(已编辑)
举报内容
One thing I have found helpful is performing visualization before and after each operation. This little step emphasizes all possible flaws of your data and helps you strategize the way to get a balanced dataset.

已翻译

赞

加载更多内容

2 Data testing

Data testing is the process of verifying if the data meets the requirements and specifications of the machine learning models, such as data size, shape, distribution, balance, and features. Data testing can help you evaluate and improve the performance, accuracy, and robustness of your machine learning models. Data testing can be performed at different stages of the machine learning lifecycle, such as during data preprocessing, feature engineering, model training, validation, or deployment. Some of the tools and frameworks that can help you with data testing are:

- Scikit-learn : A widely used Python library for machine learning that provides various methods and functions for data testing, such as train_test_split() , cross_validate() , GridSearchCV() , RandomizedSearchCV() , accuracy_score() , confusion_matrix() , classification_report() , and roc_curve() .

- PyTest : A popular Python testing framework that allows you to write and run automated tests for your data and code. You can use PyTest to create test cases, fixtures, mocks, and assertions for your data engineering and machine learning projects.

- MLflow : An open-source platform for managing the end-to-end machine learning lifecycle that enables you to track, compare, and reproduce your data and model experiments. You can use MLflow to log and monitor your data and model metrics, parameters, artifacts, and versions, and deploy your models to various environments.

Data engineering for machine learning is a complex and iterative process that requires constant validation and testing of your data. By using the methods and tools discussed in this article, you can ensure that your data is of high quality, reliability, and suitability for your machine learning models, and that your models are performing as expected and meeting your objectives.

添加您的观点

César R. F.

Data Analytics Manager | Data Project Manager | Power BI | Python | SQL | DAX | M | HTML | CSS | JS
举报内容
To validate and test data for machine learning, I use to split the data into three sets: a training set, a validation set, and a test set. Use the training set to train the model, the validation set to fine-tune the model parameters, and the test set to evaluate the model's performance on unseen data. By following these steps, you can ensure that the model is not overfitting to the training data and that it is able to generalize to unseen data.

已翻译

赞
Diego Horna

Azure AI & Data Analytics Management | ITIL 4| Agile | Lean 6 Sigma | Driving Delivery Excellence and Digital Capability
(已编辑)
举报内容
Begin by assessing data quality, size, distribution, and balance. Scikit-learn's functions for an unbiased validation. Employ PyTest for automated, robust testing, creating cases and fixtures. Validate data transformations, aligning with model needs. For monitoring and reproducibility, embrace MLflow. Log metrics, parameters, artifacts, facilitating model comparison. Analyze. Iterate and refine based on results. Test during preprocessing, feature engineering, training, and deployment. Validate data's alignment with objectives. This approach ensures informed decisions, and enhances model robustness, empowering impactful machine learning.

已翻译

赞
Earl Mark Joseph Santos

Data Science | Quantitative Trading | Engineering
举报内容
Strategies like stratified splits or k-fold cross-validation should be employed for comprehensive testing. Tools from Scikit-learn, like SelectKBest, can help in prioritizing relevant features. For a robust assessment, synthetic data generation methods, including SMOTE, are recommended along with adversarial testing to gauge model resilience.

已翻译

赞
Ganesh DG

Data Engineering & Project Management professional | VP - Sr Technical Manager with Expertise in Financial and Manufacturing Domain | Mentor | Views expressed are my own
举报内容
Test the efficiency of the code to handle larger quantities of data as against lab condition data load. Check if code works efficiently when the data goes ten -100 folds of test data. Many times initial results are good but when huge pipelines get added realworld scenarios are slow and in effective which need to be avoided

已翻译

赞
Shafkat Rahman

ML Engineering | LLMOps | Generative AI applications + solutions
(已编辑)
举报内容
Here is the correct process: 1. Split the dataset first and set your test set aside 2. Transform the train set 3. Transform the rest of the data After transforming the train set, you should use the same parameters to change the rest of the data. Then you will use the min and max values calculated on the train set to scale the test samples

已翻译

赞

加载更多内容

3 Here’s what else to consider

This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?

添加您的观点

Dominic Ligot

Technologist, Social Impact, Data Ethics, AI
举报内容
One thing that's not commonly discussed is model drift. This is how models deteriorate over time. At regular intervals it's always good to test how the through-the-door population or dataset compares with the original development, test, and validation samples. A significant variance can presage a model deterioration.

已翻译

赞
Aurimas Griciūnas

AI Engineer ? Follow me to Learn about AI Systems ? Author of SwirlAI Newsletter ? Public Speaker
举报内容
ML Models are only useful as long as they are solving our business problems. Be sure to always monitor business metrics and their improvements/deterioration once iterating on new versions of your model. There are multiple and widely used approaches to perform online model testing, these include A/B Testing, interleaving experiments, multi-armed bandits etc.

已翻译

赞
Ganesh DG

Data Engineering & Project Management professional | VP - Sr Technical Manager with Expertise in Financial and Manufacturing Domain | Mentor | Views expressed are my own
举报内容
What is important while building a framework is to understand which data points make a real difference to the output and hypothesis. Sometimes data which are outliers and with lesser impact tend to skew the results. Working with business to understand the data points and it's impact is very critical

已翻译

赞
Diego Horna

Azure AI & Data Analytics Management | ITIL 4| Agile | Lean 6 Sigma | Driving Delivery Excellence and Digital Capability
举报内容
Validating and testing data for machine learning extends beyond technical processes. Cultural aspects like collaboration and mindset play a crucial role. Encourage open communication between data engineers and domain experts to refine validation rules effectively. Embrace an adaptable mindset. Consider an e-commerce scenario: A model for customer preferences failed as it didn't validate seasonal trends. Collaborative efforts could've led to realizing the importance of temporal validation. Stories like this highlight the essence of understanding data's context. Don't solely rely on tools; Contextual insights matter.

已翻译

赞
Yooki ??

Data Scientist/Engineer @ Gleeson Recruitment Group | Professional bear
举报内容
Ensure that the data is free from biases, as biased data can lead to unfair or discriminatory model outcomes. It's a blend of technical insight and a deep understanding of the data origin and implications.

已翻译

赞

加载更多内容

Data Engineering

+ 关注

给文章评分

我们借助人工智能创建了此文章。您认为这篇文章怎么样？

很棒不太好

举报此文章

查看全部

How can you validate and test data for machine learning?

1

2

3

1 Data validation

2 Data testing

3 Here’s what else to consider

Data Engineering

给文章评分

感谢您的反馈

更多Data Engineering相关文章

更多相关阅读内容

How can you validate and test data for machine learning?

1

2

3

1 Data validation

2 Data testing

3 Here’s what else to consider

Data Engineering

给文章评分

感谢您的反馈

查看其他技能