登录查看更多内容

How to build a good database for AI and machine learning?

Arkangel AI

?? Ai as a service platform AI for healthcare industry.

发布日期: 2023年5月4日

Understanding the business problem and solving it is the main objective of any AI or data science project within a company. But after understanding the problem, do you have the data you need to drive the business outcome? Are the data ready to be used for analysis, AI, and data science? Understanding the data is a fundamental step in creating AI and data science solutions.

The quality and availability of data are crucial for the success of AI models, to ensure they are accurate and useful. Therefore, data experts must use data cleaning and normalization techniques to ensure that data is consistent and accurate. In addition, machine learning models need high-quality training and validation datasets to produce accurate results. AI teams must also adopt an iterative approach in developing models, continuously testing and adjusting models based on results. For image-based AI models, data teams need accurately labeled and high-quality datasets to ensure accurate results. Some techniques, such as active learning and crowdsourcing, can help label data more efficiently.

Moreover, it is important to ensure the privacy and security of data used in AI models. Encryption and anonymization techniques can help maintain data privacy and security. Ultimately, data science and AI require a collaborative approach. Multidisciplinary teams that include data experts, software developers, and domain-specific experts must work together to avoid biases and promote accuracy.? Through this blog, we want to show you how to prepare your dataset for use in Arkangel AI , taking into account what type of data you are using and the project you want to carry out.

How to improve your data?

Once you have imported data is time to understand and clean your data. We perform an automatic analysis and suggest best practices for it.

Preparing your data is an iterative process. Even if you clean and prep your training data prior to use it, you can still improve its quality by assessing features during EDA (Exploratory Data Analysis).

EDA1 (data ingest)?: detect issues in your data like Outliers, Inliers, Excess zero and Disguised missing values.

2. Investigate feature importance: calculate the significance of each feature and correlation with the prediction target selected, with our platform this step is done automatically. With these you can assess you data and improve it in at EDA2 and keep iterating.

Tabular Data

Basic requirements:

The data must be in a flat file, tabular, and saved as a CSV file.
Each record must be a line in the file and the supporting variables must be columns.
The first column must be the subject_id which must be a number for each record.

1. For Classification (multiclass/multilabel) projects

You must have a column that includes the target you are trying to predict, which must be categorical.

As a rule of thumb: We recommend having a maximum number of 10 categories. If you need more than 10 try to divide the problem into multiple prediction steps.

2. For Regression projects

You must have a column that includes the target you are trying to predict, which must be a number.

Example of a Tabular Classification Project

To prepare this learning data we require a minimum 3 columns:

First column [Subject_id]: You must assign an identification number to each record in this column. As the information must be anonymized, this identification number cannot be the ID of a person but a number assigned in the creation of the database. This number is vital for creating the algorithm, as it will be used to properly manage the presence of numerous entries in the dataset for the same person.
Second column [Target]: In this column, you will assign the gold standard that the algorithm must recognize. In other words, you will give a category to each entry that the model will learn from. For example, if there are two classes, you will have “Benning” and “Malignant” for each record.
Third column [Supporting variables]: In this column, you will assign a specific characteristic to the entry that will be used for training the algorithm. This characteristic can be either a string, a boolean (true/false), or a number. We strongly recommend including columns that are complete, or that have a maximum of 20% empty cells. You can include as many supporting variables to the algorithm as you wish. The only crucial aspect is that all of them should fulfill the requirements mentioned above.

Imaging Data

Data Best Practices

1. Avoid high cardinality for your target

Make sure the number of classes is reduced from 2-10 for better performance and distribution of examples.
Avoid columns or rows with >30% missing values
Work with columns and rows that have less than 30% missing rows (depending on the scenario, these columns could be imputed).

2. Avoid target leakage

Target leakage happens when your training data includes predictive information that is not available when you ask for a prediction. Target leakage can cause your model to show excellent evaluation metrics but perform poorly on real data.

3. Avoid training-serving skew

Training-serving skew happens when you generate your training data differently than you generate the data you use to request predictions.

For example, if you are planning on making predictions about user lifetime value (LTV) over the next 30 days, then make sure that the data is taken fed in the same format. In this case the context of today and the outcome taken 30 days after.

In general, any difference between how you generate your training data and your serving data (the data you use to generate predictions) should be reviewed to prevent training-serving skew.

4. Provide a time signal

For classification and regression models, if the underlying pattern in your data is likely to shift over time (it is not randomly distributed in time), make sure you provide that information. You can provide a time signal in several ways:

If each row of data has a timestamp, make sure that column is included, has a transformation type of Timestamp, and is set as the Time column when you train your model.

Naveen Joshi 5 年前

Label It Right: Best Practices for High-Quality Data…

Objectways 2 个月前

Data Preprocessing and Cleaning: Leveraging AI and…

Nelinia (Nel) Varenas, MBA 2 个月前

5. Make information explicit

Some data types that might improve with feature engineering:

Longitude/Latitude
URLs
IP addresses
Email addresses
Phone numbers
Addresses

6. Include calculated or aggregated data in a row

Arkangel AI uses only the input data in a single row to predict the target value for that row. If you have calculated or aggregated data from other rows or sources that would be valuable in determining the predicted value for a row, include that data and the source row.

For example, if you want to predict next week's demand for a healthcare product, you can improve the quality of the prediction by including columns with the following values:

The total number of items in stock from the same category as the product.
The average price of items in stock from the same category as the product.
The number of days before a known holiday when the prediction is requested.

7. Avoid bias

Make sure that your training data is representative of the entire universe of potential data that you will be making predictions for.

Classification problems

1. Represent null values appropriately

If you are importing from CSV, use empty strings to represent null values.
If your data uses special characters or numbers to represent null values, including zero, these values are misinterpreted, reducing model quality.

2. Avoid missing values where possible

Check your data for missing values, and correct them if possible. Otherwise, you can leave the value blank, and it is treated as a null value. We treat each missing value with different techniques to improve your training dataset.

3. Use spaces to separate text

Arkangel AI tokenizes text strings and can derive training signals from individual words. It uses spaces to separate words; words separated by other characters are treated as a single entity.

For example, if you provide the text "red/green/blue", it is not tokenized into "red", "green", and "blue". If those individual words might be important for training the model, you should transform the text to "red green blue" before including it in your training data.

4. Make sure your categorical features are accurate and clean

Data inconsistencies can cause categories to be incorrectly split. For example, if your data includes "Brown", “bròwn” and "brown". Misspellings can have a similar effect. Make sure you remove these kinds of inconsistencies from your categorical data before creating your training data.

5. Use extra care with imbalanced classes for classification models

If you have imbalanced classes (a classification problem with one or more outcomes that is seen rarely), review the following tips.

6. Provide sufficient training data for the minority class

Having too few rows of data for one class degrades model quality. If possible, you should provide at least 100 rows of data for every class.

7. Consider using a manual split

Arkangel AI selects the rows for the test dataset randomly (but deterministically). For imbalanced classes, you could end up with a small number of the minority class in your test dataset, or even none, which causes training to fail.

If you have imbalanced classes, you might want to assign a manual split to make sure enough rows with the minority outcomes are included in every split.

Takeaways

The success in implementing AI and data science solutions largely depends on having the right data, in the appropriate quantity and quality.
It's important to clearly understand and define the business problem being addressed before starting to collect data.
Data is not just an input for AI and data science, but also an output: models and algorithms generate useful data that can feed into other applications or systems.
Data management should be a constant and continuous process, not just something done at the beginning of a project. Data quality should be monitored and improved continuously.
Data privacy and security are crucial and should be considered from the outset of any data project.
Collaboration and teamwork are essential for effective data management and the creation of successful AI and data science solutions.

Do you still need help with your data? We can help you! Schedule a free consultation with us:

References

IBM Garage Method. (2021). Data needs for AI & data science. Recuperado el 27 de abril de 2023, de https://www.ibm.com/garage/method/practices/think/data-needs-for-ai-data-science/

Ark's Newsletter

2,152 位关注者

要查看或添加评论，请登录

?Cómo implementar modelos de inteligencia artificial en salud para optimizar el tratamiento de enfermedades?

2024年11月6日
?Te vas a quedar obsoleto? Mantente al tanto de los avances tecnológicos en Salud.

2024年10月23日
400 millones de afectados en todo el mundo: ?La Inteligencia Artificial es aliada en la lucha contra las enfermedades raras?

2024年9月3日
Errores comunes y oportunidades con IA para el sector Salud - Conversación con Jose Zea y Laura Velásquez

2024年8月27日
Arkangel AI y Astrazeneca presentan desarrollo con Inteligencia Artificial que permite la detección de enfermedades crónicas renales en Latinoamérica.

2024年8月22日
La Batalla contra el Alzheimer con Arkangel AI y Knight

2024年4月30日
VPH y Cancer de cuello uterino: Salvando Vidas con IA

2024年3月20日
3 Ways to adapt Patient Support Programs (PSPs) with AI

2024年1月24日
Why is the time to start using Artificial Intelligence in health?

2023年11月23日
The opportunity that Artificial Intelligence has to reduce breast cancer mortality

2023年10月18日

How to build a good database for AI and machine learning?

Arkangel AI

?? Ai as a service platform AI for healthcare industry.

领英推荐

Ark's Newsletter

2,152 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Challenges and Solutions in Data Labeling for Complex Datasets

3 Keys to Machine Learning

May 07, 2024

Will AI Take Over Data Analytics?

Data Correction

Unleashing the Power of AI: The Importance of Cleansing Your Data for Maximum Efficiency

How To Best Manage Raw Data for Computer Vision

The Benefits Of Outsourcing Data Annotation For Machine Learning

The Best Data Annotation and Labeling Metrics to Track: How to Ensure Project Success

领英推荐

Ark's Newsletter

2,152 位关注者

?Cómo implementar modelos de inteligencia artificial en salud para optimizar el tratamiento de enfermedades?

2024年11月6日

?Te vas a quedar obsoleto? Mantente al tanto de los avances tecnológicos en Salud.

2024年10月23日

400 millones de afectados en todo el mundo: ?La Inteligencia Artificial es aliada en la lucha contra las enfermedades raras?

2024年9月3日

Errores comunes y oportunidades con IA para el sector Salud - Conversación con Jose Zea y Laura Velásquez

2024年8月27日

Arkangel AI y Astrazeneca presentan desarrollo con Inteligencia Artificial que permite la detección de enfermedades crónicas renales en Latinoamérica.

2024年8月22日

La Batalla contra el Alzheimer con Arkangel AI y Knight

2024年4月30日

VPH y Cancer de cuello uterino: Salvando Vidas con IA

2024年3月20日

3 Ways to adapt Patient Support Programs (PSPs) with AI

2024年1月24日

Why is the time to start using Artificial Intelligence in health?

2023年11月23日

The opportunity that Artificial Intelligence has to reduce breast cancer mortality

2023年10月18日

社区洞察

其他会员也浏览了

Challenges and Solutions in Data Labeling for Complex Datasets

3 Keys to Machine Learning

May 07, 2024

Will AI Take Over Data Analytics?

Data Correction

Unleashing the Power of AI: The Importance of Cleansing Your Data for Maximum Efficiency

How To Best Manage Raw Data for Computer Vision

The Benefits Of Outsourcing Data Annotation For Machine Learning

The Best Data Annotation and Labeling Metrics to Track: How to Ensure Project Success