How to build a good database for AI and machine learning?

How to build a good database for AI and machine learning?

Understanding the business problem and solving it is the main objective of any AI or data science project within a company. But after understanding the problem, do you have the data you need to drive the business outcome? Are the data ready to be used for analysis, AI, and data science? Understanding the data is a fundamental step in creating AI and data science solutions.

The quality and availability of data are crucial for the success of AI models, to ensure they are accurate and useful. Therefore, data experts must use data cleaning and normalization techniques to ensure that data is consistent and accurate. In addition, machine learning models need high-quality training and validation datasets to produce accurate results. AI teams must also adopt an iterative approach in developing models, continuously testing and adjusting models based on results. For image-based AI models, data teams need accurately labeled and high-quality datasets to ensure accurate results. Some techniques, such as active learning and crowdsourcing, can help label data more efficiently.

Moreover, it is important to ensure the privacy and security of data used in AI models. Encryption and anonymization techniques can help maintain data privacy and security. Ultimately, data science and AI require a collaborative approach. Multidisciplinary teams that include data experts, software developers, and domain-specific experts must work together to avoid biases and promote accuracy.? Through this blog, we want to show you how to prepare your dataset for use in Arkangel AI , taking into account what type of data you are using and the project you want to carry out.

How to improve your data?

Once you have imported data is time to understand and clean your data. We perform an automatic analysis and suggest best practices for it.

Preparing your data is an iterative process. Even if you clean and prep your training data prior to use it, you can still improve its quality by assessing features during EDA (Exploratory Data Analysis).

  1. EDA1 (data ingest)?: detect issues in your data like Outliers, Inliers, Excess zero and Disguised missing values.

No alt text provided for this image

2. Investigate feature importance: calculate the significance of each feature and correlation with the prediction target selected, with our platform this step is done automatically. With these you can assess you data and improve it in at EDA2 and keep iterating.

Tabular Data

Basic requirements:

  • The data must be in a flat file, tabular, and saved as a CSV file.
  • Each record must be a line in the file and the supporting variables must be columns.
  • The first column must be the subject_id which must be a number for each record.

1. For Classification (multiclass/multilabel) projects

  • You must have a column that includes the target you are trying to predict, which must be categorical.

As a rule of thumb: We recommend having a maximum number of 10 categories. If you need more than 10 try to divide the problem into multiple prediction steps.

2. For Regression projects

  • You must have a column that includes the target you are trying to predict, which must be a number.

Example of a Tabular Classification Project

To prepare this learning data we require a minimum 3 columns:

  • First column [Subject_id]: You must assign an identification number to each record in this column. As the information must be anonymized, this identification number cannot be the ID of a person but a number assigned in the creation of the database. This number is vital for creating the algorithm, as it will be used to properly manage the presence of numerous entries in the dataset for the same person.
  • Second column [Target]: In this column, you will assign the gold standard that the algorithm must recognize. In other words, you will give a category to each entry that the model will learn from. For example, if there are two classes, you will have “Benning” and “Malignant” for each record.
  • Third column [Supporting variables]: In this column, you will assign a specific characteristic to the entry that will be used for training the algorithm. This characteristic can be either a string, a boolean (true/false), or a number. We strongly recommend including columns that are complete, or that have a maximum of 20% empty cells. You can include as many supporting variables to the algorithm as you wish. The only crucial aspect is that all of them should fulfill the requirements mentioned above.

No alt text provided for this image
Example of a CSV file created for a data algorithm. In this particular case, as you may recognize, the user included the first column named "subject_id", a second column named "target" which is categorical, and 6 more columns as supporting variables.

Imaging Data

Data Best Practices

1. Avoid high cardinality for your target

  • Make sure the number of classes is reduced from 2-10 for better performance and distribution of examples.
  • Avoid columns or rows with >30% missing values
  • Work with columns and rows that have less than 30% missing rows (depending on the scenario, these columns could be imputed).

2. Avoid target leakage

Target leakage happens when your training data includes predictive information that is not available when you ask for a prediction. Target leakage can cause your model to show excellent evaluation metrics but perform poorly on real data.

3. Avoid training-serving skew

Training-serving skew happens when you generate your training data differently than you generate the data you use to request predictions.

For example, if you are planning on making predictions about user lifetime value (LTV) over the next 30 days, then make sure that the data is taken fed in the same format. In this case the context of today and the outcome taken 30 days after.

In general, any difference between how you generate your training data and your serving data (the data you use to generate predictions) should be reviewed to prevent training-serving skew.

4. Provide a time signal

For classification and regression models, if the underlying pattern in your data is likely to shift over time (it is not randomly distributed in time), make sure you provide that information. You can provide a time signal in several ways:

  • If each row of data has a timestamp, make sure that column is included, has a transformation type of Timestamp, and is set as the Time column when you train your model.

5. Make information explicit

Some data types that might improve with feature engineering:

  • Longitude/Latitude
  • URLs
  • IP addresses
  • Email addresses
  • Phone numbers
  • Addresses

6. Include calculated or aggregated data in a row

Arkangel AI uses only the input data in a single row to predict the target value for that row. If you have calculated or aggregated data from other rows or sources that would be valuable in determining the predicted value for a row, include that data and the source row.

For example, if you want to predict next week's demand for a healthcare product, you can improve the quality of the prediction by including columns with the following values:

  • The total number of items in stock from the same category as the product.
  • The average price of items in stock from the same category as the product.
  • The number of days before a known holiday when the prediction is requested.

7. Avoid bias

Make sure that your training data is representative of the entire universe of potential data that you will be making predictions for.

No alt text provided for this image
Example of an Image Classification Project

Classification problems

1. Represent null values appropriately

  • If you are importing from CSV, use empty strings to represent null values.
  • If your data uses special characters or numbers to represent null values, including zero, these values are misinterpreted, reducing model quality.

2. Avoid missing values where possible

Check your data for missing values, and correct them if possible. Otherwise, you can leave the value blank, and it is treated as a null value. We treat each missing value with different techniques to improve your training dataset.

3. Use spaces to separate text

Arkangel AI tokenizes text strings and can derive training signals from individual words. It uses spaces to separate words; words separated by other characters are treated as a single entity.

For example, if you provide the text "red/green/blue", it is not tokenized into "red", "green", and "blue". If those individual words might be important for training the model, you should transform the text to "red green blue" before including it in your training data.

4. Make sure your categorical features are accurate and clean

Data inconsistencies can cause categories to be incorrectly split. For example, if your data includes "Brown", “bròwn” and "brown". Misspellings can have a similar effect. Make sure you remove these kinds of inconsistencies from your categorical data before creating your training data.

5. Use extra care with imbalanced classes for classification models

If you have imbalanced classes (a classification problem with one or more outcomes that is seen rarely), review the following tips.

6. Provide sufficient training data for the minority class

Having too few rows of data for one class degrades model quality. If possible, you should provide at least 100 rows of data for every class.

7. Consider using a manual split

Arkangel AI selects the rows for the test dataset randomly (but deterministically). For imbalanced classes, you could end up with a small number of the minority class in your test dataset, or even none, which causes training to fail.

If you have imbalanced classes, you might want to assign a manual split to make sure enough rows with the minority outcomes are included in every split.

Takeaways

  • The success in implementing AI and data science solutions largely depends on having the right data, in the appropriate quantity and quality.
  • It's important to clearly understand and define the business problem being addressed before starting to collect data.
  • Data is not just an input for AI and data science, but also an output: models and algorithms generate useful data that can feed into other applications or systems.
  • Data management should be a constant and continuous process, not just something done at the beginning of a project. Data quality should be monitored and improved continuously.
  • Data privacy and security are crucial and should be considered from the outset of any data project.
  • Collaboration and teamwork are essential for effective data management and the creation of successful AI and data science solutions.

Do you still need help with your data? We can help you! Schedule a free consultation with us:

References

IBM Garage Method. (2021). Data needs for AI & data science. Recuperado el 27 de abril de 2023, de https://www.ibm.com/garage/method/practices/think/data-needs-for-ai-data-science/

要查看或添加评论,请登录

社区洞察

其他会员也浏览了