Preprocess Text data howto.
Many companies run data pipelines and integrate with third party software or data for their business purposes.?
However, before /after using any software, the data (text, audio) has to go through a pipeline of preparation tasks to 'quality' assuring the data to ingest is actually valid.?
?As anything regarding using data products, if you input 'bad' data you will get bad output.??
For example, for translation you may need to make sure the special characters and unicode have been handled.? In other cases, a company needs to process a large number of records or documents and categorize them before ingesting them into their content platform. It would be unfeasible to categorize 1000s (or more) documents manually therefore an automatic preprocessing that label them, extract summary and metadata is useful. We have seen this in many industries. Large legal practices or ministries are one. But also regulated industries like insurance or trading operations or even shipping.?
In a similar way after you processed the data, you need to validate or ensure the data will reach the 'final destination' right. Data preparation will also help to provide ‘context’ of each data piece: categorization, summary, keywords, labels and snapshots all stem out from a good data preparation process.?
Businesses that aggregate datasets from many sources, public data, third party proprietary data or obtain data from clients or other departments may encounter a variety of data-related problems.?Examples are legal documents, product documentations, process planning and delivery etc.?
Textual data in its raw form typically has the following issues:?
- Data duplication: Two or more records are the same. This could lead to erroneous inventory counts, redundant marketing materials, or needless invoicing.?
-Conflicting Data: Conflicting data occurs when identical records have distinct properties. For instance, delivery problems can arise from a business using many addresses or when a word is ‘ambiguous’ depending on context (cat as the animal or the caterpillar)?
-Missing attributes in the data make it incomplete. Example: Employees whose social security numbers are absent from the database may not be able to execute their payroll.?
-Invalid Data: The characteristics of the data do not follow standards. For instance, phone numbers are recorded with nine digits as opposed to ten.-Irrelevant Data: data not useful for the purpose of the task and out of context.?
?
To preprocess the data, we suggest following 5 steps:?
1. Inspection: Find inconsistent, erroneous, and unexpected data.?
We perform running script and code to get summary statistics about the data.?
For instance, see if a specific column complies with a set of guidelines or conventions. Is a string or a number recorded in the data column? How long is each piece of text on average? How many familiar words or frequent words statistically???
How distributed are the unique values inside a column (if any)? Is there a relationship or link between each piece of text??
Then, it may be beneficial to produce some visualization and unexpected and hence incorrect results can be found by employing statistical techniques like mean, standard deviation, range, and quantiles to analyze and visualize the data.?
It will also be useful at this stage to run the text through ‘embeddings’ and visualize the texts with clustering techniques just to see if any immediate relationship is evident. This will also allow detecting ‘outliers’ , if any?
2. Cleaning: Correct or eliminate any irregularities that are found. For example, usages of dictionaries or other data to ‘fill the gaps’ or replace the invalid data with other values.?
Depending on the issue and the type of data, several strategies are used in data cleaning and each has pros and cons of its own.?
In the case of purely textual errors, here are the techniques often used:?
领英推荐
3. Contextualizing:?
At this stage, the data has been cleaned and we are getting ready to use it in production, ingesting it into a data catalogue, content platform or processing it in other ways (like translation for example).?
Before doing this, it would be extremely beneficial to add ‘contextualization’ of the textual data to help retrieve information in the future and perform further analysis easily.?
For example, a piece of text may talk about holidays in Greece, another of a specific restaurant in London UK. In this case you may want to store metadata respectively as:?
-location: Greece, London?
-action: holiday, eating out?
-sentiment: positive/negative?
The examples and possibilities are endless, and we stress all of this is nowadays possible automatically and, in a cost, effective way (i.e. using AI bespoke solutions not manual labor).??
It is worth saying that you can even add ‘predictive’ information from complex text using AI (LLM specifically).??
For example, we have successfully labelled court document of hearing and legal document of lawsuit: ‘effective defense’, ‘defendant denied’, ‘judge approved/denied’ can all be extracted from a complex text in many languages (including Arabic, Italian etc.)?
4. Assurance checks: The accuracy of the results is checked after cleaning.?
At this stage, the data is cleaned and contextualized according to the purpose. Now we will put in place a process to ensure the data quality (and the context extracted) is as intended.?
In order to do so, we will define metrics to quantify and measure:?
5. Reporting: An account of the modifications made and the standard of the data that is currently being saved is kept on file.?
Reporting on the data's health is just as vital as cleansing. In this report all the techniques used, the metrics defined and used for assurance are explained and results reported.?
This will be useful when periodically the data is reviewed, modified, and checked.?
At this point, the data can be used in applications (such as translation) or can be ingested into content platforms ready to use and retrieved, as necessary.?
#ai #artificialintelligence #technology #innovation #business #data