Preprocess Text data howto.

Preprocess Text data howto.

Many companies run data pipelines and integrate with third party software or data for their business purposes.?

However, before /after using any software, the data (text, audio) has to go through a pipeline of preparation tasks to 'quality' assuring the data to ingest is actually valid.?

?As anything regarding using data products, if you input 'bad' data you will get bad output.??

For example, for translation you may need to make sure the special characters and unicode have been handled.? In other cases, a company needs to process a large number of records or documents and categorize them before ingesting them into their content platform. It would be unfeasible to categorize 1000s (or more) documents manually therefore an automatic preprocessing that label them, extract summary and metadata is useful. We have seen this in many industries. Large legal practices or ministries are one. But also regulated industries like insurance or trading operations or even shipping.?

In a similar way after you processed the data, you need to validate or ensure the data will reach the 'final destination' right. Data preparation will also help to provide ‘context’ of each data piece: categorization, summary, keywords, labels and snapshots all stem out from a good data preparation process.?

Businesses that aggregate datasets from many sources, public data, third party proprietary data or obtain data from clients or other departments may encounter a variety of data-related problems.?Examples are legal documents, product documentations, process planning and delivery etc.?

Textual data in its raw form typically has the following issues:?

- Data duplication: Two or more records are the same. This could lead to erroneous inventory counts, redundant marketing materials, or needless invoicing.?

-Conflicting Data: Conflicting data occurs when identical records have distinct properties. For instance, delivery problems can arise from a business using many addresses or when a word is ‘ambiguous’ depending on context (cat as the animal or the caterpillar)?

-Missing attributes in the data make it incomplete. Example: Employees whose social security numbers are absent from the database may not be able to execute their payroll.?

-Invalid Data: The characteristics of the data do not follow standards. For instance, phone numbers are recorded with nine digits as opposed to ten.-Irrelevant Data: data not useful for the purpose of the task and out of context.?

?

To preprocess the data, we suggest following 5 steps:?

1. Inspection: Find inconsistent, erroneous, and unexpected data.?

We perform running script and code to get summary statistics about the data.?

For instance, see if a specific column complies with a set of guidelines or conventions. Is a string or a number recorded in the data column? How long is each piece of text on average? How many familiar words or frequent words statistically???

How distributed are the unique values inside a column (if any)? Is there a relationship or link between each piece of text??

Then, it may be beneficial to produce some visualization and unexpected and hence incorrect results can be found by employing statistical techniques like mean, standard deviation, range, and quantiles to analyze and visualize the data.?

It will also be useful at this stage to run the text through ‘embeddings’ and visualize the texts with clustering techniques just to see if any immediate relationship is evident. This will also allow detecting ‘outliers’ , if any?

2. Cleaning: Correct or eliminate any irregularities that are found. For example, usages of dictionaries or other data to ‘fill the gaps’ or replace the invalid data with other values.?

Depending on the issue and the type of data, several strategies are used in data cleaning and each has pros and cons of its own.?

In the case of purely textual errors, here are the techniques often used:?

  • Standardization of Cases?This is one of the most popular NLP preprocessing processes, where the text is changed to the same case—usually lower case.?However, in some NLP jobs, this phase may result in knowledge loss. Words written in uppercase, for instance, can represent strong emotions like wrath, enthusiasm, etc. in a sentiment analysis assignment. In certain situations, we might want to skip this step entirely or conduct it differently.?
  • Harmonizing Accented Characters?Accent characters like ?, é, and so on are sometimes used to indicate stress on a certain letter when pronouncing a word. Accent marks can sometimes shed light on a word's semantics, which could be unclear otherwise. Even though you might not often come across accented characters, it is a clever idea to translate them into ordinary ASCII characters.?
  • Contractions Handling?Words or syllables are condensed into contractions. Words are formed by taking out one or more letters. Sometimes a contraction is made by combining several words. I will, for instance, be contracted into I will, and do not, into do not. Taking I will and I'll differently could lead to the model performing poorly. Therefore, expanding each contraction to its full shape is a healthy habit. The contractions library allows us to expand contractions into their full shape.?
  • Eliminating Special Characters?Non-alphanumeric characters are known as special characters. Certain characters are unique, such as %,$,&, etc. These characters cause noise in algorithms and offer no benefit to text interpretation in most NLP jobs. Regular expressions are a useful tool for eliminating special characters.?
  • Misspelling or wrong syntax?Sometimes a single word is just misspelt like ‘resolve’ or the syntax is wrong like ‘the code not functioning is well’. For these tasks we use a combination of open source (like Textblob) and proprietary tools to fix the text.?

3. Contextualizing:?

At this stage, the data has been cleaned and we are getting ready to use it in production, ingesting it into a data catalogue, content platform or processing it in other ways (like translation for example).?

Before doing this, it would be extremely beneficial to add ‘contextualization’ of the textual data to help retrieve information in the future and perform further analysis easily.?

For example, a piece of text may talk about holidays in Greece, another of a specific restaurant in London UK. In this case you may want to store metadata respectively as:?

-location: Greece, London?

-action: holiday, eating out?

-sentiment: positive/negative?

The examples and possibilities are endless, and we stress all of this is nowadays possible automatically and, in a cost, effective way (i.e. using AI bespoke solutions not manual labor).??

It is worth saying that you can even add ‘predictive’ information from complex text using AI (LLM specifically).??

For example, we have successfully labelled court document of hearing and legal document of lawsuit: ‘effective defense’, ‘defendant denied’, ‘judge approved/denied’ can all be extracted from a complex text in many languages (including Arabic, Italian etc.)?

4. Assurance checks: The accuracy of the results is checked after cleaning.?

At this stage, the data is cleaned and contextualized according to the purpose. Now we will put in place a process to ensure the data quality (and the context extracted) is as intended.?

In order to do so, we will define metrics to quantify and measure:?

  • Relevance?:the extent to which the data abide by the specified business requirements. In this case the extracted context in the last step and the inspection may prove useful as the metric can be defined on the metadata extracted.?
  • Accuracy/Precision?:
  • The extent to which the data and the true values are similar.? Although describing every conceivable legitimate value makes it easier to identify invalid values, this does not imply that the values are accurate.?Random checks and ad hoc metrics can be defined to check the data.?
  • Completeness?:The degree to which all required data is known.?We may have few texts or none for a certain category of type of text or all the types may miss certain information (example: certain location is not specified if it is a restaurant review).?

5. Reporting: An account of the modifications made and the standard of the data that is currently being saved is kept on file.?

Reporting on the data's health is just as vital as cleansing. In this report all the techniques used, the metrics defined and used for assurance are explained and results reported.?

This will be useful when periodically the data is reviewed, modified, and checked.?

At this point, the data can be used in applications (such as translation) or can be ingested into content platforms ready to use and retrieved, as necessary.?


#ai #artificialintelligence #technology #innovation #business #data

要查看或添加评论,请登录

社区洞察

其他会员也浏览了