WAZEN: Arabic NLP Project To Find Word Variation Pattern: Find Word Root As An Application.
Motivation
Natural Language Processing (NLP) is a field that combines Artificial Intelligence and Linguistics, and it is concerned in making computers understand the statements or words written/spoken in human languages. NLP has spread it's applications in various fields such as information extraction, chatbots, spam detection, etc. Arabic NLP has a problem of a lack of libraries that provides robust solutions for basic operations such as Arabic stemming, Arabic stop words, “tashkeel”, roots, lemmatizing, etc...
Arabic word structure is combined of two main parts: roots and affix
Every language has its aspects, one of the Arabic language’s aspects is the structure of the words which is helpful in extracting patterns from each word variation [Wazen]. Arabic word structure is combined of two main parts: the first part (root) which is a set of three letters and the second part (affix) which is the letters added which can lead to a vast number of words all of them refer to the root letter but reflect semantic variations.
It was a hard task to find a labeled dataset in order to train the classifiers, so a large labeled data set was created using a semi-automated method and reviewed manually.
Models were trained, validated and tested using 20k labeled words
96% accuracy was recorded
In this work three supervised machine learning algorithms were trained and validated on a 20k labeled words in order to detect the variation pattern of a given word which is a pre-processing step that is needed for NLP text in Arabic and about 96% accuracy was recorded. More details about potential applications and ideas is shown in the Application section.
Arabic Language Background?
The word variation pattern indicates whether a word is (plural or singular, past or present, noun or verb, subject or object) and even can indicate the gender of the speaker, etc. The syntactic pattern of Arabic words is the key concept for our extraction algorithm. Consequently, we don’t have to deal with all the words in the Arabic vocabulary, instead we will only deal with a finite form of words which will help in stemming and understanding each word more.?
The syntactic pattern of Arabic words is the key concept for our extraction algorithm.
As we mentioned in the motivation section, the added letters (Affix) which are added before the root or after it or in the middle (suffix, prefix and infix) give info about the word meaning, for example:?
? "?" is added at the beginning of a verb to indicate that is plural and present: ????
? “??” is added at the end of the verb to indicate that it is past and plural: ?????
?This is also applicable for nouns for example: ?????? ??????
? "?" is added after the first letter to indicate that this is a subject. ????
? "?" is added to the beginning of the word and "?" is added after the second letter to indicate that this is an object: ?????
This structure is helpful because any word in the Arabic vocabulary can be reversed to its three letters root. All the words in the Arabic vocabulary can be matched with a corresponding word in a list of words called variation patterns for example:
? "???" is matched with "???"??
? "???? " is matched with "????"
Each variation pattern has a specific meaning for example:
? "???--> ???" is a verb that means hit.?
? "???? --->????" is a noun that means a bat (something you use to hit).?
Another major aspect for the Arabic language is the special characters called (harakat) which could be added to letters and give different meanings. Nowadays, most people usually write words without harakat.
Methodology
Data preparation
The data set was prepared semi-automatically as described in the following steps:
1. A web scraping script was written to gather basic Arabic verbs [that are consisted of 3 letters] from web sites, noting that we tried to cover all verbs types in Arabic like:?
We called the [?, ?, ?] a “long vowels”.
- ???? : that doesn’t have long vowels like: ???, ???
- ???? ???? : that starts with one of the long vowels like: ???, ???
- ???? ???? : that has one of the long vowels in the middle like: ???, ???
- ???? ???? : that ends with one of long vowels like: ???, ???
- ???? ????? : that starts and ends with one of the long vowels like: ???, ???
- ???? ????? : that has one of the long vowels as 2nd and 3rd letters?like: ???, ???
- ????? : that has ? character like:???, ???
- ?????: that has ? character like: ???, ???
2. We wrote a script to derivative the words by adding the needed affixes from the origin verb taking into consideration the different formats of the above types, Ex:
etc, ??? --> ????--> ???? --> ???? --> ??????
3. The output of the above 2 steps is about 20K words, and they were reviewed manually! Not all words are used these days but regardless of that we found that the structure of these words is still useful for training.
Table 1 shows the word variation types that we used for training models noting we limited our classes to be 110.
领英推荐
Feature extraction algorithm
The structure of the words has been studied and analyzed by us to extract and differentiate the most informative and distinguished characters and sub words. Representing each word as a vector of numbers is the key operation to enable understanding the words by machine learning algorithms because it’s more convenient for ML algorithms to deal with numbers. So, the 1st question that needs to be answered when you work with words is how to encode it in a convenient way that reflects the structure. One hot, Bag of words and Word2vec are examples of encoding words in vectors of numbers. Using them demonstrates how you can get better word representations to learn from much bigger datasets [1].
Representing each word as a vector of numbers is the key operation to enable understanding the words by machine learning algorithms
MACHINE LEARNING TECHNIQUES and Results?
Three supervised algorithms were used in this project for training classifiers, each classifier was trained and validated using different algorithms as following: NN, Naive Bayes and SVM. Each classifier used the same train data. The experimental results show that the classifiers yielded better results with NN than with the Na?ve Bayes’ and lastly SVM. We trained NN and SVM using scikit learn library while we used Naive Bayes from NLTK.
sikit learn and NLTK used in WAZEN
The overall results for our various ML algorithms are summarized in Table 2 below.
NN achieved the best result and this is understandable given it's complexity. However, Naive Bayes generally performs well on text classification tasks and in our case it was very robust despite its’simplicity and gave an accuracy better than SVM. In the end, obviously developers tend to choose the simplest algorithm that will give good enough results. The std for k-fold validations shows that the models are stable and we got minimal differences between the final model that we retained and the mean of k-fold models.?
Validation?
The purpose of this step is to see how our final model is going to deal in the wild, and it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set. And here is the methodology that we used: All data that we have [around 20K] words is split into 2 parts:
1. Training data: 70% of data [around 14K] used to train the model and prepare it.
Source: https://elitedatascience.com/overfitting-in-machine-learning
2. Testing data: 30% of data [around 6K] represents the unseen observations that have not been used in the model training.?
In one round of cross validation, original training data set has been split into two parts: I. Cross validation training set. II. Cross validation testing set. Let’s unpack this further by the below pseudo sketch:
Why did we use k-fold cross validation??
By using cross validation, we tested our machine learning models in the “training” phase to check for over fitting and estimate the generalization capability of classifier in order to get an idea about how machine learning models will generalize to independent data, which is the test data set. We can easily estimate the mean of model in each round to get an idea of how well the procedure performs on average. In addition to mean, we can calculate the standard deviation of these measures to get an idea of how much the model result is expected to vary in practice.
Future work
WAZEN in the first version released supports:
1. Detecting the Variation pattern of the word.
2. Giving the three letters root of the word if that is possible because in Arabic there is some variation pattern that consists of 1 or 2 letters [?????].
As an extension to this work, we would like to work on the followings:
1. Special characters (haraka) in order to give precise meaning to the form of the word and to be able to distinguish between two words which have the same letters but different "harakas".
2. Work on robust solution to help to decide if word is noun or verb and if it is single, dual or plural which could be useful in chatbot.
3. Add more variation patterns in order to cover all the variations in the Arabic language.
4. Evaluate the feature extraction algorithm that we used and the potential for a better extraction algorithm.
Applications
WAZEN could be used in the followings:
1. Indexing [consider words that refer to same root as one word] and searching.
2. Understand the context of sentences: If a sentence has ??????then it is a request, ?????means doing that action a lot.
The final conclusion is Hrakat/Tashkeel should be the 1st pre-processing step for Arabic text in order to fully understand context.
FinTech || Software Engineer || Solution Designer || Team Lead
6 年it's really good job