Machine Learning and document analysis a Practical approach
Giovanni Emanuele Nocco
SVP Tech & Enterprise Architect at UTU Singapore [Remote]
AI (Artificial Intelligence) and Machine Learning are very popular these days. A lot of resources are available ranging from the traditional paper to the most advanced web platforms. But without a solid mathematical background, every approach will scratch only the surface.
Three different books were been the starting point to understand this world, both very interesting and passionate.
- Tom Michael Mitchell - Machine Learning
- Christopher M. Bishop - Pattern Recognition And Machine Learning
- Ian Goodfellow and Yoshua Bengio and Aaron Courville - Deep Learning
The first one is quite old, but we can find in it a lot of useful insight to understand the mathematical aspects.
The others allow having a wide vision of how the algorithm works, as well as what is the best algo for each specific case.
Math
Mathematical aspects that are strongly required (even though formal aspects could be skipped):
- Linear algebra
- Probability theory
- Decision theory
- Information theory
The Goal
In order go on the street and see what we can be done. I tried to solve a well-defined problem.
The problem:
Can I use ML to implement advanced NLP algorithm, in particular: can I create a model able to identify some specific feature inside a document?
The aim of the lab was to identify law references inside institutional documents using a well defined ML algorithm.
The Algorithms
In order to be able to analyze big documents, the first issue has been to identify an algorithm able to scan large documents.
The second one was to identify word similarity. In other words how to convert sparse vector built on single word into a much more dense vector and at the same time how to normalize the words distance based on their meaning.
To do that a mix of Convolutional Neural Network and the Word2Vec algorithm has been chosen.
The environment -
The magic world of Python, Tensorflow and Cuda.
The Anaconda environment
Using Anaconda one host can contain many different environments.
Tensorflow is a framework designed by Google to develop Machine Learning algorithm. It is open, simple to use, it has a huge community. It allows developing both very fast and complex solution using the Python as programming language.
Nvidia Cuda is a library developed by Nvidia that allows using GPU to manipulate efficiently floating points values. The high number of cores allows running simultaneously repetitive tasks.
CPU vs GPU
Following pictures compare same computational work made on CPU vs GPU
Same work shown by means Python Spider IDE
The results
Using Machine Learning approach it is possible to identify specific patterns as Nouns, Locations, Chemical/Biological Compounds, Law References, and many more terms, inside a textual document.
Using well-known algorithms, framework, programming languages as well as specific execution platform it is possible to implement awesome products in very quickly and efficiently fashion.