Machine Learning and document analysis a Practical approach

Machine Learning and document analysis a Practical approach

AI (Artificial Intelligence) and Machine Learning are very popular these days. A lot of resources are available ranging from the traditional paper to the most advanced web platforms. But without a solid mathematical background, every approach will scratch only the surface.

Three different books were been the starting point to understand this world, both very interesting and passionate.

The first one is quite old, but we can find in it a lot of useful insight to understand the mathematical aspects.

The others allow having a wide vision of how the algorithm works, as well as what is the best algo for each specific case.

Math

Mathematical aspects that are strongly required (even though formal aspects could be skipped):

  • Linear algebra
  • Probability theory
  • Decision theory
  • Information theory

The Goal

In order go on the street and see what we can be done. I tried to solve a well-defined problem.

The problem:

Can I use ML to implement advanced NLP algorithm, in particular: can I create a model able to identify some specific feature inside a document? 

The aim of the lab was to identify law references inside institutional documents using a well defined ML algorithm.

The Algorithms

In order to be able to analyze big documents, the first issue has been to identify an algorithm able to scan large documents.

The second one was to identify word similarity. In other words how to convert sparse vector built on single word into a much more dense vector and at the same time how to normalize the words distance based on their meaning.

To do that a mix of Convolutional Neural Network and the Word2Vec algorithm has been chosen.

The environment -

The magic world of Python, Tensorflow and Cuda.

The Anaconda environment

Using Anaconda one host can contain many different environments.


Tensorflow is a framework designed by Google to develop Machine Learning algorithm. It is open, simple to use, it has a huge community. It allows developing both very fast and complex solution using the Python as programming language.

Nvidia Cuda is a library developed by Nvidia that allows using GPU to manipulate efficiently floating points values. The high number of cores allows running simultaneously repetitive tasks.

CPU vs GPU

Following pictures compare same computational work made on CPU vs GPU

Same work shown by means Python Spider IDE

The results

Using Machine Learning approach it is possible to identify specific patterns as Nouns, Locations, Chemical/Biological Compounds, Law References, and many more terms, inside a textual document.

Using well-known algorithms, framework, programming languages as well as specific execution platform it is possible to implement awesome products in very quickly and efficiently fashion.


要查看或添加评论,请登录

Giovanni Emanuele Nocco的更多文章

  • from FULL to WHOLE Stack Development

    from FULL to WHOLE Stack Development

    Let's dive into the future: Full Stack Development with AI! The world of software development is constantly evolving…

  • ?? Lightning Network: A valid option to empower the Future of Micropayments??

    ?? Lightning Network: A valid option to empower the Future of Micropayments??

    In the dynamic landscape of blockchain technology, innovation knows no bounds. At the forefront of this digital…

  • To identify crops parasites in agriculture by means of ML (CNN)

    To identify crops parasites in agriculture by means of ML (CNN)

    This paper is about a suggestion for a software that uses artificial intelligence (AI) to identify Leptinotarsa…

  • Entreprise LLM solutions

    Entreprise LLM solutions

    Utilizzo di soluzioni per creare ambienti on-prem che sfruttano a pieno l’intelligenza artificiale e gli LLM in…

  • Bing Compose

    Bing Compose

    Yesterday I received a promotion coming from #Microsoft: - Use #bing for three days to receive 3 months Spotify. Today…

  • Zero trust and Decentralize Consensus

    Zero trust and Decentralize Consensus

    The use of blockchain related technologies is a way to solve a lot of issues we are encounter. How can we trust that a…

  • Vertical Farm MES solution with LoRaWAN Metering

    Vertical Farm MES solution with LoRaWAN Metering

    The Idea - Vertical Farming In this article I want to share with you an idea come up to me. I started to figure out a…

  • Micro-services Architecture with Dapr

    Micro-services Architecture with Dapr

    I wanted to write this simple article at the end of an evaluation session of Dapr. I have to say that the possibility…

    2 条评论
  • Google Cloud Services - The One Man Platform

    Google Cloud Services - The One Man Platform

    Google cloud services are a complete set of tools, libraries, APIs, and so on that allow you to implement an end-to-end…

  • The digital transformation's challenges

    The digital transformation's challenges

    La trasformazione digitale passa attraverso la realizzazione di flussi di lavoro ridisegnati utilizzando un approccio…

社区洞察

其他会员也浏览了