Leveraging AI to automate industry watch: a competitive advantage
Illustration on the Hydrogen use-case: How the AI expertise combined with UX/WebDev expertises allows Sia Partners’ Hydrogen experts to monitor the Hydrogen sector more efficiently
A bit of context
In many evolving industries, the monitoring of the regulatory/competitive/news environment is essential for Strategic and Marketing departments. It helps these departments to better anticipate future needs of the industry and therefore optimize their offering.
This challenge is faced by a team of experts within Sia Partners’ Energy team (the ‘Club Hydrogen’). In order to stay up-to-date with the industry’s latest developments, these experts publish a monthly newsletter on the Hydrogen sector. It provides news, views and opinions related to the sector. This requires the team to be on permanent watch for Hydrogen news, new projects and new regulations around the world.
The Hydrogen ecosystem is currently booming and many articles are published daily by a wide variety of sources: some generic newspapers and some H2 news pure-players. To monitor all these sources, the team used an RSS aggregator (after having tried different tools). However, this solution did not match their needs because the different articles were neither prioritized nor classified. It was therefore difficult to :
The H2 experts wanted to have a solution that would allow a real monitoring of the H2 news:
Solution
Unable to find a tool matching their needs on the market, they were able to leverage the expertise of Sia Partners’ Data Science Business Line and its accelerator platform to quickly industrialize custom solutions: Heka. After several workshops, Sia Partners’ hydrogen experts and data scientists came up with the following solution: a Web Application that would automatically scrap several sources before leveraging NLP (Natural Language Processing) techniques to automatically extract key information from the news.
As the underlying business need is not specific to the Hydrogen industry, the solution is Hydrogen-agnostic by design and can therefore be used in any field
Scrapping & Translation
The hydrogen team identified a list of 39 different sources to be scrapped in order to capture the diversity of H2 news around the world. New articles and their content need to be scrapped every night as the experts monitor the news on a daily basis. Because the sources are in a wide variety of languages, the articles also require to be automatically translated to English.
Information extraction
The following key information need to be extracted from each article:
Additionally, all the articles require to be classified according to the following axes:
Since the H2 market is quickly evolving, these axes of classification and the key information extracted need to be easily editable by non-technical users via the Web Application.
Web application
The articles and analyses are accessible on a Web Application called SentiNews. This Web Application offers:
The Web Application also needs to display a Settings page where users could edit the kind of information they wished to extract.
领英推荐
Technical Approach
Leverage Heka
At the heart of the Consulting 4.0 strategy, Heka is Sia Partners' industrial artificial intelligence platform. It is composed of all the technical building blocks necessary for the realization of Data Science projects and allows fast deployment of Machine Learning or Deep Learning projects of all types: voice or image recognition, natural language processing, etc… Heka guarantees a quick time-to-market from idea to POC and from POC to industrialization, as well as the scalability and robustness of our artificial intelligence solutions.
Heka was therefore naturally used as the platform underlying the Web Application and its associated AI algorithms.?
Regarding the data model, we chose to use a combination of a MongoDB database and ElasticSearch to allow for efficient filtering and full-text search in the scrapped articles.
Designed to be field-agnostic
Even if SentiNews was originally imagined for a use-case related to H2, all the technical choices were made with the goal of being applicable to any other field. Therefore, all the structural bricks of the application are H2-agnostic.
The data pipeline is composed of two different bricks: Scrapping and NLP (Natural Language Processing) with the latter consisting of two different tasks, Topic Classification and Entity Extraction.
NLP Topic Classification
In order to perform this task, supervised classification could have been a suitable solution. It was quite feasible to scrape a large number of articles, and the Hydrogen team had the expertise to label the articles, therefore providing labeled data to train a model. However, we needed our model to work regardless of the choice of topics, since the user had the ability to add new axes of classification or change existing topic definition. It was thus inconceivable to re-train the classification model every time a topic or axis was created or modified. In this situation, our constraint was not on the quantity of available data as in most cases but rather on the evolving state of the application’s functional needs. The idea of supervised classification was therefore disregarded.?
We (Heka’s Data Scientists) chose to use a zero-shot classification model. HuggingFace provides zero-short text classification models capable of performing predictions without labelled data. It benefits from its NLI (Natural Language Inference) feature to classify text. Given two inputs – a premise and a hypothesis – the model predicts a score according to the entailment of the hypotheses with the premise (e.g. ‘This text talks about a partnership’ for evaluating the Partnership topic). The precision achieved with this technique is lower than with supervised learning but much more versatile.?
The choice of the keywords used in the hypotheses is crucial to reach good levels of performance for the topic detection task. Our Data Science team worked alongside the business experts to carefully choose these words in order to improve model performance. We, as Data Scientists, also shared insights on the best-practices since the H2 experts will have the ability to add new topics on their own during the application’s lifetime.
Entity Extraction
The energy consultants were also very interested in the ability to extract specific entities from the articles, to identify for example which countries and players were mentioned and what were some of the key figures.?
To identify this kind of information, we used Spacy’s Named Entity Extraction algorithm. This algorithm combines a deep convolutional neural network with residual connections, word embeddings using subword features and “Bloom” embeddings, and a transition-based approach in order to provide accurate and efficient results. It detects the following named entities:
In this case, the Hydrogen team was interested in Locations, Organizations, Dates, Times, Money, Numerals, Percentages, and Quantities.
Additionally, it was important to have more specific entities on the “quantities” in order to track mentions of Energy, Capacity, and the Cost of Energy. We therefore defined custom regular expressions that identify these new entities that take precedence over Spacy’s predictions.?
Finally, specific post-processing methods were developed to harmonize the entities, such as:
Conclusion & Next steps
The solution is now used by 10 experts from the Energy team on a daily basis and helps them easily write the H2 monthly newsletter. As of now, more than 1,600 articles have been scrapped and translated with associated entities and topics extracted.?
For example, the solution can quickly be used to search for all news related to H2 being used in Aviation and involving Airbus:
If one article seems interesting for the experts, they can directly click on it and have a detailed view with the main information highlighted.
To further improve the solution, a newsletter functionality is currently under development. It will allow users to receive by email either all the scrapped articles related to a specific topic/organization, or to receive at a defined frequency the most important articles for each topic.
Next steps also include role-based permissions, allowing multiple users with different areas of expertise to use one same platform. For example, users could have access to analysis of both the H2 and the nuclear fields, each scrapped from custom sources and defined by different experts.
The solution has already been adapted and is currently used by a municipality that needs to watch Design topics. The solution has been used as-is thanks to the Hydrogen-agnostic architecture - the only adaptation needed was the list of sources automatically scrapped.