Leveraging AI to automate industry watch: a competitive advantage

Leveraging AI to automate industry watch: a competitive advantage

Illustration on the Hydrogen use-case: How the AI expertise combined with UX/WebDev expertises allows Sia Partners’ Hydrogen experts to monitor the Hydrogen sector more efficiently



A bit of context

In many evolving industries, the monitoring of the regulatory/competitive/news environment is essential for Strategic and Marketing departments. It helps these departments to better anticipate future needs of the industry and therefore optimize their offering.

This challenge is faced by a team of experts within Sia Partners’ Energy team (the ‘Club Hydrogen’). In order to stay up-to-date with the industry’s latest developments, these experts publish a monthly newsletter on the Hydrogen sector. It provides news, views and opinions related to the sector. This requires the team to be on permanent watch for Hydrogen news, new projects and new regulations around the world.

The Hydrogen ecosystem is currently booming and many articles are published daily by a wide variety of sources: some generic newspapers and some H2 news pure-players. To monitor all these sources, the team used an RSS aggregator (after having tried different tools). However, this solution did not match their needs because the different articles were neither prioritized nor classified. It was therefore difficult to :

  • Extract the key news from the Hydrogen noise online
  • Identify underlying trends
  • Monitor sources published in different languages

The H2 experts wanted to have a solution that would allow a real monitoring of the H2 news:

  • Access to a sorted and searchable history of H2 news
  • Overview of main quantitative information (Subsidy, Power capacity, etc.)
  • Overview of main players acting in the field (Countries, Companies, International organizations)
  • Automatic translation into English of all sources
  • Intuitive dashboard to quickly identify trending topics


Solution

Unable to find a tool matching their needs on the market, they were able to leverage the expertise of Sia Partners’ Data Science Business Line and its accelerator platform to quickly industrialize custom solutions: Heka. After several workshops, Sia Partners’ hydrogen experts and data scientists came up with the following solution: a Web Application that would automatically scrap several sources before leveraging NLP (Natural Language Processing) techniques to automatically extract key information from the news.

As the underlying business need is not specific to the Hydrogen industry, the solution is Hydrogen-agnostic by design and can therefore be used in any field


Scrapping & Translation

The hydrogen team identified a list of 39 different sources to be scrapped in order to capture the diversity of H2 news around the world. New articles and their content need to be scrapped every night as the experts monitor the news on a daily basis. Because the sources are in a wide variety of languages, the articles also require to be automatically translated to English.


Information extraction

The following key information need to be extracted from each article:

  • Key figures (dates, amount, capacity, duration, etc.)
  • Countries concerned by the news
  • Players (Companies and organizations) mentioned in the article

Additionally, all the articles require to be classified according to the following axes:

  • Position within the Hydrogen Value Chain (e.g. Production, Storage, Transportation, etc.)
  • Topic of the article (e.g. Partnership announcement, Regulation, Financing, etc.)

Since the H2 market is quickly evolving, these axes of classification and the key information extracted need to be easily editable by non-technical users via the Web Application.


Web application

The articles and analyses are accessible on a Web Application called SentiNews. This Web Application offers:

  • A search functionality that simplifies monitoring and analysis by:
  • Accessing the historical data
  • Filtering by player, date, topic, etc.
  • Providing full-text search in all articles
  • An aggregated view displaying the key extracted information from each article in a table format
  • A detailed view allowing users to directly read the article inside the application, with the main information highlighted in the text.


Aucun texte alternatif pour cette image

The Web Application also needs to display a Settings page where users could edit the kind of information they wished to extract.


Technical Approach

Leverage Heka

At the heart of the Consulting 4.0 strategy, Heka is Sia Partners' industrial artificial intelligence platform. It is composed of all the technical building blocks necessary for the realization of Data Science projects and allows fast deployment of Machine Learning or Deep Learning projects of all types: voice or image recognition, natural language processing, etc… Heka guarantees a quick time-to-market from idea to POC and from POC to industrialization, as well as the scalability and robustness of our artificial intelligence solutions.

Heka was therefore naturally used as the platform underlying the Web Application and its associated AI algorithms.?

Regarding the data model, we chose to use a combination of a MongoDB database and ElasticSearch to allow for efficient filtering and full-text search in the scrapped articles.


Designed to be field-agnostic

Even if SentiNews was originally imagined for a use-case related to H2, all the technical choices were made with the goal of being applicable to any other field. Therefore, all the structural bricks of the application are H2-agnostic.

The data pipeline is composed of two different bricks: Scrapping and NLP (Natural Language Processing) with the latter consisting of two different tasks, Topic Classification and Entity Extraction.


NLP Topic Classification

In order to perform this task, supervised classification could have been a suitable solution. It was quite feasible to scrape a large number of articles, and the Hydrogen team had the expertise to label the articles, therefore providing labeled data to train a model. However, we needed our model to work regardless of the choice of topics, since the user had the ability to add new axes of classification or change existing topic definition. It was thus inconceivable to re-train the classification model every time a topic or axis was created or modified. In this situation, our constraint was not on the quantity of available data as in most cases but rather on the evolving state of the application’s functional needs. The idea of supervised classification was therefore disregarded.?

We (Heka’s Data Scientists) chose to use a zero-shot classification model. HuggingFace provides zero-short text classification models capable of performing predictions without labelled data. It benefits from its NLI (Natural Language Inference) feature to classify text. Given two inputs – a premise and a hypothesis – the model predicts a score according to the entailment of the hypotheses with the premise (e.g. ‘This text talks about a partnership’ for evaluating the Partnership topic). The precision achieved with this technique is lower than with supervised learning but much more versatile.?

The choice of the keywords used in the hypotheses is crucial to reach good levels of performance for the topic detection task. Our Data Science team worked alongside the business experts to carefully choose these words in order to improve model performance. We, as Data Scientists, also shared insights on the best-practices since the H2 experts will have the ability to add new topics on their own during the application’s lifetime.


Aucun texte alternatif pour cette image


Entity Extraction

The energy consultants were also very interested in the ability to extract specific entities from the articles, to identify for example which countries and players were mentioned and what were some of the key figures.?

To identify this kind of information, we used Spacy’s Named Entity Extraction algorithm. This algorithm combines a deep convolutional neural network with residual connections, word embeddings using subword features and “Bloom” embeddings, and a transition-based approach in order to provide accurate and efficient results. It detects the following named entities:

Aucun texte alternatif pour cette image

In this case, the Hydrogen team was interested in Locations, Organizations, Dates, Times, Money, Numerals, Percentages, and Quantities.

Additionally, it was important to have more specific entities on the “quantities” in order to track mentions of Energy, Capacity, and the Cost of Energy. We therefore defined custom regular expressions that identify these new entities that take precedence over Spacy’s predictions.?

Finally, specific post-processing methods were developed to harmonize the entities, such as:

  • Filter dates too close to the article’s publication date to avoid entities such as “last Tuesday” and focus on Key Dates
  • Geolocate any detected locations to extract the name of the country
  • Preproces of organization names to match terms such as “European Union”, “the European Union’s”, “EU”.


Conclusion & Next steps

The solution is now used by 10 experts from the Energy team on a daily basis and helps them easily write the H2 monthly newsletter. As of now, more than 1,600 articles have been scrapped and translated with associated entities and topics extracted.?

For example, the solution can quickly be used to search for all news related to H2 being used in Aviation and involving Airbus:

Aucun texte alternatif pour cette image

If one article seems interesting for the experts, they can directly click on it and have a detailed view with the main information highlighted.

Aucun texte alternatif pour cette image

To further improve the solution, a newsletter functionality is currently under development. It will allow users to receive by email either all the scrapped articles related to a specific topic/organization, or to receive at a defined frequency the most important articles for each topic.

Next steps also include role-based permissions, allowing multiple users with different areas of expertise to use one same platform. For example, users could have access to analysis of both the H2 and the nuclear fields, each scrapped from custom sources and defined by different experts.

The solution has already been adapted and is currently used by a municipality that needs to watch Design topics. The solution has been used as-is thanks to the Hydrogen-agnostic architecture - the only adaptation needed was the list of sources automatically scrapped.

要查看或添加评论,请登录

Sia AI的更多文章

社区洞察

其他会员也浏览了