Cleora.ai - Swiss Army knife - essential element of systems operating on data in the form of a network of connected nodes.
Jaroslaw Krolewski
synerise.com | basemodel.ai | cleora.ai | wislakrakow.com | agh.edu.pl
We created Cleora, one of the fastest graph-embedding algorithms in existence. How was it created and what is the purpose of this project? How can open-source projects grow with the help of the community? How can cleora.ai accelerate the implementation of AI in modern companies?
What's the story behind Cleora's origins? How many people worked on it and for how long?
Since the inception of the AI team at Synerise, our ambition has been to quickly and easily process giant heterogeneous interactive data. Existing libraries, such as StarSpace, Node2Vec, DeepWalk, or various graph convolutional networks, did not meet our requirements.
Each of them had a drawback, like very slow performance, impractical limitation of the maximum graph size or unsatisfactory quality of the results. We needed a solution that would allow us to quickly and accurately calculate graph embeddings with millions of vertices and billions of edges to represent user behavior. Several months of waiting for a calculation result was unacceptable for us.
The first version of Cleora was created at the beginning of 2019 and was implemented in Scala. It was quickly apparent that the tool successfully replaced all existing graph embedding libraries.
In the next iteration, at the beginning of 2020, in addition to optimizing the algorithm, we decided to get rid of JVM. The entire solution was rewritten from Scala to Rust, thanks to which we have more control over memory and processor consumption, and the speed more than doubled.
Initially, a team of several people was involved in creating Cleora. Its development gave us additional opportunities to create a number of solutions based on it, including generation of recommendations, scoring, segmentation and various predictions.
The experience gathered by the entire AI team allowed us to make Cleora what it is today, a universal and reliable "Swiss Army knife" for computing graph embeddings.
What is the purpose of this tool and how can Cleora help entrepreneurs? Who should definitely be interested in using it?
Cleora is one of the fastest graph embedding algorithms in existence. It is an essential element for systems operating on data in the form of a network of connected nodes. These are recommendation systems, systems which predict the existence of connections between users in social media (e.g like / follow), or even systems predicting the biological functions of protein networks, which allows for the creation of new drugs.
No wonder then that such algorithms are created by digital giants such as Facebook and Google, creating a number of new solutions each year. However, Cleora has a significant advantage over these algorithms.
First, it is much faster. Secondly, it does not require specialized hardware (e.g. GPUs for acceleration of calculations) and, in addition, produces high quality embedding vectors. This means that systems (e.g. recommenders) using Cleory may run faster and with greater accuracy.
Cleora is capable of processing graphs of hundreds of millions of nodes. In social networks, one node usually corresponds to a single user, so Cleora can be used to process datasets on a global scale, at the level of the number of users of the largest social networking sites such as Twitter.
The release of the software under an open-source license means that from now on, either a company, an individual or a research institution can use Cleora for any purpose. We recommend Cleora when working with large graphs, especially in conditions of limited computing power. The implementation is available on GitHub.
Cleora is expected to work 8 times faster than the PyTorch-BigGraph created by Facebook, and Synerise itself has been appreciated by Microsoft. Has the time come for Polish companies to start to make their mark on the international arena?
In the scientific sphere, the Synerise team achieved significant success by winning the Rakuten Data Challenge competition at the SIGIR (Special Interest Group on Information Retrieval) conference. The subject of the competition was creating recommendations in e-commerce, and the organizers included Tracy H. King (Adobe), Shervin Malmasi (Amazon), Dietmar Jannach (University of Klagenfurt), Weihua Luo (Alibaba), Surya Kallumadi (The Home Depot).
The Synerise publication on methods for detecting the most important features of products which determines the user's interest, also appeared in the materials of the ICML 2020 conference. A few months later, at the ICONIP 2020 conference, our article was presented describing a model which recommends similar clothes based on a photo gallery from producers and users.
What is the plan for further Synerise activities?
Our research goal is to enable automatic and very efficient processing of various data sources that are owned by our clients, both in terms of the quality of the results and the calculation time.
Graph algorithms can process interactive data typically found in banking, telecommunications, and e-commerce ecosystems. However, there are many other types of data such as images, text, sounds and structured data, and any company looking to improve its performance must be able to seamlessly synthesize all of its data into a form that allows easy and instant real-time predictions.
The business priority for Synerise is now international expansion to Western and Middle Eastern markets.
Why is Cleora open-source? What are the benefits of opening the project to the community?
Activities of this type bring companies a lot of publicity, especially if the published tool is of high quality, i.e. it is quick, easy to run and comprehensive, and when it is offered under an open license allowing for commercial use. This is the case with Cleora.
Synerise wants to stimulate knowledge sharing, following the example of digital giants like Google and Facebook who publish some of their solutions. At the same time, we do not see companies on the Polish technology market, even very large ones, that would share their knowledge in an equally open way.
Of course, Google or Facebook can afford to publish some of their property for free by being a monopoly in their market. However, Cleora is just one of many proprietary technologies being developed at Synerise, so we do not think that its publication will have a negative impact on our company.
One should also remember that the use of this type of tools is not easy (although we tried to make Cleora as easy to use as possible) and often in practice it will require assistance in the form of consulting, so we also treat it as a potential source of new clients.
Cleora underpins some parts of our ecosystem and we continuously implement solutions based on it in many companies. Opening the source code has a positive effect on transparency and increases trust in AI solutions, which for many people are still something incomprehensible.
The transparent approach has many advantages. Scientists employed by our clients may use Cleora to carry out corporate or personal projects to better understand the principle of operation or to validate our performance claims.
The recruitment aspect should not be underestimated either. High-quality open-source often becomes an element that attracts the most ambitious candidates.
The advantage of innovation hubs such as Silicon Valley is largely due to the synergy effect - creating a friendly environment for sharing ideas and inspiring each other. We have taken a bold step in this direction. Opening the Cleora code is an important experiment for us, the consequences of which are sure to come back to us through various endeavors. We will observe them with curiosity and draw conclusions. Within days after publication, the first volunteer contributors have already appeared to bring their improvements to Cleora, which makes us very happy.
How can open-source software contribute to the development of a project?
The publication of the open-source code allows the development of an informal group of users involved in the development of the tool by introducing their own ideas and improvements. In this way, Cleora has a chance to permanently appear in the catalog of graph embedding solutions, as it is constantly developed and updated.
Of course, this is only possible if the tool is interesting for the community and provides a source of inspiration. We encourage all Rust and Python developers to contribute to our solution, which is and will remain open. Any suggestions for improvements to the project can be submitted via GitHub.
Are there any risks related to sharing your projects as free software? Why do so few companies decide to make such a move?
A very important goal of many companies is the protection of intellectual property, understood as a strategic resource. From the company's perspective, revealing a key tool’s key code often means losing a competitive edge.
However, the advantage of Synerise is not based on one unique technology, but on the synergy of many proprietary solutions, concentrated in one ecosystem. Therefore, we believe that the disclosure of a single or even several tools is not a threat to us. The most important reason for such a small number of similar initiatives may be the fact that most companies in our region use technologies created by someone else and profit from implementations.
What can we expect in connection with the development of artificial intelligence in business?
We can expect a gradual elimination of secondary and non-creative activities, as well as the superhuman possibilities of synthesizing giant data sets. Companies will invest in more and more AI services to improve their operation, i.e. searching for target groups, targeting properly matched advertisements, accelerating processes inside companies, or accelerating communication between technical support departments and customers.
The artificial intelligence industry is currently at a very early stage, but the time will come to consolidate it in a few years. Then the players with the most universal and comprehensive offer will remain on the market.
Which processes in the future could be automated with the help of artificial intelligence? Can we talk about self-operating companies?
It depends on the type of industry, but automation and artificial intelligence allow you to achieve better and better results with less and less effort required on the part of human effort.
Probably in a few years we will be able to expect the first, initially very simple "self-operating" companies.
Of course, just like on board of a jet, the human pilot will oversee their operation for many years to come, but thus he will be able to focus his attention on truly creative problems and innovation, instead of performing unambitious and repetitive tasks. Fully automatic grocery stores are being tested even today.
How is artificial intelligence affecting our lives right now?
Machine learning has been around for decades in hedge funds, banks and other parts of the financial sector. For over a dozen years, they have penetrated into ever wider branches of the economy.
The boom in recommendation systems in e-commerce companies is already responsible for over 30% of its turnover. Without artificial intelligence, companies of this type today have no chance to compete with market leaders.
The increasing adoption of AI solutions allows, above all, to automate the most tedious and labor-intensive activities, not only those that have been performed by people so far, but also those that, due to the enormous amount of work, have so far been beyond the reach of humanity.
We have an opportunity to observe the next gigantic achievements from impressive models generating natural text on any topic, such as GPT-3 (OpenAI), to models simulating protein folding such as AlphaFold (DeepMind) with unprecedented accuracy. Despite their relative conceptual simplicity, since these models do not yet have much in common with "human intelligence," such advances will revolutionize entire fields of science, and soon also industry.
And how does AI affect the everyday life of the average person?
AI is everywhere today. Our phone collects data about the pages viewed and displays personalized ads. GPS data analysis systems track the routes on which we travel. We have tools for recognizing human speech (eg Siri, Google Assistant), making translations (language translations), as well as advanced image analysis (thanks to this, we have the ability to unlock the phone using a scan of your own face).
What does the artificial intelligence market look like today? Is it easy to find specialists?
The artificial intelligence market is paradoxical today.
On the one hand, there are many applicants who are expressing great interest in machine learning and AI. Deep learning in particular, as a newly created area, is of great interest. On the other hand, universities practically ignored this field until recently. For this reason, most specialists with more than 6-8 years of experience are self-taught. There are very few of them and that is why they are in high demand.
However, not all work related to the field of artificial intelligence requires the participation of a specialist. The vast majority of companies use solutions created and implemented by others, for which even a general knowledge of programming and data analysis is sufficient.
What skill would you distinguish as needed to work on artificial intelligence?
There are many positions in the work on AI that can be divided into two main groups: research and implementation.
Research positions require knowledge in the field of mathematics, understanding the internals of machine learning models and being up to date with the latest developments in this rapidly growing field by reading scientific publications, blogs, and reports. In addition, programming skills in languages particularly useful for modeling AI tools are required, where the most common choices are Python or R. Experience in publishing and knowledge in the scientific world is an additional advantage.
In implementation positions, an important part of the job are skills related to big data processing. In this case, we emphasize knowledge of databases and technologies dedicated specifically to Big Data or streaming data such as Hadoop, Spark, Hive, Kafka, Clickhouse. We attach great importance to the ability to write high-quality code in languages such as Scala, Java, C++ and Python.
Interview with Barbara Rychalska