Artificial Intelligence in efficient image search on MEO Cloud

Artificial Intelligence in efficient image search on MEO Cloud

As technology improves, people are storing more data. Cloud platforms like MEO Cloud from Altice help keep this data safe and synced across devices. These platforms often store many images, making it hard to find specific ones. New image search methods and deep learning models can help with this. The CLIP model links text and images, while the BLIP model creates image text descriptions. The MiniLM-L6-v2 model then compares these descriptions to what the user seeks. This article discusses a system that uses these models to make searching for images on MEO Cloud easier and faster.

Continue reading to learn more about the system developed to integrate efficient and dynamic image search into the MEO Cloud platform using artificial intelligence, or download the full white paper.


State of the Art

Thanks to technology, people now have and manage more images than ever. This is because of better cameras, more storage, and improved ways to share images. Images are essential for remembering moments and sharing messages. Every day, about 5.3 billion images are taken. In 2023, over 1.8 trillion images were taken worldwide. Since 2012, the number of images taken has been growing by 10%-14% each year, and it’s expected to reach 1.94 trillion in 2024. However, during the pandemic years, there was a 20% drop in images taken. MEO Cloud, a product from Altice, is a cloud storage service where users can store and manage various files, including images. As the number of images grows, efficient search methods become necessary. Traditional search methods like date or file type are not enough. Deep learning, a subsector of AI, offers better performance with large data volumes and can improve image search. This article discusses a new image search system for MEO Cloud that uses deep learning models.


Different Approaches

A successful image search system needs to organize and find images accurately. Text-Based Image Retrieval (TBIR) and Content-Based Image Retrieval (CBIR) are two main methods. AI has improved these methods, leading to Multi-Modal Image Retrieval (MMIR) and Cross-Modal Image Retrieval (CMIR). These methods refine results and are known as Semantic-Based Image Retrieval (SBIR).


Text-Based Image Retrieval

The Text-Based Image Retrieval (TBIR) approach is used when users search for images using keywords or descriptions. Text-matching algorithms match these with the image's textual annotations like names, tags, or descriptions. However, manually describing images takes a lot of effort. So, there's been a focus on automatic image annotation, which can be grouped into three methods:

  • Keyword-based methods use models like Convolutional Neural Networks (CNN) to assign a class or classes to the image.

  • Ontology-based methods represent keywords and their relationships in hierarchical categories to capture semantic content.

  • Generative methods generate textual descriptions that represent the visual content, capturing the semantic context."


Content-Based Image Retrieval

Content-Based Image Retrieval (CBIR) systems store the visual features of images in a database. When searching, they compare the search image's colour, shape, and texture to these stored features. This is often used in e-commerce to find similar products. However, these systems can struggle when features are taken from the whole image due to backgrounds, overlaps, and clutter. Also, they assume that if images look similar, they have similar meanings, which isn’t always true.


Multi-Modal Image Retrieval

The Multi-Modal Image Retrieval (MMIR) approach combines text and image features to improve search results. It’s used in e-commerce and makes searches more accurate and detailed. For example, if a user searches for ‘beach without people’, the system considers all words in the search. The images shown match the search words and have similar visual features. Combining text and images in searches helps reduce differences between what the user is looking for and the results they get.


Cross-Modal Image Retrieval

The Cross-Modal Image Retrieval (CMIR) approach compares the user’s text to the image’s visual content. It analyzes the characteristics and relationships of these data, known as multimodal text-to-image relationships. The goal is to measure the similarity of the text and image by converting the data into the same multi-dimensional space using embeddings. A model that can generate text and image embeddings is used during the storage and search phases. The CMIR approach uses pairwise learning to calculate the similarity between pairs of text-image embeddings using a cross-loss function. Pairs that are related have a higher similarity than those that aren’t. This way, the text description and images to be searched are represented in the same way, without needing textual annotations or visual characteristics of the image.

CMIR approach

Embeddings are low-dimensional vectors that capture the meaning of data. They allow similar items to be close in a multi-dimensional space. The similarity between vectors is often measured using cosine similarity, which is suitable for unstructured data. It measures the cosine of the angle between two vectors, with smaller angles indicating more similarity.

The Cross-Modal Image Retrieval (CMIR) approach uses vector databases to store these embeddings. These databases are designed to store, index, and retrieve data in a multi-dimensional vector space, making them suitable for AI applications. Images are processed by a model that generates image embeddings for storage. When searching for images, the database compares the image embeddings with the embeddings of the text description processed by the model, using algorithms like Hierarchical Navigable Small World (HNSW).


Altice’s very Own

The MEO Cloud platform is essential for users to store, search, and remove images, promoting continuous and iterative interaction. To guarantee the integration of the system developed for the platform, the following functionalities are included:

Image Storage

This functionality allows users to retain several images in the cloud from devices such as computers, smartphones, or tablets, ensuring that the images are accessible.

Image Search

This functionality locates specific images based on defined criteria, such as textual descriptions, allowing users to specify desired characteristics to find the images.

Image Removal

This functionality allows users to delete images from the cloud, enabling efficient image gallery management.

The developed system is a microservice of the MEO Cloud platform, offering a scalable way to manage and search images. It syncs with the platform to ensure user data is accurate and consistent. It has two layers: a logical layer for implementing image search methods and a persistence layer for storing data. To keep data consistent between the platform and the microservice, unique identifiers are used:

  • User UUID: this is used to store image search data in a separate collection in the system’s vector database.
  • Image UUID: this is used to store image data as objects in the user’s collection.


Method and Strategy

Text-Based Image Retrieval (TBIR), Content-Based Image Retrieval (CBIR), and Multi-Modal Image Retrieval (MMIR) all have limitations that affect their accuracy. TBIR needs help representing the relationship between words and images, leading to irrelevant results if inappropriate terms are used. CBIR often fails to capture the semantic content of images, creating a gap between high-level concepts and low-level features. MMIR, which combines TBIR and CBIR, also has gaps in matching user descriptions and images. The main issue with these approaches is the semantic discrepancy between the user’s query and the results. This is influenced by factors like:

  • Ontological Reasoning: The ability to analyze semantic relationships between words, such as synonyms, hyperonyms, and homonyms.
  • Objects and Regions: The ability to identify objects or regions of interest in the image.
  • Spatial Context: The spatial context of regions, objects, and scenes within an image, especially for queries involving spatial prepositions like ‘next to’, ‘over’, ‘left’, and ‘bottom’

To address these limitations, the Cross-Modal Image Retrieval (CMIR) and Text-Based Image Retrieval (TBIR) approaches were chosen. They complement each other and effectively capture the semantic relationships of user searches, which are done by textual description. Combining these approaches creates synergy and is implemented through cutting-edge AI models. The architecture was designed to integrate seamlessly with the MEO Cloud platform, incorporating the chosen approaches, the AI models, and the vector database.

System architecture

The Cross-Modal Image Retrieval (CMIR) approach combines text and image through embeddings and checks the similarity between text-image pairs. The Text-Based Image Retrieval (TBIR) approach filters out irrelevant images from the search by analyzing the text of the captions generated from the images. The system focuses on three main tasks:

  • Embeddings Generation: the CLIP model creates embeddings of the images and user descriptions for the search. It’s a multi-modal Language Vision Transformer trained on 400 million text-image pairs and can integrate text and image in the same space.
  • Captions Generation: the BLIP model automatically generates captions that reflect the visual content of the images. These captions are stored as image metadata in the vector database for filtering results.
  • Semantic Textual Analysis: the MiniLM-L6-v2 model evaluates the similarity of image captions and the user’s description during image search. It’s a Sentence Transformer trained with over a billion training pairs, learning robust representations of textual sequences.

The system uses a vector database to store the necessary data. It keeps the embeddings of the images and the generated captions as metadata while the images stay on the MEO Cloud platform. The vector database ensures the system can scale in operations and user numbers, keeps each user’s data separate, and makes image searches more efficient.


Discover more in the white paper about the image storage pipeline and the conclusions derived from integrating innovative image search techniques with AI models that merge text and images within a unified multidimensional space.


Authors


Keywords: Artificial Intelligence, Search, Image, Cloud Platform, MEO Cloud



Contact us if you want to engage in a deeper discussion on this topic!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了