登录查看更多内容

Vector-based Search to Identify Duplicate Unstructured Data

Ganapathy Shankar

?? CxO Advisor | ?? AI Strategy Consulting | ?? AI Lead Business Transformation | ?? Product Development | ?? Corporate Training on AI & GenAI

发布日期: 2023年4月18日

The complications of unstructured data

Unstructured data is defined as data that lacks a predefined data model or that cannot be stored in relational databases. According to a report, 80% to 90% of the world’s data is unstructured, the majority of which has been created in the last couple of years. The unstructured data is growing at a rate of 55%-65% every year. Unstructured data may contain large amounts of duplicate data, limiting enterprises' ability to analyze their data.

Here are a few issues with unstructured data (duplicate data in particular) and its impact on any system and its efficiency:

Increase in storage requirements:?Higher the duplicate data, more the storage requirements. This increases the operating costs for applications substantially.

Large number of data files:?This significantly increases the response time for every type of search function.
Delays in migration:?Larger duration of time is required for migrating data from one storage facility to another.
Difficulty in eliminating duplicates:?It becomes more difficult to remove duplicate files when the scalability of the system increases.

Redundant data creates disarray in the system. For that reason, it becomes imperative for organizations to identify and eliminate duplicate files. A clean database free of duplicate data avoids unnecessary computation requirements and improves efficiency.

Challenges in duplicate data detection

Detecting duplicate files by search functions using file characteristics like name, size, type and others, may seem to be the easiest method. However, it might not prove to be the most efficient method, especially if the data is on large scale. Here’s why:

Searching with file names:?Most of the applications use unique file names to store media files. This makes the search difficult because the same file can be under different names. Identification of duplicate data is not possible unless the content is examined.
Search based on content:?As searching with file names isn’t suitable for applications, a search based on content appears to be the next option. However, if we are dealing with a large document or pdf with multiple pages, this is not a feasible solution either. It will not only have high latency but will also be a computationally expensive task.
Search based on types and formats:?Media files can be of different types like images, video, audio and so on. Each type of media file can be stored in multiple formats. For instance, an audio file can be saved as .wav, .mp3, AAC or others. The file structure and encoding for each format will be different, hence making the detection of duplicate files difficult.

The proposed solution

A suitable solution to detect duplicate files must address the complications associated with dealing with large volumes of data, multiple media formats and low latency. If each file were to be converted into multi-dimensional vectors and fed as inputs to the nearest neighbor’s algorithm, one would get the top 5-10 possible duplicate copies of the file. Once converted into vector files, duplicate data can be easily identified as the difference in distance of the respective dimensions of duplicate files will be almost indistinguishable.

Here’s how different types of files can be converted to multi-dimensional vectors.

Image files:?Images are multi-dimensional arrays that have multiple pixels. Each pixel has three values – red, green and blue. When passed through a pre-trained convolution neural network, the images or a video frame get converted into vectors. A convolution neural network is a deep learning architecture, specifically designed to work with image inputs. Many standard architectures like VGG16, ResNet, MobileNet, AlexNet and others are proven to be very efficient in prediction based on inputs. These architectures are trained on large standard datasets like ImageNet with classification layers at the top.

The required images are fed into multiple convolution layers as inputs. Convolution layers are trained to identify underlying patterns from image inputs. Each convolution layer has its own set of filters that multiplies the pixels of the input image. The pooling layer takes the average of the total pixels and reduces the image size as it passes on to the next step in the network. The flatten layer collects the input from the pooling layers and gives out the vector form for the images.?

领英推荐

What’s The Difference Between Structured…

Bernard Marr 5 年前

What Is Unstructured Data And Why Is It So Important…

Bernard Marr 5 年前

Vector Search: Unlocking the Power of Unstructured Data

Ark Mahata 1 年前

Text Files:?To convert the text files into vectors, the words that comprise that particular file are used. Words are nothing but a combination of ASCII codes of characters. However, there is no representation available for a complete word. In such cases, pre-trained word vectors such as Word2Vec or Glove vectors can be used. Pre-trained word vectors are obtained after training a deep-learning model such as the skip-gram model on large text data. More details on this skip-gram model are available in the TensorFlow documentation. The output vector dimension will change with respect to the chosen pre-trained word representation model.

To convert a text document with multiple words where the number of words is not fixed, Average Word2Vec representation can be used on the complete document. The calculation of Average Word2Vec vectors is done using the formula below:

This solution can be made more feasible by adding a 36-dimensional (26 alphabets + 10 digits) vector as an extension to the final representation of the text file. This becomes efficient in cases when two text files have the same characters but in different sequences.

PDF files:?PDF files usually contain texts, images or a mix of both. Therefore, to make a more inclusive solution, vector conversion for both texts and images are programmed in. The approaches discussed earlier to convert text and images into vectors is combined here.

First, to convert the text into a vector, it needs to be extracted from the PDF file and then passed through a similar pipeline as discussed before. Similarly, to convert images to vectors, each page in a PDF is considered as an image and is passed through a pre-trained convolution neural network as discussed before. A PDF file can have multiple pages and to include this aspect, the average of all page vectors is taken to get the final representation.

Audio files:?Audio files stored in .wav or .mp3 formats are sampled values of audio levels. Audio signals are analogue and to store them digitally, it undergoes the process of sampling. Sampling is a process where an analogue-to-digital converter captures sound waves from audio files at regular intervals of time (known as samples) and stores them. The sampling rate may vary according to the applications used. Therefore, while converting audio files to vectors, a fixed resampling is used to get standard sampling rates.

Another difficulty while converting audio files into vectors is that the lengths of the audio files may vary. To solve this, a fixed-length vector with padding (adding zeros at the end or start) or trimming (trimming the vector to a fixed length) can be added, depending on the audio length.

Finding duplicates with vector representations

With vector representations for all types of files, it now becomes easier to find duplicate data based on the difference in distance of respective dimensions. As previously stated, detection by comparing each vector may not be an efficient method as it can increase latency. Therefore, a more efficient method with lower latency is to use the nearest neighbors algorithm.

This algorithm takes vectors as inputs and computes the Euclidean distance or cosine distance between the respective dimensions of all the possible vectors. The files with the shortest distance between their respective vector dimensions are likely duplicates.

Finding Euclidean distance may take longer (O(n^2) latency computation), but the optimized Sci-Kit Learn?implementation?with the integration of KDTrees reduces the computational time (brings down latency by O(n(k+log(n))). Note: k is the dimension of the input vector.

Please note that different processing pipelines must be used when converting images, texts, PDFs, and audio files into vectors. This is to ensure that the scale of these vectors is the same. Since the nearest neighbour’s algorithm is a distance-based algorithm, we may not get correct results if the vectors are in different scales. For instance, one vector’s values can vary from 0 to 1 while another vector’s values can vary from 100-200. In this case, irrespective of the distance, the second vector will take precedence.

The nearest neighbour algorithm also tells us how similar the files are (lesser the distance between dimensions, more similar the files are). Each file vector has to be scaled within a standard range to have a uniform distance measure. This can be done by using a pre-processing technique such as StandardScaler from Sci-kit Learn. After the pre-processing, the nearest neighbour algorithm can be applied to get the nearest vector for each file. Since the Euclidean distances are calculated along with the nearest neighbour vectors, a distance threshold can be applied to filter out less probable duplicate data.

Conclusion

Data duplication in any system will impact its performance and demand unnecessary infrastructure requirements. Duplicate record detection based on file characteristics is not a recommended method as it might require an examination of the content for accurate results. Vector-based search is a more efficient technique for duplicate record detection. Successful implementation of this methodology can help identify the most and least probable duplicate files in unstructured data storage systems.

要查看或添加评论，请登录

Ganapathy Shankar的更多文章

Top 5 Generative AI News Updates from Week 12 2025 (16th-22nd March)

2025年3月24日

Top 5 Generative AI News Updates from Week 12 2025 (16th-22nd March)

1. NVIDIA GTC 2025 Update Last week it was all about NVIDIA GTC 2025 with CEO Jensen Huang, unveiling the company's…
Top 5 Generative AI News Updates from Week 11 2025 (10th-15th March)

2025年3月17日

Top 5 Generative AI News Updates from Week 11 2025 (10th-15th March)

OpenAI released New Tools for building Agents OpenAI released the first set of building blocks that will help…

2 条评论
OpenAI DevDay 2023 Highlights

2023年11月7日

OpenAI DevDay 2023 Highlights

At the first ever OpenAI conference DevDay 2023 on 6th Nov, some exciting announcement by OpenAI New GPT-4 Turbo: GPT-4…
ERP's gets face lift with Generative AI features

2023年11月6日

ERP's gets face lift with Generative AI features

The three major ERP products, Microsoft Dynamics 365, SAP, and Oracle Fusion have integrated Generative AI features…
Database 5.0 - Vector Databases for Unstructured Data

2023年4月21日

Database 5.0 - Vector Databases for Unstructured Data

Yesterday April 20, 2023 Weavite (Formaly SeMi Technologies), Amsterdam headquarters startup raised $50 Million in…

1 条评论
Top 10 AI startups of 2020 in Healthcare and Clinical trail

2020年3月26日

Top 10 AI startups of 2020 in Healthcare and Clinical trail

Interest in artificial intelligence in healthcare soared in 2019 with investors pouring $4 billion into the sector…
Mission Mangal, India's Mission to Mars - Movie Review

2019年8月15日

Mission Mangal, India's Mission to Mars - Movie Review

Mission Mangal, couldn’t have got a better day than today to release this movie, 15th Aug, a tribute to people of…

See all articles

Vector-based Search to Identify Duplicate Unstructured Data

Ganapathy Shankar

?? CxO Advisor | ?? AI Strategy Consulting | ?? AI Lead Business Transformation | ?? Product Development | ?? Corporate Training on AI & GenAI

The complications of unstructured data

Increase in storage requirements:?Higher the duplicate data, more the storage requirements. This increases the operating costs for applications substantially.

Challenges in duplicate data detection

The proposed solution

领英推荐

Finding duplicates with vector representations

Conclusion

Ganapathy Shankar的更多文章

社区洞察

其他会员也浏览了

Harnessing the Power of Unstructured Data: Salesforce’s Data Vector Cloud Database

Structured VS Unstructured Data

Demystifying Data: Structured vs Unstructured. Plus: Latest fintech and tech developments in SEA

The Value of Unstructured Data

Part II - Designing a Comprehensive Data Masking Strategy for Unstructured Data

Data Demystified: Structured vs Unstructured

We need to talk about unstructured data

Ingest Unstructured Data / Output Structured Evidence

Streamline Unstructured Data with Search in the Public Sector

Distributed Data Processing..

The complications of unstructured data

Increase in storage requirements:?Higher the duplicate data, more the storage requirements. This increases the operating costs for applications substantially.

Challenges in duplicate data detection

The proposed solution

领英推荐

Finding duplicates with vector representations

Conclusion

Ganapathy Shankar的更多文章

Top 5 Generative AI News Updates from Week 12 2025 (16th-22nd March)

Top 5 Generative AI News Updates from Week 11 2025 (10th-15th March)

OpenAI DevDay 2023 Highlights

ERP's gets face lift with Generative AI features

Database 5.0 - Vector Databases for Unstructured Data

Top 10 AI startups of 2020 in Healthcare and Clinical trail

Mission Mangal, India's Mission to Mars - Movie Review

社区洞察

其他会员也浏览了

Harnessing the Power of Unstructured Data: Salesforce’s Data Vector Cloud Database

Structured VS Unstructured Data

Demystifying Data: Structured vs Unstructured. Plus: Latest fintech and tech developments in SEA

The Value of Unstructured Data

Part II - Designing a Comprehensive Data Masking Strategy for Unstructured Data

Data Demystified: Structured vs Unstructured

We need to talk about unstructured data

Ingest Unstructured Data / Output Structured Evidence

Streamline Unstructured Data with Search in the Public Sector

Distributed Data Processing..