登录查看更多内容

Snippet: Speeding up Bulk Upload Speeds to Elastic with Parallelisation in Python

Han Xiang Choong

Senior Customer Architect - APJ @ Elastic | Applied AI/ML | Search Experiences | Delivering Real-World Impact

发布日期: 2024年9月13日

Scenario:

Uploading 35,000 large text documents of the format below, roughly 1-1500 words each, to an Elastic Cloud index without additional processing or ingestion pipelining. Document size is 350MB.

{
"_id": "d2ab8863-c548-43a4-8645-402d0986a33b"
"text": "Database Administrator - Family Private Care LLC Lawrenceville, GA A self-motivated Production SQL Server Database Administrator .... "
}

Setting

Destination: Elastic Cloud on GCP, asia-southeast1, Compute-Optimized cluster

Origin: M3 Pro, 12 threads

Problem

Uploading under prescribed scenario with elasticsearch.helpers.bulk api in batches of 500 documents requires 70-80s to complete on average. Want faster.

Solution

Parallel batch upload with concurrent.futures module.

Result

Naive (Sequential, batch size 1) - 695 seconds

Base (Sequential, batch size 500) - 73.4 seconds

Parallel (5 workers maximum, batch size 500) - 37.1 seconds

Parallel (10 workers maximum, batch size 500) - 27.2 seconds

Parallel (10 workers, batch size 250) - 29 seconds

Parallel (10 workers, batch size 1000) - 41 seconds

Code Snippet

import os
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor, as_completed
from elasticsearch import Elasticsearch, helpers # elasticsearch==8.14.0
from tqdm import tqdm # tqdm==4.66.4

def bulk_upload_pickle_to_elasticsearch(data, index_name, es, batch_size=500, max_workers=10):
    ''' 
    data: [ {document} ]
        document: {
                    "_id": str
                    ...
                  }
    index_name: str 
    es: Elasticsearch 
    batch_size: int 
    max_workers: int
    '''
    total_documents = len(data)
    success_bar = tqdm(total=total_documents, desc="Successful uploads", colour="green")
    failed_bar = tqdm(total=total_documents, desc="Failed uploads", colour="red")

    def create_action(doc):
        '''
        Define upload action from source documents
        '''
        return {
            "_index": index_name,
            "_id": doc["id_"],
            "body": doc["text"]
        }

    def read_and_create_batches(data):
        ''' 
        Yield document batches
        '''
        batch = []
        for doc in data:
            batch.append(create_action(doc))
            if len(batch) == batch_size:
                yield batch
                batch = []
        if batch:
            yield batch

    def upload_batch(batch):
        ''' 
        Make bulk call for batch
        '''
        try:
            success, failed = helpers.bulk(es, batch, raise_on_error=False, request_timeout=45)
            if isinstance(failed, list):
                failed = len(failed)
            return success, failed
        except Exception as e:
            print(f"Error during bulk upload: {str(e)}")
            return 0, len(batch)

    ''' 
    Parallel execution of batch upload
    '''
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_batch = {executor.submit(upload_batch, batch): batch for batch in read_and_create_batches(data)}
        for future in as_completed(future_to_batch):
            success, failed = future.result()
            success_bar.update(success)
            failed_bar.update(failed)

    ''' 
    Updata progress bars
    '''
    total_uploaded = success_bar.n
    total_failed = failed_bar.n
    success_bar.close()
    failed_bar.close()

    return total_uploaded, total_failed

try:
    es_endpoint = os.environ.get("ELASTIC_ENDPOINT")
    es_client = Elasticsearch(
        es_endpoint,
        api_key=os.environ.get("ELASTIC_API_KEY")
    )
except Exception as e:
    es_client = None

total_uploaded, total_failed = bulk_upload_pickle_to_elasticsearch(documents, "resumes_base", es_client)
print(f"Total uploaded: {total_uploaded}, Total failed: {total_failed}")

要查看或添加评论，请登录

Han Xiang Choong的更多文章

Improving e-Commerce Search with Query Profiles in Elastic

2024年11月21日

Improving e-Commerce Search with Query Profiles in Elastic

Introduction Elasticsearch is naturally suited for e-Commerce data, by which I mean large quantities of product…

2 条评论
Code Snippet: Parallel LLM Calls

2024年9月19日

Code Snippet: Parallel LLM Calls

Problem Want to use LLM to process very large data (>10^7 documents), want results asap. Minimize time per document.
Automating Traditional Search

2024年9月9日

Automating Traditional Search

tldr; This short article is about uploading structured data to an Elastic index, then converting a plain English query…
A Personal Chatbot Interface with Elasticsearch & Streamlit

2024年8月31日

A Personal Chatbot Interface with Elasticsearch & Streamlit

This project lives in this github repo along with set-up instructions. Features Select an LLM and a custom system…

4 条评论
Search Concepts Cheatsheet - Elastic Oriented

2024年8月13日

Search Concepts Cheatsheet - Elastic Oriented

Decided to write an overview of key search concepts, just to refresh and crystallize my understanding. This is far from…
Search in Elastic 8.15 - Building RAG Extremely Quickly WITHOUT Code

2024年8月12日

Search in Elastic 8.15 - Building RAG Extremely Quickly WITHOUT Code

Hello friends and colleagues! Elastic 8.15 is out, and Semantic Search is easier than ever to pull off.

2 条评论
Advanced RAG Techniques Part 2: Querying and Testing

2024年8月6日

Advanced RAG Techniques Part 2: Querying and Testing

Welcome to Part 2 of our article on Advanced RAG Techniques! In part 1 of this series, we set-up, discussed, and…

2 条评论
Advanced RAG Techniques Part 1: Data Processing

2024年8月6日

Advanced RAG Techniques Part 1: Data Processing

This is Part 1 of our exploration into Advanced RAG Techniques. [Click here for Part 2!] The recent paper Searching for…

2 条评论
Searching for Best Practices in RAG: The Sparknotes Version

2024年7月26日

Searching for Best Practices in RAG: The Sparknotes Version

Recently got around to reading "Searching for Best Practices in Retrieval Augmented Generation". Thought it would be a…
The Basics: Managing Time-Series Data with Elastic Datastreams

2024年7月24日

The Basics: Managing Time-Series Data with Elastic Datastreams

Second entry in my Basics Series. This article revolves around using the Elastic Query Domain Specific Language.

See all articles

Scenario:

Setting

Problem

Solution

Result

Code Snippet

Han Xiang Choong的更多文章

Improving e-Commerce Search with Query Profiles in Elastic

Code Snippet: Parallel LLM Calls

Automating Traditional Search

A Personal Chatbot Interface with Elasticsearch & Streamlit

Search Concepts Cheatsheet - Elastic Oriented

Search in Elastic 8.15 - Building RAG Extremely Quickly WITHOUT Code

Advanced RAG Techniques Part 2: Querying and Testing

Advanced RAG Techniques Part 1: Data Processing

Searching for Best Practices in RAG: The Sparknotes Version

The Basics: Managing Time-Series Data with Elastic Datastreams