Purge Methods in Elasticsearch and its whereabouts.

Recently, lot of questions came to me about purging of documents from Elastic Indexes for e.g.

  1. What is the best way to delete from ES?
  2. What batch size we can prefer?
  3. What all things we need to take care once data is deleted?
  4. Can we use reindexing or clone?

This article of mine talks about purge activity on Elasticsearch. It’s a detailed one but generic as there could be many ways to deal with this but i picked generic once, so go through it. All below is per my understanding I strongly recommend to go through the documentation and practice carefully.?

So lets start by answering questions one by one:

a) What is the best way to delete from ES?

Solution:

Methods are below which is handled below case by case:

(i)??????????????Using Bulk API,

(ii)?????????????Delete by query api

Case 1: Using Delete by query

First I inserted documents to deltest index

curl --insecure -XPUT -u XXXX:XXXXX? https://XXXXX:9201/deltest/_doc/1?pretty -H 'Content-Type: application/json' -d
{
? "tweet": "test1"
}
'

curl --insecure -XPUT -u XXXXX:XXXXX? https://XXXXX:9201/deltest/_doc/2?pretty -H 'Content-Type: application/json' -d'
{
? "tweet": "test2"
}
'
curl --insecure -XPUT -u es_admin:XXXXX? https://XXXXX:9201/deltest/_doc/3?pretty -H 'Content-Type: application/json' -d'
{
? "tweet": "test3"
}
'

See the delete by query time:
?
curl --insecure -u XXXXX:XXXXX -X POST https://XXXXXX:9201/deltest/_delete_by_query?pretty -H 'Content-Type: application/json' -d'

> {

>?? "query": {

> "terms": {

>???????????????????? "_id":

>???????????????????????? [1,2,3]

>???????????????? }

>?? }

> }

> '

{

? "took" : 256,-------------------------------> See it took 256 mili seconds which is quite high for a deletion.
? "timed_out" : false,
? "total" : 3,
? "deleted" : 3,
? "batches" : 1,
? "version_conflicts" : 0,
? "noops" : 0,
? "retries" : {
??? "bulk" : 0,
??? "search" : 0
? },
? "throttled_millis" : 0,
? "requests_per_second" : -1.0,
? "throttled_until_millis" : 0,
? "failures" : [ ]
}'        

Case 2: Using Bulk api

Here same 3 documents were deleted via bulk api. Observe the time.


curl --insecure -XPUT -u XXXX:XXXXX? https://XXXXXX:9201/_bulk?pretty -H 'Content-Type: application/json' -d

> { "delete" : { "_index" : "deltest", "_id" : "1" } }

> { "delete" : { "_index" : "deltest", "_id" : "2" } }

> { "delete" : { "_index" : "deltest", "_id" : "3" } }

> ''

? "took" : 5,-----------> See this
? "errors" : false,
? "items" : [
??? {
????? "delete" : {
??????? "_index" : "deltest",
??????? "_type" : "_doc",
??????? "_id" : "1",
??????? "_version" : 2,
??????? "result" : "deleted",
??????? "_shards" : {
????????? "total" : 2,
????????? "successful" : 2,
????????? "failed" : 0
??????? },
??????? "_seq_no" : 27,
??????? "_primary_term" : 1,
??????? "status" : 200
????? }
??? },
??? {
????? "delete" : {
??????? "_index" : "deltest",
??????? "_type" : "_doc",
??????? "_id" : "2",
??????? "_version" : 2,
??????? "result" : "deleted",
??????? "_shards" : {
????????? "total" : 2,
????????? "successful" : 2,
????????? "failed" : 0
??????? },
??????? "_seq_no" : 28,
??????? "_primary_term" : 1,
??????? "status" : 200
????? }
??? },
??? {
????? "delete" : {
??????? "_index" : "deltest",
??????? "_type" : "_doc",
??????? "_id" : "3",
??????? "_version" : 2,
??????? "result" : "deleted",
??????? "_shards" : {
????????? "total" : 2,
????????? "successful" : 2,
????????? "failed" : 0
??????? },
??????? "_seq_no" : 29,
??????? "_primary_term" : 1,
??????? "status" : 200
????? }
??? }
? ]
}        

So you see the time taken by BULK api is very less as compared to delete by query, When you submit a delete by query request, Elasticsearch gets a snapshot of the data stream or index when it begins processing the request and deletes matching documents using internal versioning. If a document changes between the time that the snapshot is taken and the delete operation is processed, it results in a version conflict and the delete operation fails.

While processing a delete by query request, Elasticsearch performs multiple search requests sequentially to find all of the matching documents to delete. A delete request is performed for each batch of matching documents.

So to me this internal snapshot preparation and sequential searches takes more time as compared to bulk api. There are many features each method comes up with, so you can try choose based on your application nature and load on the system.

At least in my view based on the time taken, I prefer BULK API delete instead of QUERY by Clause.

b) What batch size we can prefer?

Now, how much we can delete,?it is not simple to answer as it depends on load on the system, still with with 16 VCPU machine, I suggest to try how much 5000 as a batch is taking and if 5000 is doing fine then increase it to 10000 gradually.

c) What all things we need to take care once data is deleted??

Once huge data is deleted, deleted documents are marked as ghost documents/tombstone, the old version is not immediately removed. These soft-deleted documents are automatically cleaned up during regular segment merges but it is slow process. Also note that all these soft deletes are counted in total number of documents and as per limitation of each shard , each shard can not hold more 2 Billion document. So if you delete in?huge number two things will occur

(i)???Index will get fragmented.

(ii)??No of deleted documents will increase and will be counted in total number of documents with in a shard.

So what to do here??In this I generally prefer team to execute force merge api with expunge option and this can be executed without any downtime and per my experience it takes almost 30-40 minutes to remove 120 Million soft deletes(with 16 VCPU and 6 Cluster machine), this expunge will remove the soft deletes from only those shards where fragmentation is more than 10%

curl --insecure -X POST -u XXXXX:XXXX "https://xxxxx:9201/index_name/_forcemerge?only_expunge_deletes=true"        

but again I suggest to test it according to your load on the system.

d) Can we use reindexing or clone?

Off course we can use considering we have all the fields are mapped as source fields, Reindexing basically works on the source fields meaning the source fields defined in the mapping . If source fields are not there then reindexing won’t be effective as it will copy the data but due to limitation that copied data will not be searchable if it is not part of source field.

say for our case i.e. source field is included as below. i.e ES index is maintaining only document id and its metadata in the index; although the document when pushed is searchable but it won’t appear in fetch result part of the document.

"_source" : {

???????"includes" : [

?????????"meta.*",

?????????"doc.id"

???????]

Now when we do reindexing, it will copy all the documents but due to source field enabled as above, ES just stores document_id and metadata and same gets copied which makes other fields like status if it was there but not in source fields, etc not searchable and this creates a challenge. Meaning new index will have only document_id and meta*, so this option is not useful.

Cloning of the index: Only possible if downtime as it freezes the original index and with freeze; no writes can happen

So I tried touch base most of the aspects here. Feel free to write to me in case any concern.

Enjoy Learning!!!!!

Kowshik G Y

Digital Marketing Associate | Passionate About Digital Transformation & Audience Engagement

1 年

Stay ahead of the competition by joining our cutting-edge Elastic training program. Gain the expertise to deliver scalable and efficient solutions. Post link:https://www.dhirubhai.net/posts/springpeople-training_springpeople-business-power-activity-7092693058974056448-cXpx?utm_source=share&utm_medium=member_desktop Visit - https://lnkd.in/gRq5xTcd #springpeople #business #power #training #jobalert #elastic #elasticsearch #jobfair #elasticsearchengineer #talentmanagement #hrconsulting #jobhunters

回复
Kavita Teltia

DevOps Engineer at Amdocs | can help with Referral

1 年

Now I know the story behind the forcemerge API

Ashok Kumar Varma Sagiraju

Account Technologies Line Manager

1 年

Very useful and Interesting topic vineet. Will bug you on this soon.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了