登录查看更多内容

Optimizing Disk Space Utilization in EFK Nodes

Hasantha Malinga

Cloud Operations Engineer at MillenniumIT ESP

发布日期: 2025年2月25日

Efficient disk space utilization is crucial for maintaining the performance and stability of an Elasticsearch, Fluentd, and Kibana (EFK) stack. Though shard balancing is automatically managed across nodes, Uneven disk consumption may result from differences in shard sizes. To address this, administrators can take strategic actions to optimize storage distribution.

When is Manual Shard Relocation Needed?

There are specific scenarios where manually relocating shards is beneficial:

Uneven Disk Space Utilization: When some nodes are running out of storage while others have sufficient space.
Performance Optimization: Redistributing shards to balance the query and indexing load across nodes.
Failure Recovery: Moving shards away from failing or corrupted nodes to ensure data availability and prevent data loss.
Infrastructure Changes: When adding or removing nodes in a cluster, manual shard relocation can help balance the storage distribution.
Automated Balancing Limitations: In cases where Elasticsearch’s built-in balancing does not effectively resolve storage disparities, manual intervention is necessary.

Benefits of Manual Shard Relocation

Manual shard relocation offers several advantages:

Improved Load Balancing: Ensures that no single node becomes overwhelmed with excessive disk usage, improving cluster stability.
Enhanced Performance: By redistributing data efficiently, query performance and indexing speed can be optimized.
Preventing Disk Overflows: Helps prevent critical failures due to nodes running out of storage space.
Greater Control: Allows administrators to make informed decisions about shard placement rather than relying solely on automated balancing mechanisms.
Minimized Downtime: Strategic shard relocation reduces the risk of unexpected failures, leading to a more resilient system.

Identifying Storage Imbalances

The first step in optimizing disk space is monitoring shard distribution and usage. The following Elasticsearch command can be used to retrieve and display shard information in a human-readable format:

GET _cat/shards?v=true&h=index,shard,prirep,state,node,store&s=store

_cat/shards → Fetches details about all shards in the cluster.

?v=true → Includes a header row for better readability.

&h=index,shard,prirep,state,node,store → Specifies the columns to display

This command helps identify nodes with abnormally high disk usage, enabling targeted actions to redistribute shards effectively.

Redistributing Large Indices

If certain indices are consuming excessive storage on specific nodes, they can be relocated to nodes with more available space. To prevent duplication and any conflicts, make sure the target node is free of primary or replica indices before relocating them.

Understanding Primary and Replica Indices

Elasticsearch uses a distributed architecture to improve data availability and fault tolerance (the ability of a system to keep working even when a component fails). Each index consists of multiple shards, categorized as primary and replica shards:

Primary Shard: This is the main data-holding shard responsible for indexing and searching. Each document is stored in one primary shard.
Replica Shard: This is a copy of the primary shard that provides redundancy. Replicas help in fault tolerance and load balancing by handling search requests.

Elasticsearch automatically ensures that a replica shard is never placed on the same node as its corresponding primary shard to prevent data loss in case of node failure.

Why Avoid Moving Indices to Nodes Containing Their Primary or Replica Shards?

If a primary or replica shard already exists on the target node, relocating another shard of the same index to that node can lead to:

Data Redundancy Issues: The system might store multiple copies of the same data unnecessarily.
Imbalanced Storage Utilization: One node could become overloaded while others remain underutilized.
Potential Conflicts: Elasticsearch’s balancing algorithms might continuously try to correct incorrect placements, causing instability.

Ensuring that the target node does not already hold a copy of the same shard prevents indexing issues and maintains balanced data distribution.

A manual shard relocation can be performed using the _cluster/reroute API:

POST _cluster/reroute
{
  "commands": [
    {
      "move": {
        "index": "example-index-2025.02.13",
        "shard": 0,
        "from_node": "node-1.example.com",
        "to_node": "node-2.example.com"
      }
    }
  ]
}

This API call manually moves a shard from one node to another, as a result of that balancing disk space utilization across the cluster.

Verifying the Relocation Process

Once the relocation command is executed, it is important to verify its progress and completion. The following command provides real-time status updates of the overall health of the cluster in a human-readable format:

GET _cluster/health?pretty

During the relocation process, the relocating_shards value should be greater than zero. Once the migration is completed successfully, this value will return to zero, indicating a balanced storage distribution.

{
  "cluster_name": "my-cluster",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 3,
  "number_of_data_nodes": 2,
  "active_primary_shards": 5,
  "active_shards": 8,
  "relocating_shards": 0,
  "initializing_shards": 1,
  "unassigned_shards": 2,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 80.0
}

Optimizing disk space utilization in an EFK cluster ensures better performance, prevents failures due to storage overload, and enhances operational efficiency. By implementing these techniques, organizations can proactively manage disk usage and ensure seamless EFK cluster operations.

Hasantha Malinga的更多文章

Automating AWS Infrastructure with Terraform: Building a VPC with Public and Private Subnets

2024年7月22日

Automating AWS Infrastructure with Terraform: Building a VPC with Public and Private Subnets

Managing infrastructure on AWS can be time-consuming and error-prone when done manually. Terraform offers a powerful…

1 条评论
AWS Networking Basics - Build an AWS Architectural Diagram

2024年7月8日

AWS Networking Basics - Build an AWS Architectural Diagram

Create VPC Create two subnets Create Internet gateway Create two Route tables Configure Route Out to the Internet…

4 条评论