Bootstrapping OpenSearch - Part 2. - Indexes and DataStreams

Bootstrapping OpenSearch - Part 2. - Indexes and DataStreams

In my previous article we went through the process of creating Helm charts to bootstrap an OpenSearch cluster in a local Kubernetes environment running in Rancher Desktop.

In this article we continue to build up our development environment with an Infrastructure as a Code mindset and provision DataStreams and Indexes declaratively. This frees us up from going to the OpenSearch console or use Postman or curl to submit the right REST API request to create these resources. Additionally, to make our local configuration more life-like, Index Management rules will be configured too - to rotate the indexes daily.

Indexes

Indexes in OpenSearch are collections of documents that are stored and indexed for search and retrieval. Each index is a physical storage unit that contains multiple documents, and each document is a collection of fields. Indexes are the fundamental building blocks for storing and querying data in OpenSearch, suitable for a wide range of use cases beyond just time-series data.

Data Streams

Data streams in OpenSearch are logical collections of time-series data, typically used for logging or metrics. They allow for efficient indexing and querying of data that arrives continuously over time. Data streams automatically manage the underlying indices, rolling over to new indices based on specified conditions (e.g., index size or age). OpenSearch automatically manages the creation and rollover of underlying indices based on predefined conditions, simplifying the handling of large volumes of time-series data.

Aliases

Aliases in OpenSearch are pointers or shortcuts to one or more indices. They provide a flexible way to manage and interact with indices without directly referencing them by their actual names. Aliases are particularly useful for managing index rollovers. For example large amount of streaming data can be split into daily indexes. The writing application publishes data refering the alias and the infrastructure selects the right index to sink the data into. If a certain condition is met (such as or the index size exceeded a certain size, or it is a new day) then Index Management creates a new index and continues writing to that new index. The new index is added to the alias alongside the older indexes, allowing seamless searching and writing of the indexes dataset.

Index templates

Index templates in OpenSearch are predefined configurations that automatically apply settings, data schema mappings, and aliases to new indices that match a specified pattern. They help standardize index configurations and simplify the management of multiple indices.

Provisioning Indexes and Data Streams with the OpenSearch Kubernetes Operator

In the previous article, we set up a local OpenSearch cluster using the OpenSearch Kubernetes Operator. We will now continue by setting up indexes and data streams using Helm chart.

During this process, we define Index Templates, which describe the schema of the data we intend to store in the index. For example, we might want to ingest page-view data as documents, which comprise the following fields:

  • event_time (timestamp for the time-series data)
  • pageid
  • userid
  • viewtime

The schema, with the appropriate data types, is provided as “mappings,” which instruct OpenSearch on how to handle the value of each specific field. An index pattern is defined (such as “pageviews-index-*”), indicating to OpenSearch which newly created index names should apply the template. Index data is structured into shards and replicated across other nodes, which can also be configured. Additionally, an index management policy is defined to handle index rotation.

With an Index Template in place, users can easily create a new index or data stream and automatically receive the designated settings without hassle.

Example #1 Provisioning Indexes with Rollover Policy

Our first example involves creating an Index Template for indexes.

Note: As we intend to rotate the indexes using lifecycle management, the alias and writer index must not be specified in the template. Doing so would disrupt index rolling, resulting in an error: “Rollover alias [pageviews] can point to multiple indices, found duplicated alias [[pageviews]] in index template [pageviews_template].”

# file: helm/opensearch/templates/index.yaml
apiVersion: opensearch.opster.io/v1
kind: OpensearchIndexTemplate
metadata:
  name: {{ .Release.Name }}-pageviews
spec:
  opensearchCluster:
    name: {{ .Release.Name }}

  name: pageviews_template

  version: 1 
  _meta: {}

  indexPatterns:
    - "pageviews-index-*"   
  priority: 100     

  template:
    settings: 
      number_of_shards: 2
      number_of_replicas: 2
      index.plugins.index_state_management.policy_id: pageviews-index-cleanup-policy
      index.plugins.index_state_management.rollover_alias: "pageviews"

    mappings: 
      properties:
        event_time:
          type: date
          format: "strict_date_optional_time||epoch_millis"
        pageid:
          type: keyword
        userid:          
          type: keyword
        viewtime:
          type: long        

Next, the Index State Management (ISM) Policy is defined to roll over the index daily. This policy also ensures that after one day, the number of replicas is decreased from 2 to 1, and after seven days, the index is deleted. This approach helps reduce disk capacity requirements for older data while maintaining high availability and performance for recent data.

# file: helm/opensearch/templates/index-ism.yaml
apiVersion: opensearch.opster.io/v1
kind: OpenSearchISMPolicy
metadata:
  name: {{ .Release.Name }}-policy
  namespace: {{ .Release.Namespace }}
spec:
   opensearchCluster:
      name: {{ .Release.Name }}
   description: "Index State Management Policy to clean up old data."
   policyId: pageviews-index-cleanup-policy
   ismTemplate:
    priority: 1
    indexPatterns:
      - "pageviews-index-*"
   defaultState: hot
   states:
      - name: hot
        actions:
           - rollover:           
               minIndexAge: "24h"  
        transitions:
           - stateName: warm
             conditions:
               minIndexAge: "24h"
      - name: warm
        actions:
           - replicaCount:
                numberOfReplicas: 1
        transitions:
           - stateName: delete
             conditions:
                minIndexAge: "7d"
      - name: delete
        actions:
           - delete: {}
        

Creating the initial index

Once the Index Template and the Index State Management Policy are set up, a manual step is required to create the first index and assign it to the appropriate index alias as a writing index.

The following command can be executed in the Dashboards’ Dev Tools:

PUT pageviews-index-0
{
  "aliases": {
    "pageviews": {
      "is_write_index": true
    }
  }
}        

This command creates the initial index named pageviews-index-0 and assigns the alias pageviews to it, marking it as the write index. This setup ensures that new data is ingested into the correct index, and the rollover policy can manage subsequent indices efficiently.

Then you can verify that the alias and the index is properly initialized with:

GET pageviews        

The response should be similar to this:

{
  "pageviews-index-0": {
    "aliases": {
      "pageviews": {
        "is_write_index": true
      }
    },
    "mappings": {
      "properties": {
        "event_time": {
          "type": "date"
        },
        "pageid": {
          "type": "keyword"
        },
        "userid": {
          "type": "keyword"
        },
        "viewtime": {
          "type": "long"
        }
      }
    },
    "settings": {
      "index": {
        "replication": {
          "type": "DOCUMENT"
        },
        "number_of_shards": "2",
        "plugins": {
          "index_state_management": {
            "policy_id": "pageviews-index-cleanup-policy",
            "rollover_alias": "pageviews"
          }
        },
        "provided_name": "pageviews-index-0",
        "creation_date": "1729359423003",
        "number_of_replicas": "2",
        "uuid": "cxa6o5tlSQKo_Hm8CKQtxA",
        "version": {
          "created": "136387927"
        }
      }
    }
  }
}        

From there new documents can be ingested using the alias:

POST pageviews/_doc
{
  "userid": "user_8",
  "pageid": "page_4",
  "viewtime": "12345",
  "event_time": "2024-10-18T00:02:00Z"
}        

Monitor index rollover

The following command is useful to monitor the lifecycle state of the indexes belonging to the alias. You can discover potential issues related to rollover too.

GET plugins/ism/explain/pageviews        

After a few minutes it will look like:

{
  "pageviews-index-0": {
    "index.plugins.index_state_management.policy_id": "pageviews-index-cleanup-policy",
    "index.opendistro.index_state_management.policy_id": "pageviews-index-cleanup-policy",
    "index": "pageviews-index-0",
    "index_uuid": "tyBu3IUPTpyMM_5_RtKHGA",
    "policy_id": "pageviews-index-cleanup-policy",
    "policy_seq_no": 3209,
    "policy_primary_term": 2,
    "index_creation_date": 1729360610499,
    "state": {
      "name": "hot",
      "start_time": 1729361075268
    },
    "retry_info": {
      "failed": false,
      "consumed_retries": 0
    },
    "info": {
      "message": "Successfully initialized policy: pageviews-index-cleanup-policy"
    },
    "enabled": true
  },
  "total_managed_indices": 1
}        

Example #2 Provisioning Data Streams with Rollover Policy

Our second example involves creating an Index Template for Data Streams. This process is more convenient because we do not need to manually create the “first index”; it is automatically managed when the first write request arrives at the Data Stream’s alias. Additionally, note that the configuration does not include an alias with the is_write_index setting, as this would prevent proper index rollover.

Here are the YAML resources:

# file: helm/opensearch/templates/datastream.yaml
apiVersion: opensearch.opster.io/v1
kind: OpensearchComponentTemplate
metadata:
  name: {{ .Release.Name }}-pageviews-ds
spec:
  opensearchCluster:
    name: {{ .Release.Name }}

  template:
    settings:
      number_of_shards: 2
      number_of_replicas: 2
    mappings:
      properties:
        event_time:
          type: date
          format: "strict_date_optional_time||epoch_millis"
        pageid:
          type: keyword
        userid:          
          type: keyword
        viewtime:
          type: long                     

---
apiVersion: opensearch.opster.io/v1
kind: OpensearchIndexTemplate
metadata:
  name: {{ .Release.Name }}-pageviews-ds
spec:
  opensearchCluster:
    name: {{ .Release.Name }}

  name: pageviews-ds_template

  indexPatterns:
    - "pageviews-ds"
  composedOf:
    - {{ .Release.Name }}-pageviews-ds
  priority: 100    

  dataStream:
    timestamp_field:
      name: "event_time"
        

This template defines the schema and settings for the Data Stream, including the index pattern pageviews-ds. The dataStream: {} section indicates that this template is for a Data Stream, ensuring that the first index is automatically created and managed. Also the "event_time" field is specified as the timestamp for the time-series data, as we do not use the standard "@timestamp" field.

Similarly to the first example, we define a policy to manage the indexes. As an "index pattern" we use the name of the data stream.

apiVersion: opensearch.opster.io/v1
kind: OpenSearchISMPolicy
metadata:
  name: {{ .Release.Name }}-policy-ds
  namespace: {{ .Release.Namespace }}
spec:
   opensearchCluster:
      name: {{ .Release.Name }}
   description: "Index State Management Policy to clean up old data."
   policyId: pageviews-ds-cleanup-policy
   ismTemplate:
    priority: 1
    indexPatterns:
      - "pageviews-ds"
   defaultState: hot
   states:
      - name: hot
        actions:
           - rollover:       
               minIndexAge: "24h"
        transitions:
           - stateName: warm
             conditions:
               minIndexAge: "24h"
      - name: warm
        actions:
           - replicaCount:
                numberOfReplicas: 1
        transitions:
           - stateName: delete
             conditions:
                minIndexAge: "7d"
      - name: delete
        actions:
           - delete: {}        

From there new documents can be ingested into the datastream:

POST pageviews-ds/_doc
{
  "userid": "user_8",
  "pageid": "page_4",
  "viewtime": "12345",
  "event_time": "2024-10-18T00:02:00Z"
}        

Conclusion

In this article, we have explored the comprehensive process of bootstrapping OpenSearch in a local Kubernetes environment using Rancher Desktop and Helm. We began by setting up the OpenSearch cluster with the OpenSearch Kubernetes Operator, followed by configuring indexes and data streams to efficiently manage and query time-series data.

We delved into the creation of Index Templates to define the schema and settings for our data, ensuring consistency and ease of management. Additionally, we implemented Index State Management Policies to automate index rollover and optimize resource usage, maintaining high performance and availability for recent data.

By leveraging these tools and configurations, you can streamline the deployment and management of OpenSearch.

We hope this guide has been informative and valuable in your journey to mastering OpenSearch. For more advanced configurations and best practices, refer to the official OpenSearch documentation and community resources.

Thank you for following along, and happy searching!

要查看或添加评论,请登录

Richard Pal的更多文章