My evaluation of Elastic Cloud Serverless on Microsoft Azure (Technical Preview)

My evaluation of Elastic Cloud Serverless on Microsoft Azure (Technical Preview)

Introduction

Last week I decided to try out the technical preview deployment on Azure. I’ve read through the announcement. After some first thoughts I’ve started to spin up a serverless project and give it a try.

During the blog we will try to verify the following statements done by the announcement.

  • No compromise on speed or scale
  • Hassle-free operations
  • Purpose-built product experience
  • Simplified pricing model
  • Security and compliance certified

Let’s see what we can conclude here.

Project creation

First I’ve selected the use case to try-out. I’ve chosen “Elastic for Security“.

Use case selection for Project type.

It took around 3 minutes to spin up a fully functional instance and get ready with a Kibana like console. The big difference is that it’s fully focussed on the chosen use case, so only under the hood it’s powered by Kibana.

Project console Powered by Kibana.

Awesome, that’s already an A+ for the project deployment, but I want to see more.

Elastic Cloud

In the Elastic Cloud portal you directly notice the two different sections for Hosted deployments and Serverless projects.

Elastic Cloud portal sections Deployments vs Projects.

We have just created the serverless project called “My Security project“, which is launched in Azure Cloud, Virginia East US region.

When we compare both, within Serverless projects no hardware profile or version can be chosen anymore. Just select the project type “like Security” and go.

Serverless project Management

Now we go to Manage our first Serverless project called “My Security project“.

Serverless Project Management.

Here we look into the Overview section. You can find information about the project ID and connection details like URLs. Big tiles are available for managing Data and Integrations, managing API keys and adding project Membership.

The first two are native project features within the console. Membership is managed in the well-known Elastic Cloud portal organization feature

Also notice we can only change the Project name or features. Project features are all related to the type, in my case Security. Below you see the configuration options available to us.

Project features for Security

Important things to keep in mind are that you can’t change the project type or cloud provider / region. This would require a recreation of the particular project. Data migration seems to be handled by you as a customer, since there is only a Delete project action available.

One of the questions remains how we can handle snapshots or duplication from a project migration or clone perspective? This would be a desirable future request.

Renewed Storage Architecture

It took a while, but Elastic did a good job creating an Object storage based Storage Architecture, as they call it Search AI Lake Architecture.

It combined Object Storage capabilities, such as Azure Blob Storage to ensure performance, durability and minimising latency. As an Elastic user you can only provide the Cloud Region, where your data resides. This is helpful for compliance reasons, such as GDPR.

And as many due expect no Hot-Warm Architecture anymore.

Everything is abstracted in the Search AI lake. Another benefit is that you don’t have to manage the storage layer yourself. We can configure a default setting for retention that fits-all, set the maximum retention or we can optimize the retention per data stream. Everything seems straightforward and we do assume our data is stored in Azure Cloud Region East US.

Below is how Data Retention is configured in the Elastic Cloud portal.

Data Retention Configuration.

If you want to know more about Search AI Lake Architecture, read the blog here.

Exploring the Project Console

When we open the Project console we directly are directed to the Elastic Security UI. Everything is focussed on the type of project you have chosen. At the moment of writing I didn’t spot any limitations or missing features. Core functionality like Alerts, Cases, Findings, Rules and Attack Discovery are available. This includes the Security AI Assistant, which requires you to configure your Azure OpenAI Instance first.

Security AI Assistant initial setup.

We also have Stack Management. Most options are familiar, but I’m curious about the Index creation process. Let’s look into this.

Index Creation

Index management is still available, but technically everything has been evolved to the Search AI Lake architecture. This makes creation an index simple, no need to set number_of_replicas or number_of_shards anymore. Only a name and Index mode is required.

Index Management.

Luckily we still have Developer Tools available, let’s go over this in a new section.

Developer Tools

Developer Tools is a crucial and powerful console for managing your Elastic environment. It provides an API driven management approach, for example using the Compact and Aligned Text (CAT) API. Here I’m curious what the differences are when using the famous CAT API.

When exploring the basics we directly find differences, which are expected since we don’t manage the infrastructure components anymore. Famous insights like shards, segments, allocations and recovery are gone. I do miss things like health, thread_pool and snapshot insights. Again the question remains how we could troubleshoot or handle data recovery.

From a true platform perspective only indices and data related API are still available. See the list below. Try it yourself using “GET _cat”.

=^.^=
/_cat/indices
/_cat/indices/{index}
/_cat/count
/_cat/count/{index}
/_cat/aliases
/_cat/aliases/{alias}
/_cat/component_templates
/_cat/ml/anomaly_detectors
/_cat/ml/anomaly_detectors/{job_id}
/_cat/ml/datafeeds
/_cat/ml/datafeeds/{datafeed_id}
/_cat/ml/trained_models
/_cat/ml/trained_models/{model_id}
/_cat/ml/data_frame/analytics
/_cat/ml/data_frame/analytics/{id}
/_cat/transforms
/_cat/transforms/{transform_id}        

Doing Data Ingesting

Now that we explored most features it’s time to do some actual data ingestion. As input I’m using a public available data set from Kaggle. Kaggle is an interesting community website for sharing Machine Learning and Data Science data.

I chose a CSV data set that includes crime data from 2020 to present. This data set is available here. Credits to Avis02 for publishing this data set.

CSV Integration

For uploading file contents like CSV, TSV and JSON there is a helpful Integration available. Let’s first add this integration.

Upload file Integration

Now provide the index name, keep the defaults and start the import.

Import data setup.

After the import completed 1,004,847 documents were created. It took around 5 minutes, so roughly the ingest was around 3,350 documents per second.?

Normally I will look into the Index stats which shows statistics about indexing and search. Unfortunately they are unavailable for serverless projects. Again for troubleshooting and monitoring purposes, such insights are still helpful.

Query performance

Now let’s look further in Query Performance. To validate this we are going to use the Bool query below. Let’s try this in the Developer Console first.

POST crimes/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "Crm Cd Desc": "VEHICLE"
          }
        },
        {
          "range": {
            "AREA": {
              "gte": 10
            }
          }
        }
      ],
      "should": [
        {
          "match": {
            "Status": "IC"
          }
        }
      ]
    }
  }
}        

The first query response took 35 ms, but the second results query response had an average around 4 ms. Again here seems the caching layer to drop in.

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": 2.8923671,
    "hits": [
      ...
    ]
        

Search Profiler

Now let us analyse the Bool query above using the Search Profiler, which is still available to us, next to other great tools like the Grok Debugger and Painless Lab.

Now select the index crime, copy over the query part and send the request.

First execution is around 22 ms. See below.

Initial query results using Search Profiler

After the first query, which seems to warm up the caching layer again, we are constantly around 12-13 ms. We could slightly optimize the Range query by setting a ‘lte’ value, such as 50.

Warmed up cache results using Search Profiler

Benchmarking Performance with Rally

Now let’s burn up that serverless power. For this I’m going to use Rally. Rally is an Open Source tool for benchmarking Elasticsearch environments. There is a special section, which explains the Serverless capabilities here.

Setting up Rally

Installation on Ubuntu is simple when Python3 (default nowadays) is installed. Just follow the steps below.

sudo apt install python3-pip -y
sudo apt install git -y
sudo apt install pbzip2 -y
python3 -m pip install --user --upgrade pip
sudo apt install python3.10-venv -y
python3 -m venv .venvsource .venv/bin/activate
python3 -m pip install esrally        

Now first have a look at the various tracks available to race.? Ensure you are in the virtual environment (venv).

esrally list tracks        

Most applicable for us is the security track. Take notice of the requirements, such as local storage for Rally and the required API Key.

Let’s execute and enjoy the ride!

Security Track

Below is an example that hits the project in test mode.

esrally race --track=elastic/security --target-hosts=${ES_HOST}:443 --pipeline=benchmark-only --client-options="use_ssl:true,api_key:${ES_API_KEY}" --on-error=abort --test-mode        

Follow the logging like below. Also see that a serverless mode is detected and cluster health checks are getting skipped.


    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [c0f06d40-dc46-48f6-a794-a5f465fab00b]
[INFO] Detected Elasticsearch Serverless mode with operator=[False].
[INFO] Installing track dependencies [geneve==0.2.0, pyyaml, elastic-transport==8.4.1, elasticsearch==8.6.1]
[INFO] Treating parallel task in challenge [security-querying] as public.
[INFO] Excluding [check-cluster-health] as challenge [security-querying] is run on serverless.
[INFO] Downloading track data (97.2 kB total size)                                [100.0%]        

During the run of our Rally track I directly was missing insights with Stack Monitoring. Looking for an equivalent , I’ve opened the Elastic Cloud portal and looked at the Usage and Performance metrics. Seems that the Ingest rate has a delay, so my ingestion (which is caused by Rally) is not shown yet. Something to keep in mind. Below is a screenshot.

Delayed Usage and Performance metrics

After the ‘elastic/security’ track I’ve looked into a smaller track to execute, that includes nested documents.

Nested Track

This track is good to verify performance. Especially that nested documents can be complex and can cause Performance Bottlenecks.

Let’s start the nested track.

esrally race --track=nested --target-hosts=${ES_HOST}:443 --pipeline=benchmark-only --client-options="use_ssl:true,api_key:${ES_API_KEY}"        

Looking at the indices (using CAT api) I do see an index called sonested growing.

green open sonested? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?   erb3My1MRoya0_cXHHhXvw 1 1 25567891 0 1.1gb 1.1gb ? 2.8gb

After a SUCCESSFUL run that took 2,226 seconds the following  report was returned.        

    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/

[INFO] Race id is [8255f112-e75d-48ee-9337-90a66cb03f75]
[INFO] Detected Elasticsearch Serverless mode with operator=[False].
[INFO] Excluding [check-cluster-health], [force-merge], [wait-until-merges-finish] as challenge [nested-search-challenge] is run on serverless.
[INFO] Racing on track [nested], challenge [nested-search-challenge] and car ['external'] with version [serverless].

Running delete-index                                                           [100% done]
Running create-index                                                           [100% done]
Running index-append                                                           [100% done]
Running refresh-after-index                                                    [100% done]
Running refresh-after-force-merge                                              [100% done]
Running randomized-nested-queries                                              [100% done]
Running randomized-term-queries                                                [100% done]
Running randomized-sorted-term-queries                                         [100% done]
Running match-all                                                              [100% done]
Running nested-date-histo                                                      [100% done]
Running randomized-nested-queries-with-inner-hits_default                      [100% done]
Running randomized-nested-queries-with-inner-hits_default_big_size             [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                         Metric |                                                       Task |      Value |   Unit |
|-------------------------------:|-----------------------------------------------------------:|-----------:|-------:|
|                 Min Throughput |                                               index-append | 19304.2    | docs/s |
|                Mean Throughput |                                               index-append | 20042.7    | docs/s |
|              Median Throughput |                                               index-append | 20093.6    | docs/s |
|                 Max Throughput |                                               index-append | 20262.3    | docs/s |
|        50th percentile latency |                                               index-append |   797.295  |     ms |
|        90th percentile latency |                                               index-append |  1069.43   |     ms |
|        99th percentile latency |                                               index-append |  2194.52   |     ms |
|      99.9th percentile latency |                                               index-append |  2760.21   |     ms |
|       100th percentile latency |                                               index-append |  2965.42   |     ms |
|   50th percentile service time |                                               index-append |   797.295  |     ms |
|   90th percentile service time |                                               index-append |  1069.43   |     ms |
|   99th percentile service time |                                               index-append |  2194.52   |     ms |
| 99.9th percentile service time |                                               index-append |  2760.21   |     ms |
|  100th percentile service time |                                               index-append |  2965.42   |     ms |
|                     error rate |                                               index-append |     0      |      % |
|                 Min Throughput |                                  randomized-nested-queries |    17.17   |  ops/s |
|                Mean Throughput |                                  randomized-nested-queries |    18.52   |  ops/s |
|              Median Throughput |                                  randomized-nested-queries |    18.65   |  ops/s |
|                 Max Throughput |                                  randomized-nested-queries |    19.23   |  ops/s |
|        50th percentile latency |                                  randomized-nested-queries |  3647.95   |     ms |
|        90th percentile latency |                                  randomized-nested-queries |  4602.93   |     ms |
|        99th percentile latency |                                  randomized-nested-queries |  5026.31   |     ms |
|      99.9th percentile latency |                                  randomized-nested-queries |  5058.38   |     ms |
|       100th percentile latency |                                  randomized-nested-queries |  5066.09   |     ms |
|   50th percentile service time |                                  randomized-nested-queries |    95.8041 |     ms |
|   90th percentile service time |                                  randomized-nested-queries |   101.063  |     ms |
|   99th percentile service time |                                  randomized-nested-queries |   103.555  |     ms |
| 99.9th percentile service time |                                  randomized-nested-queries |   115.023  |     ms |
|  100th percentile service time |                                  randomized-nested-queries |   235.051  |     ms |
|                     error rate |                                  randomized-nested-queries |     0      |      % |
|                 Min Throughput |                                    randomized-term-queries |    23.59   |  ops/s |
|                Mean Throughput |                                    randomized-term-queries |    23.65   |  ops/s |
|              Median Throughput |                                    randomized-term-queries |    23.65   |  ops/s |
|                 Max Throughput |                                    randomized-term-queries |    23.68   |  ops/s |
|        50th percentile latency |                                    randomized-term-queries |  2541.45   |     ms |
|        90th percentile latency |                                    randomized-term-queries |  3384.52   |     ms |
|        99th percentile latency |                                    randomized-term-queries |  3539.69   |     ms |
|       100th percentile latency |                                    randomized-term-queries |  3557.58   |     ms |
|   50th percentile service time |                                    randomized-term-queries |    82.8646 |     ms |
|   90th percentile service time |                                    randomized-term-queries |    83.9267 |     ms |
|   99th percentile service time |                                    randomized-term-queries |    87.8161 |     ms |
|  100th percentile service time |                                    randomized-term-queries |   109.894  |     ms |
|                     error rate |                                    randomized-term-queries |     0      |      % |
|                 Min Throughput |                             randomized-sorted-term-queries |    11.77   |  ops/s |
|                Mean Throughput |                             randomized-sorted-term-queries |    11.91   |  ops/s |
|              Median Throughput |                             randomized-sorted-term-queries |    11.92   |  ops/s |
|                 Max Throughput |                             randomized-sorted-term-queries |    12.02   |  ops/s |
|        50th percentile latency |                             randomized-sorted-term-queries | 24935.3    |     ms |
|        90th percentile latency |                             randomized-sorted-term-queries | 27700.2    |     ms |
|        99th percentile latency |                             randomized-sorted-term-queries | 28473.5    |     ms |
|       100th percentile latency |                             randomized-sorted-term-queries | 28597.1    |     ms |
|   50th percentile service time |                             randomized-sorted-term-queries |   148.491  |     ms |
|   90th percentile service time |                             randomized-sorted-term-queries |   201.326  |     ms |
|   99th percentile service time |                             randomized-sorted-term-queries |   226.093  |     ms |
|  100th percentile service time |                             randomized-sorted-term-queries |   227.999  |     ms |
|                     error rate |                             randomized-sorted-term-queries |     0      |      % |
|                 Min Throughput |                                                  match-all |     5      |  ops/s |
|                Mean Throughput |                                                  match-all |     5      |  ops/s |
|              Median Throughput |                                                  match-all |     5      |  ops/s |
|                 Max Throughput |                                                  match-all |     5      |  ops/s |
|        50th percentile latency |                                                  match-all |    82.9413 |     ms |
|        90th percentile latency |                                                  match-all |    84.1061 |     ms |
|        99th percentile latency |                                                  match-all |    86.3922 |     ms |
|       100th percentile latency |                                                  match-all |   101.242  |     ms |
|   50th percentile service time |                                                  match-all |    81.2109 |     ms |
|   90th percentile service time |                                                  match-all |    82.1217 |     ms |
|   99th percentile service time |                                                  match-all |    84.0972 |     ms |
|  100th percentile service time |                                                  match-all |    98.8632 |     ms |
|                     error rate |                                                  match-all |     0      |      % |
|                 Min Throughput |                                          nested-date-histo |     1      |  ops/s |
|                Mean Throughput |                                          nested-date-histo |     1      |  ops/s |
|              Median Throughput |                                          nested-date-histo |     1      |  ops/s |
|                 Max Throughput |                                          nested-date-histo |     1      |  ops/s |
|        50th percentile latency |                                          nested-date-histo |   740.46   |     ms |
|        90th percentile latency |                                          nested-date-histo |   745.375  |     ms |
|        99th percentile latency |                                          nested-date-histo |   757.155  |     ms |
|       100th percentile latency |                                          nested-date-histo |   774.359  |     ms |
|   50th percentile service time |                                          nested-date-histo |   737.498  |     ms |
|   90th percentile service time |                                          nested-date-histo |   742.6    |     ms |
|   99th percentile service time |                                          nested-date-histo |   753.964  |     ms |
|  100th percentile service time |                                          nested-date-histo |   771.635  |     ms |
|                     error rate |                                          nested-date-histo |     0      |      % |
|                 Min Throughput |          randomized-nested-queries-with-inner-hits_default |    17.92   |  ops/s |
|                Mean Throughput |          randomized-nested-queries-with-inner-hits_default |    17.95   |  ops/s |
|              Median Throughput |          randomized-nested-queries-with-inner-hits_default |    17.96   |  ops/s |
|                 Max Throughput |          randomized-nested-queries-with-inner-hits_default |    17.97   |  ops/s |
|        50th percentile latency |          randomized-nested-queries-with-inner-hits_default |    98.135  |     ms |
|        90th percentile latency |          randomized-nested-queries-with-inner-hits_default |   103.348  |     ms |
|        99th percentile latency |          randomized-nested-queries-with-inner-hits_default |   107.32   |     ms |
|      99.9th percentile latency |          randomized-nested-queries-with-inner-hits_default |   132.372  |     ms |
|       100th percentile latency |          randomized-nested-queries-with-inner-hits_default |   143.779  |     ms |
|   50th percentile service time |          randomized-nested-queries-with-inner-hits_default |    96.3666 |     ms |
|   90th percentile service time |          randomized-nested-queries-with-inner-hits_default |   101.587  |     ms |
|   99th percentile service time |          randomized-nested-queries-with-inner-hits_default |   104.37   |     ms |
| 99.9th percentile service time |          randomized-nested-queries-with-inner-hits_default |   112.876  |     ms |
|  100th percentile service time |          randomized-nested-queries-with-inner-hits_default |   117.065  |     ms |
|                     error rate |          randomized-nested-queries-with-inner-hits_default |     0      |      % |
|                 Min Throughput | randomized-nested-queries-with-inner-hits_default_big_size |    15.88   |  ops/s |
|                Mean Throughput | randomized-nested-queries-with-inner-hits_default_big_size |    15.93   |  ops/s |
|              Median Throughput | randomized-nested-queries-with-inner-hits_default_big_size |    15.94   |  ops/s |
|                 Max Throughput | randomized-nested-queries-with-inner-hits_default_big_size |    15.96   |  ops/s |
|        50th percentile latency | randomized-nested-queries-with-inner-hits_default_big_size |   114.945  |     ms |
|        90th percentile latency | randomized-nested-queries-with-inner-hits_default_big_size |   120.141  |     ms |
|        99th percentile latency | randomized-nested-queries-with-inner-hits_default_big_size |   123.643  |     ms |
|      99.9th percentile latency | randomized-nested-queries-with-inner-hits_default_big_size |   148.916  |     ms |
|       100th percentile latency | randomized-nested-queries-with-inner-hits_default_big_size |   153.608  |     ms |
|   50th percentile service time | randomized-nested-queries-with-inner-hits_default_big_size |   113.26   |     ms |
|   90th percentile service time | randomized-nested-queries-with-inner-hits_default_big_size |   118.335  |     ms |
|   99th percentile service time | randomized-nested-queries-with-inner-hits_default_big_size |   121.079  |     ms |
| 99.9th percentile service time | randomized-nested-queries-with-inner-hits_default_big_size |   123.385  |     ms |
|  100th percentile service time | randomized-nested-queries-with-inner-hits_default_big_size |   125.117  |     ms |
|                     error rate | randomized-nested-queries-with-inner-hits_default_big_size |     0      |      % |


----------------------------------
[INFO] SUCCESS (took 2226 seconds)
----------------------------------        

After going through the Final Score we can conclude that:

  • During every test no errors occurred, which is great.
  • The throughput seems stable and performance is good.
  • No large latency numbers spotted here and all within an acceptable bandwidth.
  • Unfortunately it took many minutes before the Usage and performance metrics were updated correctly.

See the Ingest rate below.

Elastic Cloud Portal Usage and Performance metrics

Again this is just a benchmark, but can give a good view on stability and performance.?

Conclusion

Now it’s time to wrap up and set our conclusions. We see Elastic transforming to a truly SaaS Platform. From a user perspective this is great. I’m impressed about the work and the current offering with some minor remarks.

Let’s go through the promises below.

No compromise on speed or scale

I can agree on this. Performance tests show that still warming up is needed, but after that we have an excellent performance.

Hassle-free operations

Indeed Infrastructure challenges are gone, but of course other challenges will pop up. To troubleshoot, recover and monitor I do miss some functionality.

Purpose-built product experience

Fully agree here. Elastic Serverless projects are built to provide full Product experience. No infrastructure complexity anymore, since it’s all abstracted away. Just ingest, store and use your data.

Simplified pricing model

Due to the simplicity of using a usage-based pricing model pricing becomes flexible and really data-centric. This helps teams to make business valuable decisions on data storage, but Elastic should be cautious not to scare to store large volumes of data. Luckily for that there are Pricing Packages available. Read more about this here.

Security and compliance certified

This is an important topic, especially when you process and store PII or PHI data. Looking in Europe we have the GDPR. If you want to assess the service you can download most documentation in the Trust portal.

Still I have some findings that I’ve asked myself while writing this blog. Below is a list of findings that I would like to get answered in the future.

  • How can customers handle snapshots or duplication from a project migration or clone perspective?What seems the approach for migrating to a new project. For example a new CSP or Region.
  • How can customers do troubleshooting?or handle?data recovery?In some cases something like threats, health status and snapshots can be helpful for troubleshooting issues and data recovery.
  • How can customers look into more in depth index (+search) statistics?In particular cases these are helpful to understand ingesting statistics in a programmatic way.
  • will the current refresh time of Usage and performance metrics be improved?For troubleshooting Production issues this is not acceptable to wait several minutes to see your ingestion metrics.Also I do miss more metrics that can be used for troubleshooting.

Next steps

Ready to Unlock the Full Potential of Elastic Cloud Serverless?

Are you looking to streamline your search, observability, and security workflows with Elastic Cloud Serverless but unsure where to start? Let’s connect!

Whether you need guidance on implementation, architectural design, or best practices for optimization, we are available to help you during the journey.

Drop me a message or book a consultation to discuss how you can get the most out of Elastic Cloud.

Let’s build secure, scalable, and efficient solutions together!

Rolf de Vries

Elasticsearch Expert

1 个月

Great article! Thanks for pointing out the missing features. Hopefully they will be on the roadmap for the GA release

Arnold ?? Van Wijnbergen

Independent ?? Consultant | Microsoft MVP | MCT Trainer | Speaker | Empowering Global Clients with DevSecOps, Reliable Architectures, Observability ?? Insights & Cybersecurity ?? Acceleration through Threat Intelligence

1 个月

^^ Philipp Krenn thanks for the support and advocacy. Hopefully this helps the Elastic community.

要查看或添加评论,请登录

Arnold ?? Van Wijnbergen的更多文章

社区洞察

其他会员也浏览了