登录查看更多内容

Mental Transitions from SQL to Splunk: Office Buildings vs Public Libraries

Dal Jeanis

Data Consultant | Splunk MVP | Complex Things, Told Simply

发布日期: 2017年11月12日

By Dal Jeanis

I came to Splunk from the relational world last year (December 2016), and let me tell you, translating directly from SQL code to SPL will screw you up nearly every time. There are many things that are efficient in SQL that will turn your SPL to a swamp.

Here's an analogy that will hopefully help you transition over...

Relational Databases are like an office building attached to a bunch of warehouses.

Splunk is like a city library system.

Repeat that to yourself a couple of times, then read on.

Relational Databases are like an office building attached to a bunch of warehouses.

Splunk is like a city library system.

THE RELATIONAL OFFICE BUILDING AND WAREHOUSES

So, relational databases are based on the idea of everything having a place, all the stuff of a particular type stored in one warehouse or one floor of the office building, in nice little organized rows, and if you want anything out, then you ask for the stuff that's in any particular warehouse, THEN tell the system how to connect it to the stuff that's stored somewhere else, and that's efficient for the forklifts.

You with me here?

That will be inefficient in Splunk, every time.

THE SPLUNK MAIN LIBRARIAN AND BRANCH LIBRARIES

Now, if you remember how libraries work, there's a bunch of stuff lying around on tables, and there's shelves, some of which are pretty well indexed and some aren't, and in the old days there were card files of 3x5 index cards that let you find books exactly one of three ways - author, subject, title. And they had stacks in the back of stuff in odd sizes that didn't fit.

That’s Splunk.

Let me go a little further here - if you want to know what's in ALL the libraries, how does that work? Assuming you were the mayor and had the authority to demand that knowledge, you send a request to the head librarian, she writes instructions for all the other librarians, sends it out to the branch libraries, and they send back the results, which are then collated by the head librarian again.

And, one more little thing, ONLY the head librarian's senior staff is allowed to compile and analyze stuff, compare books to other books, and so on. Union rules. Well, that and the head librarian is sure that her staff can read and count.

So, if you send a request that requires everything of a particular type to be TOGETHER before analyzing it and connecting it to something else, then that means that EVERY BOOK needs to be sent to the head librarian. Twice.

Okay, maybe not always twice, but at least once more than it needs to be. Because, unions, right? Anyway, she has to have all the work done by her own staff, instead of having most of it done by all the good (cheap) junior people at the branch libraries.

When possible, write your splunk queries so that the branch librarians are only sending back the LIST of information you really need, one line per shelf item, not the shelf items themselves. And write them so that they are sending back ALL the materials you need at the same time, rather than having to have the head librarian compile things, then ask again. (Verbs like map and most kinds of join go here.)

THE SEARCH PSEUDOCODE FOR SPLUNK SOUP

Martin Mueller wrote the following pseudocode for the general method of combining data in Splunk. I call this method “Splunk Soup”. You throw everything in a pot and stir until it turns into what you want. The above discussion explains the first line of the following pseudocode... sending back everything you want:

((filters identifying events of type A) OR (filters identifying events of type B))

| fields ... the list of every field that you need from either type A or B...

| eval joinfield = case(expression to detect type A,

functions(to(transform(events, of, type, A))),

expression to detect type B,

functions(to(transform(events, of, type, B))))

| stats values(field1) as field1 ... values(fieldN) as fieldN by joinfield

The second line is a | fields command, limiting the values returned to the fields you absolutely need. The Splunk optimizer will do this for you most of the time, but it's better at the beginning stage of programming SPL to do it explicitly, as early as possible, so that you learn to think about it.

You are telling the branch librarians, i.e. the indexers, that you need nothing else besides fields x, y and z. Let's keep those books on the shelves.

After that, we create a field that will help us to “join” the two types of records together, in this case joining them using the stats command.

There is a separate essay that you can refer to that will explain all the different structures that you can use in Splunk to join data. There are at least 6 distinct ways, each of which has different applicability and efficiency, based on the characteristics of the underlying data. Stats is the one to aim for, and “Splunk Soup” the default architecture, unless there is a reason to do otherwise.

So, what's our mantra for today?

Relational Databases are like an office building attached to a bunch of warehouses.

Splunk is like a city library system.

STREAMING vs NON-STREAMING, DISTRIBUTABLE vs everything else

Okay, one more thing. The key to knowing what a branch librarian can do is the words "streaming" and "distributable".

In our analogy, "streaming" means the librarians can do it to one book at a time. "Distributable" means it can be done at a branch library. (Yes, there ARE things the head librarian can do one at a time, if she has to. In fact, she has to do so for the remainder of the search as soon as you include a command that no one else can do.)

Whether something is streaming and/or distributable is normally pretty intuitive, but there are exceptions if you don't know all the functions of the verb. The command table was a gotcha for me. I initially thought it just reformatted the output, so it would of course be streaming and distributable, but in reality among other things it also potentially limits the number of records returned... which cannot be done at the branch libraries, because they are not aware of what the other branches are doing.

Therefore, table is non-streaming, and you use fields instead, as in our sample pseudocode.

Performance, In Theory and In Practice

When you are trying to speed up a query, as a general case, you want to…

· limit by time first,

· limit by records second,

· reduce to the required fields third, then

· do streaming commands, then

· non-streaming commands.

Subject to the above…

· Do reformats after all aggregations if the formatted field is one of the aggregation keys (and especially if the reformat is increasing the field size).

· Do reformats and renames that are solely for display purposes last, after everything else has been completed.

When you have multiple options and a bit of time, or when you have a slow search, then write the code up multiple ways and TEST THEM. Performance in Splunk, like in everything else, is highly data-dependent and architecture-dependent, and the same "efficient" query on sparse data might be inefficient on dense data, and so on.

Second mantra for the day:

Where theory and real performance differ, believe the real performance.

There's some guidelines and a few new mental maps laid down. Happy splunking!

Dal Jeanis, August 2017

This essay has been reformatted and expanded to stand alone, from an answer originally posted on Answers.splunk.com at https://answers.splunk.com/answers/561130/sql-analogy-for-my-log-search-join-two-tables-wher.html.

Kevin Tran

Senior Data Scientist | LinkedIn Top Voice 2019 in Data Science & Analytics

7 年

Thank you for sharing!

查看更多评论

要查看或添加评论，请登录

Dal Jeanis的更多文章

What to do with these Statistical Process Control study results?

2022年9月14日

What to do with these Statistical Process Control study results?

Here's a hypothetical, although the numerical results presented here are from a real world study (#BM2004) on a…

2 条评论
Recipe Four For Splunk Stew

2020年5月28日

Recipe Four For Splunk Stew

INGREDIENTS: Two files with one common field between two, but the relevant value of one of them changes over time, and…
Recipe Three For Splunk Stew

2020年5月28日

Recipe Three For Splunk Stew

INGREDIENTS: Three files with one common field between two, and a different common field between the other two File1…
Recipe Two For Splunk Stew

2020年5月28日

Recipe Two For Splunk Stew

INGREDIENTS: Two files with a common field that has different names File1 index=foo1 with fields host user field1 junk1…
Recipe One for Splunk Stew

2020年5月28日

Recipe One for Splunk Stew

INGREDIENTS: Two files with a common field that is called the same name File1 index=foo1 with fields host user field1…
The Interview Feedback You Need

2020年5月26日

The Interview Feedback You Need

When you are in the job-seeking pipeline, you often are starved for feedback. There's a tendency to think, "If I'd only…

7 条评论
A Request for Help, A Fun Challenge and An Offer

2019年10月19日

A Request for Help, A Fun Challenge and An Offer

I don't ask my network for favors, but this is a special circumstance. I'll start with the ask, then I'll tell you why,…

1 条评论
The Synthropic Principle - Part 2

2019年5月9日

The Synthropic Principle - Part 2

I've previously stated that it is almost impossible for us NOT to be in a simulation. Here's the inescapable logic that…

7 条评论
The Synthropic Principle - Part 1

2019年5月9日

The Synthropic Principle - Part 1

A few decades back, some scientific philosophers put together something called "The Anthropic Principle". It's a…
Lying Effectively With Data Visualization: Part Two

2017年11月20日

Lying Effectively With Data Visualization: Part Two

A few weeks back, I began a series on How to Lie with Data Visualization. Here's the second installment.

2 条评论

See all articles

Mental Transitions from SQL to Splunk: Office Buildings vs Public Libraries

Dal Jeanis

Data Consultant | Splunk MVP | Complex Things, Told Simply

Dal Jeanis的更多文章

社区洞察

其他会员也浏览了

Weekly SQL Newsletter: SQL Server Deep Dive ft. Pinal Dave + More!

Weekly SQL Newsletter: SQL Server Deep Dive ft. Pinal Dave + More!

Weekly SQL Newsletter: Essential SQL Server Backup Tools + More!

Resolving SQL Server Error 9002 With Expert Solutions

Unveiling the Power of SQL Database: The Heart of Data Management

Why We Choose MSSQL Server for Most of Our Client Projects

SQL Server Fine-Tune & Optimization Guide

Enhance SQL Server Query Responses Using Indexing

SQL Server Plan Cost Cross-Over

Top 50 Microsoft SQL Server Interview Questions: Complete Guide

Dal Jeanis的更多文章

What to do with these Statistical Process Control study results?

Recipe Four For Splunk Stew

Recipe Three For Splunk Stew

Recipe Two For Splunk Stew

Recipe One for Splunk Stew

The Interview Feedback You Need

A Request for Help, A Fun Challenge and An Offer

The Synthropic Principle - Part 2

The Synthropic Principle - Part 1

Lying Effectively With Data Visualization: Part Two

社区洞察

其他会员也浏览了

Weekly SQL Newsletter: SQL Server Deep Dive ft. Pinal Dave + More!

Weekly SQL Newsletter: SQL Server Deep Dive ft. Pinal Dave + More!

Weekly SQL Newsletter: Essential SQL Server Backup Tools + More!

Resolving SQL Server Error 9002 With Expert Solutions

Unveiling the Power of SQL Database: The Heart of Data Management

Why We Choose MSSQL Server for Most of Our Client Projects

SQL Server Fine-Tune & Optimization Guide

Enhance SQL Server Query Responses Using Indexing

SQL Server Plan Cost Cross-Over

Top 50 Microsoft SQL Server Interview Questions: Complete Guide