Mental Transitions from SQL to Splunk: Office Buildings vs Public Libraries
By Dal Jeanis
I came to Splunk from the relational world last year (December 2016), and let me tell you, translating directly from SQL code to SPL will screw you up nearly every time. There are many things that are efficient in SQL that will turn your SPL to a swamp.
Here's an analogy that will hopefully help you transition over...
Relational Databases are like an office building attached to a bunch of warehouses.
Splunk is like a city library system.
Repeat that to yourself a couple of times, then read on.
Relational Databases are like an office building attached to a bunch of warehouses.
Splunk is like a city library system.
THE RELATIONAL OFFICE BUILDING AND WAREHOUSES
So, relational databases are based on the idea of everything having a place, all the stuff of a particular type stored in one warehouse or one floor of the office building, in nice little organized rows, and if you want anything out, then you ask for the stuff that's in any particular warehouse, THEN tell the system how to connect it to the stuff that's stored somewhere else, and that's efficient for the forklifts.
You with me here?
That will be inefficient in Splunk, every time.
THE SPLUNK MAIN LIBRARIAN AND BRANCH LIBRARIES
Now, if you remember how libraries work, there's a bunch of stuff lying around on tables, and there's shelves, some of which are pretty well indexed and some aren't, and in the old days there were card files of 3x5 index cards that let you find books exactly one of three ways - author, subject, title. And they had stacks in the back of stuff in odd sizes that didn't fit.
That’s Splunk.
Let me go a little further here - if you want to know what's in ALL the libraries, how does that work? Assuming you were the mayor and had the authority to demand that knowledge, you send a request to the head librarian, she writes instructions for all the other librarians, sends it out to the branch libraries, and they send back the results, which are then collated by the head librarian again.
And, one more little thing, ONLY the head librarian's senior staff is allowed to compile and analyze stuff, compare books to other books, and so on. Union rules. Well, that and the head librarian is sure that her staff can read and count.
So, if you send a request that requires everything of a particular type to be TOGETHER before analyzing it and connecting it to something else, then that means that EVERY BOOK needs to be sent to the head librarian. Twice.
Okay, maybe not always twice, but at least once more than it needs to be. Because, unions, right? Anyway, she has to have all the work done by her own staff, instead of having most of it done by all the good (cheap) junior people at the branch libraries.
When possible, write your splunk queries so that the branch librarians are only sending back the LIST of information you really need, one line per shelf item, not the shelf items themselves. And write them so that they are sending back ALL the materials you need at the same time, rather than having to have the head librarian compile things, then ask again. (Verbs like map and most kinds of join go here.)
THE SEARCH PSEUDOCODE FOR SPLUNK SOUP
Martin Mueller wrote the following pseudocode for the general method of combining data in Splunk. I call this method “Splunk Soup”. You throw everything in a pot and stir until it turns into what you want. The above discussion explains the first line of the following pseudocode... sending back everything you want:
((filters identifying events of type A) OR (filters identifying events of type B))
| fields ... the list of every field that you need from either type A or B...
| eval joinfield = case(expression to detect type A,
functions(to(transform(events, of, type, A))),
expression to detect type B,
functions(to(transform(events, of, type, B))))
| stats values(field1) as field1 ... values(fieldN) as fieldN by joinfield
The second line is a | fields command, limiting the values returned to the fields you absolutely need. The Splunk optimizer will do this for you most of the time, but it's better at the beginning stage of programming SPL to do it explicitly, as early as possible, so that you learn to think about it.
You are telling the branch librarians, i.e. the indexers, that you need nothing else besides fields x, y and z. Let's keep those books on the shelves.
After that, we create a field that will help us to “join” the two types of records together, in this case joining them using the stats command.
There is a separate essay that you can refer to that will explain all the different structures that you can use in Splunk to join data. There are at least 6 distinct ways, each of which has different applicability and efficiency, based on the characteristics of the underlying data. Stats is the one to aim for, and “Splunk Soup” the default architecture, unless there is a reason to do otherwise.
So, what's our mantra for today?
Relational Databases are like an office building attached to a bunch of warehouses.
Splunk is like a city library system.
STREAMING vs NON-STREAMING, DISTRIBUTABLE vs everything else
Okay, one more thing. The key to knowing what a branch librarian can do is the words "streaming" and "distributable".
In our analogy, "streaming" means the librarians can do it to one book at a time. "Distributable" means it can be done at a branch library. (Yes, there ARE things the head librarian can do one at a time, if she has to. In fact, she has to do so for the remainder of the search as soon as you include a command that no one else can do.)
Whether something is streaming and/or distributable is normally pretty intuitive, but there are exceptions if you don't know all the functions of the verb. The command table was a gotcha for me. I initially thought it just reformatted the output, so it would of course be streaming and distributable, but in reality among other things it also potentially limits the number of records returned... which cannot be done at the branch libraries, because they are not aware of what the other branches are doing.
Therefore, table is non-streaming, and you use fields instead, as in our sample pseudocode.
Performance, In Theory and In Practice
When you are trying to speed up a query, as a general case, you want to…
· limit by time first,
· limit by records second,
· reduce to the required fields third, then
· do streaming commands, then
· non-streaming commands.
Subject to the above…
· Do reformats after all aggregations if the formatted field is one of the aggregation keys (and especially if the reformat is increasing the field size).
· Do reformats and renames that are solely for display purposes last, after everything else has been completed.
When you have multiple options and a bit of time, or when you have a slow search, then write the code up multiple ways and TEST THEM. Performance in Splunk, like in everything else, is highly data-dependent and architecture-dependent, and the same "efficient" query on sparse data might be inefficient on dense data, and so on.
Second mantra for the day:
Where theory and real performance differ, believe the real performance.
There's some guidelines and a few new mental maps laid down. Happy splunking!
Dal Jeanis, August 2017
This essay has been reformatted and expanded to stand alone, from an answer originally posted on Answers.splunk.com at https://answers.splunk.com/answers/561130/sql-analogy-for-my-log-search-join-two-tables-wher.html.
Senior Data Scientist | LinkedIn Top Voice 2019 in Data Science & Analytics
7 年Thank you for sharing!