Splunk and Data-Driven IT - Using outlier analysis of properties and attributes of fields within events.
Using data analysis to find outlier attributes

Splunk and Data-Driven IT - Using outlier analysis of properties and attributes of fields within events.

When #DataMining, you should look for outlier properties and attributes which lead to investigative deep queries to find the most frequent/most abusive/whatever aspects of the traffic you are trying to identify. 

This outlier analysis determines the most used length of usernames across the top six bad top-level domains [.top, .xyz, .download, etc.] which result in automatic message deletion.  Using that attribute as a filter reveals a plethora of information about nefarious traffic and the sources of that traffic which users and internal computers should not be exposed to (or their exposure should be at least minimized).

This distribution's prominent peak at len=11 does not exist in Accepted / low SCL [Spam Scoring Level] scoring traffic. The "accepted" .com traffic has a bump at 11, but the overall distribution follows more the standard curve before the right tail skew.

When does this seemingly esoteric detail come in handy? These property and attribute details come into play when, for example, you are extracting field data for a word database.

When collecting and extracting the inbound messages for UCE (Unwanted Commerical Email or spam) or BCE (Business Compromise Email or spam with criminal intent - invoice-jacking, payment fraud, etc.), some usernames are obviously false, offensive, inappropriate for the workplace, or outright bogus (i.e. Zer0-Prnet-Creditcards@). When using #BigData platforms like @Splunk, it is a straight-forward project to collectextractions into a dedicated index for later use. I want to use this word database/index as a known good versus known bad usernames for multi-elimination queries.

As a side note: I separately imported a sampling of first names and last names reported to the Census Bureau or the Social Security Administration 1880 to 2008. It is a big word database focused on rooting out bad message senders and their source domains, subnets, and even their ISPs/datacenters.

Knowing the range of the properties or attributes of specific (or "key") data fields will create searches which are much more efficient. This way, the more "deep" the query, the less the empty range values can become time-consuming null cycles which return no useable data results. Is there any purpose searching for nefarious senders with usernames with lengths less than 8? Not really, the distribution in the bad top level domain graph explicitly shows the range of interest and opportunity given this investigative vector.

Further down the data road, you can use the property with specific other factors to drill down on a particular set of activities. An examination of one of the worst-of-the-worst top level domains in messaging, .top, reveals a different distribution signature than the results of the combined top six top-level domains. This information can be very useful in compartmentalizing or subsetting queries based on specific observed ranges of said properties and/or attributes.

The graph above is showing the distribution of the .top top-level domain which can be used to create a specific .top searches with only the known observed ranges. Searching hundreds of millions, or even billions, of events with a range of c(10,62) is a much smaller query process than having to search all possible property values of the field. In this case, you are eliminating roughly 3/4 of the query (and by relation, query-time) by limiting the events without observed properties or attributes or values thereof since the spec's acceptable range is c(1,253).

Writing the exception query to the observed values of a given property or attribute is obvious and straight-forward to write as a query. Catching what is outside of your observed historical ranges is a trivial set of code. Remember, that with each and every time you run the deep query, as in a cyclical alert process, you compound the impact of deep queries with large occurrences of null results. This has a growing negative operational impact across your search cluster topology.

Also remember, each related factor (this graph is .stream, lately the worst-of-the-worst) may produce its own distribution signature, and that detail will help write hyper-specific and hyper-efficient / well-performing queries over very large datasets.


About the author: Gregg is an in-the-trenches Silicon Valley engineer with a serious case of Data Compulsion Disorder who has been working in various levels of IT, DevOps, data engineering and maths for over 20 years, including L3 reverse engineering and debugging. Gregg is currently a working IT consultant and a co-founder/data and performance engineer/acting data scientist in a data-focused start-up company.

Disclaimer: The information provided in this post is my opinion and my proprietary research. This is not a recommendation, warranty, surety or guarantee in any form whatsoever.


要查看或添加评论,请登录

Gregg Daly的更多文章

社区洞察

其他会员也浏览了