Big Data Analytics

Big Data Analytics

Like so many buzzwords, it's impossible to avoid the term “Big Data” these days.  Today I thought I’d explain the history of how the term came to be, and its implications for Analytics.  Wikipedia defines it as: “a term for data sets that are so large or complex that traditional data processing applications are inadequate”.  The term “big data” is relatively recent.  My first recollection was in the context of the large unstructured data from social media sites like Facebook and Twitter.  When one considers the volume of posts to these sites (literally billions of posts per day) then it is easy to see why traditional data processing models can break down.  Either they are not up to the task of analyzing the data at all, or they can’t produce results on a timely basis, perhaps taking days or weeks to produce a result.

These days, the usage of “big data” has morphed from this original usage, to now simply mean “lots of data”, like the normal structured transactional volumes of any large organization.  This is because the volumes of transactional data continue to skyrocket.  As a result, they can suffer from the same processing challenges as other big data sources.

Over my 30+ years in the analytics field I have regularly heard from customers who say that they have extremely large files (remember, “big data” is a relatively new term).  In the early days this often meant laughably small quantities like a million records.  The reason they considered this “large” was due to the tools they were using at the time.  A million records is impossibly large when your Excel 97 allows only 65,000 rows.  Of course their real problem was choosing a tool that was not suited to their data.  Luckily, I have always been associated with tools that thrive on big data.

Moving to the 21st century, it is obvious that the data volumes are orders of magnitude larger, but that doesn’t mean that users are making better tool choices.  For example, Excel now allows a million rows, so this might still make a 10 million record table seem “big” (it isn’t).  The best example of a tool choice associated with big data is “Hadoop”.  Simply put, if it takes you 100 hours to perform an analysis on your volume of data, then spreading it across 100 simultaneous processors might theoretically make it happen in an hour (though it seldom even approaches this).  Hadoop (and its associated technologies) allow you to spread your processing across multiple processors.  Unfortunately, the infrastructure and maintenance costs of this type of solution are extreme.

Imagine, if you will, that you had a processing alternative that didn’t require heroic commitments to hardware and technological complexity to achieve timely results on big data.  This is exactly what the Arbutus solution offers.  It may not suit the very largest data sets, but even my desktop processor (at around $3K) yields results of up to 6 million records per second.  This rivals much more expensive and complex alternatives, and is orders of magnitude faster than typical SQL based data solutions.  I admit that I have a pretty good desktop, but the point is that in many cases you don’t need to go overboard to handle big data.

I’m not saying to dump Hadoop and the like for all big data projects, but before you up the ante on your processing budget, consider the alternatives that are simpler and already available.

Check out some of my other weekly posts at https://www.dhirubhai.net/today/author/0_1c4mnoBSwKJ9wfyxYP_FLh?trk=prof-sm

Nishith Seth

Professional in Analytics Automation | Advance Analytics | Digital Transformation | Artificial Intelligence | Machine Learning | Certified Master Trainer for Advance Analytics |

8 年

We found the Arbutus Server, too good and fast to analyse the Server Logs, which had almost 400 million records for a 12 months period under analytics...

要查看或添加评论,请登录

Grant Brodie的更多文章

  • Self-serve analytics

    Self-serve analytics

    Self-serve analytics are an ideal we should all strive for. They minimize the “time to answer”, by letting the consumer…

  • Using Delimited Data (last of a series)

    Using Delimited Data (last of a series)

    The delimited data format is the workhorse of data transfer standards, and for good reason. It is designed specifically…

  • Using XML Data (part of a continuing series)

    Using XML Data (part of a continuing series)

    An increasingly popular format in the Internet age is XML. The main reason for its ubiquity is that unlike most other…

  • Using print image files (part of a continuing series)

    Using print image files (part of a continuing series)

    Last week I talked about PDF as a data transfer choice. The logical extension is to discuss print image files, as they…

  • Success with analytics is everyone's job

    Success with analytics is everyone's job

    PwC recently published their annual State of the Internal Audit Profession. ACL published a response titled "Leadership…

    1 条评论
  • Data Transfer Formats

    Data Transfer Formats

    Most data isn't transferred in its native format. The reason for this is that internal formats are usually designed for…

  • Take control of your data, maintain your audit independence

    Take control of your data, maintain your audit independence

    Data is seldom cooperative, it comes in innumerable formats, and in many/most cases isn't conveniently located in a…

  • Data Quality testing

    Data Quality testing

    We are all familiar with the phrase “garbage in, garbage out”. Once data quality gets “off the rails” it can be…

  • No Apologies

    No Apologies

    Just this morning I read a post by an individual who wrote a utility to overcome a shortcoming in our major competitor.…

    2 条评论
  • Servers: Simpler is better

    Servers: Simpler is better

    Few IT words generate more anxiety in the non-IT crowd than servers. Most people have a sense of what “server” means…

社区洞察

其他会员也浏览了