Big Data Analytics
Like so many buzzwords, it's impossible to avoid the term “Big Data” these days. Today I thought I’d explain the history of how the term came to be, and its implications for Analytics. Wikipedia defines it as: “a term for data sets that are so large or complex that traditional data processing applications are inadequate”. The term “big data” is relatively recent. My first recollection was in the context of the large unstructured data from social media sites like Facebook and Twitter. When one considers the volume of posts to these sites (literally billions of posts per day) then it is easy to see why traditional data processing models can break down. Either they are not up to the task of analyzing the data at all, or they can’t produce results on a timely basis, perhaps taking days or weeks to produce a result.
These days, the usage of “big data” has morphed from this original usage, to now simply mean “lots of data”, like the normal structured transactional volumes of any large organization. This is because the volumes of transactional data continue to skyrocket. As a result, they can suffer from the same processing challenges as other big data sources.
Over my 30+ years in the analytics field I have regularly heard from customers who say that they have extremely large files (remember, “big data” is a relatively new term). In the early days this often meant laughably small quantities like a million records. The reason they considered this “large” was due to the tools they were using at the time. A million records is impossibly large when your Excel 97 allows only 65,000 rows. Of course their real problem was choosing a tool that was not suited to their data. Luckily, I have always been associated with tools that thrive on big data.
Moving to the 21st century, it is obvious that the data volumes are orders of magnitude larger, but that doesn’t mean that users are making better tool choices. For example, Excel now allows a million rows, so this might still make a 10 million record table seem “big” (it isn’t). The best example of a tool choice associated with big data is “Hadoop”. Simply put, if it takes you 100 hours to perform an analysis on your volume of data, then spreading it across 100 simultaneous processors might theoretically make it happen in an hour (though it seldom even approaches this). Hadoop (and its associated technologies) allow you to spread your processing across multiple processors. Unfortunately, the infrastructure and maintenance costs of this type of solution are extreme.
Imagine, if you will, that you had a processing alternative that didn’t require heroic commitments to hardware and technological complexity to achieve timely results on big data. This is exactly what the Arbutus solution offers. It may not suit the very largest data sets, but even my desktop processor (at around $3K) yields results of up to 6 million records per second. This rivals much more expensive and complex alternatives, and is orders of magnitude faster than typical SQL based data solutions. I admit that I have a pretty good desktop, but the point is that in many cases you don’t need to go overboard to handle big data.
I’m not saying to dump Hadoop and the like for all big data projects, but before you up the ante on your processing budget, consider the alternatives that are simpler and already available.
Check out some of my other weekly posts at https://www.dhirubhai.net/today/author/0_1c4mnoBSwKJ9wfyxYP_FLh?trk=prof-sm
Professional in Analytics Automation | Advance Analytics | Digital Transformation | Artificial Intelligence | Machine Learning | Certified Master Trainer for Advance Analytics |
8 年We found the Arbutus Server, too good and fast to analyse the Server Logs, which had almost 400 million records for a 12 months period under analytics...