Mind the Swamp
IDC predicts the average person will have nearly 5,000 digital interactions per day by 2025, up from the 700 to 800 or so that people average today. It further states that worldwide data will grow 61% to 175 zettabytes, with as much of the data residing in the cloud as in data centers. This sounds too much data but how to visualise it? If one were able to store 175ZB onto BluRay discs, then you’d have a stack of discs that can get you to the moon 23 times.
With so much of data, one can easily estimate that big-data analytics is going to grow across all sectors. On the surface, the idea sounds fantastic and full of possibilities. Many enterprises jumped on the bandwagon and create Hadoop-based repositories and started filling those with all kinds of data. Research from PwC, almost 4 years ago, found that 75 percent of business leaders from companies of all sizes, locations and sectors feel they're "making the most of their information assets" in reality, only 4 percent are set up for success. Overall, 43 percent of companies surveyed "obtain little tangible benefit from their information" while 23 percent "derive no benefit whatsoever".
Avi Perez, CTO of business intelligence (BI) software specialist Pyramid Analytics says one of the biggest mistakes organisations make is collecting too much data, simply because they can. Consider your smartphone. If you own one, chances are you've got hundreds or more pictures stored on it. You end up with a billion pictures on your phone, and yet 99 percent of them are probably garbage that you would get rid of in a heartbeat. It's gotten so easy to take pictures with your phone, it's essentially free. And you probably think, One day I'll go and clean it up, but of course no one ever does. You're collecting an enormous amount of information, but you have no way to work your way through it to use it effectively. When you inevitably want to show someone a particular photograph, finding it can require scrolling through an enormous volume of junk. The same thing happens with data swamps. You have massive repositories of data that lacks efficient accessibility, exceptionally difficult to manage and perhaps impossible to generate any real business value.
Building data driven business is new trend among CXOs talks, and of course in almost all interviews you attend for senior IT positions. There is no doubt that Data is new oil or gold whatever you want to call it. But looking closely to the practical implementation of the concept, to my understanding, this trend is somewhat based on perception rather than fact. You go to a tech talk, a business meetup or read FTSE 500 whitepaper and try to copy it; there is cultural problem of always looking dig-data analytics from lens of a technical challenge or failing to be original thinker in one’s own domain.
While reading through the successful implementation of big-data analytics in SMEs world, a few you may find, I come across business leaders talking about their own experience and implementation. But maximum material online either from business leaders or tech writers quotes Amazon, Google, Mercedes, Barclays or other industry giants, which sounds great for a general audience. But you speak to business guys in back office of a SME and they roll eyes as their companies has spent a lot building a so called “Big Data” analytics that has velocity and volume but no real world value component associated. And slowly people are realising this particular issue of failing to generate adequate business value through big-data analytics. Businesses are increasingly hesitant to buy this fad of big-data analytics unless someone shows real value. Yet the truth is that every company has to have a big-data analytic team as your competitor has one, no matter with negative equity on overall performance and cost.
For a while I am trying to understand why this specific problem of “Data Swamp” may occur in any organisation. There is lot of good content over internet on this topic but very little that I came across that I can practically visualise. As per my experience and online readings, I have tried to summarise some of the reasons I think might help to avoid Data Swamp.
Business “Value” to Manage Data "Strategy"
The whole point of big-data analytic is to generate some form of value; it may be for core business, industrial operations, social engineering, natural resource management etc. The aim is to use data and power of analytics to give them an edge. We need to start with a clear vision of the business problem we're trying to solve. With an objective in mind, it should be relatively easy to zero in on the data we need to collect and the best machine learning technique for gleaning insight from that data. Hence even before we start thinking about technical approach, we need to sit with end customer having working domain knowledge, listen to what they want/ need to achieve. Tracing backwards and work with them to understand context in which they wish to use the data. We need domain knowledge driving the data analytics project, not a data unicorn.
Domain contextualising of data must be ‘the’ focus if we wish to generate value. This is a whole subject in its own. One example might be to understand how historic data from heterogeneous sources can be used alongside current or real time data in order to form some type of contextualised decision. Another example is how data from different domains can be used to form a real world model for making decision based on evolving experience. Whatever the case might be, but Contextualising data for value creation is where domain knowledge is the key, it’s not a technical problem to solve.
Business needs to take ownership and lead from front on this particular subject of value creation in big-data era. Once we've built a capability with a business initiative in mind, it's often possible to iterate on that capability to provide the business with even more targeted solutions.
Separate Data ‘Science’ from Data ‘Engineering’
I have used Data Science and Data Engineering quite interchangeably in past. But now I see the difference and also potential problem of why they should not be mixed or used interchangeably. It’s almost rare or impossible to find someone with deep domain knowledge of business plus a sharp understanding of technical challenges that data volume, velocity and processing for data contextualisation puts in front of tech team. Often we find that someone with few years of experience either in data science or data engineering puts themselves as one that can handle both, hence a data unicorn. That’s where the biggest mistake is made.
Someone with data science experience would be great at analysing data based on domain knowledge, applying statistical formula or using mathematical algorithms to find value over a large set of data. But not necessary the same person knows the challenges of CAP theorem limitation over big-data processing and persistence, parallel processing of high velocity data, challenge of data traffic over network, managing fault tolerant system using latest technologies etc. When data engineer tries to act/ pretend as a data scientist, project is bound to fail for business value reasons. Similarly when data scientist tries to act/ pretend as data engineer, then project is bound to fail for technical implementation reasons.
Having said that I believe that data science should guide the strategy for data engineering and not the other way around. Data engineering exists to facilitate data science and allow business to leverage the outcome of data science to generate value.
Tomer Shiran, CEO and co-founder of analytics startup Dremio, a driving force behind the open source Apache Arrow project, predicts that enterprises will see the need for a new role: the data curator.
The data curator, Shiran says, sits between data consumers (analysts and data scientists who use tools like Tableau and Python to answer important questions with data) and data engineers (the people who move and transform data between systems using scripting languages, Spark, Hive, and MapReduce). To be successful, data curators must understand the meaning of the data as well as the technologies that are applied to the data. “The data curator is responsible for understanding the types of analysis that need to be performed by different groups across the organization, what datasets are well suited for this work, and the steps involved in taking the data from its raw state to the shape and form needed for the job a data consumer will perform,” Shiran says.
Data Indexing As Top Priority
This is a topic which is highly under-valued in my opinion, but most important to set up, maintain and use big-data analytics system. If a data engineering team that has not planned data indexing, be assured that it’s heading for data swamp very soon.
Indexing techniques are vital to access data speedily, thus when volume of data is so high, indexing plays important role in accessing data timely. In absence of indexing, we might end up either duplicating data for each use or make data storage so dense that indexes are inefficient to find data in adequate time.
At the time of ingesting data you have some idea around the usage of data but you are not sure of the entire context in which data can be used. There are lot of factors how people might want to use data either on its own or mixing it with other data from different domain. What we can do though is to make it easy to search data, and allow easy referencing. Hence enabling data science to come to data based on different referential routes.
Another very important point is that data governance will fall on it’s face in absence of adequate data indexing. This can create bigger overhead for whole company than we think. Data archiving and cleaning is another important aspect of big-data that will fail without proper indexing.
Two important questions to always think ahead of time are:
- How could we change the data layout in our big-data storage to be better suitable for analytical query processing?
- What happens if the query workload is not known at data upload time? Is there a way to adaptively create those indexes?
The hallmark for a successful big-data systems is having an enterprise catalogue that brings information discovery, AI, and information stewarding together to deliver new insights to the business.
Conclusion
Data Swamp is not just about the technology. Organisations that want to succeed in big-data analytics era need to work hard on the organisational structures, processes and practices that lead to the collect, utilise and dispose high volume and velocity data in a way that it is efficient to process, easy to maintain with minimal human intervention.