Big Data - What The Heck Are Data Lakes?
Bernard Marr
?? Internationally Best-selling #Author?? #KeynoteSpeaker?? #Futurist?? #Business, #Tech & #Strategy Advisor
You’ve probably heard of data warehousing, but now there’s a newer phrase doing the rounds, and it’s one you’re likely to hear more in the future if you’re involved in big data: ‘Data Lakes’.
So what are they? Well, the best way to describe them is to compare them to data warehouses, because the difference is very much the same as between storing something in a warehouse and storing something in a lake.
In a warehouse, everything is archived and ordered in a defined way – the products are inside containers, the containers on shelves, the shelves are in rows, and so on. This is the way that data is stored in a traditional data warehouse.
In a data lake, everything is just poured in, in an unstructured way. A molecule of water in the lake is equal to any other molecule and can be moved to any part of the lake where it will feel equally at home.
This means that data in a lake has a great deal of agility – another word which is becoming more frequently used these days – in that it can be configured or reconfigured as necessary, depending on the job you want to do with it.
A data lake contains data in its rawest form – fresh from capture, and unadulterated by processing or analysis.
It uses what is known as object-based storage, because each individual piece of data is treated as an object, made up of the information itself packaged together with its associated metadata, and a unique identifier.
No piece of information is “higher-level” than any other, because it is not a hierarchically archived system, like a warehouse – it is basically a big free-for-all, as water molecules exist in a lake.
The term is thought to have first been used by Pentaho CTO James Dixon in 2011, who didn’t invent the concept but gave a name to the type of innovative data architecture solutions being put to use by companies such as Google and Facebook.
It didn’t take long for the name to make it into marketing material. Pivotal refer to their product as a “business data lake” and Hortonworks include it in the name of their service, Hortonworks Datalakes.
It is a practice which is expected to become more popular in the future, as more organizations become aware of the increased agility afforded by storing data in data lakes rather than strict hierarchical databases.
For example, the way that data is stored in a database (its “schema”) is often defined in the early days of the design of a data strategy. The needs and priorities of the organization may well change as time goes on.
One way of thinking about it is that data stored without structure can be more quickly shaped into whatever form it is needed, than if you first have to disassemble the previous structure before reassembling it.
Another advantage is that the data is available to anyone in the organization, and can be analyzed and interrogated via different tools and interfaces as appropriate for each job.
It also means that all of an organization’s data is kept in one place – rather than having separate data stores for individual departments or applications, as is often the case.
This brings its own advantages and disadvantages – on the one hand, it makes auditing and compliancy simpler, with only one store to manage. On the other, there are obvious security implications if you’re keeping “all your eggs in one basket”.
Data lakes are usually built within the Hadoop framework, as the datasets they are comprised of are “big” and need the volume of storage offered by distributed systems.
A lot of it is theoretical at the moment because there are very few organizations which are ready to make the move to keeping all of their data in a lake. Many are bogged down in a “data swamp” – hard-to-navigate mishmashes of land and water where their data has been stored in various, uncoordinated ways over the years.
And it has its critics of course – some say that the name itself is a problem (and I am inclined to agree) as it implies a lack of architectural awareness, when a more careful consideration of data architecture is what’s really needed when designing new solutions.
But for better or worse, it is a term that you will probably be hearing more of in the near future if you’re involved in big data and business intelligence.
Are you ready to dive head first into the data lake or do you prefer to keep your data high and dry? Let me know using the comments section below.
As always, I hope this was useful? Please let me know if you have any views or comments on the topic or would like to add something to this description.
--------------
I really appreciate that you are reading my post. Here, at LinkedIn, I regularly write about big data as well as management and technology issues and trends. If you would like to read my regular posts then please click 'Follow' and send me a LinkedIn invite. And, of course, feel free to also connect via Twitter, Facebook and The Advanced Performance Institute.
Check out other recent LinkedIn Influencer posts by Bernard Marr:
- Big Data: The Key Vocabulary Everyone Should Understand
- The Big Data Economy: Here's What You Must Know
- Big Data: 25 Eye-Opening Facts Everyone Should Know
- Big Data: The Key Skills Businesses Need
- Big Data: The 4 Layers Everyone Must Know
- 10 Awesome Ways Big Data Is Used Today To Change Our World
About : Bernard Marr is a globally recognized expert in strategy, performance management, analytics, KPIs and big data. He helps companies and executive teams manage, measure, analyze and improve performance.
His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance
Photo: Shutterstock.com
Solution Architect (data) - contract
7 年I think when using a Data lake you need to create a data dictionary / data glossary that describes the data you are pushing into the data lake. Otherwise over time this data becomes difficult to use. In my eyes a data lake should be what was traditionally the Raw layer of a data warehouse.
Associate Director at Virtusa Chennai
7 年Said in a simpler Way! Well done
Director - Head of Technology Operations | Cloud Transformations | Automation | SRE | MLOps | Data | Engineering | Strategy | AWS | Azure | DevOps | Data-Driven Innovation at London Stock Exchange Group (LSEG)
7 年Data Lake - "Dump the data and move to next dump " let us decide in 2 yrs what do we do next??... Unfortunately not many organisation thinking through end to end before jumping!
Enterprise Search Consultant at New Idea Engineering, Inc.
9 年Great article defining what a popular buzzword means. It seems to me though that a 'data lake' is a new name for a file share - say an "F:" drive for those of us old enough to remember Novell Netware drives. Except that F drives had a benefit that data lakes do not: are least you could have a directory on a drive, so you could put your financial reports in F:\financials - which at least gives a *little* metadata that your search engine could use to actually FIND content once it's stored. Unless you're willing to seriously improve your document metadata, shoving your documents into a data lake might be convenient; but what you've done is seriously decreased the likelihood of that document every being seen again.