登录查看更多内容

Big Data - What The Heck Are Data Lakes?

Bernard Marr

?? Internationally Best-selling #Author?? #KeynoteSpeaker?? #Futurist?? #Business, #Tech & #Strategy Advisor

发布日期: 2015年1月14日

You’ve probably heard of data warehousing, but now there’s a newer phrase doing the rounds, and it’s one you’re likely to hear more in the future if you’re involved in big data: ‘Data Lakes’.

So what are they? Well, the best way to describe them is to compare them to data warehouses, because the difference is very much the same as between storing something in a warehouse and storing something in a lake.

In a warehouse, everything is archived and ordered in a defined way – the products are inside containers, the containers on shelves, the shelves are in rows, and so on. This is the way that data is stored in a traditional data warehouse.

In a data lake, everything is just poured in, in an unstructured way. A molecule of water in the lake is equal to any other molecule and can be moved to any part of the lake where it will feel equally at home.

This means that data in a lake has a great deal of agility – another word which is becoming more frequently used these days – in that it can be configured or reconfigured as necessary, depending on the job you want to do with it.

A data lake contains data in its rawest form – fresh from capture, and unadulterated by processing or analysis.

It uses what is known as object-based storage, because each individual piece of data is treated as an object, made up of the information itself packaged together with its associated metadata, and a unique identifier.

No piece of information is “higher-level” than any other, because it is not a hierarchically archived system, like a warehouse – it is basically a big free-for-all, as water molecules exist in a lake.

The term is thought to have first been used by Pentaho CTO James Dixon in 2011, who didn’t invent the concept but gave a name to the type of innovative data architecture solutions being put to use by companies such as Google and Facebook.

It didn’t take long for the name to make it into marketing material. Pivotal refer to their product as a “business data lake” and Hortonworks include it in the name of their service, Hortonworks Datalakes.

It is a practice which is expected to become more popular in the future, as more organizations become aware of the increased agility afforded by storing data in data lakes rather than strict hierarchical databases.

For example, the way that data is stored in a database (its “schema”) is often defined in the early days of the design of a data strategy. The needs and priorities of the organization may well change as time goes on.

One way of thinking about it is that data stored without structure can be more quickly shaped into whatever form it is needed, than if you first have to disassemble the previous structure before reassembling it.

Another advantage is that the data is available to anyone in the organization, and can be analyzed and interrogated via different tools and interfaces as appropriate for each job.

It also means that all of an organization’s data is kept in one place – rather than having separate data stores for individual departments or applications, as is often the case.

This brings its own advantages and disadvantages – on the one hand, it makes auditing and compliancy simpler, with only one store to manage. On the other, there are obvious security implications if you’re keeping “all your eggs in one basket”.

Data lakes are usually built within the Hadoop framework, as the datasets they are comprised of are “big” and need the volume of storage offered by distributed systems.

A lot of it is theoretical at the moment because there are very few organizations which are ready to make the move to keeping all of their data in a lake. Many are bogged down in a “data swamp” – hard-to-navigate mishmashes of land and water where their data has been stored in various, uncoordinated ways over the years.

And it has its critics of course – some say that the name itself is a problem (and I am inclined to agree) as it implies a lack of architectural awareness, when a more careful consideration of data architecture is what’s really needed when designing new solutions.

But for better or worse, it is a term that you will probably be hearing more of in the near future if you’re involved in big data and business intelligence.

Are you ready to dive head first into the data lake or do you prefer to keep your data high and dry? Let me know using the comments section below.

As always, I hope this was useful? Please let me know if you have any views or comments on the topic or would like to add something to this description.

--------------

I really appreciate that you are reading my post. Here, at LinkedIn, I regularly write about big data as well as management and technology issues and trends. If you would like to read my regular posts then please click 'Follow' and send me a LinkedIn invite. And, of course, feel free to also connect via Twitter, Facebook and The Advanced Performance Institute.

Check out other recent LinkedIn Influencer posts by Bernard Marr:

About : Bernard Marr is a globally recognized expert in strategy, performance management, analytics, KPIs and big data. He helps companies and executive teams manage, measure, analyze and improve performance.

His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance

Photo: Shutterstock.com

Pete Gadsby

Solution Architect (data) - contract

7 年

I think when using a Data lake you need to create a data dictionary / data glossary that describes the data you are pushing into the data lake. Otherwise over time this data becomes difficult to use. In my eyes a data lake should be what was traditionally the Raw layer of a data warehouse.

Balaji Canchi Srinivasalu

Associate Director at Virtusa Chennai

7 年

Said in a simpler Way! Well done

Asif Raza

7 年

Data Lake - "Dump the data and move to next dump " let us decide in 2 yrs what do we do next??... Unfortunately not many organisation thinking through end to end before jumping!

1 次回应

Miles Kehoe

Enterprise Search Consultant at New Idea Engineering, Inc.

9 年

Great article defining what a popular buzzword means. It seems to me though that a 'data lake' is a new name for a file share - say an "F:" drive for those of us old enough to remember Novell Netware drives. Except that F drives had a benefit that data lakes do not: are least you could have a directory on a drive, so you could put your financial reports in F:\financials - which at least gives a *little* metadata that your search engine could use to actually FIND content once it's stored. Unless you're willing to seriously improve your document metadata, shoving your documents into a data lake might be convenient; but what you've done is seriously decreased the likelihood of that document every being seen again.

查看更多评论

要查看或添加评论，请登录

Bernard Marr的更多文章

AI's Competitive Edge: Turning Data Challenges Into Business Success

2025年3月21日

AI's Competitive Edge: Turning Data Challenges Into Business Success

Thank you for reading my latest article AI's Competitive Edge: Turning Data Challenges Into Business Success. Here at…

8 条评论
4 Game-Changing Quantum Computer Types That Could Transform Everything

2025年3月19日

4 Game-Changing Quantum Computer Types That Could Transform Everything

Thank you for reading my latest article 4 Game-Changing Quantum Computer Types That Could Transform Everything. Here at…

14 条评论
5 AI Mistakes That Could Kill Your Business

2025年3月17日

5 AI Mistakes That Could Kill Your Business

March 14, 2025 Thank you for reading my latest article 5 AI Mistakes That Could Kill Your Business. Here at LinkedIn…

41 条评论
The Everything AI: How Google's Super Assistant Could Change Life As We Know It

2025年3月16日

The Everything AI: How Google's Super Assistant Could Change Life As We Know It

Thank you for reading my latest article The Everything AI: How Google's Super Assistant Could Change Life As We Know…

21 条评论
AI Agents Are Coming For Your Industry: Here's Who's First In Line

2025年3月14日

AI Agents Are Coming For Your Industry: Here's Who's First In Line

Thank you for reading my latest article AI Agents Are Coming For Your Industry: Here's Who's First In Line. Here at…

53 条评论
5 Amazing Things You Can Do With ChatGPT's New Operator Mode?

2025年3月12日

5 Amazing Things You Can Do With ChatGPT's New Operator Mode?

Thank you for reading my latest article 5 Amazing Things You Can Do With ChatGPT's New Operator Mode? Here at LinkedIn…

17 条评论
Generative AI Vs. Agentic AI: The Key Differences Everyone Needs To Know

2025年3月10日

Generative AI Vs. Agentic AI: The Key Differences Everyone Needs To Know

Thank you for reading my latest article Generative AI Vs. Agentic AI: The Key Differences Everyone Needs To Know.

29 条评论
The Rise Of AI Scientists: Is Agentic AI The Future Of R&D

2025年3月9日

The Rise Of AI Scientists: Is Agentic AI The Future Of R&D

Thank you for reading my latest article The Rise Of AI Scientists: Is Agentic AI The Future Of R&D. Here at LinkedIn…

51 条评论
5 Critical Quantum Computing Facts Business Leaders Can't Afford to Miss

2025年3月7日

5 Critical Quantum Computing Facts Business Leaders Can't Afford to Miss

Thank you for reading my latest article 5 Critical Quantum Computing Facts Business Leaders Can't Afford to Miss. Here…

16 条评论
How China's DeepSeek Redefined The Global AI Race

2025年3月5日

How China's DeepSeek Redefined The Global AI Race

Thank you for reading my latest article How China's DeepSeek Redefined The Global AI Race. Here at LinkedIn and at…

60 条评论

See all articles

Big Data - What The Heck Are Data Lakes?

Bernard Marr

?? Internationally Best-selling #Author?? #KeynoteSpeaker?? #Futurist?? #Business, #Tech & #Strategy Advisor

Bernard Marr的更多文章

社区洞察

其他会员也浏览了

90-Day Journal Of An Enterprise Architect In Big Data Strategy

The Data Lakes That Turn into Swamps: Why Companies Struggle with Big Data

Benchmark Study: The Industry’s Fastest Data Replication Resync Times

3 Reasons Data Engineers Should Embrace Apache Iceberg

Lakehouse, make Big Data great again

Data Management News for the Week of June 7; Updates from Cloudera, Snowflake, Informatica & More

Kimball vs. Inmon: Unraveling the Synergy of Data Warehouse Approaches

How Dremio Simplifies Data Lakehouse Architecture for Modern Analytics

Data Integration from Fabric Lakehouse to Snowflake Database using Data Pipeline

Which Is the Better Data Architecture, a Data Lake or a Data Warehouse?

Bernard Marr的更多文章

AI's Competitive Edge: Turning Data Challenges Into Business Success

4 Game-Changing Quantum Computer Types That Could Transform Everything

5 AI Mistakes That Could Kill Your Business

The Everything AI: How Google's Super Assistant Could Change Life As We Know It

AI Agents Are Coming For Your Industry: Here's Who's First In Line

5 Amazing Things You Can Do With ChatGPT's New Operator Mode?

Generative AI Vs. Agentic AI: The Key Differences Everyone Needs To Know

The Rise Of AI Scientists: Is Agentic AI The Future Of R&D

5 Critical Quantum Computing Facts Business Leaders Can't Afford to Miss

How China's DeepSeek Redefined The Global AI Race

社区洞察

其他会员也浏览了

90-Day Journal Of An Enterprise Architect In Big Data Strategy

The Data Lakes That Turn into Swamps: Why Companies Struggle with Big Data

Benchmark Study: The Industry’s Fastest Data Replication Resync Times

3 Reasons Data Engineers Should Embrace Apache Iceberg

Lakehouse, make Big Data great again

Data Management News for the Week of June 7; Updates from Cloudera, Snowflake, Informatica & More

Kimball vs. Inmon: Unraveling the Synergy of Data Warehouse Approaches

How Dremio Simplifies Data Lakehouse Architecture for Modern Analytics

Data Integration from Fabric Lakehouse to Snowflake Database using Data Pipeline

Which Is the Better Data Architecture, a Data Lake or a Data Warehouse?