登录查看更多内容

Hadoop: Powering the next generation of analytics

Daniel CF Ng 伍长辉

发布日期: 2016年3月27日

Much like the numerous emerging enterprise technologies surfacing ever year, the hype over Apache Hadoop will soon be over. Hadoop, the framework that allows for the distributed processing of large data sets across clusters of computers, using simple programming models, needs to recede into the background so as to deliver on its promise of being the enterprise data hub of choice and the relational database management system that powers most of the online world today. It cannot be exclusive and remain a specialized skillset, but, in fact, needs to take a revolutionary approach to operate and be the preferred platform that powers the next generation of analytics. In other words, Hadoop must overcome the hype and evolve.

That is not to say that Hadoop has not made tremendous progress over the years. From a monolithic storage and batch architecture, it has transformed significantly into a modern, modular data platform. Hadoop is now capable of interacting with data discovery through analytic SQL engines like Impala, as well as supporting Apache Spark to provide a next generation data processing layer for a combination of batch and streaming workloads. With these many developments in place, delivering ease of use and increase performance for developers has not been compromised.

Nevertheless, there is still more that needs to be done. Hadoop needs to be able to address the fundamental challenges that users are still facing – we identify the three most pressing issues below.

Better data engineering: Improving Spark for the enterprise

The role of data engineering needs to first be addressed before we go into the discussion of data analytics. Some of you may be saying “Data engineering, really?”. Yes, and rightly so!

With the responsibility of designing and building the infrastructure jointly with the data science team, data engineers are quite literally providing the foundation for the next generation of analytics.

Exceptional data engineering needs to be easy to use and flexible. It needs to be able to perform. Take Spark for example, a general compute engine that supports a wide range of applications, it is now widely popular among users. In addition to data processing, applications also need ways to ingest, store, and serve data, while enterprise teams need tools for operations, security and data governance. In this sense, Spark naturally complements the comprehensive and complex Hadoop ecosystem with its ease of use, flexibility and performance.

Comprehensive security: Enforcing unified access control policy

One of Hadoop’s defining characteristics is its ability to allow access to unlimited data in a variety of ways. By moving beyond MapReduce, the programming paradigm that allows for massive scalability through parallel processing of large data sets, complex application architectures that required many separate systems for data preparation, staging, discovery, modeling, and operational deployment, can now be consolidated into a single end-to-end workflow on a common platform. This empowers users with more diverse skills, allowing them to extract value from data.

Of course, this flexibility needs to be balanced with security requirements. To ensure that sensitive data cannot fall into the wrong hands, a comprehensive security approach should be adopted to ensure that every access path to data respects policy in the same way, right down to the most granular level.

However, the reality today is that each access engine handles security differently. Take Impala and Hive for example. These Hadoop modules offer row and column-based access controls with shared policy definitions through Apache Sentry. In contrast, Spark and MapReduce support only file or table level controls. This fragmentation forces a reliance on the lowest common denominator – coarse-grained permissions – which often result in several undesirable outcomes: limitations of data and access, security silos or, worse, inconsistent policy due to human error in policy replication. Ultimately, the issue constrains the types of applications that can be built.

By providing a common API for policy-compliant data access, third party products are better integrated into the Hadoop cluster, while also providing dynamic data masking everywhere.

With users empowered to securely gain value from data using their tools of choice, the focus should then shift to a more fundamental problem – how to store data for the next generation of analytics.

Fast analytics on fast data

The next generation of applications built on Hadoop are collapsing the distance between data collection, insight, and action; in other words, becoming more real-time. In the best case, analytical models are embedded right in the operational application, directly influencing business outcomes as users interact with them. On the flip side, considering a simpler case, an operational dashboard requires the ability to integrate data and immediately analyze it.

It turns out that this is pretty difficult to achieve with Hadoop today, and that can be largely attributed to storage constraints concerning updates. At an early stage, users are already faced with a dilemma: Do I pick HDFS, which offers high throughput reads, which is great for analytics, but with no capability to update files, or Apache HBase, which offers low-latency updates that are great for operational applications but perform poorly for analytics?

Often, the result is a complex and complicated hybrid of the two, with HBase for updates and periodic syncs to HDFS for analytics. This is arduous for a few reasons:

Data pipelines need to be maintained to move data and ensure synchronization between storage systems
The same data is being stored multiple times, which increases the total cost of ownership
There is latency between when data arrives and when it can be analyzed
Data that is written to HDFS will need to be rewritten and if it needs to be corrected

(remember, no updates)

Looking ahead

Hadoop has come a long way in its first 10 years. As Matt Aslett of 451 Research recently summarized, “Hadoop has evolved from a batch processing engine to encompass a set of replaceable components in a wider distributed data-processing ecosystem that includes in-memory processing and high-performance SQL analytics.”

Naturally, Hadoop’s storage options are also evolving, and this is just the beginning. With Spark as the new data processing architecture, a new unified security layer, and a new storage engine for simplified real-time analytic applications, Hadoop is ready for its next phase: Powering the next generation of analytics.

Hadoop 10th anniversary video : https://www.youtube.com/watch?v=QTVsLsKysUQ

要查看或添加评论，请登录

Daniel CF Ng 伍长辉的更多文章

AI in the Maritime Industry: Enhancing Efficiency with Digital Human Platforms like Graphen’s AiiA

2025年3月23日

AI in the Maritime Industry: Enhancing Efficiency with Digital Human Platforms like Graphen’s AiiA

The maritime industry is a critical pillar of global trade, and Singapore stands at its forefront. With over 41.
From Swamp to Skyline: Celebrating Singapore’s Journey and Forging New Horizons

2025年3月23日

From Swamp to Skyline: Celebrating Singapore’s Journey and Forging New Horizons

From swamp land to a gleaming metropolis, Singapore’s transformation is nothing short of miraculous—a living testament…

2 条评论
Confidence: The Gift We Forget to Give Ourselves

2025年3月22日

Confidence: The Gift We Forget to Give Ourselves

AS I WAS PASSING through the neighborhood of a dear friend, I stopped by for a chat. Over coffee, he mentioned he was…
The Omni Integra Thought Leaders Series - Prime Minister Kakuei Tanaka: Japan’s Architect of Modernization and Diplomacy

2025年3月21日

The Omni Integra Thought Leaders Series - Prime Minister Kakuei Tanaka: Japan’s Architect of Modernization and Diplomacy

Kakuei Tanaka, Japan’s Prime Minister from 1972 to 1974, was a political powerhouse whose vision and strategies left an…

2 条评论
The Wisdom of Proverbs: Social Education Through Asian Sayings

2025年3月20日

The Wisdom of Proverbs: Social Education Through Asian Sayings

Proverbs are a vital part of cultural heritage, offering timeless wisdom that transcends generations. Rooted in…

1 条评论
Autism: Rising Awareness or Increasing Cases?

2025年3月20日

Autism: Rising Awareness or Increasing Cases?

Autism, or Autism Spectrum Disorder (ASD), is a developmental condition that affects how a person perceives and…
Brands In Your Lives - WhatsApp: Transforming Communication Across the Globe

2025年3月19日

Brands In Your Lives - WhatsApp: Transforming Communication Across the Globe

WhatsApp is a brand that redefined how people connect, offering instant, seamless communication at the touch of a…
The Cultural Influence of Astrology: From Fun to Foundation

2025年3月18日

The Cultural Influence of Astrology: From Fun to Foundation

Astrology, in its various forms, has been a part of human civilization for millennia. It bridges the gap between the…

1 条评论
Humanizing Technology in Front-End Banking: Digital Humans as the New Standard

2025年3月17日

Humanizing Technology in Front-End Banking: Digital Humans as the New Standard

The banking sector has undergone massive digital transformation, yet many customers still find the front-end experience…
The Builders of Singapore: Legacy of the Samsui Women

2025年3月16日

The Builders of Singapore: Legacy of the Samsui Women

The Samsui Women hold a vital place in Singapore’s history, symbolizing resilience, hard work, and community spirit…

See all articles

Hadoop: Powering the next generation of analytics

Daniel CF Ng 伍长辉

Daniel CF Ng 伍长辉的更多文章

社区洞察

其他会员也浏览了

Hadoop versus Spark: Who’s winning?

Difference between RDBMS and HBase

Hadoop and the Iceberg

Hadoop vs. Snowflake: Which One is Better

Understanding Narrow and Wide Transformations in Apache Hadoop and Apache Spark

What Are The Key Differences Between Spark And Hadoop?

Understanding Hadoop and Managed Cloud Versions from Microsoft, AWS, and GCP

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Unleashing the Power of Big Data: Exploring the Transformative Use Cases of Hadoop Ecosystems

Daniel CF Ng 伍长辉的更多文章

AI in the Maritime Industry: Enhancing Efficiency with Digital Human Platforms like Graphen’s AiiA

From Swamp to Skyline: Celebrating Singapore’s Journey and Forging New Horizons

Confidence: The Gift We Forget to Give Ourselves

The Omni Integra Thought Leaders Series - Prime Minister Kakuei Tanaka: Japan’s Architect of Modernization and Diplomacy

The Wisdom of Proverbs: Social Education Through Asian Sayings

Autism: Rising Awareness or Increasing Cases?

Brands In Your Lives - WhatsApp: Transforming Communication Across the Globe

The Cultural Influence of Astrology: From Fun to Foundation

Humanizing Technology in Front-End Banking: Digital Humans as the New Standard

The Builders of Singapore: Legacy of the Samsui Women

社区洞察

其他会员也浏览了

Hadoop versus Spark: Who’s winning?

Difference between RDBMS and HBase

Hadoop and the Iceberg

Hadoop vs. Snowflake: Which One is Better

Understanding Narrow and Wide Transformations in Apache Hadoop and Apache Spark

What Are The Key Differences Between Spark And Hadoop?

Understanding Hadoop and Managed Cloud Versions from Microsoft, AWS, and GCP

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

Unleashing the Power of Big Data: Exploring the Transformative Use Cases of Hadoop Ecosystems