登录查看更多内容

Awesome Insights Into How Ancestry.com Uses Big Data

Bernard Marr

?? Internationally Best-selling #Author?? #KeynoteSpeaker?? #Futurist?? #Business, #Tech & #Strategy Advisor

发布日期: 2016年9月27日

With more and more businesses becoming aware of the value locked inside the data they collect, many are becoming aware of a pressing problem. While traditionally a company has employed a data team to store, organize and distribute data, the sheer increase in size of data that large businesses are dealing with today means that model is quickly becoming outdated.

The most popular solutions involve implementing a company-wide data strategy and ensuring staff are engaged with data driven operations (often quoted as a major obstacle in achieving a truly data-driven culture) at all levels. One aspect of this is becoming known as data stewardship – giving all staff who work with data responsibility for its management.

A good example of a business built on a lot of data is Ancestry.com. The genealogy website has become hugely popular thanks to the data it has built up on family connections dating back almost 1,000 years.

Ancestry.com undertook a thorough restructuring of its data operations while deploying the open source Kafka platform. The primary aim was to move from a once-per-day batch processing data operation to real-time, on-the-fly processing. However a by-product was an increased understanding of how data was used throughout the business.

Neha Narkhede, CTO of Confluent, and one of the original developers of Kafka while it was an internal project at LinkedIn, tells me “Traditionally there is one team or set of people who really care about data, and that is the data warehouse team.

“However if we look at how companies work, there are thousands of developers who are really creating data by the second, writing applications that produce data which is critical to the business.

“And usually they are the people who just created it and threw it over the fence.”

When data isn’t properly looked after it becomes meaningless and valueless. Worse, if it is out of date, divorced from its context or incorrectly categorized, it can be damaging if decisions are based on it.

Confluent’s solution, using Kafka, is to code a “metadata repository” into the system, allowing whoever is working with the data to define and redefine its format, in real time.

“This is a pretty big practical game changer,” says Narkhede, “as it allows applications to automatically publish metadata, and it allows applications which are interested in consuming that data to understand it, and to evolve it.”

Missing and mismatched metadata can cause serious problems for a business such as Ancestry, with a database containing over 13 billion records spread across more than 10 petabytes of storage.

Chris Sanders, director of data warehousing, says “We ran into problems where data just didn’t exist or it was inaccurate. For data warehousing, business intelligence, reporting and legal obligations, or to pay royalties, that’s a nightmare.

“Developers can now come in near real time and see that their production data is not just getting dropped off into a message queue or something where they have no idea – they can actually become data stewards now.”

Ancestry’s approach is certainly one which I can see becoming more and more popular as businesses find themselves dealing with an ever increasing amount of data, touching on the workload of a greater number of employees.

As well as the move from batch to real-time data processing, LinkedIn’s adoption of Kafka also reflects a broad trend in the industry towards increasing use of open source technology.

“The problem is very simple,” says Narkhede, “Data is critical to companies and any kind of system which locks people into a certain software that essentially holds the most essential aspect of a company – which is their data – is unacceptable.

“So open source is changing that, because even if they change vendors the customer knows their data is there because it flows through an open source system.”

Companies, for the foreseeable future, are going to be busy coming up with ways to make sure they are extracting value from their data. They know full well that if they are not, then they have competitors who certainly will be.

Open source greatly reduces the workload by removing the need for a great deal of investment in expensive, bespoke infrastructure. And data stewardship – when it is rolled out throughout a business – reduces the risks posed by bad, out of data or inaccurate information.

Both are tactics which I can see becoming popular for companies of all sizes as data and analytics become increasingly important in maintaining a competitive edge.

As always, I would love to hear your thoughts on this topic.

Thank you for reading my post. Here at LinkedIn and at Forbes I regularly write about management, technology and Big Data. If you would like to read my future posts then please click 'Follow' and feel free to also connect via Twitter, Facebook, Slideshare, and The Advanced Performance Institute.

You might also be interested in my brand new and free ebook on Big Data in Practice, which includes 3 Amazing use cases from NASA, Dominos Pizza and the NFL. You can download the ebook from here: Big Data in Practice eBook

Dimitrios Mitrentsis

Department Manager System & Infrastructure, Senior Programm Manager, Agile Transformation, Global Transition & Transformation, Outsourcing Services, M&A, License & Contract Management

8 年

may i user your tree?

Benjamin Imuentinyan

Experience supporting back-end infrastructure. Knowledge virtualization ( hyper-V ,VMWare. Strong knowledge of Microsoft ecosystem , Computer cavalry: IT System Administrator CompTIA Security+(SY0-601)

8 年

https://TheTasksPay.com/?user=201449

SkillBridge Rebeca Miranda

International Management and Recruitment. Connecting global talent with companies worldwide through talent management and skill development services.

8 年

Gustavo Reneses Mira esto!!!

Kirk Noland

NOLANDS ENTERPRISES

8 年

Awesome information

2 次回应

Kirk Noland

NOLANDS ENTERPRISES

8 年

Great information

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Awesome Insights Into How Ancestry.com Uses Big Data

Bernard Marr

?? Internationally Best-selling #Author?? #KeynoteSpeaker?? #Futurist?? #Business, #Tech & #Strategy Advisor

更多精彩文章

社区洞察

其他会员也浏览了

Introduction to Big Data World

Data Management News for the Week of October 11; Updates from Cloudera, Snowflake, Teradata & More

Real-time Universal DataLakeHouse: Harnessing Debezium, Kafka, DeltaStreamer, HiveMetastore, MiniO, and Trino Data Freshness <5min

Revolutionizing Data Management: A Review of Hudi's Success Stories at Walmart, Uber, Grofers, and Robinhood

Advantages of Metadata Indexing and Asynchronous Indexing in Apache Hudi

Three V's of Big Data

Real-Time Data Processing with Kafka Streams: A Case Study

The Top Challenges of Big Data: Volume, Velocity, Variety, and Veracity

The Great "Data Product" Mix-up: A Tale of Extraction and Confusion

Real Meaning of BigData Problem And the Solution

The Best Smartwatches In 2025: From AI Health Tracking To Adventure-Ready Timepieces

2024年11月15日

The Future Of Corporate Learning And Employee Engagement: Why Traditional Training Is Dead

2024年11月13日

4 AI-Powered Strategies For Your Ultimate Job Search

2024年11月11日

The Impact Of Microsoft's New AI Employees On Your Job

2024年11月10日

8 Game-Changing Smartphone Trends That Will Define 2025

2024年11月8日

Pivot Or Die: Why Adaptability Is The Key To Survival In The Age Of AI

2024年11月6日

The Next AI Frontier: How Multimodal Systems Are Reshaping Our World

2024年11月4日

How Heineken Is Brewing Success With Generative AI

2024年11月3日

The AI Revolution: How Predictive, Prescriptive, And Generative AI Are Reshaping Our World

2024年11月1日

The 5 Most In-Demand Skills In 2025

2024年10月30日