Awesome Insights Into How Ancestry.com Uses Big Data

Awesome Insights Into How Ancestry.com Uses Big Data


With more and more businesses becoming aware of the value locked inside the data they collect, many are becoming aware of a pressing problem. While traditionally a company has employed a data team to store, organize and distribute data, the sheer increase in size of data that large businesses are dealing with today means that model is quickly becoming outdated.

The most popular solutions involve implementing a company-wide data strategy and ensuring staff are engaged with data driven operations (often quoted as a major obstacle in achieving a truly data-driven culture) at all levels. One aspect of this is becoming known as data stewardship – giving all staff who work with data responsibility for its management.

A good example of a business built on a lot of data is Ancestry.com. The genealogy website has become hugely popular thanks to the data it has built up on family connections dating back almost 1,000 years.

Ancestry.com undertook a thorough restructuring of its data operations while deploying the open source Kafka platform. The primary aim was to move from a once-per-day batch processing data operation to real-time, on-the-fly processing. However a by-product was an increased understanding of how data was used throughout the business.

Neha Narkhede, CTO of Confluent, and one of the original developers of Kafka while it was an internal project at LinkedIn, tells me “Traditionally there is one team or set of people who really care about data, and that is the data warehouse team.

“However if we look at how companies work, there are thousands of developers who are really creating data by the second, writing applications that produce data which is critical to the business.

 “And usually they are the people who just created it and threw it over the fence.”

When data isn’t properly looked after it becomes meaningless and valueless. Worse, if it is out of date, divorced from its context or incorrectly categorized, it can be damaging if decisions are based on it.

Confluent’s solution, using Kafka, is to code a “metadata repository” into the system, allowing whoever is working with the data to define and redefine its format, in real time.

“This is a pretty big practical game changer,” says Narkhede, “as it allows applications to automatically publish metadata, and it allows applications which are interested in consuming that data to understand it, and to evolve it.”

Missing and mismatched metadata can cause serious problems for a business such as Ancestry, with a database containing over 13 billion records spread across more than 10 petabytes of storage.

Chris Sanders, director of data warehousing, says “We ran into problems where data just didn’t exist or it was inaccurate. For data warehousing, business intelligence, reporting and legal obligations, or to pay royalties, that’s a nightmare.

“Developers can now come in near real time and see that their production data is not just getting dropped off into a message queue or something where they have no idea – they can actually become data stewards now.”

Ancestry’s approach is certainly one which I can see becoming more and more popular as businesses find themselves dealing with an ever increasing amount of data, touching on the workload of a greater number of employees.

As well as the move from batch to real-time data processing, LinkedIn’s adoption of Kafka also reflects a broad trend in the industry towards increasing use of open source technology.

“The problem is very simple,” says Narkhede, “Data is critical to companies and any kind of system which locks people into a certain software that essentially holds the most essential aspect of a company – which is their data – is unacceptable.

“So open source is changing that, because even if they change vendors the customer knows their data is there because it flows through an open source system.”

Companies, for the foreseeable future, are going to be busy coming up with ways to make sure they are extracting value from their data. They know full well that if they are not, then they have competitors who certainly will be.

Open source greatly reduces the workload by removing the need for a great deal of investment in expensive, bespoke infrastructure. And data stewardship – when it is rolled out throughout a business – reduces the risks posed by bad, out of data or inaccurate information.

Both are tactics which I can see becoming popular for companies of all sizes as data and analytics become increasingly important in maintaining a competitive edge.

As always, I would love to hear your thoughts on this topic. 

Thank you for reading my post. Here at LinkedIn and at Forbes I regularly write about management, technology and Big Data. If you would like to read my future posts then please click 'Follow' and feel free to also connect via TwitterFacebookSlideshare, and The Advanced Performance Institute.

You might also be interested in my brand new and free ebook on Big Data in Practice, which includes 3 Amazing use cases from NASA, Dominos Pizza and the NFL. You can download the ebook from here: Big Data in Practice eBook

Dimitrios Mitrentsis

Department Manager System & Infrastructure, Senior Programm Manager, Agile Transformation, Global Transition & Transformation, Outsourcing Services, M&A, License & Contract Management

8 年

may i user your tree?

回复
Benjamin Imuentinyan

Experience supporting back-end infrastructure. Knowledge virtualization ( hyper-V ,VMWare. Strong knowledge of Microsoft ecosystem , Computer cavalry: IT System Administrator CompTIA Security+(SY0-601)

8 年
回复
SkillBridge Rebeca Miranda

International Management and Recruitment. Connecting global talent with companies worldwide through talent management and skill development services.

8 年

Gustavo Reneses Mira esto!!!

回复
Kirk Noland

NOLANDS ENTERPRISES

8 年

Awesome information

Kirk Noland

NOLANDS ENTERPRISES

8 年

Great information

要查看或添加评论,请登录

社区洞察

其他会员也浏览了