Pachyderm - A new big data stack for the container era
Thanks to the Internet of Things, experts predict that by 2020, the “digital universe” will hold up to 44 zettabytes of data.[i] To give you some context, a single zettabye contains one trillion gigabytes of data.[ii] That’s approximately 250 billion full-length DVDs!
The potential for extracting useful insights from this infinitude of data is enormous. But data, alone, don’t tell us anything. It’s the ability to analyze those data that unlocks the insights. And, until recently, Big Data analysis was very hard to do.
Here’s an example of what I mean. Say you’re a journalist, and you come across a large data set of NASDAQ trading and transaction records. You realize you can analyze these transaction records, and see if there are signs of insider trading.
But there’s a problem: analyzing big and complicated data sets like these is impossible for a layman. Even a talented programmer wouldn’t be able to do much with it. You need a special program that can handle an extremely large data set. That program has to split the file up into many smaller separate pieces, analyze those pieces on thousands of computers, and then combine the analyses into coherent insights. The program most people use for this kind of task is called Hadoop, and because it’s deeply complex, you often need a few specialists to operate it for you.
The complexity problem—and the barrier to entry it represents—is why I’m really excited about one of our latest investments: a company called Pachyderm. Quite simply, Pachyderm has the power to democratize analysis of large data sets.
Pachyderm's open-source software is a massive simplification of the system. Rather than having lots of hardware and on premise servers, Pachyderm takes computing to the cloud. Rather than having to run an analysis all over again every time you get more data, Pachyderm automatically processes new data and updates your findings. And rather than running separate analyses for related conclusions, Pachyderm “chains” them together so you can follow a chain of logic from a few large sets of data.
Perhaps most important, with Pachyderm, that journalist can hire pretty much any programmer in the world, send that data into Pachyderm’s system, and get the analysis they’re looking for.
With Pachyderm, we see a future where businesses both small and large can benefit from the insights hiding in both their proprietary data and the world at large. We see a future in which academics can tell us even more about our behavior and our planet. We see a future in which governments can better serve their citizens, and citizens can better hold their governments accountable. This is a tool that can clear a path to greater insight.
But don’t take my word for it. Go to Pachyderm’s website, and check out their software for yourself.
Software Engineering | Cloud | ML/AI | Solution Architecture | IT Strategy
8 年You should have clearly stated your affiliation with Pachyderm as your portfolio company in a disclaimer - not everyone is curious enough to visit your LI profile to find it out. Otherwise, an interesting OSS big data project... Having said that, the same (or similar) result could be achieved by other means, such as manual Dockerization of data sets (https://www.datadan.io/containerized-data-science-and-engineering - BTW, the post mentions Pachyderm) or using OpenStack Swift storage engine (which provides both containerization and versioning).