Planning Your Day (and the Next Decade) at the Data Lake
(Below are several excerpts from Chapter 2 of the newly published DATA LAKES FOR DUMMIES)
Suppose that you and about 15 other family members or friends all head to your favorite lake for a weeklong summer vacation.
You love going to the lake because you jump into your sailboat every day and spend hours out on the water. Others in your group, though, have their own favorite pastimes. Some prefer a boat with a little more “oomph” and spend their days in speedboats, zooming up and down the length of the lake. Others prefer leisurely canoeing. Some are into waterskiing, so they take turns latching onto one of those speedboats and zipping along the water. Others in your group are into fishing, and that’s how they spend most of their time at the lake. Still others aren’t all that interested in even going out on the water at all — they plop down on the beach to read, soak up some rays, and even grab a snooze every afternoon.
A data lake is very much like that weeklong trip to your favorite lake. Because a data lake is an enterprise-scale effort, spanning numerous organizations and departments, as well as many different business functions, you and your coworkers will likely seek a variety of varying benefits and outcomes from all that hard work.
The best data lakes are those that satisfy the needs of a broad range of constituencies — basically, something for everyone to make the results well worth the effort.
Seizing the Day With Big Data: Carpe Diem
Maybe your organization has been dabbling in the world of big data for a while, going back to when Hadoop was one of the hottest new technologies. You’ve built some pretty nifty predictive analytics models, and now you’re fairly adept at discovering important patterns buried in mountains of data.
So far, though, your AAA — adventures in advanced analytics — have been highly fragmented. In fact, your analytical data is all over the place. You don’t have consistent approaches to cleansing and refining raw data to get the data ready for analytics; different groups do their own thing. It’s like the Wild West out there!
The concept of a data lake helps you harness the power of big data technology to the benefit of your entire organization. By following emerging best practices, avoiding traps and pitfalls, and building a solidly architected data lake, you can seize the day and help take your organization to new heights when it comes to analytics and data-driven insights.
You’ll achieve?economies of scale?for the data side of analytics throughout your organization, which means that you’ll get “more bang for your buck” when it comes to acquiring, consolidating, preparing, and storing your analytical data on behalf of your enterprise as a whole rather than repetitively doing so for numerous smaller groups.
.......
Constructing a Bionic Data Environment
Maybe you’ve heard of a B-52. No, not a member of the American new wave music group (so don’t start singing “Love Shack”), but rather the U.S. Air Force plane.
The B-52 first became operational in 1952. The normal life span for an Air Force plane is around 28 years before it’s shuffled off to retirement, which means that B-52s should’ve gone out of service around 1980. Instead, the B-52 will eventually be retired sometime in the 2050s. That’s a hundred years — an entire century!
However, a B-52 today bears only a slight resemblance to one made in the ’50s or ’60s. Sure, if you were to put one of the original B-52s side by side with one of today’s planes, the two aircraft would look nearly identical. But the engines, the avionics, the flight controls . . . pretty much every major subsystem has been significantly upgraded and replaced in each operational B-52 at least a couple times over the years.
Better yet, a B-52 isn’t just some old plane that you may see flying at an airshow but that otherwise doesn’t have much purpose due to the passage of time. Not only is the B-52 still a viable, operational plane, but its mission has continually expanded over the years thanks to new technologies and capabilities.
领英推荐
In fact, you can think of a B-52 as sort of a bionic airplane. Its components and subsystems have been — and will continue to be — swapped out and substantially upgraded on a regular basis, giving the plane a planned life span of almost?four times?the normal longevity of the typical Air Force plane. Talk about an awe-inspiring feat of engineering!
However, all those enhancements and modifications to the B-52 happened gradually over time, not all at once. Plus, the changes were all carefully planned and implemented, with longevity and continued viability top of mind.
Your data lake should follow the same model: a “bionic” enterprise-scale analytical data environment that regularly incorporates new and improved technologies to replace older ones, as well as enhancing overall function. You almost certainly won’t get an entire century’s usage out of a data lake that you build today, but if you do a good job with your planning and implementation, 10 or even 20 years of value from your data lake is certainly achievable.
More important, your data lake won’t be just another aging system hanging around long past when it should’ve been retired. You almost certainly have plenty of those antiquated systems stashed in your company’s overall IT portfolio. That’s why the B-52 is the perfect analogy for the data lake, with a “bionic” approach to regularly replacing major subsystems helping to keep your data lake viable for years to come.
..........
Speedboats, Canoes, and Lake Cruises: Traversing the Variable-Speed Data Lake
You can?stream?all kinds of data into your data lake, as quickly as that data is created in your source applications. Suppose that you dedicate a portion of your data lake to analyzing your overall computer network traffic and server performance, to help you detect possible security threats, network bottlenecks, and database performance slowdowns.
You’ll be streaming tons of log data from your routers, gateways, firewalls, servers, databases — pretty much any piece of hardware in your enterprise — into your data lake, as quickly as you can as traffic flows across your network and transactions hit your databases. Then, just as quickly, you and your coworkers can analyze the rapidly incoming data and take necessary actions to keep everything running smoothly.
At the same time, not everything needs to zoom into your data lake at lightning-fast speed. Think about a lake that not only has speedboats zipping all over but also has much larger ferry-type vessels that take hundreds of passengers at a time all around the lake. Some of those ferries also offer evening gourmet dinner cruises in addition to their daytime excursions.
You’re not going to have much success trying to water-ski behind a lake ferry, nor will you have much success trying to eat a six-course gourmet meal served on the finest china while you’re bouncing all over the place on a speedboat. You need to find the proper water vessel for what you’re trying to do out on the lake, right?
You should think of your data lake as a variable-speed transportation engine for your enterprise data. If you need certain data blasted into your data lake as quickly as possible because you need to do immediate analysis, no problem! On the other hand, other data can be batched up and periodically brought into the data lake in bulk, on sort of a time-delayed basis, because you don’t need to do real-time analysis. You can mix and match the data feeds in whatever combination makes sense for your organization’s data lake.
__________________
Alan Simon is the Managing Principal of Thinking Helmet, Inc., a boutique management and technology strategy consultancy specializing in analytical business process management, business intelligence/analytics, and enterprise-scale data management.?
Alan is the author or co-author of 32 business and technology books, dating back to 1985, including the just-published?Data Lakes For Dummies?(Wiley).