Doing Big Data Backup and Migration the Right Way

Doing Big Data Backup and Migration the Right Way

“Reality is far more vicious than Russian roulette. First, it delivers the fatal bullet rather infrequently, like a revolver that would have hundreds, even thousands of chambers instead of six. After a few dozen tries, one forgets about the existence of a bullet, under a numbing false sense of security. Second, unlike a well-defined precise game like Russian roulette, where the risks are visible to anyone capable of multiplying and dividing by six, one does not observe the barrel of reality. One is capable of unwittingly playing Russian roulette - and calling it by some alternative “low risk” game.” ― Nassim Nicholas Taleb, The Black Swan: The Impact of the Highly Improbable

For enterprise companies currently running Hadoop distributions, the recent news about some Hadoop vendor’s financial troubles has been unsettling. However, I’ll let you do your own research on that.

This has led many Hadoop enterprise customers to re-evaluate their Hadoop disaster recovery and migration plans, as they are no longer playing a low risk game. Customers have been asking themselves the following questions:

  • What happens if my current Hadoop vendor has to pivot out of the Hadoop space due to a lack of revenue?
  • What happens if my current Hadoop vendor gets acquired by another company, that makes changes to their fundamental business?
  • Do we have a Hadoop backup and migration strategy, or are we locked in to our current vendor?
  • We’re a world-class company, is our Hadoop migration, disaster and recovery strategy the best in the world?


All of these are great questions, that I will help answer in this post.

Why Should Hadoop Customers Be Concerned?

With the financial distress going on at current Hadoop vendors, current customers should start to become concerned, as financial distress, leads to uncertainty. The last thing a customer wants to do, is build a big data infrastructure, on an uncertain Hadoop vendor.

Financial distress, makes companies more prone to pivoting and leaving a domain, or being acquired by other companies.

All we have to do is look back in history for examples.

In terms of acquisitions, tech titans are eager to purchase smaller companies, at a fraction of the price, while quickly shutting them down, in what’s known as acqui-hires.

Yahoo and Apple are well known for closing down companies immediately after purchasing them. Apple shutdown Beats Music, a streaming competitor it purchased for $3 billion dollars.

I myself, have experienced the same insecurity. As a startup entrepreneur, I created an app, on Parse, a mobile-backed-as-a-service (mBaaS) provider. Parse was purchased by Facebook, and shortly after, shutdown due to its failure to generate revenue, leaving users in a desperate search for an alternative.

As you can see, financial distress, is a serious problem, that companies must take into account, when evaluating and building their IT infrastructure.

With the financial distress some Hadoop vendors are experiencing, customers must have a backup plan. Now is the time to start evaluating your current migration, disaster recovery and backup plan, in case your current vendor bows out of enterprise Hadoop gracefully, or is involved in a merger or acquisition.

As New York Times best-selling author Nassim Taleb would say, companies have to prepare for a black swan event when it comes to improbable events.

“Those who believe in the unconditional benefits of past experience should consider this pearl of wisdom allegedly voiced by a famous ship’s captain: But in all my experience, I have never been in any accident… of any sort worth speaking about. I have seen but one vessel in distress in all my years at sea. I never saw a wreck and never have been wrecked nor was I ever in any predicament that threatened to end in disaster of any sort. E. J. Smith, 1907, Captain, RMS Titanic Captain Smith’s ship sank in 1912 in what became the most talked-about shipwreck in history.” ― Nassim Nicholas Taleb, The Black Swan: The Impact of the Highly Improbable

 

Hadoop Migration and Backup Solutions

Maybe after evaluation of your Hadoop vendor’s long-term viability, your company wants to copy your Hadoop clusters to another Hadoop vendor’s clusters.

Or maybe you still believe in your current vendor as your Hadoop vendor of choice, but would like to create a backup for disaster recovery situations.

Perhaps, you simply want to copy your on-premise Hadoop clusters to the cloud, and vice-versa. How would you go about doing that?

Here are the current options available.

DistCP (Distributed Copy)

Also known as distributed copy, is an open source tool used for large inter/intra cluster copying. This is a tool commonly provided by numerous Hadoop vendors. This tool is created for copying files in a batch schedule format.

The challenges with DistCP is that it operates in a batch schedule mode, not consistent replication (nearly real-time). As a result of being built on MapReduce, it consumes significant cluster resources when copying data. This results in DistCP being scheduled during non-peak usage hours of a business. Outages with DistCP can result in losing an entire day’s data.

It also does not guarantee data consistency and is manually intensive, due to the risk of human error during recovery.

When it comes to data security, Hadoop vendor solutions built using DistCP mandate every cluster and data node have access to each other. This means ports and firewalls have to be made available for every single one. The last thing an enterprise company wants is a security breach by hackers from open ports and firewalls. This in itself won’t pass an enterprise network security audit in a number of organizations. It’s already hard enough to migrate and backup Hadoop data, having to worry about enterprise level security, is another job in itself, that companies should not have to deal with.

Dual Ingest

Another option is dual ingest, which funnels data through a load balancer, and routes the data to two different clusters. This sounds great in theory, but hiccups in the load balancer or Hadoop clusters, can result in diverging data. If the network connection to either Hadoop clusters goes down, you may run out of buffer. This requires active attention and man hours from a system administrator.

In addition, any data created on one of the clusters, is not replicated to the other. This is a huge concern in an enterprise scale company, that requires a single version of the truth. Having ambiguity, is a serious business risk, when fundamental business decisions are made from the data.

Enterprise Ready Solution for Hadoop Migrations and Backups

A clearly alternative to small-scale open source tools like DistCP and dual ingest strategies is a solution called IBM Big Replicate.

IBM Replicate is a disaster and recovery solution for Hadoop. It’s built on patented real-time replication with guaranteed data consistency and always on availability.

Big Replicate offers LAN-speed performance across any combination of Hadoop clusters, regardless of storage, distance, vendor distribution, vendor version, or infrastructure type (On-premise or cloud).

So how does IBM Big Replicate work?

In a simple way, it’s one or more proxy servers that talks with your Hadoop clusters. As a proxy server, it frees you from being locked in to a particular Hadoop vendor or distribution version. It’s Hadoop vendor agnostic, as long as the cluster supports Hadoop Compatible File System (HCFS).

Your API client apps connect to IBM Big Replicate, instead of Hadoop. Consequently, it unifies Hadoop clusters running on different vendor distributions and versions, regardless of being on-premise or on the cloud. It’s not limited to Hadoop, it replicates data across cloud object storage, local and NFS mounted file systems

Why Big Replicate and What are the Benefits?

So how does Big Replicate overcome the challenges we discussed with DistCP and Dual Ingest?

By enabling distributed clusters to appear, act, and operate as one. By using patented active-transactional replication of data, this allows the same files to be written and read to, from any location, while maintaining data consistency.

Big Replicate maintains availability during hardware and network outages. Clusters can fail, and service will still be available. Network outages do not cause a data center outage or data conflicts.

Recovery is automatic and downtime (RTO and RPO) are virtually zero due to continuous active replication as opposed to intermittent batch replication. In other words, if your Hadoop clusters go out at noon, when you recover data, you recover the last active-written data, which occurred within moments. Rather than recovering from stale data that was last written during non-business hours, like 5AM with DistCP which only has a backup from the last scheduled job. Big Replicate is always on, always backing up and replicating.

Big Replicate Use Cases

Hybrid Cloud, Backup and Migration

Companies are now looking at leveraging the power of the cloud. Big Replicate is a perfect solution for migrating on-premise Hadoop clusters to the cloud, without having to bring additional hardware, or expertise into the company.

Another use case is offsite backup and recovery. If a company’s on-premise infrastructure where to have a disaster, having a backup on the cloud would be a great option. However, the cloud backup, has to have data consistency, as it won’t make a great option if the data in the cloud is hours or days old for real-time use cases. A true Hybrid cloud, allows the ability to scale up, backup, and recover production scale data without days of downtime.

So for current Hadoop customers that want to migrate or backup their Hadoop cluster, to the cloud, or to a different vendor, IBM Big Replicate is a great solution to achieve that goal.

Data Lakes

The most common use case for IBM Big Replicate is the Data Lake. A Data Lake is a storage repository that holds vast amounts of data in its native format, until needed. This can be in both unstructured, and semi structured formats.

Data lakes are used by businesses for applications such as having a 360-degree view of a company’s supply chain. Gathering data from numerous different data sources like historical purchase data, predictive analytics, internet-of-things device data, compliance monitoring, trends analysis and much more.

Data lakes integrate data across various storage platforms, from a mixture of Hadoop clusters, to legacy data warehouses and relational databases. The overall goal is to remove information silos.

Data lakes require ingestion of data from multiple sources and locations, while having data protection and privacy regulations. Since some real-time analytics applications vital to a business may use this data, they’re often required to be available 24 hours a day, seven days a week.

Because of the design of IBM Big Replicate, it is well-suited for Data Lake use cases.

Real-Time (Streaming) Analytics

Real time analytics or big data analytics applications like credit card fraud detection, high frequency securities trading and real-time analysis of sensor data (Internet-of-Things), require processing and analyzing of Fast Data in motion.

This Fast Data in motion needs to be captured and analyzed as it’s being generated so businesses can act in real time, on decisions that affect the fundamentals of their business. This data could be from various sources.

The Data Lake and Real time analytics use cases complement each other well. The Data Lake provides historical context, while Fast Data tells us what’s happening now. This also has to be available 24 hours a day, seven days a week.

Why Go With IBM for Hadoop Backup and Migration?

I have provided an in-depth overview as to enterprise Hadoop customers should start exploring alternatives due long-term uncertainty of some vendors.

However, why would a customer migrate their Hadoop clusters to IBM?

First, IBM is a leader in the Hybrid Cloud. If you want to operate both on-premise and cloud Hadoop clusters, IBM is the best vendor available. According to industry analysts Forrester and Gartner, IBM Big Insights and Big Insights on the Cloud, are leaders in a very competitive field.

Whether you want to migrate from your current vendor or backup your on-premise data to the cloud, IBM is the best vendor for you.

The second thing I would ask you to do, is research IBM's big data portfolio. IBM can provide an end-to-end big data solutions, for all roles in an enterprise. From the data scientist, to the data engineer, to the business analyst, we have solutions for all user roles in a company .

Third, IBM is by no-way the hottest stock in the world. However, IBM is a company that’s been in existence for over 100 years. Year to date, IBM’s stock is up 18+%. You can bet your money that IBM is going to be around for a while. The last thing a customer needs to worry about is the financial uncertainty of their chosen vendor for their IT infrastructure. In addition, 47 of the Fortune 50 companies have at least one IBM solution as part of their IT infrastructure. That level of achievement is attained by being a reliable partner for enterprise companies.

Action Plan for Backing Up and Migrating Hadoop

With all this talk, how do we actually create an action plan for backing up and migrating your current Hadoop clusters.

A course of action I suggest, is to first start off by using IBM Big Replicate to do real-time backups of your current clusters to IBM Big Insight on the Cloud.

In other words, use IBM Big Replicate for making additional copies of your Hadoop clusters, to the cloud. That way if an unfortunate disaster where to occur with your current Hadoop clusters, you simply point your Big Replicate nodes to Big Insights on the Cloud, and your business continues on with no downtime.

I will end this post with a few wise words said by a wise person…

"You don't train to be the best in the world... you train to be the best in the world on your worst day." 

The question now, is your Hadoop cluster setup to be the best in the world on its worst day?

If a black swan event where to occur, would your Hadoop cluster be able to survive without downtime?

If the answer to any of these questions isn’t yes, then your company should seriously work on this plan.

IBM Big Replicate is a great first step. Reach out to me, and let's start a conversation around your big data backup and migration strategy.

Your friend in big data analytics,

Ian Balina
IBM North America
Big Data Analytics Evangelist (Retail and Travel Industries)
Email: [email protected]
Twitter: @ibalina88

Amit Likhyani

Helping customers with Observability, Complex Cloud Deployment, Real User Monitoring, automation and Application Security

8 年

It took me all week to read :-) but outstanding work!!!

要查看或添加评论,请登录

Ian Balina的更多文章

社区洞察

其他会员也浏览了