One of the datalake environment we built as POC, in couple of years became production datalake and hit it's capacity. The elastic cluster health was constantly red and only solution was to increase the capacity. But here were the challenges..
Challenges / dependencies:?
- Why don't you increase the capacity? Well, there were no free VMs. Needed additional VMs to be provisioned for increasing number of data nodes, matching the capacity of rest of the data nodes.
- No free licenses. Number of Elastic licenses were fully consumed.?
- While adding data node, what if Elastic cluster goes down or data gets corrupt in the process. You can restore from backup right? Well, there is no Elastic Snapshot backup.
- Why don't you take backup? Well we need NAS and there was no NAS provisioned (where the backup can go).
- Why don't you test the backup in non-prod and do the same in Prod? Well, there was no non-prod environment.
- Why don't you take time, plan everything and do properly? Well, there is no time now as business is impacted and we needed to restore services ASAP.
- Wow so many dependencies.. do you need more complications?, well this Datalake cluster is running on single Kubernetes master that is in legacy version.?
- Oh really! Don't tell me that there are any more challenges.. Well, there is... The SME who was managing the environment has quit the job. Nobody in the team knows how to take Elastic snapshot backup.
- Man.. you are doomed!
- Got 4 additional high config VMs provisioned.
- Got additional Elastic licenses during license renewal.
- Created a small non-prod environment.
- Got NAS provisioned and mapped it across prod and non-prod.
- Tested Elastic snapshot backup in non-prod and then prod.
- Took successful first backup of Elastic Datalake.
- Added 4 data nodes to K8 and Elastic cluster in periodic interval.
- Resumed all new ingestions and ML jobs.
The nodes provisioned for new elastic data nodes were downgraded due to periodic capacity optimization initiative and needed Exec approval for restoration of config.
- Elastic Data Lake - Cluster status returned to Green.
- Elastic snapshot backup is scheduled to happen daily on NAS.
- Addressed K8 legacy and single master issue separately, which is another article to write.
Senior Manager at Cognizant | BigData | SME | SRE | Service Delivery | Agile Management
2 年May I ask how long did it take for you to get this issue addressed Rajaraman Sathyamurthy ??