Cleaning & Archiving - Accelerate applications and reduce TCO (Part I.)
Data is, was and will be the foundation on which everything is built. Topics such as data literacy, the data lifecycle, and the ability to delete data are essential steps to ensure application health/performance and keep costs under control. Let's see how to do proper cleaning & archiving.
Note: This article focuses on complex archiving for typically custom applications (it is not about easy archiving for e.g. MS Exchange email communication, where spam emails can be automatically discard and rest of emails can be archived without deep analysis).
1. How to identify the need for cleaning and archiving?
1.1. Are these horror sentences familiar to you?
1.2. Are you also solving similar issues?
These are real questions and issues from several companies :-(
If you encounter any of the areas described above, sooner or later you have to start dealing with data cleaning and archiving. It has impact to whole company such as Business (processes, use cases, etc.), IT development (code quality, performance, knowledge transfer, etc.), IT operation (amount of incidents, migration, etc.), IT infrastructure (cost, etc.), RISK (Fraud, BCM, SoD, etc.), Legal & Compliance (litigations, GDPR, etc.), etc.
Ask yourself the basic questions: Do I really need to keep all historical data? Will I be able to use all of them or only some of them? What for (litigations, legal, audit, data mining, policy, reporting, etc.)? Do I have enough human and HW resources to work with all historical data? Can I continue like this with rising costs for another 3-5 years?
In most cases, 3-5 years of history is sufficient. Older data usually require legal & compliance, to resolve litigations, etc. where data retention periods are 10, 20, 30 or even more years depending on the legal in country (and only selected small amount of business information is required).
2. The main motivation
I see these seven basic reasons why you should focus on cleaning and Archiving.
2.1. Data volumes are growing in time
Not only based on business (it is an irrelevant argument during e.g. COVID) but based on technology changes e.g., processing voice, liveness checks, focus on client behavioral, real-time processing, etc.
Do you see your annual data growth factor?
2.2. Addition HW and storage is needed
You can see it in budget items, such as annual growth of 5-20% for "Add application data storage", "Add backup storage", "Add data nodes", "Increase network line", etc.
If you need to add 1 Terabyte of data, you will typically use between 2.5 - 8 Terabytes of data; considering PROD, including PDC/DRC, all non-production (PrePROD, UAT, DEV, EDU), backups, copies in DWH/DataLake, etc.?
How many times do you have duplicate data?
2.3. Application maintenance growth
We keep all historical data in production, and in that case, we also retain all obsolete application code there.
You can see change requests for cover performance issues e.g., split data to cold/hot, asynchronous data load in GUI, tune execution plans, add indexes, pre-calculate/aggregate values, etc.
We practically cannot remove closed applications (because we keep there all data in case of litigation, etc.).
You still need to keep in mind all application logic and how it affects your delivery team (longer learning time, etc.).
How does your maintenance grow in time (> that three years) and why?
2.4. Processing time growth
Processing time for application rollout including data migration, reporting, backup/restore, synchronization, etc. growth with impacts not only storage, but also network traffic and CPU.
How much time do you need today for data backup and migration?
2.5 Harder to keep BCM (Business Continuity Management)
Time for backup/restore often takes more than 24 hours, which is critical in an emergency.
You have probably heard of real outages in banks, e-shops, or service providers (it is not only a theoretical examination).
What does a half-day or day outage of your core system mean in terms of money (see BIA as part of BCM)?
领英推荐
2.6 Affected SLA/incidents
These three points - data growth, maintenance growth, and processing time growth have a negative impact on the number of incidents and adherence to SLAs.
It has a strong relation to client satisfaction and client retention.
Do you know how the number of incidents and the availability of time have changed in the last 3-5 years?
2.7 Have to cover RISK & Compliance
You have to cover RISK & Compliance for cases of client complaints, litigations, requests from the police, internal or external audits.
You can see requirements for fulfilment of audit/compliance (e.g. GDPR for Europe, ?near to GDPR“ for Asia, etc.) and it affects data retention & termination, elimination of large data leaks in production, etc. It is important to primary avoid sanctions and fines or revoke the license.
Do you know how much effort you spend on the provision of data for client complaints, litigations, police, etc.?
3. What should be done for cleaning and archiving?
Typically, data cleaning and archiving are solved in the following delivery aims, see the mind-map (link).
What will you have to do (see these typically steps).
3.1. Analyze
You can start prepare data descriptions, data ownership, impacted use cases, data life cycle, i.e. identify at least these data states: Created, ... Closed, Discard, including data retention and approval from business stakeholders and legal, etc.
This activity represents the largest part of the labor spent (for complex solutions, this labor can reach 90% of whole effort).
Note: You will only archive information (fully described data with clear meaning) in the state Closed (not live data).
3.2. Development
Don't forget to prepare
3.3. Archiving
You need to have the right technical solution for data archiving that addresses these fundamental topics such as long-term storage, ensuring authenticity, immutability (fixity), long-term readability, intelligibility, access/role management, SLA for queries, etc.
3.4. Delete application data
You can start deleting data from source applications after archiving.
Note: Don't forget to delete data after expiration on the archiving system side (don't forget to extend the archiving period for records/cases that are or have been part of litigation).
3.5. Clean application code
You can start removing the application code that was there to work with the old data and simplify the application as much as possible.
Your goal is to ensure efficient and sustainable application development.
4. The conclusion for part I.
In this first part, I outlined the complexity of the topic of cleaning and archiving, which has a fundamental impact on the health of the entire company. In the next parts, I would like to focus on other details, e.g. the differences between archive and backup, data lifecycle, FAQ, etc.
BTW: Please don't be fooled into thinking that you don't need to archive data because you have a backup.
Backup does not remove your data, it just creates a copy of the same data on another cheaper storage (for purpose of business continuity management). This is a well-known joke of people from the infrastructure, for whom archiving is a very difficult subject compared to the subject of backup (where you can focus only on data movement without the real analysis and development).
As the volume of data you manage increases, so does the need for cleaning and archiving (if the volume of your data exceeds 0.5+ PB, the topic will surely be interesting for you).
?Thank you for your attention and have a nice day.
#archiving #archive #dataarchive #backup #storage #hdd #magnetictape #litigation #legal #compliance #police #regulatory #audit #gdpr #fraudprevention #bcm #bia #bcp #disaster #sustainabledevelopment #sustainablecosts #backwardcompatibility #mdm #datalifecycle #savestorage #esg #efficiency #petabytes #exabytes #zetabytes #bigdata #objectstorage #datacenter #longtimestorage
#microsoft #google #smarsh #proofpoint #globalrelay #mimecast #veritas #microfocus #solix #barracuda #jatheon #mithi #archive360 #nasuni #netapp #dell #hitachivantara #hpe #huawei #ibm #netapp #seagate #westerndigital #hynix #intel #kingston #samsung #kioxia #micron #ibm #quantum
Hi Jiri, thanks for unexpected deep dive view with real touch, very useful.
Data Science & AI Solution Lead UK&I at HP | SME | Global NextGen Co-Chair at HP | AI & Data Enthusiast | Machine Learning | Workstations | Public Speaking | Dad of 3 | Board Games
2 年Jiri Steuer thanks for sharing. This is fantastic and I hope people really pay attention. As you say Data is the foundation.
Director | Enterprise Architect
2 年HI Jiri, A really great article and I am pleased that people like yourself are trying to create more awareness around the issue. Not enough businesses are doing this and it could be saving them huge amounts of money ?? ?? ?? . Data archiving should be a part of any business practise, just like backup and patching switches/servers are done now. Our new site https://www.data-storage.uk has lots of articles on the subject.