Cleaning & Archiving - Accelerate applications and reduce TCO (Part I.)

Cleaning & Archiving - Accelerate applications and reduce TCO (Part I.)

Data is, was and will be the foundation on which everything is built. Topics such as data literacy, the data lifecycle, and the ability to delete data are essential steps to ensure application health/performance and keep costs under control. Let's see how to do proper cleaning & archiving.

Note: This article focuses on complex archiving for typically custom applications (it is not about easy archiving for e.g. MS Exchange email communication, where spam emails can be automatically discard and rest of emails can be archived without deep analysis).

1. How to identify the need for cleaning and archiving?

1.1. Are these horror sentences familiar to you?

  • Business: I want to keep unlimited data history in the company, disks are cheap.
  • IT team: We keep all historical data in production, it is easy for us and end users to find all data in one place.
  • IT team: We cannot delete any data for legal reasons (you know the legal, they want everything).
  • IT team: We don't keep a history of changes, only the last state, that would kill us in terms of data volume.
  • IT team: We move older data from application DB to the “archiving DB” in the same relation structure, it is our "like archive" (this may be sufficient)
  • Infra team: To be safe, we backup all data for 10 years (this should be enough from a legal point of view).
  • IT team: We are deleting old data directly from production (without backup) because we have really big performance issues (and we still keep billions of records in master tables for real-time queries). Hopefully no one will need the deleted data.

1.2. Are you also solving similar issues?

  • Business: Why are production queries taking so long?
  • Infra team: Data backup takes more than a day (hopefully we won't have to restore it).
  • IT management: We are putting more and more effort into solving technological debt (we need to maintain the SLA and speedup the system).
  • IT management: The costs of HW including CPU, RAM, storage, network is regularly increasing (we can't do anything about it).
  • Legal: Data older than a year are unusable from the backup (we are not able to find anything for the needs of litigations, we have to go to the production applications for the data).
  • IT team: It takes a very long time to introduce a new product, because of relations to historical products.
  • IT team: Training a new team member takes a long time, they have to understand the old code as well.
  • IT Security: In the last attack on the production systems, all data (including historical) was stolen.
  • Legal: We need to keep a history of data by type and country/legal, from 5 to 30 years or for the entire time of the company's existence, we only have this data in production applications.

These are real questions and issues from several companies :-(

If you encounter any of the areas described above, sooner or later you have to start dealing with data cleaning and archiving. It has impact to whole company such as Business (processes, use cases, etc.), IT development (code quality, performance, knowledge transfer, etc.), IT operation (amount of incidents, migration, etc.), IT infrastructure (cost, etc.), RISK (Fraud, BCM, SoD, etc.), Legal & Compliance (litigations, GDPR, etc.), etc.

Complexity

Ask yourself the basic questions: Do I really need to keep all historical data? Will I be able to use all of them or only some of them? What for (litigations, legal, audit, data mining, policy, reporting, etc.)? Do I have enough human and HW resources to work with all historical data? Can I continue like this with rising costs for another 3-5 years?

In most cases, 3-5 years of history is sufficient. Older data usually require legal & compliance, to resolve litigations, etc. where data retention periods are 10, 20, 30 or even more years depending on the legal in country (and only selected small amount of business information is required).

2. The main motivation

I see these seven basic reasons why you should focus on cleaning and Archiving.

2.1. Data volumes are growing in time

Not only based on business (it is an irrelevant argument during e.g. COVID) but based on technology changes e.g., processing voice, liveness checks, focus on client behavioral, real-time processing, etc.

Do you see your annual data growth factor?

2.2. Addition HW and storage is needed

You can see it in budget items, such as annual growth of 5-20% for "Add application data storage", "Add backup storage", "Add data nodes", "Increase network line", etc.

If you need to add 1 Terabyte of data, you will typically use between 2.5 - 8 Terabytes of data; considering PROD, including PDC/DRC, all non-production (PrePROD, UAT, DEV, EDU), backups, copies in DWH/DataLake, etc.?

How many times do you have duplicate data?

2.3. Application maintenance growth

We keep all historical data in production, and in that case, we also retain all obsolete application code there.

You can see change requests for cover performance issues e.g., split data to cold/hot, asynchronous data load in GUI, tune execution plans, add indexes, pre-calculate/aggregate values, etc.

We practically cannot remove closed applications (because we keep there all data in case of litigation, etc.).

You still need to keep in mind all application logic and how it affects your delivery team (longer learning time, etc.).

How does your maintenance grow in time (> that three years) and why?

2.4. Processing time growth

Processing time for application rollout including data migration, reporting, backup/restore, synchronization, etc. growth with impacts not only storage, but also network traffic and CPU.

How much time do you need today for data backup and migration?

2.5 Harder to keep BCM (Business Continuity Management)

Time for backup/restore often takes more than 24 hours, which is critical in an emergency.

You have probably heard of real outages in banks, e-shops, or service providers (it is not only a theoretical examination).

What does a half-day or day outage of your core system mean in terms of money (see BIA as part of BCM)?

2.6 Affected SLA/incidents

These three points - data growth, maintenance growth, and processing time growth have a negative impact on the number of incidents and adherence to SLAs.

It has a strong relation to client satisfaction and client retention.

Do you know how the number of incidents and the availability of time have changed in the last 3-5 years?

2.7 Have to cover RISK & Compliance

You have to cover RISK & Compliance for cases of client complaints, litigations, requests from the police, internal or external audits.

You can see requirements for fulfilment of audit/compliance (e.g. GDPR for Europe, ?near to GDPR“ for Asia, etc.) and it affects data retention & termination, elimination of large data leaks in production, etc. It is important to primary avoid sanctions and fines or revoke the license.

Do you know how much effort you spend on the provision of data for client complaints, litigations, police, etc.?

3. What should be done for cleaning and archiving?

Typically, data cleaning and archiving are solved in the following delivery aims, see the mind-map (link).

Cleaning & Archiving, Delivery Aims, Motivation

What will you have to do (see these typically steps).

3.1. Analyze

You can start prepare data descriptions, data ownership, impacted use cases, data life cycle, i.e. identify at least these data states: Created, ... Closed, Discard, including data retention and approval from business stakeholders and legal, etc.

This activity represents the largest part of the labor spent (for complex solutions, this labor can reach 90% of whole effort).

Note: You will only archive information (fully described data with clear meaning) in the state Closed (not live data).

3.2. Development

Don't forget to prepare

  • The data for export from the source application to the archive. Consider a flat structure, independent of the existing form (which can change very often within each release). Expect versioning of the exported data in the archive (data content may change over time).
  • Delete data on the source system after a successful archive.
  • Delete the data on the archive side after the data expiration (e.g. after 10, 20, etc. years, see data state Discard after the state Closed).
  • Tests that focus on detecting data consistency errors as well as presenting missing data on the user side (e.g. limited history, available of history in archive for specific roles).

3.3. Archiving

You need to have the right technical solution for data archiving that addresses these fundamental topics such as long-term storage, ensuring authenticity, immutability (fixity), long-term readability, intelligibility, access/role management, SLA for queries, etc.

3.4. Delete application data

You can start deleting data from source applications after archiving.

Note: Don't forget to delete data after expiration on the archiving system side (don't forget to extend the archiving period for records/cases that are or have been part of litigation).

3.5. Clean application code

You can start removing the application code that was there to work with the old data and simplify the application as much as possible.

Your goal is to ensure efficient and sustainable application development.

4. The conclusion for part I.

In this first part, I outlined the complexity of the topic of cleaning and archiving, which has a fundamental impact on the health of the entire company. In the next parts, I would like to focus on other details, e.g. the differences between archive and backup, data lifecycle, FAQ, etc.

BTW: Please don't be fooled into thinking that you don't need to archive data because you have a backup.

Backup does not remove your data, it just creates a copy of the same data on another cheaper storage (for purpose of business continuity management). This is a well-known joke of people from the infrastructure, for whom archiving is a very difficult subject compared to the subject of backup (where you can focus only on data movement without the real analysis and development).

As the volume of data you manage increases, so does the need for cleaning and archiving (if the volume of your data exceeds 0.5+ PB, the topic will surely be interesting for you).

?Thank you for your attention and have a nice day.

#archiving #archive #dataarchive #backup #storage #hdd #magnetictape #litigation #legal #compliance #police #regulatory #audit #gdpr #fraudprevention #bcm #bia #bcp #disaster #sustainabledevelopment #sustainablecosts #backwardcompatibility #mdm #datalifecycle #savestorage #esg #efficiency #petabytes #exabytes #zetabytes #bigdata #objectstorage #datacenter #longtimestorage

#microsoft #google #smarsh #proofpoint #globalrelay #mimecast #veritas #microfocus #solix #barracuda #jatheon #mithi #archive360 #nasuni #netapp #dell #hitachivantara #hpe #huawei #ibm #netapp #seagate #westerndigital #hynix #intel #kingston #samsung #kioxia #micron #ibm #quantum


Hi Jiri, thanks for unexpected deep dive view with real touch, very useful.

Tom Sadler

Data Science & AI Solution Lead UK&I at HP | SME | Global NextGen Co-Chair at HP | AI & Data Enthusiast | Machine Learning | Workstations | Public Speaking | Dad of 3 | Board Games

2 年

Jiri Steuer thanks for sharing. This is fantastic and I hope people really pay attention. As you say Data is the foundation.

Ray Quattromini

Director | Enterprise Architect

2 年

HI Jiri, A really great article and I am pleased that people like yourself are trying to create more awareness around the issue. Not enough businesses are doing this and it could be saving them huge amounts of money ?? ?? ?? . Data archiving should be a part of any business practise, just like backup and patching switches/servers are done now. Our new site https://www.data-storage.uk has lots of articles on the subject.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了