Archive Data Migration via Vendor API – “Legacy Technology”?
Archive Data Migration via Vendor API – “Legacy Technology” or Core Component?
In a recent conversation with a highly experienced migration partner, the topic came up over whether extracting data from an archiving system such as Enterprise Vault can provide the state of the art performance required for today’s projects.
I would like share my point of view…
Performance
In the migration industry, there is the perception that extraction via APIs can be slow and unreliable. As always, if you want maximum performance and reliability, you need an in-depth understanding of how the underlying system works in order to get any benefit from it. For example, API access is much faster through local access rather than access via the network. This is one of the reasons behind QUADROtechs modular approach to migration.
Extracting data, even from a single Enterprise Vault archiving server, doesn’t have to be slow:
As demonstrated in Figure 1, 53 GB/h were extracted in a recent test. With the transparent Archive Shuttle approach, one can migrate 24h a day, leading in this case to a daily extraction rate of 1.3 TB, all through the utilization of a single EV Server.
What if the environment is larger?
Because of its flexible architecture and modular approach, ArchiveShuttle can scale with the numbers of EV Servers involved. If 2 Servers are involved, both will extract in parallel, if 10 are involved, all 10 will extract. You can multiply the extraction rate by the amount of involved servers until the archiving storage destination becomes the bottleneck. We’ve seen extraction rates of over 200GB/h in a single datacenter, with involvement from multiple EV Servers.
As explained in my blog, “Why Extraction speed is only half the story,” (https://www.quadrotech-it.com/archive-migration-why-extraction-speed-is-only-half-the-story/) it doesn’t make sense to extract faster than you can ingest to the target system. Remember, speed of extraction is only a single component when considering the total speed of an end-to-end migration project. Several steps are required to complete a migration project, such as the provisioning the target archives, the ingestion process, the switch of the user from source to target system and finally the fix of the shortcuts and re-linking everything that points from the source environment to the target. With ArchiveShuttle, all these manual steps are completely automated, thanks to the APIs. Also, our cloud servers help to avoid some of the more time-consuming elements of a project by reducing the need for hardware provisioning and configuration. This helps us to provide much faster total project times.
On the ingestion side, with Enterprise Vault to Enterprise Vault Migration as an example, trying to migrate without access to the APIs, is impossible. The only way to ingest the data would be through a slow and unreliable PST File Import which some migration vendor’s script, hence trying to semi-automate the migration. As illustrated in Figure 2, ArchiveShuttle, through the use of the Enterprise Vault API, was able to ingest over 53GB hour of data into a single EV Server target.
Chain of Custody
Migration tools that access data on the storage layer, (“direct connector to storage”), have to reverse engineer the proprietary format of the archive system. No vendor has published these internal formats (e.g. EV’s .dvs, .dvscc, etc.), hence you have to have total trust that the migration vendor didn’t miss anything and has considered all possible format combinations. For example, in the case of Enterprise Vault, the storage formats have changed between versions and are often different on different types of archiving storage, which means a single message may be spread across several files and will need to be reconstituted. With such a direct access approach to the storage, how can one ensure that the data extracted (and ingested) is the SAME as the data that was originally archived by the archiving system, possibly years ago?
According to IT-Law, a proper Chain-of-Custody requires the protection of evidence (e.g. the archived emails to migrate) to ensure it against loss, breakage, alteration or unauthorized handling. (https://itlaw.wikia.com/wiki/Chain_of_custody)
My point of view is that for a forensically verifiable chain-of-custody, only the API of the vendor can be used. ArchiveShuttle for example, ensures that during the migration, data is not tampered with, altered or corrupted in any way by calculating checksums/hashes during extraction and before ingestion on the target system. Only when both hashes are matching does a forensically verifiable chain of custody exist. Even retrospective verification is possible, comparing the ingested item on the target to the archived item in the source.
Furthermore, the approach taken by some migration vendors would be to use PST files as an intermediate format between extraction and ingestion which can lead to a break in the chain-of-custody, because the items can be modified or, in the worst case, falsified, as they are unprotected during this stage.
QUADROtechs ArchiveShuttle has been gold certified by Migration Forensics, a company specializing in the forensic verification of migrations, hence meeting the highest standards of compliance possible. Find out more about the open standards here: https://www.migrationforensics.com/standard-compliance/.
Storage Transparency
Different customers are using different kinds of Archiving Storage. Archiving systems support these storage platforms to address individual needs. Some of them are CAS Systems, such as EMC Centera or Hitachi Content Archive Platform while others are used to migrate collections of old archived mails to a secondary Archive Storage Tier. All this is transparent to ArchiveShuttle as the underlying system (e.g. Enterprise Vault) that wrote the data, will read it with the verified processes that are certified by the storage vendors. Importantly, the same system reads the data that has written it.
Migration products that try to access the data directly on the storage layer have either no support for a specific storage system or they need to apply reverse engineering. As explained before, without knowing exactly what and why the Archiving Software has written into this specific format, it’s difficult to assess what might have changed between different versions.
To summarize the facts:
- The proper use of the APIs doesn’t have to be slower than direct access from the storage.
- For performant ingestion, the use of an API is mandatory.
- Without APIs, provisioning/de-provisioning of archives has to be performed manually and can’t be automated.
- Archiving Systems are using proprietary storage formats to store the archived data. They are not published and have changed between different Versions and even Service Packs.
- Different Storage Platforms might contain different formats.
- A proper chain-of-custody requires the use of the API or else it can’t be forensically proven that the data extracted is the same data that was originally archived.
- Be aware that PST Files as an intermediate format might break your chain-of-custody by harming the integrity of the items.
Let me know what you think and which experiences you made with migrations, with and without API! Please feel free to comment here...
Senior Principal Product Manager at Veritas Technologies: Surveillance, Classification, AI & Metadata Enrichment
9 年Terrific article Peter!