What is deduplication, and what is its role in eDiscovery? Find out with GoldFynch!
GoldFynch eDiscovery
If you’re looking for an eDiscovery solution that’s fast, cost-effective and easy to use, GoldFynch is the answer.
In the world of eDiscovery, where vast quantities of data need to be analyzed, being able to accurately and efficiently cull unnecessary information is a big deal - it helps save both time and money.
Deduplication is a key tool through which legal professionals can do this and work with such large volumes, making it more manageable and, in the process, making their review more effective. Here’s a quick dive into the uses of deduplication in the eDiscovery workflow and a peek into how GoldFynch tackles the process.
If you’re interested in a deeper look at the specifics of deduplication using GoldFynch, check out our support article on the topic.
What is deduplication?
Deduplication is the process of identifying and eliminating duplicate data within a dataset. In eDiscovery, where data sets can be massive and include a lot of redundant information, deduplication plays a crucial role in streamlining the review process. By using software to remove (or “cull”) duplicate files, legal teams can:
There are many “strategies” that can be used for deduplication. In simpler datasets where all information exists solely within a single table of information (e.g. marketing databases), the data in the rows of the database can be directly compared against that of other rows. But when it comes to eDiscovery, where there are frequently thousands, or even millions of files in a dataset, and where files can potentially have hundreds of pages, it would be extremely wasteful to compare all the data in each file. So, instead, various unique identifiers associated with files are used to compare them.
Deduplication in GoldFynch
Deduplication in GoldFynch can be performed in two ways: by comparing either the MD5 hash value of your files, or Message-IDs in the case of emails. MD5 file hashes serve as a digital "signature" for files, and even the slightest change to a file's data (whether visible or the binary data that comprises it) can change the hash value of the file.
It's also worth noting that deduplication is carried out on a root family level since excluding or removing duplicate attachments belonging to non-duplicate parent files is not typically desirable. So GoldFynch will not mark attachment files as duplicates even if the file hashes are the same, unless the parent files are also duplicates.?
This means that if you run a deduplication session using the hash-based strategy and no duplicates are detected, you can be confident that the duplicate-looking files are either attachments to non-duplicate parent files or aren't, in fact, exact duplicates.
领英推荐
Reviewing GoldFynch's deduplication results
When you run a deduplication operation in GoldFynch, you can generate a report of the files detected as duplicates before deciding what to do with them.
The GoldFynch duplicate report contains the following information:
In case the files are emails, the following fields are populated with the available metadata:
Note: If the source does not have metadata, these fields will be blank, even if they are emails.
What's next?
Once the deduplication session is run, the system will mark (or “tag”) the duplicates with a special “DUPE” tag, and give you the option to delete them if you wish to. Whether you delete the duplicate files or not, we recommend creating a review set of your case for review - these automatically exclude any system-marked duplicates. So you’ll be covered even if you decide to hold onto your duplicate files (for example, if you want to show that a specific file was present in a particular folder, even though it was a duplicate.) You can learn about the other benefits of reviewing your files using review sets here.
All in all, deduplication is a powerful tool in your eDiscovery arsenal. It can help significantly reduce costs, save time, increase accuracy, and enhance review efficiency. Because when it comes down to it, you want to be able to focus on your case, not on managing your files!