What is deduplication, and what is its role in eDiscovery? Find out with GoldFynch!

What is deduplication, and what is its role in eDiscovery? Find out with GoldFynch!

In the world of eDiscovery, where vast quantities of data need to be analyzed, being able to accurately and efficiently cull unnecessary information is a big deal - it helps save both time and money.

Deduplication is a key tool through which legal professionals can do this and work with such large volumes, making it more manageable and, in the process, making their review more effective. Here’s a quick dive into the uses of deduplication in the eDiscovery workflow and a peek into how GoldFynch tackles the process.

If you’re interested in a deeper look at the specifics of deduplication using GoldFynch, check out our support article on the topic.

What is deduplication?

Deduplication is the process of identifying and eliminating duplicate data within a dataset. In eDiscovery, where data sets can be massive and include a lot of redundant information, deduplication plays a crucial role in streamlining the review process. By using software to remove (or “cull”) duplicate files, legal teams can:

  1. Reduce costs: Deduplication reduces the volume of data that needs to be processed and reviewed, leading to significant cost savings in storage and review expenses.
  2. Save time: Removing duplicates from your review accelerates the review process, allowing legal teams to focus on relevant information promptly.
  3. Increase accuracy: By eliminating duplicate documents from your review, deduplication ensures that you can focus on unique content and avoid having multiple partially-reviewed versions of documents. This helps reduce the risk of inconsistencies, contradictions, and review metadata being lost.
  4. Enhance review efficiency: With a unique dataset, reviewers can efficiently identify key documents and make informed decisions.

There are many “strategies” that can be used for deduplication. In simpler datasets where all information exists solely within a single table of information (e.g. marketing databases), the data in the rows of the database can be directly compared against that of other rows. But when it comes to eDiscovery, where there are frequently thousands, or even millions of files in a dataset, and where files can potentially have hundreds of pages, it would be extremely wasteful to compare all the data in each file. So, instead, various unique identifiers associated with files are used to compare them.

Deduplication in GoldFynch

Deduplication in GoldFynch can be performed in two ways: by comparing either the MD5 hash value of your files, or Message-IDs in the case of emails. MD5 file hashes serve as a digital "signature" for files, and even the slightest change to a file's data (whether visible or the binary data that comprises it) can change the hash value of the file.

It's also worth noting that deduplication is carried out on a root family level since excluding or removing duplicate attachments belonging to non-duplicate parent files is not typically desirable. So GoldFynch will not mark attachment files as duplicates even if the file hashes are the same, unless the parent files are also duplicates.?

This means that if you run a deduplication session using the hash-based strategy and no duplicates are detected, you can be confident that the duplicate-looking files are either attachments to non-duplicate parent files or aren't, in fact, exact duplicates.

Reviewing GoldFynch's deduplication results

When you run a deduplication operation in GoldFynch, you can generate a report of the files detected as duplicates before deciding what to do with them.

The GoldFynch duplicate report contains the following information:

  • APP Link - This is a direct link to the document in your GoldFynch case (only accessible if you are logged into an account that has access to your case)
  • APP ID - GoldFynch's internal ID, which is used to track each individual file that is uploaded
  • APP Parent ID - This is the ID of the Parent document. If there is no parent then it is the same as the APP ID?
  • Keep? - When the value is “TRUE” it indicates that the file is primary, and “FALSE” indicates that the file is a duplicate
  • File Name - File name of the document
  • Pathname - Path of the document in GoldFynch
  • Tags - All tags attached to the document will be listed

In case the files are emails, the following fields are populated with the available metadata:

  • Subject?
  • From?
  • To
  • Cc
  • Bcc
  • Sent
  • Message ID

Note: If the source does not have metadata, these fields will be blank, even if they are emails.

What's next?

Once the deduplication session is run, the system will mark (or “tag”) the duplicates with a special “DUPE” tag, and give you the option to delete them if you wish to. Whether you delete the duplicate files or not, we recommend creating a review set of your case for review - these automatically exclude any system-marked duplicates. So you’ll be covered even if you decide to hold onto your duplicate files (for example, if you want to show that a specific file was present in a particular folder, even though it was a duplicate.) You can learn about the other benefits of reviewing your files using review sets here.

All in all, deduplication is a powerful tool in your eDiscovery arsenal. It can help significantly reduce costs, save time, increase accuracy, and enhance review efficiency. Because when it comes down to it, you want to be able to focus on your case, not on managing your files!

要查看或添加评论,请登录

GoldFynch eDiscovery的更多文章

社区洞察

其他会员也浏览了