The DOJ's Proposal for Machine Learning in the Michael Cohen Case
Yesterday, the United States Attorney’s Office for the Southern District of New York proposed that a special master, aided by technology-assisted review, be assigned to make the initial privilege determinations on materials that the FBI seized in its raid on Michael Cohen’s office and hotel room. The DOJ proposal is a fascinating document for the electronic discovery industry, and perhaps a confusing one for others following the Cohen case. Here's a primer on the technologies involved.
What is Technology-Assisted Review?
Technology-assisted review (TAR) is the use of machine learning and related technologies to accelerate the review of documents in lawsuits, investigations, and other legal matters. Over the past few decades, the volume of documents -- particularly email and text messages -- relevant to any given legal matter has exploded. It is often impossible for the attorneys and investigators to read every document of potential interest. Increasingly they are turning to TAR. TAR substantially speeds review and, given the frailties of human reviewers, studies have shown it can be as or more effective than brute-force human examination of every document.
What is Supervised Learning?
Supervised learning is the most widely used TAR technology. “Supervised” simply means that people are driving the process, teaching the software by example. The same technology is used when you teach your email software to recognize junk mail, or a streaming music system to pick tunes you like.
In the Cohen case, USAO-SDNY proposed that a retired judge or other “special master” play the role of reviewer and teacher. The special master would review some documents, and determine whether they were or were not privileged. Those decisions would then be used to teach the software to predict whether other documents are likely to be privileged.
What is Active Learning?
Supervised learning learns from examples, but some examples teach the machine more than others. In active learning, the software selects those documents from which it will learn the most, and presents them to the expert.
When documents of interest are rare, as DOJ's letter posits that privileged documents would be in the Cohen data set, the system learns the most from those few documents that are predicted most likely to be on topic. When assessed, some of those documents would turn out to actually be privileged, while others would be near-misses that weren't privileged. Both types are very useful for the software to learn from. Computer scientists refer to an approach of training on top-ranked documents as relevance feedback. In e-discovery this approach has sometimes been referred to as Continuous Active Learning, as referenced in the DOJ letter. (The latter term has a pending trademark application, so I capitalize it here.)
In the relevance feedback approach proposed by DOJ, the software would iteratively bring potentially privileged documents to the special master's attention for review. Their decisions on those documents would then be fed back to the software, and the cycle repeated until satisfactory results were achieved.
What are Other Advantages of Supervised Learning in Complex Legal Cases?
A major advantage of supervised learning is that documents found by any means can be used to teach the system. A variety of other TAR technologies such as conceptual search, other forms of machine learning, analysis of communication patterns, and so on can be used along with relevance feedback to find privileged documents, with all reviewed documents usable for teaching the software. My company's product, Brainspace, supports using a range of such capabilities, including relevance feedback, through our Continuous Multimodal Learning (CMML) capability.
Further, all parties in the case could submit documents they assert to be privileged or non-privileged to the special master to begin, or continue the training process. So even though a neutral third party would be making privilege decisions, both sides could contribute to the process without being in the room. If desired, the neutral could even train the system twice, once seeded with each parties' interpretation of disputed documents.
We may or may not see machine learning used in the Michael Cohen case. But it has already had a great influence on the legal world, and this will only grow over time.
Sr. Principal Business Critical Engineer eDiscovery Platform at Veritas Technologies LLC
6 年Where does privilege stand here if as I understand this data was captured during a court approved raid?
Computer Scientist
6 年Thanks Dave, very interesting! Two questions: 1. Active learning as described looks like it belongs in the area of reinforcement learning - would you agree? 2. In the contested cases, will the same technology be available to all parties? TIA
Attorney, AI Whisperer, Open to work as independent Board member of for-profit corps. Business, Emp. & Lit. experience, all industries. Losey.ai - CEO ** e-DiscoveryTeam.com
6 年Honestly, Eli, I don't think it is that hard a project. Humans are still involved. The multimodal search using TAR is far more powerful than any human-alone efforts. I do it everyday.
Senior Counsel at Redgrave LLP
6 年Thanks for posting this, Dave! Using TAR for identifying privilege in the Michael Cohen matter seems like a very tough use case. The fact that these are an attorney's communications will generally create a strong bias in favor of privilege relative to typical business communications, and may not reliably signal waiver issues or purely business (or other) communications without some serious pre-planning. I'd guess that the folks involved in deploying TAR here would want to do some preliminary mapping of entities and relationships identified in the data to help refine feature weights for stuff associated with actual legal advice versus everything else. Very interesting thought exercise!
The Data Diva | Data Privacy & Emerging Technologies Advisor | Technologist | Keynote Speaker | Helping Companies Make Data Privacy and Business Advantage | Advisor | Futurist | #1 Data Privacy Podcast Host | Polymath
6 年It is fantastic to see eDiscovery in the news and talked about in the highest profile kinds of matters. This case will be a true education for all especially those who have no idea about the kinds of things these tech tools can do in litigation and in investigations. Maybe someday the Cohen corpus will replace the Enron dataset.