The DOJ's Proposal for Machine Learning in the Michael Cohen Case

The DOJ's Proposal for Machine Learning in the Michael Cohen Case

Yesterday, the United States Attorney’s Office for the Southern District of New York proposed that a special master, aided by technology-assisted review, be assigned to make the initial privilege determinations on materials that the FBI seized in its raid on Michael Cohen’s office and hotel room. The DOJ proposal is a fascinating document for the electronic discovery industry, and perhaps a confusing one for others following the Cohen case.  Here's a primer on the technologies involved.

What is Technology-Assisted Review?

Technology-assisted review (TAR) is the use of machine learning and related technologies to accelerate the review of documents in lawsuits, investigations, and other legal matters. Over the past few decades, the volume of documents -- particularly email and text messages -- relevant to any given legal matter has exploded. It is often impossible for the attorneys and investigators to read every document of potential interest. Increasingly they are turning to TAR. TAR substantially speeds review and, given the frailties of human reviewers, studies have shown it can be as or more effective than brute-force human examination of every document.

What is Supervised Learning?

Supervised learning is the most widely used TAR technology. “Supervised” simply means that people are driving the process, teaching the software by example. The same technology is used when you teach your email software to recognize junk mail, or a streaming music system to pick tunes you like.

In the Cohen case, USAO-SDNY proposed that a retired judge or other “special master” play the role of reviewer and teacher. The special master would review some documents, and determine whether they were or were not privileged. Those decisions would then be used to teach the software to predict whether other documents are likely to be privileged.

What is Active Learning? 

Supervised learning learns from examples, but some examples teach the machine more than others. In active learning, the software selects those documents from which it will learn the most, and presents them to the expert. 

When documents of interest are rare, as DOJ's letter posits that privileged documents would be in the Cohen data set, the system learns the most from those few documents that are predicted most likely to be on topic. When assessed, some of those documents would turn out to actually be privileged, while others would be near-misses that weren't privileged. Both types are very useful for the software to learn from. Computer scientists refer to an approach of training on top-ranked documents as relevance feedback.  In e-discovery this approach has sometimes been referred to as Continuous Active Learning, as referenced in the DOJ letter.  (The latter term has a pending trademark application, so I capitalize it here.)

In the relevance feedback approach proposed by DOJ, the software would iteratively bring potentially privileged documents to the special master's attention for review. Their decisions on those documents would then be fed back to the software, and the cycle repeated until satisfactory results were achieved.

What are Other Advantages of Supervised Learning in Complex Legal Cases?

A major advantage of supervised learning is that documents found by any means can be used to teach the system.  A variety of other TAR technologies such as conceptual search, other forms of machine learning, analysis of communication patterns, and so on can be used along with relevance feedback to find privileged documents, with all reviewed documents usable for teaching the software. My company's product, Brainspace, supports using a range of such capabilities, including relevance feedback, through our Continuous Multimodal Learning (CMML) capability.

Further, all parties in the case could submit documents they assert to be privileged or non-privileged to the special master to begin, or continue the training process. So even though a neutral third party would be making privilege decisions, both sides could contribute to the process without being in the room. If desired, the neutral could even train the system twice, once seeded with each parties' interpretation of disputed documents.

We may or may not see machine learning used in the Michael Cohen case. But it has already had a great influence on the legal world, and this will only grow over time.


James Harris, CEDS

Sr. Principal Business Critical Engineer eDiscovery Platform at Veritas Technologies LLC

6 年

Where does privilege stand here if as I understand this data was captured during a court approved raid?

回复
Alexander Genkin

Computer Scientist

6 年

Thanks Dave, very interesting! Two questions: 1. Active learning as described looks like it belongs in the area of reinforcement learning - would you agree? 2. In the contested cases, will the same technology be available to all parties? TIA

回复
Ralph Losey

Attorney, AI Whisperer, Open to work as independent Board member of for-profit corps. Business, Emp. & Lit. experience, all industries. Losey.ai - CEO ** e-DiscoveryTeam.com

6 年

Honestly, Eli, I don't think it is that hard a project. Humans are still involved. The multimodal search using TAR is far more powerful than any human-alone efforts. I do it everyday.

Eli Nelson

Senior Counsel at Redgrave LLP

6 年

Thanks for posting this, Dave! Using TAR for identifying privilege in the Michael Cohen matter seems like a very tough use case. The fact that these are an attorney's communications will generally create a strong bias in favor of privilege relative to typical business communications, and may not reliably signal waiver issues or purely business (or other) communications without some serious pre-planning. I'd guess that the folks involved in deploying TAR here would want to do some preliminary mapping of entities and relationships identified in the data to help refine feature weights for stuff associated with actual legal advice versus everything else. Very interesting thought exercise!

Debbie Reynolds

The Data Diva | Data Privacy & Emerging Technologies Advisor | Technologist | Keynote Speaker | Helping Companies Make Data Privacy and Business Advantage | Advisor | Futurist | #1 Data Privacy Podcast Host | Polymath

6 年

It is fantastic to see eDiscovery in the news and talked about in the highest profile kinds of matters. This case will be a true education for all especially those who have no idea about the kinds of things these tech tools can do in litigation and in investigations. Maybe someday the Cohen corpus will replace the Enron dataset.

要查看或添加评论,请登录

Dave Lewis的更多文章

  • My Take on ChatGPT and LLMs

    My Take on ChatGPT and LLMs

    A legal technology colleague asked for my opinion of ChatGPT today. Below is a slightly edited version of my email to…

    12 条评论
  • My Conspiracy Theory about DARPA's Hidden Lair Post: Conspiracy Theories

    My Conspiracy Theory about DARPA's Hidden Lair Post: Conspiracy Theories

    As widely reported, the Defense Advanced Research Projects Agency (DARPA) posted the following tweet on August 28th:…

  • Privacy, Search, and Email @ Archives 2018

    Privacy, Search, and Email @ Archives 2018

    I'm pleased to be speaking tomorrow (Thursday, 16Aug18) on a panel on privacy-preserving search of email archives at…

  • 3 Hats @ 2018 Archives

    3 Hats @ 2018 Archives

    I'm pleased to be speaking Thursday in DC on a panel on privacy-preserving search in email archives, at the ARCHIVES *…

  • PROFS #1 & SIGIR 0.731

    PROFS #1 & SIGIR 0.731

    I'm delighted to be giving a keynote talk in Ann Arbor this Thursday at the First International Workshop on…

  • Grand Pwning Unit

    Grand Pwning Unit

    Microarchitecture timing attacks are pretty scary, but can be kind of slow. Good thing attackers can't attach a…

  • Billions in Buggy Bitcoin Bindings

    Billions in Buggy Bitcoin Bindings

    Having written (imperfectly) both software and contracts, the enthusiasm for smart contracts has surprised me. Adrian…

  • Ola! An amazing legal tech conference in Brazil

    Ola! An amazing legal tech conference in Brazil

    I had the honor speaking on text analytics in the law last week at I Congresso Internacional de Direito e Tecnologia…

    4 条评论
  • AI & Law Panel tomorrow in Chicago

    AI & Law Panel tomorrow in Chicago

    I'm pleased to speaking on the panel Demystifying Artificial Intelligence: What Lawyers Need to Know About AI and…

  • Back from Tokyo

    Back from Tokyo

    SIGIR 2017, the 40th annual conference of the ACM Special Interest Group on Information Retrieval was a fantastic (and…

    5 条评论

社区洞察

其他会员也浏览了