Scraping Copyrighted Content -  Deceitful Data Security Bypassing Schemes that Security Analysts should be aware of

Scraping Copyrighted Content - Deceitful Data Security Bypassing Schemes that Security Analysts should be aware of

This article shines a light on evolving deceitful content-stealing practices despite efforts by data owners to secure their precious content by implementing technical and business controls.

Background

Ross Intelligence, a Canadian company formed in 2015 developed AI-based products to augment lawyers' cognitive abilities. To build such a product, it required vast amounts of enriched legal content to train and test the IBM Watson Deep learning model in the form of legal questions and answers. Ross explored obtaining content from Thompson Reuters,the leading provider of curated legal text, but being a competitor, was explicitly denied such access to Thompson's copyrighted and proprietary content.

Ross Intelligence "AI"

Raising $13.1 M, Ross leveraged IBM Watson NLP natural language processing (NLP) service for advanced text analytics to develop the legal technology product. The AI neural network trained from data obtained from Thompson enabled Ross product users to ask questions in natural language, such as: "Can a bankrupt company continue to transact business?" instead of performing a keyword search or boolean searches "(Bankruptcy) AND (business)".

The Scheme

Upon being denied access to Thompsons data on grounds of being a competitor, Ross devised a novel scheme to steal Thompson password protected and copyrighted content by inducing LegalEase Solutions LLC, a third party. Ross struck a business partnership with LegalEase and signed a contract for bulk legal data, data that was to be procured by LegalEase from Thompson in violation of LegalEase's contract with Thompson. LegalEase willingly played along with the scheme.

No alt text provided for this image

Being an existing customer of Thompson, LegalEase had access to Thomson's password-protected and copyrighted content. Under its SLA, LegalEase was prohibited from misuse of proprietary content, including bulk transfer to third parties. LegalEase knowingly chose to breach its SLA and covertly executed automated computer programs to scrape copyrighted content for purposes other than its authorized use.

Before July 2017, Legalease averaged ~6000 web-based content access transactions/month, however that changed in July 2017. Starting July 2017, LegalEase content access spiked 40X. (see chart below)

Automated data scrapping

Thompson Reuters traced back the traffic spike to LegalEase and upon being questioned, LegalEase explained that "machine learning research firm" was being supplied with "tons and tons" of Thompsons content which Ross was using to build Thompson competing product. Thompson immediately severed its business relationship with LegalEase for material breach of SLA, and sued Ross.

The Lawsuit

Thompson Reuters sued Ross Intelligence for deliberate copyright infringement under the federal Copyright Act of 1976 and creating a derivative product. Secondarily, Thompson alleges that Ross is liable for LegalEase's copyright infringement.

The complaint here differs from that in hiQ vs LinkedIna case recently decided in favor of the screen scraping party by a federal appeals court. There, the court found that the information scraped is publicly available, hence not subject to copyright protection under the Digital Millennium Copyright Act. LinkedIn also unsuccessfully asserted that hiQ violated the federal Computer Fraud and Abuse Act (CFAA) by improperly accessing LinkedIn's system. 

Thomson's complaint does not allege CFAA violation and does not claim authorship of or copyright protection for the federal and state court opinions it digests. It does, however, go to lengths to describe the significant resources invested in the development of its Copy-protected unique “Key Number System”. The Westlaw system is password protected and accessible only to subscribers.  Thompsons data is neither accessible publicly nor considered "open law" such as content that is available via PACER.

A very interesting strategy that legal and risk management professionals may note is the Swiss legal warriors surprise operation - they settled out of court with LegalEase which created the plumbing to securely siphon content. By sealing LegalEase lips , the Swiss warriors can be expected to concentrate all their legal firepower against Ross Intelligence, the beneficiary of ill gotten content.

Relief requested by Thompson

Thompson has requested the court to order removal and destruction of its content from Ross's products; punitive damages for the interference of contractual relationship between its customer LegalEase; statutory and actual damages along with an injunction prohibiting Ross from further acts of infringement.

What could Security Analysts and Risk Management personnel do to prevent copyright content looting ?

Create content usage baselines for paying and non-paying users. Monitor average use of content and standard deviation to determine content consumption boundary ranges per user. Set alerts for deviations. Investigate root cause.

Routinely examine log files for automated bot activity. Carefully examine content being scrapped.

Establish OSINT practices to identify misuse of copyright materials in public domain or in the dark web. Promptly execute DMCA for content takedown on open web.

Monitor product license use, misuse; monitor data pub-sub APIs for accidental and intentional misuse.

Insights and Takeaways - What Security Analysts Must Know

Accessing copyrighted content secured behind a paywall and reproduction, distribution of copyrighted materials without consent may subject one to liabilities. The copyright owner is entitled to recover actual damages suffered.

Security Analysts play an important role in detection engineering. While automated bots may appear as a pest in web logs, Analysts must be cognizant of authenticated bots accessing copyrighted content in violation of established Service Level Agreements. Refer findings of abuse to Business Analysts, Contract Management and concerned stakeholders to verify entitlement of content and methods approved for data consumption.

While testing, implementing Detection Engineering, do not get fixated on Server-side analysis alone. Why ? Bots can be engineered to leave the same exact fingerprints as humans, so server side log file analysis alone may not suffice. Collect client-side signals.[warning - there are privacy issues connected with this method. Consult your privacy practitioner before employing this technique.]

The field of Natural Language Processing (NLP) and Deep Learning is teaching computers to see text, make sense of it, and perform complex tasks better than humans. In the context of legal text, the use of stolen data, labelled and structured content obtained unlawfully and unethically may introduce complexities to regulated users. Attorneys could be disciplined for knowingly using AI products that are developed using stolen content which infringes copyright. Attorneys must tread carefully while using products that are suspect or risk being disciplined by the BAR or even being sued by their clients.

NLP libraries, Cloud ML tool kits are creating opportunities for legal text mining and tech development at an unprecedented rate. Expect copyrighted legal content looting, unauthorized legal content reproduction to continue using bots and other sophisticated methods that may not leave traditional digital forensic trace.

In using technologies such as Ross Intelligence in legal practice, lawyers must understand their ethical obligation — that the advice the client receives is the result of the lawyer's independent judgment and not guided by [Q&A] or predictions made by legal tech tools built on stolen content.

The role of a Security Analyst tasked to monitor, secure legal technologies will be stretched beyond log analysis and dashboard viewing. Legal text content is precious commodity and ever evolving - demand for processed content, case outcome predictions, judge analytics and more creative legal tech products will make crooks innovate - expect "white collar" tactics to evolve to a level which will make the role of Security Analyst inconsequential while curated, high quality human expert analysed content is looted in pain sight.

Disclaimers and Disclosures

The author has not received any compensation from the parties involved in this litigation. Author has formal education in legal research and has built NLP based legal tech tools for a hobby.

Richard Stiennon

On a mission to provide actionable insights and foster informed decision-making with complete data on the cybersecurity industry.

4 年

This is great Pradeep. Thanks for posting. I love the "What does this mean for security analysts" part.

回复

要查看或添加评论,请登录

Pradeep K.的更多文章

社区洞察

其他会员也浏览了