登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Do web scraper bots violate Computer Fraud and Abuse Act. What Security Analysts could do while awaiting Supreme Courts decision ?

Pradeep K.

发布日期: 2020年5月3日

Publicly visible LinkedIn profiles have no data privacy rights. Anonymous web data scraping bots can scrape content of such users without authorization.

?Anonymous web scraping bot controversy

LinkedIn and HiQ, Silicon Valley darlings are at loggerheads over Social Media user profile data collection. The data collection controversy has now reached the United States Supreme Court.

LinkedIn claims that social media user data scrapping by anonymous, unauthorized bots constitute hacking and HiQ, on the contrary claims monetary loss if they are not permitted to scrape LinkedIn. Strangely, HiQ can scrape all the public data they want, until Supreme Court makes a ruling on the controversy. For now, it is clearly a "huge win" for the party scraping LinkedIn but is this truly a "win", or considered "hacking" under Computer Fraud and Abuse Act (CFAA) ?

What is CFAA ? CFAA is Americas baseline expectation that people are entitled to have control over their computing infrastructure and protects Americans against criminals who hack into computers to steal information, install malicious code, or delete files.

The Computer Fraud and Abuse Act (“CFAA”), codified at Title 18, United States Code, Section 1030, is for Federal prosecutors to address cyber-based crimes. LinkedIn or its Counsel is neither a Federal Prosecutor, nor does it have a standing to "charge" hiQ of a federal crime making this case more sensational than one can imagine.

The controversy shines light on 95 million data scraping bots scraping off LinkedIn users publicly posted content on a daily basis. 95M bots/day x 365 days = 34 Billion 675 Million bots/year scrape LinkedIn users public content! What impact does 95 Million bots have on LinkedIn's infrastructure ? Are these bots operating lawfully ?

This is not the first time, nor the last data scraping controversy. On April 9,2018 people witnessed Facebook CEO testifying to Congress in the infamous Cambridge Analytica data scandal. Cambridge Analytica obtained Facebook users data who had not explicitly given permission to access their data and developed technology to target people with conspiratorial thinking.

Scope of web scraping controversy

The crawler case falls within the scope of Computer Fraud and Abuse Act (CFAA) liability associated with web scraping of publicly available social media profile data. This law still remains unsettled despite the blockbuster ruling in hiQ Labs, Inc. v. LinkedIn Corp., 938 F.3d 985 (9th Cir. 2019).

The legislative history of the CFAA reflects an expectation that persons who “exceed authorized access” are insiders , while persons who access computers “without authorization” are outsiders (e.g., hackers). Can outside persons and orgs who operate the 95M web data scraping bot network without LinkedIn's authorization be treated as hackers for the purposes of CFAA? The Supreme court shall decide this question - whether HiQs unauthorized access of LinkedIn's data constitutes hacking under CFAA.

Background

In September 2019, Ninth Circuit Appeals court judges decided an appeal [1] filed by LinkedIn seeking to overturn trial court's grant of a preliminary injunction to HiQ Labs, Inc. At issue was LinkedIn's right to impose anti-scraping technical measures to prevent HiQ from collecting data of LinkedIn members on their publicly accessible profiles.

HiQ products predict employees at the greatest risk of being recruited away and identifies skill gap in the work force. These products were developed using LinkedIn data and monetized by HiQ.

Discussion

Although it obviously signals the court's view of the merits, a preliminary injunction does not end a lawsuit. Rather, it directs a party to do or refrain from doing something while the case is being litigated, though a permanent injunction may ultimately result once the case is over.

Invoking several provisions of federal and state computer and contract laws, HiQ obtained a preliminary injunction directing LinkedIn to cease its efforts to stop HiQ's screen scraping activity - LinkedIn appealed.

A preliminary injunction may only be granted if four distinct questions are answered affirmatively:

Is the requesting party [HiQ] likely to suffer non-monetary harm if the injunction isn't granted?
Does it appear that the requesting party is likely to succeed on the merits of the case?
Do the equities – the essential fairness of granting injunction - favor the requesting party?
Would granting injunction promote public interest? This question cannot be answered favorably if granting the injunction would have a significant negative impact on third parties who aren't involved in the lawsuit.

The court found that the answer to each of these questions is a “yes”:

As to the threat of irreparable harm, HiQ, whose only business is collecting data, asserted that it would be driven out of business if LinkedIn prevailed. The court agreed that this meets the first element of the four prong injunction test.
The court reviewed federal and California law and found that HiQ is likely to succeed on the merits of the case. Specifically, it found that LinkedIn could not successfully argue that by obtaining publicly available information HiQ does not access the LinkedIn website “without authorization”, in violation of the federal Computer Fraud and Abuse Act, the principal federal computer fraud statute. The court also rejected LinkedIn's arguments that HiQ was in violation of the Digital Millennium Copyright Act (DMCA) and California law.
Granting injunction is fairer to HiQ than it is unfair to LinkedIn. The potential destruction of HiQ's business is a greater threat than LinkedIn's ostensible loss of members' goodwill resulting from HiQ's collection of their publicly available information.
Finally, granting the injunction would serve the public interest by maximizing the free flow of publicly available information on the Internet.

In March 2020, LinkedIn filed a petition asking Supreme Court to review the case. LinkedIn asserts that the court should hear the appeal as in 2003 the First Circuit Court of Appeals issued a starkly contrasting conclusion in EF Cultural Travel v. Zefer Corporation [2], a factually similar case.

What was the similarity ? Zefer built a web scraper that would scrape pricing data from EF's web site. Similar to Linkedin discovering HiQ scraping activities, EF discovered Zefer's crawlers and sued Zefer. EF prevailed. Zefer court ruled that use of scraper tool went beyond the "reasonable expectation" of ordinary users ! The story ended with Zefer filing for bankruptcy.

This conflict between Circuit Courts may - but does not automatically - prompt the Supreme Court to decide which decision is correct. As of the date of this writing, the high court had not ruled on whether to grant LinkedIn's petition. Does this conflict in case laws give HiQ the right to violate CFAA ?

Implications

The controversy has traversed up to the Supreme Court which may resolve the disagreement between the Ninth and First Circuits and remand the case back to lower court.

Unless and until that occurs, only web scrapers located in one of the Ninth Circuit states can safely rely on the Ninth Circuit court's reasoning insofar as the CFAA, the DMCA or California state law are concerned.

If they haven't already done so, web scraping operators in the First Circuit states (ME, MA,NH,PR and RI ) and elsewhere throughout the United States would be well advised to review with their counsel the implications [of bypassing technical controls and scraping] in light of their operations of the EF Cultural Travel decision. Even first-time CFAA and DMCA violations are serious: each is punishable by fines and imprisonment, up to and including life for repeat offenses. The findings of the Supreme Court on the nexus between web scraping and the CFAA could be of enormous import.

Tips for Anti-Scraping Security Analysts

Develop an adversarial mind set while analyzing and investigating anonymous web scraping bot activity conducted without authorization.

Form an Anti-Scraping team to deceive, rate limit and disrupt data stealing bot activity who violate user-agent directive. Test anti-scrapping technical control rigor. Explore creative applications of L4-7 switching and web engineering (e.g change DOM HTML markup structure) to achieve maximum impact while bots land on or inside the property. Implement directive crawl-delay. Remember, the stop sign is for the law abiding. Use CAPTCHA but be aware that some of these controls might impact user experience.

Should the log files indicate crawlers siphoning off PDF files, proceed immediately to check if the PDF files are leaking sensitive information.

Develop a feedback loop from Anti-Scraping team to Engineering , Legal and other stakeholders who can support reaching desired end state. Define end state. Mission statement must be crystal clear to the members of the team.

Augment Anti-Scraping team with cross functional skilled personnel to perform qualitative and quantitative analysis of unauthorized bot impact. Develop feedback loop from analytics team back to Engineering personnel, Legal and other relevant stakeholders.

Measure impact to network, bandwidth, packet loss, memory allocation errors, latency, API usage and any other measurements that could quantify loss, damage or harm to the infrastructure or business. (A must read for Security Analysts- eBay v. Bidder’s Edge. Take cues to quantify, qualify and characterize “harm.”)

Gather intelligence on adversarial bot proxies and bot crawling infrastructure. Capture crawlers true IP address and user-agent identity.

Prepare adversarial data crawling investigative reports for executive oversight with specific anti-scraping countermeasure recommendations. Include damage quantification in the reports.

Investigate anti-scraping controls bypass attempts. Exploit machine learning, NLP and Deep learning techniques in order to characterize bots and bot behavior.

Attempt to discover the nature of data sought by the bots, if possible. Are there specific pages and content sought or are the bots uncontrollably scraping entire web content ? Targeted bots can be detected and characterized so as to understand if the specific content is considered valuable for the data scrappers. Understand which bots are scraping content, price and contact information.

Develop a Trust-Nobot-Verify-All Bots policy. Similar to the financial sectors KYC, develop Know Your Bot KYB program. This is in addition to white listing and scoring bot risk. There may be more to a targeted bot than meets the eye.

Identify authenticated crawling activity by monitoring new or existing user accounts with high levels of HTTP/HTTPS activity and no purchases.

Detecting product view, social media profile outliers will help to discern human V. non-human activity. Track/detect competitor led scrapping of price, product catalogue, user profile and other specific data collection activities.

Several bots are built from open source libraries or code available on Github. Develop an adversarial application security plan with the end goal of disrupting crawling activity by changing the structure of web content ( landing page for e.g) depending on the bots signature and the code base used to develop the bots - thus defeating crawler logic.

Get creative with hot-linking. Instead of preventing hot-linking across the infrastructure, develop a controlled hot-linking plan to study the adversary.

Blacklist aggressive bots, submit offender an list routinely to ISP, Reputation service providers.

Involve legal personnel to fine tune website terms and conditions to restrict, warn offenders. Incorporate appropriate language to create deterrence effect. Develop legal and IT technologies to automatically send cease and desist notice to offenders duly reviewed by Counsel.

Develop data consuming APIs protected by auth-tokens as the only method to extract data. If such mechanisms already exist, develop mechanisms to control data use/misuse.

Several bot controls can be deployed at the web server configuration or OS level and are not covered explicitly in this analysis. Traditional web server and OS hardening are recommended.

Develop internal training and education programs to ramp up Security Analysts knowledge of NLP, GAN, Machine Learning, Deep learning, Facial Recognition and similar technologies that are disruptive and may pose threats that may not be easy to detect or mitigate. Data fusing a LinkedIn user profile with the same users Facebook profile could generate enriched data leading to privacy concerns. Traditional approaches and compliance with privacy laws may not be sufficient to protect user privacy. A bot could be trained to mimic a human or impersonate a human user - both are different problems.Develop appropriate TTPs.

Short staffed, resource constrained ? Consider cloud based solutions such as Cloudflare Bot Management service.

Security Analysts involved with Web Scraping Defensive activities must hope for the best and prepare for the worst. Expect automated data looting to rise until Supreme court rules one way or the other. Businesses such as HiQ may ramp up efforts on data crawling and devise Plan B to obtain data indirectly for their own survival.

Internal mock trials and moot court plays may create opportunity to fine tune organisations strategy, especially if one anticipates getting involved in litigation connected to data harvesting. Present live data, analytics and persuasive arguments to demonstrate loss, damage or harm. Understand the role of expert witnesses and scientifically sound evidence in litigation. Understand what kind of data driven evidence can rise above the threshold needed to prevail in litigation. Well prepared scientifically sound evidence could be pressed into action in pre-litigation confidential settlement negotiation. Empower inside counsel with evidence of damage, losses, lost revenues, lost customers, cart abandonment etc.

The "profession" of automated and anonymous web scraping resides squarely in the grey zone of legal and ethical business practices. Expect this market to overlap with the dark web and introduce complexities for Security Analysts that they may not be ready or willing to handle without additional assistance. Forewarned in forearmed. Prepare now to defend territory from anonymous, unauthorized bots - dont wait for Supreme Courts decision.

What would you do to resolve the dispute between LinkedIn and HiQ ? Scraping ok or not ok ?

Disclaimers

Author was not compensated for writing this article and does not endorse any products or services mentioned in this article. This article does not constitute legal advice. All trademarks are property of their respective owners.

References:

1. HiQ Labs v. LinkedIn Corp. 938 F. 3d 985

2. EF Cultural Travel BV v. Zefer Corp. 318 F. 3d 58

3. Ebay, Inc. v. Bidder’s Edge, Inc., 100 F. Supp. 2d 1058, 2000 U.S. Dist. LEXIS 7287, 54 U.S.P.Q.2D (BNA) 1798 (N.D. Cal. May 23, 2000)

要查看或添加评论，请登录

Pradeep K.的更多文章

Can Bing, ChaGPT or Gemini be your Inventors / Co-inventors ?

2024年2月14日

Can Bing, ChaGPT or Gemini be your Inventors / Co-inventors ?

Feb 14,2024. Reading time ~ 8 minutes No, only a human can be an inventor for the purpose of patenting in the United…
Apples Personal Voice to Adams Apple- An Apples to Apple Comparative Study

2024年1月16日

Apples Personal Voice to Adams Apple- An Apples to Apple Comparative Study

Examining Apple's Personal Voice: How Close Is It To Human Speech? Apple's new Personal Voice feature, introduced in…
Corporate compliance gone awry "its like the mob" .

2020年10月30日

Corporate compliance gone awry "its like the mob" .

Keywords: FDCA , Park doctrine, mens rea, Litigation intelligence #CorporateLaw The context: When the US Supreme Court…
Propaganda Network Analysis

2020年5月28日

Propaganda Network Analysis

Use of Social Media Platform for spreading rumors and propaganda amid pandemic constitutes a global digital threat that…

1 条评论
Legal Issues in use of AI to screen Mis-Information and Free Speech on Social Media Platforms.

2020年5月19日

Legal Issues in use of AI to screen Mis-Information and Free Speech on Social Media Platforms.

First Amendment Free Speech Detection is a complex problem - needs syncretic solution. Artificially Intelligent systems…
Facebook Service Disruption Analysis

2020年5月17日

Facebook Service Disruption Analysis

[ Reading time: 3 minutes] In this article, I present a brief analysis of global internet disruption from the…
Scraping Copyrighted Content - Deceitful Data Security Bypassing Schemes that Security Analysts should be aware of

2020年5月11日

Scraping Copyrighted Content - Deceitful Data Security Bypassing Schemes that Security Analysts should be aware of

This article shines a light on evolving deceitful content-stealing practices despite efforts by data owners to secure…

1 条评论
Amidst developing Kremlin-China Mis-Information Campaign, Protective Measures Possible

2020年5月10日

Amidst developing Kremlin-China Mis-Information Campaign, Protective Measures Possible

China has “innovated” efforts to push disinformation and propaganda around COVID-19. There is an increased risk of…
False negative data exploration in Machine Learning powered SOC.

2020年4月23日

False negative data exploration in Machine Learning powered SOC.

INTRODUCTION Imagine yourself to be in the shoes of a Security Analyst working at a SOC. You get to work, fill your…

2 条评论

See all articles

?Anonymous web scraping bot controversy

Scope of web scraping controversy

Background

Discussion

Implications

Tips for Anti-Scraping Security Analysts

Disclaimers

References:

Pradeep K.的更多文章

Can Bing, ChaGPT or Gemini be your Inventors / Co-inventors ?

Apples Personal Voice to Adams Apple- An Apples to Apple Comparative Study

Corporate compliance gone awry "its like the mob" .

Propaganda Network Analysis

Legal Issues in use of AI to screen Mis-Information and Free Speech on Social Media Platforms.

Facebook Service Disruption Analysis

Scraping Copyrighted Content - Deceitful Data Security Bypassing Schemes that Security Analysts should be aware of

Amidst developing Kremlin-China Mis-Information Campaign, Protective Measures Possible

False negative data exploration in Machine Learning powered SOC.

社区洞察