⚖️ Is Data Scraping Legal?

Disclaimer: I am not a lawyer - I just know a lot about data scraping after onboarding hundreds of clients onto Stevesie Data as the founder. The following are general guidelines I’ve seen in industry for what constitutes responsible data scraping practices. Nothing in this article guarantees that what you are doing is legal nor illegal to any extent. This is not legal advice!

Watch on YouTube

Data Scraping in Industry

If data scraping were illegal, we would not have Google. Search engines like Google, Bing & DuckDuckGo scrape data from millions of websites that want to be scraped, so they show up on search engines.

However, it’s not just the “big guys” who do scraping. There are thousands (perhaps millions) of smaller businesses that also perform data scraping for research (such as pricing), social media marketing (e.g. Instagram analysis tools), travel aggregators (discount websites), etc… You’ve probably used one or more of these services directly or indirectly without even knowing it.

So what are all these companies doing right that allows them to keep operating without legal issues?

From my experience, I’ve found that you must abide by these 2 simple guidelines to avoid any legal problems with data scraping:

You have the right to access (e.g. the data is public or you have permission to view it from the online account you are scraping from)
You are accessing the data at a reasonable rate that causes no harm to the data provider (e.g. at the same rate as if you were doing it manually or hired someone to do it)

That’s it! Just follow these 2 simple rules and 99% of the time you will be operating in the spirit of the law and following industry best-practices.

Illegal Cousins of Data Scraping

Since data scraping is a very broad term, it sometimes gets mixed up with other illegal practices that violate the 2 guidelines above: the content is not authorized to be accessed and/or the rate at which it is accessed causes harm to the data provider or is deemed “theft” if the content itself is financially valuable.

Data Scraping Legality Matrix

DDoS (Distributed Denial of Service)

When you have access to a resource (e.g. public data that a website is hosting), but you access it at a rate that far exceeds reasonable access - and from multiple proxies to be distributed - you are then causing harm to the website operator and may be denying access to the service for other users. This is typically seen as a deliberate & malicious attack and very serious (capable of bringing down large websites), however sometimes it can accidentally happen from a misbehaving bot.

Since this can cause real harm and financial hardship to website owners, it is a very serious concern and you must avoid doing this at all costs, or you will get in trouble.

What it means: When scraping data, you need to be careful to not exceed the rate of accessing the data that a normal person would do, otherwise you may be guilty of causing the website harm. Thankfully, most modern day APIs and sites implement rate limiting and will simply block you before you can do too much damage. Usually one request every few seconds is well below the threshold that can cause harm.

Theft

When the data you are accessing has financial value (compared to user-contributed social media content), you may then be susceptible to the CFAA’s $5,000 threshold where a you can face civil & criminal charges.

Think about a paywall website hosting financially valuable data like a buffet, where you pay to access a little of the data at a time (using the site manually). If you’re caught scraping this data in excess, then you could be charged with trying to steal the entire buffet “to-go” rather than consuming it like you’re intended to, inside the buffet (or website in this example).

So you need to be very considerate when scraping data that you must pay to access. Take Netflix for example - your paid subscription is meant for you and only you to use - you would never share the password with anyone ;).

Netflix probably doesn’t mind too much if they see 2 or 3 people accessing the same account at the same time - they’re in the business of keeping their users happy. However, if they see you download the entire catalog in a matter of 5 minutes, they will probably flag your account and ask you some questions. They can claim your excessive use breached your agreement and if they really want to, they can try and claim you stole more than $5,000 worth of their content.

What it means: If you have a paid subscription to a service that does not offer an API (with explicit limits of how much data you can scrape for what price), you probably don’t want to try scraping its data unofficially because they can claim theft.

Hacking

Hacking generally means you’ve gone way beyond just recording data you were already intended to see from a website and are now reverse-engineering the website to get information that was never intended for you to see. This clearly violates the right to access guideline, even if the data is technically “publicly available” but requires a clever “hack” to obtain.

Consider the AT&T iPad Breach where the hacker was able to scrape one of the largest lists of emails ever using a security hole in the AT&T website. This not only violated the right to access the data (as the hacker was getting emails not intended for him), but it also violated the rate of access guideline as he did this at scale to build a very large list.

What it means: Don’t do it. If your gut tells you that you shouldn’t be seeing the data you’re seeing, be a good citizen and report this as a security hole to the website. You may even get a bug bounty! 🤑

Copyright Issues

Even if you’re scraping public data from an official API (e.g. public facts about the world, like the color of the sky), the website operator serving that data to you still owns the copyright to the structure of the data - basically the JSON structure you get back from most websites. E.g. if I publish this on my website:

{"sky": {"color": "blue"}}

And you copy this and then re-publish this JSON verbatim on your website without my permission, it is copyright infringement as you are copying how I represented this public fact. However, if you simply write “the sky is blue” on your website or re-structure this data format, you are in the clear since you’re simply re-publishing a public fact alone (and not how I represented it).

What it means: When you scrape data, you need to be mindful of the structure in which it came and you cannot re-publish it as-is, otherwise this would be copyright infringement as you would be re-publishing the data structure without permission.

Getting Blocked

While most companies in 2020 are down-to-earth and offer either public APIs or don’t throw a fit if you want to do some casual scraping, there are some websites out there who are not so happy about having their data scraped.

While a website can try to block you, put in its terms of service that automated access is prohibited, or even directly tell you to stop via a cease & desist letter - it still is not illegal to scrape public data even if the company tells you to stop. See Victory! Ruling in hiQ v. Linkedin Protects Scraping of Public Data for more information.

What it means: Some companies really don’t want to be scraped and are willing to ban users & IP addresses (at the cost of blocking legitimate users) to prevent scraping. If they make it difficult, you may need to hire someone to build you a custom scraper (using Selenium & Chromedriver) and may need to invest in residential or 4G proxies. There’s always a way to do it, just a matter of difficulty

Conclusions

Be considerate of where you are scraping your data from. Remember, there are people on the other end and you need to be considerate of them! Always check if the service offers an official API first and use that.

If an official API is not available, you can try using traffic interception to uncover hidden APIs you can use instead. As long as these hidden APIs are revealing data that you otherwise have access to, then it is most likely not illegal to scrape them on an ongoing basis as long as you are respectful of rate limiting and “cause no harm.”