Reddit Scraping
As with many social media sites, public data on Reddit can be extremely valuable for marketing, analysis, research & even machine learning. However, it can be a little tedious to get this data - say to build an Excel Reddit database of comments for a niche you want to analyze.
While there are a lot of web scraping tools available, many of them are unofficial & often illegal Reddit web scrapers, which attempt (poorly) to scrape or crawl the public Reddit website and extract out the data into CSV files or into a database. These tools all violate the Reddit Terms of Service, and thus it’s illegal for third party scrapers to profit from this activity as lawsuits are already being filed.
Instead of relying on scraping screens, we will instead introduce a Reddit scraper that uses the official Reddit API instead of attempting to crawl and parse the Reddit website.
Reddit Data API
Reddit actually encourages you to build new apps and work with their data, so long as you do it with the Official Reddit API. Some common examples we’ll cover is scraping Reddit posts and Reddit comments using these API endpoints.
1. Create a Reddit App
Simply log in to Reddit with your account and view your Authorized API Applications where you can find the button on the bottom of the screen labeled “create another app…” Click this and you’ll be able to create a new app:
Be sure to select “Script for personal use” instead of “web app.”
When prompted for a redirect URI, you can simply type in http://localhost:8080
as your local machine’s development address for now.
2. Get Your Client ID & Client Secret
Once the app is created, you’ll see it under your list of apps. Your Client ID is the string provided here (highlighted in this screenshot):
To see your secret, click the “edit” link and you’ll see it listed under “secret” - save this in a password manager and do not publicly share it with anyone!
3. Generate an Access Token
Now with your client ID & secret, we’re ready to generate an access token you can use with the API endpoints. You will need to use the CURL command and substitute your values as indicated under $CLIENT_ID
and $CLIENT_SECRET
:
curl -X POST -u $CLIENT_ID:$CLIENT_SECRET "https://www.reddit.com/api/v1/access_token?grant_type=https://oauth.reddit.com/grants/installed_client&device_id=00000000000000000000"
You’ll then get back your access token in the response - save this somewhere to use for other endpoints.
As an alternative to this, you can also use a Python Wrapper like PRAW to handle authentication via its Python API. You will however then be tied to using PRAW for all interactions with the Reddit API, but it allows you to do more than scrape data, as you can write back to Reddit using the submission object as in the posting to Reddit docs.
Scrape Reddit Endpoints for Data
If you’d rather work with the raw Reddit API, we’ll briefly show how to get information like Reddit post titles or specific post comments.
For scraping Reddit posts, you can see the Subreddit Posts Endpoint, which will query the Official Subreddit Posts Endpoint on your behalf and parse out the data into downloadable CSV files. This will contain a list of recent posts and you can see the other options available for sorting and filtering the posts in the response.
Once you have a post that’s interesting (or a list of them), you can scrape their comments using the Subreddit Post Comments Endpoint, which will query the Official Subreddit Post Comments Endpoint on your behalf and parse out the response. You’ll need to pay attention to the pagination instructions though, as they require using a MoreComments object in the official Reddit API.