« Twitter Tweets Data Scraper

Twitter API Historical Archive Scraper

Download Data to Excel & CSV Files

Steve Spagnola
Written by Steve Spagnola
Last updated April 11, 2024

Scraping Twitter Historical Data

IMPORTANT: Twitter Pro API Access is required to scrape historical data with no time limit (up to 2006) from full archive search. If you only have basic access, you should check out Scraping Twitter Search Results for the same functionality, but limited to the past 7 days.

Scrape Twitter's Historical Archive

The immense number of Tweets in existence since 2006 is mind boggling, and harvesting these Tweets may prove useful for generating Twitter historical reports & trend analysis. However, we cannot simply scrape (or download) ALL of the Tweets in existence, they would likely not fit on your hard drive. Instead, we can use the Twitter API to query Tweets containing a general keyword relating to our interest, all the way back to 2006.

1. Build & Test Your Query

If you’re lucky enough to have access to the Twitter Pro API Access, then you can use the Search Tweets Scraper with the “Recent or All” set to all so you can search Twitter from 2006 for your query!

Before building your search query, you should take a look at Twitter’s Building a Search Query documentation with some examples. You can use more advanced operators like AND, OR and - to require or exclude keywords appear in the Tweets you get back.

The functionality is similar to Twitter’s advanced search on its website, but you have a few more options while on the Pro API Tier, such as the ability to filter by tagged location for Tweets. And just like the Twitter website, the results returned are near real time so you can capture just about everything on Twitter up until this current moment.

So let’s say we’re interested in scraping Tweets about “craft beer” - we can start by trying the query craft beer on the Search Tweets Integration:

Provide a search term

And we can get a sample of the results back. Note that while many of the Tweets contain the phrase “craft beer” exactly, some of them match on semantically similar terms, such as “craft breweries”:

Craft Beer initial search results

If we only want to make sure we only get back Tweets that have the phrase “craft beer” in it, we can simply add quotes to our query and instead search for "craft beer":

Using quotes for exact phrase matches

Check out the other search options under “Operators” on Twitter’s Query Documentation - you can do a lot more and even search for Tweets by emoji or stock ticker mention!

The default behavior of this endpoint is to only return results from the past 7 days (available to Basic Twitter API users), but if you have access to the Pro API Twitter API, you’ll want to set the “Recent or All” input to all and you can even explore changing the Time Range inputs to ensure your search query will work for Tweets in the past.

Pro Tip: Exclude Retweets, Quotes & Replies

If you want to exclude retweets, quotes and replies from your search results, just tack on -is:retweet -is:reply -is:quote to the end of your query. E.g. if we’re searching for "craft beer" then we’d make the query "craft beer" -is:retweet -is:reply -is:quote to only get back original Tweets from authors.

2. Estimate Number of Results

While you’re building your query, you want to keep in mind how many results you’ll get back versus how many results you actually need or have the capacity to analyze. If you pick too broad of a query, you may get overwhlemed with the volume of data you get back, but if you pick an overly-specific query you may not get back enough historical data.

Fortunately though, Twitter provides a “Tweet Count” endpoint that allows us to provide the same query as above (any text query that will work with the V2 search API) and get back the total number of Tweets within the time range provided.

Provide a search query like in Step 1

If you leave the time ranges blank, you’ll get back the total count for either the past 7 days (if you don’t override “Recent or All”) or set “Recent or All” to all to get a count for all Tweets matching your query from all time. However, we suggest providing a time range to when you want to get the results back from:

Provide Tweet Time Range

Now when you get the results back, you’ll see how many total Tweets match your search query within the time range provided:

Total Tweet Count

So now you can keep tweaking your search query until you’re comfortable with the number of results you get back. For example, if I remove the quotes around my query, the total number of results jumps to 2,646 - so based on how many Tweets you’d like to analyze you can tune this to your needs.

3. Configure Response Fields

The V2 Twitter API returns a very “thin” response by default, just the Tweet text as seen in step 1. If you only care about the actual text of the Tweets matching your query (don’t care about the author, timestamp, metadata, etc…) then you can skip this step. However, odds are you probably want some “metadata” back about the Tweets your researching.

For this, we’ll revisit the Search Tweets Integration with our query that we constructed in steps 1 & 2 and pay close attention to the “Expansions” and “Fields” inputs.

Under the “Expansions” input, you can see an “example” value we provide of all the possible extra information you can get back. However, in practice you will want to select the data that’s relevant to you. E.g. if you want to know more about the authors of the Tweets matching your query, then you should enter in author_id in expansions as follows:

Enter in author_id to get back author details

Now Twitter will return an additional column called author_id with each Tweet and then a separate collection called includes > users with data about each author’s Twitter account (this saves resources if duplicate authors match the search) - and we can see basic information about each author in the includes > users collection:

Author ID and Basic Author Fields

While this provides us with more information, it would be better if we could get more data back about the individual Tweets (like timestamp, like count, reply count, etc…) and users (follower count, number of Tweets, etc…). To do this, we need to tell Twitter to give us more fields back, so we can paste in the provided “Example” values into the Fields inputs for Tweets and Users:

Specify additional fields for Tweets & Users

Now your response will look as follows, where you can download additional fields about each Tweet matching our query:

Tweet additional columns

And more information, such as follower count, from the Tweet authors:

Tweet author additional columns

Once you’re happy with the type of data you’re getting back, it’s time to run this at scale and get back millions of results!

4. Scrape Tweets at Scale with Workflows

Now that you’re an expert on the Search Tweets Integration, you’ll want to get more than 100 results back per data request. To do this, we need to use “pagination” and pass the meta.next_token value in the <root> collection from a response into the pagination_token input of the next response to get the next 100 results and so on.

Fortunately, we don’t have to do all of this manually and can use a “workflow” version of this endpoint to automatically paginate and combine the results for us into bulk CSV files.

4.1 Import the Workflow Formula

Check out the Twitter Tweets & Archive Search - Pagination workflow formula page and click “Import” to add the workflow to your account.

4.2 Specify Workflow Inputs

Enter the inputs like you did with the endpoint version in the previous step. Put in your query, time range, fields and specify all under “Recent or All” to get ALL Tweets back (not just the past 7 days):

Provide workflow inputs

Pro tip: You can enter multiple queries (one per line), but it will take longer to run as the workflow will make queries for each search term separately, taking more time.

4.3 Execute the Workflow

Review your inputs and then execute the workflow. You can optionally give it a name to help you keep track of multiple runs:

Execute the workflow

4.4 Download Results

Once your workflow finishes running, you’ll see the results in the green section below (which will combine the responses from all pages together for you):

Archive search results

You’ll probably just want the “Tweets.csv” file, but you can also download other collections like Expanded Users, Tweet Mentions, Hashtags, etc… which are provided as separate collections since one Tweet can contain multiple mentions, etc…

You’ll notice my search only returned about 2,000 results - this is because I’m not on the Pro API Tier - but if you are then the workflow will continue to paginate until Twitter cuts it off, downloading and saving all results here as CSV files (if downloading in the millions, the workflows will split up to not get overloaded, but you can combine the individual files afterwards… just contact support about this if you get stuck).