Download dataset via Twitter API using Tweepy

Suyash Dabhane
4 min readMay 17, 2021

In this tutorial, I will be showing you how to fetch and download tweets that contain specific words via the Twitter API key using the Tweepy library. After that, we will be storing that dataset using a CSV file.

Image credit— realpython

What is Tweepy?

Tweepy is an open-source, easy-to-use Python library that provides you a very convenient way to access the Twitter platform using its API key.

Now you know Tweepy, let’s jump into the process —

Step 1 — Getting Started with the Twitter API

Before moving forward to downloading datasets we need to have Twitter API keys. For that, we need to apply and receive approval for a Twitter developer account.

You can apply and get access to the Twitter API by following these steps — click here.

After approval, you will be provided with a Twitter developer account and you can find details of your Twitter API by logging on to your developer account.

Step 2 —Install Tweepy

For installing Tweepy you just need to run the following command.

pip install tweepy

Step 3 — Setting up a connection with Twitter using the API keys

1. Importing the required libraries

import tweepy
import csv
from tqdm import tqdm #helpful for tracking progress
import string
import datetime

2. Initializing API Key Credentials

consumer_key = ‘enter_your_consumer_key_here’
consumer_secret = ‘enter_your_consumer_secret_key_here’
access_token = ‘enter_your_access_token_key_here’
access_token_secret = ‘enter_your_access_token_secret_key_here’

Note- you will have to paste your respective API keys in the above code snippet. you can find all the above keys by logging on to your developer account.

3. Authentication on Twitter

We need to authenticate our API keys to set a connection with the Twitter platform. Tweepy makes this authentication process pretty smooth using its ‘tweepy.AuthHandler’ class.

So first we need to create an instance of this class. Into the instance, we pass our consumer key and secret key.

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

The next step is exchanging the request token for an access token. The access token is the “key” for opening the Twitter API.

auth.set_access_token(access_token, access_token_secret)

So now that we have our OAuthHandler equipped with an access token, we are ready to move forward.

api = tweepy.API(auth, wait_on_rate_limit=True)

This will establish a connection with the Twitter platform by authenticating our API keys and initialize our API.

Step 4 — Fetching and storing tweets

Here I will show you,

  1. how to fetch tweets that contain mentioned words?
  2. Save the dataset as a CSV file.

Let’s start,

First, let’s create and open a CSV file into a variable named ‘csvFile’ where we are going to store the fetched tweets.

csvFile = open(‘tweet_dataset.csv’, ‘a’)
csvWriter = csv.writer(csvFile)
csvWriter.writerow(["id", "timestamp","tweet_text"])

In the second line of the above code, we are initiating a ‘csv.writer’ which will be storing or writing the fetched tweets in our CSV file.

In the third line, we are declaring the column names by writing the first row of our dataset.

Now, let’s define the list of words that are going to be present in the tweets we will be fetching. Suppose we need covid-19 related tweets. Let’s consider words related to covid-19.

query = [ ‘covid-19’ , ‘hospital’ , ‘vaccine’ , ‘mortality’]

So, we will be fetching the tweets that contain the words in the above list.

Finally the main code -

for qu in query:
for tweet in tqdm(tweepy.Cursor(api.search, q=qu,tweet_mode="extended",lang="en",since="2020-04-01").items(500)):
if (not tweet.retweeted) and ('RT @' not in tweet.full_text):
csvWriter.writerow([tweet.id, tweet.created_at,tweet.full_text.encode('utf-8')])

Didn’t understand the code? Let me read it for you.

First, there is a ‘for’ loop, under that, there is another ‘for’ loop which contains an ‘if’ statement too. (Beware of indentation error)

So basically here we are saying, for every word ‘qu’ in the list named ‘query’, search for the tweets which contain that specific word ‘qu’. The tweets should be in English and tweeted after 1st April 2020.

After that for every fetched tweet, check if that tweet is a retweet. If yes then skip that tweet else write the id, timestamp and tweet_text of that tweet into the .csv file as a row using csvWriter.

The tweet_text column in our dataset will be in a ‘utf-8’ encoding format.

Here for every word in the list ‘query,’ we are fetching and storing 500 tweets.

So total entries in this dataset = 4 words in ‘query’ * 500 = 2000

In the end, your dataset will be looking something like this —

In this way, we learned how to use Tweepy and Twitter API to fetch and download the dataset of tweets that contains specific words.

Connect me on LinkedInlinkedin.com/in/suyash-dabhane/

--

--