Web scraping with R

Part 2: Using APIs

In the previous tutorial, we saw how to scrape data in a way that essentially mimicked what a human user would do: we went to a URL, identified the information we wanted by parsing the HTML codes and using CSS selectors, then we “copied” that information into a dataframe in R or into separate text files. This method is a common and very versatile approach to scraping, but it is not the only, or even the most efficient, method for getting data from websites.

In this tutorial, we’ll see how to use R to request information from specific sites using public APIs. APIs are another very common way to access and acquire data from the Web.

What is an API?

Instead of downloading a dataset or scraping a site, APIs allow you to request data directly from a website through what’s called an Application Programming Interface. Many large sites like Reddit, Twitter, Spotify, and Facebook provide APIs so that other users can access data quickly, reliably, and legally. This last bit is important. Always check if a website has an API before scraping by other means. The following brief explanation of APIs is adapted from this post at dataquest.io.

‘API’ is a general term for the place where one computer program interacts with another, or with itself. We will be working with web APIs here, where two different computers—a client and server—interact with each other to request and provide data, respectively. APIs provide a way for us to request clean and curated data from a website. When a website sets up an API, they are essentially setting up a computer that waits for data requests from other users.

Once this computer receives a data request, it will do its own processing of the data and send it to the computer that requested it. From our perspective as the requester, we will need to write code in R that creates the request and tells the computer running the API what we want. That computer will then read our code, process the request, and return nicely-formatted data that we can work with in existing R libraries.

Why is this valuable? Contrast the API approach to “pure” webscraping that we used in the previous tutorial. When a programmer scrapes a web page, they receive the data in a messy chunk of HTML. While we were able to use libraries like {rvest} to make parsing HTML text easier, we still had to go through multiple steps to identify the page URLs and the correct bits of HTML and CSS code to give us what we wanted. This wasn’t too hard with our toy examples, but it can often be quite complicated with real sites.

APIs offer a way to get data that we can immediately use, which can save us a lot of time and frustration. Many commonly used sites have R packages that are specifically dedicated to interfacing with those sites’ APIs. {rtweet} is such a library for getting tweets from Twitter’s API. Other examples include {RedditExtractoR}, {twitteR}, {Rfacebook}, {geniusr}, and {spotifyr}. Otherwise, you can use the {httr} and {jsonlite} packages to work with APIs more generally. These are a bit more advanced, and we will only briefly go into these at the end of this session (but see here and here for an introduction).

Getting started

Again, some familiarity with R and RStudio is a necessity. This tutorial assumes you are familiar with a few of the core aspects of the “tidy” coding style, particularly the use of pipes, i.e. |> or %>%. If you are new to R, I recommend the following to get started:

swiRl. This is a tutorial library that works directly in R. There really isn’t a better way to learn R!
R for Data Science by Hadley Wickham and Garrett Grolemund (Wickham & Grolemund 2016). This covers all the basics for working with R using the “tidy” approach to programming.

tip: It’s not necessary, but I highly recommend familiarising yourself with how to use RStudio projects, and particularly the {here} package for managing file paths. These will make your life much easier as you use R for more and more projects.

R libraries

Libraries we’ll be using:

library(tidyverse) # for data wrangling
library(tictoc) # for timing processes
# library(tidytext) # for text mining
library(here) # for creating consistent file paths
library(usethis) # for editing environment files

In the following section we’ll take a quick look at a few different packages for interfacing with specific APIs. The first two are “wrapper” packages, which are designed to make accessing specific sites much easier. The latter two are general packages for working with any API you might want.

# R libraries for interfacing with APIs
library(RedditExtractoR) # for scraping reddit forums
library(rtweet) # for getting tweets

library(httr) # for interfacing with APIs in general
library(jsonlite) # for parsing JSON

Working with API wrapper packages

We’ll start with reddit. There’s a lot you could do with reddit data, and fortunately the {RedditExtractoR} package makes it very easy to get data.

For example, suppose we want to see what people are saying about Spinosaurus, a genus of dinosaur that turns out to have been even cooler and more unusual than we thought.

Spinosaurus aegyptiacus. Image source: https://www.nhm.ac.uk/discover/news/2020/may/dinosaur-diaries-spinosaurus-sauropod-necks-starry-lizard.html

The find_thread_urls() function returns a dataframe of comment threads and other information about a given search term. We’ll use the simple search term “spinosaurus.” This will give us all the comment threads matching these criteria.

spinosaur_threads <- find_thread_urls(keywords = "spinosaurus")

parsing URLs on page 1...
parsing URLs on page 2...

Notice that this function can take a while to run depending on your terms and how many threads there are.

Let’s see what we have in there.

spinosaur_threads %>%
  glimpse()

Rows: 159
Columns: 7
$ date_utc  <chr> "2022-04-27", "2022-05-08", "2022-04-28", "2022-04-30", "202~
$ timestamp <dbl> 1651087254, 1651979718, 1651149930, 1651331593, 1649876282, ~
$ title     <chr> "Glitch in Jp3 chaos theory", "All Cannon Prehistoric Creatu~
$ text      <chr> "I was playing jp3 chaos theory, after transporting the Cera~
$ subreddit <chr> "Jurassicworldevol2", "JurassicPark", "Paleontology", "Clash~
$ comments  <dbl> 0, 15, 3, 2, 1, 2, 0, 0, 0, 0, 0, 0, 21, 0, 0, 0, 0, 0, 0, 0~
$ url       <chr> "https://www.reddit.com/r/Jurassicworldevol2/comments/udbi6v~

So we have 159 threads with this keyword. If we arrange the threads by number of comments, we can look at the threads with the most comments. (as usual, scroll through columns with the little arrow in the upper right)

spinosaur_threads %>%
  select(comments, title, date_utc, subreddit) %>%
  arrange(desc(comments)) # put most popular thread first

So now let’s pull the content of the top thread. We’ll use get_thread_content(), which takes a vector of URLs as its argument. The output is a list containing information about the threads and the comments. Here I’ll pull out the comments from the largest thread, the one titled “What dinosaurs do you think are underrated?”

spinosaur_content <- spinosaur_threads |>
  arrange(desc(comments)) |>
  pull(url) |> # pull out the `url` column as a vector
  first() |> # just use the first element
  get_thread_content()

glimpse(spinosaur_content$comments)

Rows: 122
Columns: 10
$ url        <chr> "https://www.reddit.com/r/Dinosaurs/comments/ugg4c3/what_di~
$ author     <chr> "MistyLuHu", "Welikefortnite07", "spooderfbi", "GovernorSan~
$ date       <chr> "2022-05-02", "2022-05-02", "2022-05-02", "2022-05-02", "20~
$ timestamp  <dbl> 1651462551, 1651479343, 1651483426, 1651499622, 1651500493,~
$ score      <dbl> 78, 20, 15, 9, 3, 2, 8, 5, 9, 4, 0, 1, 1, 9, 1, 1, 1, 31, 3~
$ upvotes    <dbl> 78, 20, 15, 9, 3, 2, 8, 5, 9, 4, 0, 1, 1, 9, 1, 1, 1, 31, 3~
$ downvotes  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ golds      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ comment    <chr> "Deinonychus. Velociraptor gets all kinds of recognition be~
$ comment_id <chr> "1", "1_1", "1_2", "1_2_1", "1_2_2", "1_2_3", "1_3", "1_4",~

We can see the comments.

spinosaur_content$comments |>
  select(comment)

This is cool. It’s clear that these might need some cleaning up, but the data is all there. Super simple!

How you might use this information is open ended, but this seems like a very useful too for studying language in use.

Watch for changes! Earlier versions of this package offered much more functionality, and many of the older functions no longer work. This can happen as packages are continually updated to deal with changes to APIs. API policies are changing all the time, so package creators have to be vigilant to adapt. If you find the code here does not work, check the version you are using and make sure it is up to date.

Twitter

Now let’s look at Twitter. I like the {rtweet} package for getting tweets very easily. But in order to use it, you need a Twitter account so you can authorize {rtweet} to use your specific account credentials. This is due to the fact that there are limits to how many tweets you can download in a given time period. In the next section I’ll go over how to set up your credentials for using Twitter’s API.

Creating a Twitter app

The first thing to know is that every request to the Twitter API has to go through an “app.” Normally, someone else has created the app for you, but now that you’re using twitter programmatically, you need to create your own app. (It’s still called an app even though you’ll be using it through an R package).

To create a Twitter app, you need to first apply for a free developer account by following the instructions at https://developer.twitter.com. Once you have been approved (which may take some time), go to the developer portal and click the “Create App” button at the bottom of the page. You’ll need to give a name your app. The name can be whatever you want, but it needs to be unique across all twitter apps.

After you’ve named your app, you’ll see a screen that gives you some important information. These are your API key, your API secret key, and Bearer token:

Twitter API keys and bearer token

You’ll only see this once, so you need to record it in a secure location. For now you can copy these to a text file. Don’t worry though—if you don’t record these or lose them, you can always regenerate them.

Once you’ve done this, click the “App settings” button, and go to the “Keys and tokens tab.” Now as well as the API key and secret you recorded earlier, you’ll also need to generate a “Access token and secret” which you can get by clicking the “Generate” button on the next to “Access Token and Secret”:

Twitter Access keys and tokens

Again, copy or write these down somewhere secure. What I like to do is create a list in R with these values, and save that to my project directory. This way I can easy load it when I start a new session.

# Your values will differ
jgtwitterapp_keylist <- list(
  api_key = "xxxxxxxxxxx",
  api_secret = "xxxxxxxxxxx",
  access_token = "xxxxxxxxxxx",
  access_secret = "xxxxxxxxxxx",
  bearer_token = "xxxxxxxxxxx"
)
# save to file
saveRDS(jgtwitterapp_keylist, here("keys", "jgtwitterapp_keylist.rds"))

So when I want to use this with {rtweet}, I can just load it and create a token with create_token().

jgtwitterapp_keylist <- readRDS(here("keys", "jgtwitterapp_keylist.rds"))

my_token <- create_token(
  app = "jgtwitterapp", # the name of my app
  consumer_key = jgtwitterapp_keylist$api_key,
  consumer_secret = jgtwitterapp_keylist$api_secret,
  access_token = jgtwitterapp_keylist$access_token,
  access_secret = jgtwitterapp_keylist$access_secret
)

Now we should be able to use {rtweet}!

Search tweets

In this example, let’s look for tweets using the word cheugy, which has been a popular topic of discussion the past few years (see e.g. articles in The Guardian, The Telegraph, Vox, and The New York Times). If you’re not familiar with the term, Urban Dictionary defines it as

Another way to describe aesthetics/people/experiences that are basic. It was coined by a now 23 year old white woman in 2013 while a student at Beverly Hills High School, on whom the irony is apparently lost. According to the New York Times, “cheugy (pronounced chew-gee) can be used, broadly, to describe someone who is out of date or trying too hard.”

So this is not a brand new word, but it’s new enough for most people to find interesting. More important, it’s a perfect example of how fast language can change, and Twitter, perhaps more than any other source, can help us investigate such rapid (and maybe fleeting) changes in the language of social media.

For a simple search we’ll use the search_tweets() function in {rtweet}, which takes a search term and returns a dataframe. The function returns tweets from all languages, so we’ll make sure to include "lang:en" in our search term to limit our searches to English. To keep it simple, we’ll just do the most recent 200 tweets with this word in them (note this may get different results each time you run it).

cheugy_tweets <- rtweet::search_tweets(
  "cheugy lang:en", # the terms to search for
  n = 200, # the number of tweets to collect
  include_rts = FALSE, # don't include retweets
  token = my_token # the token we created
)

Normally you’d want many more than this, and you can collect up to 18k in a 15 minute period, but I’m just keeping it quick and simple here. Now what do we get…

glimpse(cheugy_tweets)

Rows: 200
Columns: 90
$ user_id                 <chr> "1283085920300924929", "294964014", "138731540~
$ status_id               <chr> "1524728325633105921", "1524725481639186432", ~
$ created_at              <dttm> 2022-05-12 12:28:55, 2022-05-12 12:17:37, 202~
$ screen_name             <chr> "nicfavic23", "zippy_gp", "the_stuffoflife", "~
$ text                    <chr> "“cheugy” is an excellent word/adjective and s~
$ source                  <chr> "Twitter for iPhone", "Twitter for iPhone", "T~
$ display_text_width      <dbl> 94, 231, 257, 208, 176, 210, 94, 29, 80, 21, 2~
$ reply_to_status_id      <chr> NA, NA, "1524653422779445249", "15246454230336~
$ reply_to_user_id        <chr> NA, NA, "1387315404385652741", "286096686", "1~
$ reply_to_screen_name    <chr> NA, NA, "the_stuffoflife", "saghir7", "ReemAbd~
$ is_quote                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS~
$ is_retweet              <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS~
$ favorite_count          <int> 0, 3, 0, 0, 0, 1, 2, 3, 0, 0, 3, 0, 2, 5, 1, 0~
$ retweet_count           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ quote_count             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ reply_count             <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ hashtags                <list> NA, NA, NA, NA, "girlboss", NA, NA, NA, NA, N~
$ symbols                 <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ urls_url                <list> NA, NA, "vox.com/the-goods/2257…", NA, NA, NA~
$ urls_t.co               <list> NA, NA, "https://t.co/9WHgcWPRvH", NA, NA, NA~
$ urls_expanded_url       <list> NA, NA, "https://www.vox.com/the-goods/225700~
$ media_url               <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ media_t.co              <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ media_expanded_url      <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ media_type              <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ ext_media_url           <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ ext_media_t.co          <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ ext_media_expanded_url  <list> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ ext_media_type          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ mentions_user_id        <list> NA, NA, NA, <"286096686", "148153617">, <"131~
$ mentions_screen_name    <list> NA, NA, NA, <"saghir7", "ghulamesposito">, <"~
$ lang                    <chr> "en", "en", "en", "en", "en", "en", "en", "en"~
$ quoted_status_id        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ quoted_text             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ quoted_created_at       <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ quoted_source           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ quoted_favorite_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ quoted_retweet_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ quoted_user_id          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ quoted_screen_name      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ quoted_name             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ quoted_followers_count  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ quoted_friends_count    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ quoted_statuses_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ quoted_location         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ quoted_description      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ quoted_verified         <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ retweet_status_id       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ retweet_text            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ retweet_created_at      <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
$ retweet_source          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ retweet_favorite_count  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ retweet_retweet_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ retweet_user_id         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ retweet_screen_name     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ retweet_name            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ retweet_followers_count <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ retweet_friends_count   <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ retweet_statuses_count  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ retweet_location        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ retweet_description     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ retweet_verified        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ place_url               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ place_name              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ place_full_name         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ place_type              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ country                 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ country_code            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ geo_coords              <list> <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, ~
$ coords_coords           <list> <NA, NA>, <NA, NA>, <NA, NA>, <NA, NA>, <NA, ~
$ bbox_coords             <list> <NA, NA, NA, NA, NA, NA, NA, NA>, <NA, NA, NA~
$ status_url              <chr> "https://twitter.com/nicfavic23/status/1524728~
$ name                    <chr> "The Gay Shoe Clerk (1903)", "greatest hits ed~
$ location                <chr> "", "TX", "", "Durham, England", "Los Angeles,~
$ description             <chr> "he/him/they/them. i got my feet on the ground~
$ url                     <chr> NA, NA, "https://t.co/TTXMVb0WmR", NA, "https:~
$ protected               <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS~
$ followers_count         <int> 88, 373, 27, 1640, 925, 132, 851, 902, 460, 12~
$ friends_count           <int> 298, 158, 109, 1523, 650, 763, 430, 905, 399, ~
$ listed_count            <int> 1, 2, 0, 11, 9, 6, 5, 2, 0, 0, 42, 1, 45, 95, ~
$ statuses_count          <int> 4071, 12058, 674, 9891, 2441, 4593, 51737, 205~
$ favourites_count        <int> 17296, 11755, 159, 27194, 2206, 44849, 66154, ~
$ account_created_at      <dttm> 2020-07-14 17:08:27, 2011-05-08 03:34:04, 202~
$ verified                <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS~
$ profile_url             <chr> NA, NA, "https://t.co/TTXMVb0WmR", NA, "https:~
$ profile_expanded_url    <chr> NA, NA, "http://www.thestuffoflife.in", NA, "h~
$ account_lang            <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
$ profile_banner_url      <chr> "https://pbs.twimg.com/profile_banners/1283085~
$ profile_background_url  <chr> NA, "http://abs.twimg.com/images/themes/theme1~
$ profile_image_url       <chr> "http://pbs.twimg.com/profile_images/135057377~

There’s A LOT of information here, and you can go to ?search_tweets to see the full run down of what these columns contain. But you should be able to see some potentially useful info. For example, we can look at who tweeted (screen_name), the date and time of the tweet, and the location they supplied, if any.

cheugy_tweets %>%
  select(screen_name, created_at, location)

So there’s lots we could do. We can see the text of the tweets in the text column.

cheugy_tweets %>%
  select(text)

We can see who is using this word and when.

cheugy_tweets %>%
  select(screen_name, created_at)

It’s worth noting that the query syntax is a bit different from other packages. From the ?search_tweets help file:

Spaces behave like boolean “AND” operator. To search for tweets containing at least one of multiple possible terms, separate each search term with spaces and “OR” (in caps). For example, the search q = “data science” looks for tweets containing both “data” and “science” located anywhere in the tweets and in any order. When “OR” is entered between search terms, query = “data OR science,” Twitter’s REST API should return any tweet that contains either “data” or “science.” It is also possible to search for exact phrases using double quotes. To do this, either wrap single quotes around a search query using double quotes, e.g., q = ‘“data science”’ or escape each internal double quote with a single backslash, e.g., q = “"data science".”

So just a warning to be careful with your searches. For example, if we wanted to look for cheugy or cheug (as in “I’m a ‘cheug’ and proud of it”) we’d need to specify the query like so.

cheugy_tweets <- rtweet::search_tweets(
  q = "cheugy OR cheug lang:en", # the terms to search for
  n = 200, # the number of tweets to collect
  include_rts = FALSE, # don't include retweets
  token = my_token
)

cheugy_tweets %>%
  select(text)

Stream tweets

We can also randomly sample (approximately 1%) from the live stream of all tweets. Here we’ll go for 30 seconds.

cheugy_stream <- stream_tweets(
  q = "cheugy OR cheug lang:en",
  timeout = 30, # time in seconds
  token = my_token
)

Now in this case we don’t get anything, which is not surprising.

cheugy_stream

NULL

But we could stream all tweets mentioning cheugy or cheug for a week and see what happens.

# stream tweets for a week (60 secs * 60 mins * 24 hours *  7 days)
# DO NOT RUN THIS!!
stream_tweets(
  q = "cheugy OR cheug lang:en",
  timeout = 60 * 60 * 24 * 7,
  file_name = here("data_raw", "live_cheugy_tweets.json"),
  parse = FALSE
)

A couple things to note here. I’ve set this to save the unparsed output to the file “live_cheugy_tweets.json” in my project’s “data_raw” folder. This is a better method for longer streams. As noted in the ?stream_tweets documentation,

By default, parse = TRUE, this function does the parsing for you. However, for larger streams, or for automated scripts designed to continuously collect data, this should be set to FALSE as the parsing process can eat up processing resources and time. For other uses, setting parse to TRUE saves you from having to sort and parse the messy list structure returned by Twitter.

We can easily load and parse this to a tidy dataframe with parse_stream() like so.

cheugy_tweets2 <- here("data_raw", "live_cheugy_tweets.json") |>
  parse_stream()

Get friends and followers

It’s also easy to track who follows and who is followed by a given user. The get_friends() function collects a list of accounts followed by a particular user.

# get user IDs of accounts followed by @Rbloggers
rblog_friends <- get_friends("@Rbloggers", n = 1000)

This gives us a dataframe with a single column of user IDs. If we want more info on these users we can find it with lookup_users().

rblog_friends %>%
  pull(user_id) %>% # pull out the column as a vector
  lookup_users()

The same applies to the get_followers() function, which gets the accounts following a user.

# get user IDs of accounts following @Rbloggers
rblog_followers <- get_followers("@Rbloggers", n = 1000)

rblog_followers %>%
  slice_head(n = 25) |> # get the first 25 followers
  pull(user_id) %>% # pull out the column as a vector
  lookup_users()

Both the get_friends() and get_followers() allow a maximum of 5000 for a single API call, and you are limited to 15 such calls per 15 minutes. You can see the documentation for these functions for more information.

There are lots of other functions for looking into Twitter data. You can find more information at https://docs.ropensci.org/rtweet/

Get timelines

You can also search for the most recent 3200 tweets from a single user with get_timeline().

bbc_tml <- get_timeline(
  "@BBCNews",
  n = 100,
  token = my_token
)
bbc_tml

It works! Annoyingly, at the moment it seems that you need to run this same create_token() process each time you start a new R session. This is a bit clunky, but it’s the only way I have found to make it work consistently. The get_token() function should be able to load your app token automatically, but I think the code is a bit buggy. You can see this thread for more discussion.

Locating tweets

We can also try to locate tweets geographically. One issue with {rtweet} is that it does not seem to work well for getting locations and geo coordinates for tweets. The package provides a function lookup_coords() for looking up coordinates, but this relies on getting information on Google’s API which apparently does not play well with others. From the lookup_coords() help file:

Since Google Maps implemented stricter API requirements, sending requests to Google’s API isn’t very convenient. To enable basic uses without requiring a Google Maps API key, a number of the major cities throughout the word and the following two larger locations are baked into this function: ‘world’ and ‘usa.’ If ‘world’ is supplied then a bounding box of maximum latitutde/longitude values, i.e., c(-180, -90, 180, 90), and a center point c(0, 0) are returned. If ‘usa’ is supplied then estimates of the United States’ bounding box and mid-point are returned. To specify a city, provide the city name followed by a space and then the US state abbreviation or country name. To see a list of all included cities, enter rtweet:::citycoords in the R console to see coordinates data.

We can see some of the cities listed here.

rtweet:::citycoords

So for example, if we wanted to collect tweets from here in Birmingham, we’d use the name in the dataframe above in lookup_coords() like so.

lookup_coords("birmingham england")

$place
[1] "birmingham england"

$box
sw.lng.lng sw.lat.lat ne.lng.lng ne.lat.lat 
 -1.966667  52.366667  -1.866667  52.466667 

$point
      lat       lng 
52.416667 -1.916667 

attr(,"class")
[1] "coords" "list"

And we’ll include a geocode argument in our search to get only tweets from within these coordinates. Now, there

bham_tweets <- search_tweets(
  q = "lang:en",
  n = 1000,
  include_rts = FALSE, # don't include retweets
  geocode = lookup_coords("birmingham england")
)

bham_tweets %>%
  select(screen_name, location, place_full_name, geo_coords)

There are several sources of geographic information in our tweets.

location: This is the user-defined location for an account’s profile. This can really be anything, so you have to be careful.
place_name and place_full_name: When users decide to assign a location to their Tweet, they are presented with a list of candidate Twitter Places, and these contain the human-readable names of those places.
geo_coords, coord_coords, bbox_coords: These contain the latitude and longitude coordinates of the tweet, if available.

Let’s see what we find in our example.

bham_tweets %>%
  count(location, sort = T)

bham_tweets <- lat_lng(bham_tweets)
bham_tweets %>%
  select(lat, lng)

There’s a lot more you can do with {rtweet}, and I encourage you to check out some of the online guides available, e.g. the creator Michael Kearney’s help here and here.

Working with general APIs

Not all APIs have convenient packages dedicated to the their use, so it’s likely that you may need to interface with an API directly. To do this we’ll use the {httr} package to work with Web APIs. Again, Web APIs involve two computers: a client and a server. The client submits a Hypertext Transfer Protocol (HTTP) request to the server and the server returns a response to the client. The response contains status information about the request and may also contain the requested content. The other library we’ll use is {jsonlite}, which is important for parsing the output of the API requests we get. The packages we’ve just seen do all this as well, but it happens behind the scenes. Now we’re going to pull the curtain back a bit and see how it works.

But this can be complicated, so I’ll just cover the very basics here.

Warning: This bit can get complicated, so I’ll just cover the very basics here. I recommend Getting started with httr vignette for more details. I’m also going to be using some more complex functions to make things efficient. It might help to work through the chapter on functions in Hadley Wickam’s Advanced R book (Wickham 2019) to better understand how these work.

library(httr)
library(jsonlite)

Basic steps

In the simplest case, to make a request all you need is a URL for the API. The example I’ll use here is https://github.com/beanboi7/yomomma-apiv2, which is free site that stores “yo momma” jokes. I found this among this list of free APIs. There are many more you can find if you poke around.

We send a request with the GET() function, along with information. If you go to the website above, it gives information about the endpoint, which is the URL we’ll use in our request, as well as any query parameters that we can set. Generally, most sites with web APIs will give you some details about how to use them.

So we store our endpoint, and include it in our GET() request:

ym_path <- "https://yomomma-api.herokuapp.com/jokes"

ym_request <- GET(
  url = ym_path,
  query = list(count = 10) # the number of jokes to get
)
ym_request

Response [https://yomomma-api.herokuapp.com/jokes?count=10]
  Date: 2022-05-12 12:47
  Status: 200
  Content-Type: application/json
  Size: 867 B

We can check whether our request worked just to be sure.

http_status(ym_request)

$category
[1] "Success"

$reason
[1] "OK"

$message
[1] "Success: (200) OK"

That’s what we want to see. Now we can extract the content.

ym_content <- content(ym_request, as = "text", encoding = "UTF-8")
ym_content

[1] "[{\"joke\":\"Yo momma is so old that she knew Burger King while he was still a prince.\"},{\"joke\":\"Yo momma is so fat that she uses redwoods to pick her teeth\"},{\"joke\":\"Yo momma is so old that when she was born, the Dead Sea was just getting sick.\"},{\"joke\":\"Yo momma is so stupid that she was on the corner with a sign that said 'Will eat for food.'\"},{\"joke\":\"Yo momma is so fat that whenever she goes to the beach the tide comes in!\"},{\"joke\":\"Yo momma is so dirty that that she was banned from a sewage facility because of sanitation concerns.\"},{\"joke\":\"Yo momma's so fat that the Dragon Ball Z crew uses her to make craters on set.\"},{\"joke\":\"Yo momma is so stupid that it took her 2 hours to watch 60 Minutes!\"},{\"joke\":\"Yo momma is so fat that it took Usain Bolt 3 years to run around her.\"},{\"joke\":\"Yo momma's so ugly that she makes Sailor Bubba feel dirty.\"}]"

This isn’t in a nice format, and this is because the content is in JSON. JSON stands for JavaScript Object Notation. JSON is useful because it is easily readable by a computer, and for this reason, it has become the primary way that data is communicated through APIs. Most APIs will send their responses in JSON format.

This is where the {jsonlite} package comes in, since it contains useful funcitons for converting JSON code into more familiar data objects in R.

# flatten tells it to create a single unnested dataframe
ym_jokes_df <- fromJSON(ym_content, flatten = TRUE) %>%
  data.frame()

ym_jokes_df

That’s all there is to it! In reality things are not always that simple, and for access to many APIs you will need to register an application with the website. I’ve included a few more examples below.

More examples

Cat facts

You can scrape list of facts about cats here: https://alexwohlbruck.github.io/cat-facts/docs/

cat_path <- "https://cat-fact.herokuapp.com/facts"

cat_facts <- GET(
  url = cat_path,
)

http_status(cat_facts)

$category
[1] "Success"

$reason
[1] "OK"

$message
[1] "Success: (200) OK"

cat_df <- content(cat_facts, as = "text", encoding = "UTF-8") %>%
  fromJSON(flatten = TRUE) %>%
  data.frame()

cat_df %>%
  select(text)

Articles in the Guardian

The Guardian has a free open API for anyone to use. All you need to do is register for a developer app here: https://open-platform.theguardian.com/documentation/

Once you do you will be sent your API key and get a URL to use that will look like this.

# the XXXXXXXXXXXXXXXXXX will be your API key
"https://content.guardianapis.com/search?api-key=XXXXXXXXXXXXXXXXXXXXXXXXXXX"

Alternatively, you can save your key and then load it when you need it. Your key should not be shared (which is why I don’t include it here).

# read in my saved key and paste it to the path
gd_api <- readRDS(here::here("keys", "guardian_api.rds"))
gd_path <- paste0(
  "https://content.guardianapis.com/search?api-key=",
  gd_api$api_key
)
# Loaded
gd_request <- GET(
  url = gd_path,
  query = list(
    q = "dinosaur" # pieces mentioning dinosaurs
  )
)

# Check status
http_status(gd_request)

$category
[1] "Success"

$reason
[1] "OK"

$message
[1] "Success: (200) OK"

Get the content and parse the JSON code. Different sites provide different information, so you need to check it.

gd_content <- content(gd_request, as = "text", encoding = "UTF-8") %>%
  fromJSON(flatten = TRUE) %>%
  data.frame()
# What's in there?
names(gd_content)

 [1] "response.status"                     "response.userTier"                  
 [3] "response.total"                      "response.startIndex"                
 [5] "response.pageSize"                   "response.currentPage"               
 [7] "response.pages"                      "response.orderBy"                   
 [9] "response.results.id"                 "response.results.type"              
[11] "response.results.sectionId"          "response.results.sectionName"       
[13] "response.results.webPublicationDate" "response.results.webTitle"          
[15] "response.results.webUrl"             "response.results.apiUrl"            
[17] "response.results.isHosted"           "response.results.pillarId"          
[19] "response.results.pillarName"

Look at the date and titles

gd_content %>%
  select(response.results.webPublicationDate, response.results.webTitle)

Entries in the Oxford English Dictionary

Again, you’ll need to register a developer account (https://developer.oxforddictionaries.com/), and you can get a free version (with limits) easily. In this case, you’ll need both your app ID and your app key, and we’ll include these in the add_headers() argument for the GET() request. I figured this out by looking at the little example of the Python code at the bottom of the developer page (they don’t have an R example, but the two work very similarly).

Information about how to create your GET request from the OED API

oed_keys <- readRDS(here::here("keys", "oed_keys.rds"))

# note that the path contains the word you are searching for
word <- "dinosaur"
ox_path <- paste0("https://od-api.oxforddictionaries.com/api/v2/entries/en-gb/", word)

ox_request <- GET(
  url = ox_path,
  add_headers(
    app_id = oed_keys$app_id,
    app_key = oed_keys$app_key
  )
)

http_status(ox_request)

$category
[1] "Success"

$reason
[1] "OK"

$message
[1] "Success: (200) OK"

It works! Now we could create our own wrapper function that gets the request and parses it all in one go:

# function for getting entries from the OED API
get_OED_entry <- function(word, lang = "en-gb") {
  # Load the keys if not already in the workspace
  if (!"oed_keys" %in% ls()) readRDS(here::here("keys", "oed_keys.rds"))

  path <- paste(
    "https://od-api.oxforddictionaries.com/api/v2/entries",
    lang,
    tolower(word),
    sep = "/"
  )
  ox_request <- GET(
    url = path,
    add_headers(
      app_id = oed_keys$app_id,
      app_key = oed_keys$app_key
    )
  )

  if (ox_request$status_code != 200) {
    http_status(ox_request) %>%
      print()
  } else {
    ox_request %>%
      content(as = "text", encoding = "UTF-8") %>%
      fromJSON(flatten = TRUE) %>%
      data.frame()
  }
}

kraken_entry <- get_OED_entry("kraken")
kraken_entry %>%
  glimpse()

Rows: 1
Columns: 10
$ id                     <chr> "kraken"
$ metadata.operation     <chr> "retrieve"
$ metadata.provider      <chr> "Oxford University Press"
$ metadata.schema        <chr> "RetrieveEntry"
$ results.id             <chr> "kraken"
$ results.language       <chr> "en-gb"
$ results.lexicalEntries <list> [<data.frame[1 x 5]>]
$ results.type           <chr> "headword"
$ results.word           <chr> "kraken"
$ word                   <chr> "kraken"

Notice that the object returned by the API is a complex list with multiple embedded lists and dataframes, so you’l have to do some exploration

kraken_entry$results.lexicalEntries %>%
  glimpse()

List of 1
 $ :'data.frame':   1 obs. of  5 variables:
  ..$ entries             :List of 1
  .. ..$ :'data.frame': 1 obs. of  3 variables:
  ..$ language            : chr "en-gb"
  ..$ text                : chr "kraken"
  ..$ lexicalCategory.id  : chr "noun"
  ..$ lexicalCategory.text: chr "Noun"

This is an unnamed list whose first argument is a dataframe of entries. We can see what this looks like:

kraken_entry$results.lexicalEntries %>%
  first() %>% # pull the first item of a list
  glimpse()

Rows: 1
Columns: 5
$ entries              <list> [<data.frame[1 x 3]>]
$ language             <chr> "en-gb"
$ text                 <chr> "kraken"
$ lexicalCategory.id   <chr> "noun"
$ lexicalCategory.text <chr> "Noun"

So the entries are a list with one dataframe as its argument (this is getting ridiculous…). Let’s see what that looks like…

kraken_entry$results.lexicalEntries %>%
  first() %>%
  pull(entries) %>% # pull the content of a data.frame column
  first() %>%
  glimpse()

Rows: 1
Columns: 3
$ etymologies    <list> "Norwegian"
$ pronunciations <list> [<data.frame[2 x 4]>]
$ senses         <list> [<data.frame[1 x 5]>]

Oh good grief! This seems crazy, but it actually makes some sense, as there is a lot of information in a dictionary entry, and a complex object like this is not a bad way to keep it organised. Once we know the structure, it would be rather simple to create functions to get it quickly.

# get our definition
kraken_entry$results.lexicalEntries %>%
  first() %>%
  pull(entries) %>% # pull the content of a data.frame column
  first() %>%
  pull(senses) %>%
  first() %>%
  pull(definitions) %>%
  simplify() # collapse a list to a vector

[1] "an enormous mythical sea monster said to appear off the coast of Norway."

So we have a process for getting definitions. We can test it on a form with multiple meanings.

bank_entry <- get_OED_entry("bank")

bank_entry$results.lexicalEntries %>%
  first() %>%
  pull(entries) %>% # pull the content of a data.frame column
  first() %>%
  pull(senses) %>%
  first() %>%
  pull(definitions) %>%
  simplify()

[1] "the land alongside or sloping down to a river or lake"                                         
[2] "a long, high mass or mound of a particular substance"                                          
[3] "a set of similar things, especially electrical or electronic devices, grouped together in rows"
[4] "the cushion of a pool table"

Nice. You can imagine creating functions that get definitions (or pronunciations, etymologies, etc) from entry objects very easily.

Citation & Session Info

Grafmiller, Jason. 2022. Webscraping with R. Part 2: Using APIs. University of Birmingham. url: https://jasongrafmiller.netlify.app/tutorials/tutorial_webscraping_with_R_part2.html (version 2022.05.12).

The following is my current setup on my machine.

sessionInfo()

R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] jsonlite_1.8.0        httr_1.4.3            rtweet_0.7.0         
 [4] RedditExtractoR_3.0.6 usethis_2.1.5         here_1.0.1           
 [7] tictoc_1.0.1          forcats_0.5.1         stringr_1.4.0        
[10] dplyr_1.0.9           purrr_0.3.4           readr_2.1.2          
[13] tidyr_1.2.0           tibble_3.1.7          ggplot2_3.3.6        
[16] tidyverse_1.3.1       fontawesome_0.2.2     knitr_1.39           

loaded via a namespace (and not attached):
 [1] sass_0.4.1        bit64_4.0.5       vroom_1.5.7       R.utils_2.11.0   
 [5] modelr_0.1.8      bslib_0.3.1       assertthat_0.2.1  askpass_1.1      
 [9] cellranger_1.1.0  yaml_2.3.5        progress_1.2.2    pillar_1.7.0     
[13] backports_1.4.1   glue_1.6.2        digest_0.6.29     rvest_1.0.2      
[17] colorspace_2.0-3  htmltools_0.5.2   R.oo_1.24.0       pkgconfig_2.0.3  
[21] broom_0.8.0       haven_2.5.0       bookdown_0.26     scales_1.2.0     
[25] tzdb_0.3.0        openssl_2.0.0     styler_1.7.0      generics_0.1.2   
[29] ellipsis_0.3.2    withr_2.5.0       cli_3.3.0         RJSONIO_1.3-1.6  
[33] magrittr_2.0.3    crayon_1.5.1      readxl_1.4.0      evaluate_0.15    
[37] R.methodsS3_1.8.1 fs_1.5.2          fansi_1.0.3       R.cache_0.15.0   
[41] xml2_1.3.3        tools_4.1.3       prettyunits_1.1.1 hms_1.1.1        
[45] lifecycle_1.0.1   munsell_0.5.0     reprex_2.0.1      compiler_4.1.3   
[49] jquerylib_0.1.4   rlang_1.0.2       grid_4.1.3        rstudioapi_0.13  
[53] rmarkdown_2.14    gtable_0.3.0      DBI_1.1.2         curl_4.3.2       
[57] rematch2_2.1.2    R6_2.5.1          lubridate_1.8.0   bit_4.0.4        
[61] fastmap_1.1.0     utf8_1.2.2        rprojroot_2.0.3   stringi_1.7.6    
[65] parallel_4.1.3    rmdformats_1.0.3  vctrs_0.4.1       dbplyr_2.1.1     
[69] tidyselect_1.1.2  xfun_0.30

References

Wickham, Hadley. 2019. Advanced R. Second edition. Boca Raton: CRC Press/Taylor and Francis Group.

Wickham, Hadley & Garrett Grolemund. 2016. R for data science: Import, tidy, transform, visualize, and model data. First edition. Sebastopol, CA: O’Reilly.