03/01/2018

What is web scraping?

Web scraping is a technique for converting the data present in unstructured format (HTML tags) over the web to structured format which can easily be accessed and used. The job is carried out by a piece of code which is called a “scraper”.

First, it sends a “GET” query to a specific website. Then, it parses an HTML document based on the received result. After it’s done, the scraper searches for the data you need within the document, and, finally, converts it into whatever specified format.

Web crawling vs web scraping: what are the differences?

Web crawling

Web crawling, means the usage of programs or automated scripts which browse the World Wide Web in a methodical, automated manner. Google, Bing and Yahoo, essentially, are the major web crawlers. Crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam). Example 1 Example 2

Web crawling vs web scraping: what are the differences?

Web scraping

Web scraping is also same as finding required data from web pages. But in scraping, we know the exact page from which we need to scrape the data. We also know the html element structure of the webpages as they are fixed. Compared to web crawling, it is more targeted.

Why do we need web scraping?

Web scraping has immense possibilities in application. Examples are:

  • Scraping movie rating data from IMDB to create movie recommendation engines
  • Scraping text data from Wikipedia and other sources to make NLP-based systems
  • Scraping labeled image data from websites like Google, Pinterest, Flickr, etc. to train image classification models
  • Scraping data from social media sites like Facebook and Twitter for performing sentiment analysis, opinion mining, etc.
  • Scraping user reviews and feedbacks from e-commerce sites like Amazon

Ways to scrape data

  • Human copy-paste: slow and inefficient
  • Text pattern matching: use regular expression matching
  • API interface: many websites Facebook and Twitter provides public/private APIs which can be called using standard code to retrive data in prescribed format
  • DOM parsing: by using the web browsers, programs can retrive the dynamic content generated by client-based-scripts. It’s also possible to parse web pages into a DOM tree, based on which programs can retrive parts of these pages

Packages for web scraping in R

  • httr: provides a user-friendly interface for executing HTTP methods

  • RCurl: provides a closer interface between R and the libcurl C library, but less user-friendly

  • curl: another libcurl client; provides the curl() function as an SSL-compatible replacement for base R’s url() and support for http 2.0, ssl, gzip, deflate and more

  • request: provides a high-level package that is useful for developing other API client packages

Packages for web scraping in R

  • RSelenium: automate interactions and extract page contents of dynamically generated webpages (those requiring user interaction to display results)

  • Rcrawler: performs parallel web crawling and web scraping; designed to crawl, parse and store web pages to produce data that can be directly used for analysis application.

  • rvest: a higher-level alternative package useful for web scraping; works with magrittr to make it easy to express common web scraping tasks

  • twitteR: an R based Twitter client that procides an interface to the Twitter web API

For more information: CRAN Task View - Web Technologies and Services

Example 1: IMDB Movies

Let’s scrape the IMDB website for the 100 most popular feature films released in 2017. The website is here.

library(rvest)
# specify the url for desired website
url <- "http://www.imdb.com/search/title?count=100&release_date=2017,2017&title_type=feature"

# read in the HTML code from the website
webpage <- read_html(url)

Example 1: IMDB Movies

What does the page contain?

  • Rank: the rank of the film by popularity

  • Title: the title of the feature film

  • Description: the description of the film

  • Runtime: the duration of the film

  • Genre: the genre of the feature film

  • Rating: the IMDB rating of the feature film

  • Metascore: the metascore on IMDB website for the feature film

Example 1: IMDB Movies

  • Votes: votes cast in favor of the feature film

  • Gross_Earning_in_Mil: gross earnings of the feature film in millions

  • Director: The main director of the feature film. In cases where there are multiple directors, we take the first one.

  • Actor: The main actor/actress of the feature film. In cases where there are multiple actors/actresses, we take the first one.

Example 1: IMDB Movies

Start with the rankings

# rankings
rank_data_html <- html_nodes(webpage, ".text-primary")
rank_data <- as.numeric(html_text(rank_data_html))
# names
title_data_html <- html_nodes(webpage, ".lister-item-header a")
title_data <- html_text(title_data_html)

head(data.frame(title_data, rank_data))
##                                  title_data rank_data
## 1                        The Shape of Water         1
## 2 Three Billboards Outside Ebbing, Missouri         2
## 3                            Thor: Ragnarok         3
## 4                      The Greatest Showman         4
## 5                            Justice League         5
## 6                      Call Me by Your Name         6

Example 1: IMDB Movies

Similarly, identify the Description, Runtime, Genre, Rating, Metascore, Votes, Gross Earning, Director and Actor

description_data_html <- html_nodes(webpage, 
                        ".ratings-bar+ .text-muted")
description_data <- html_text(description_data_html)

head(description_data)
## [1] "\nAt a top secret research facility in the 1960s, a lonely janitor forms a unique relationship with an amphibious creature that is being held in captivity."                                                                    
## [2] "\nA mother personally challenges the local authorities to solve her daughter's murder when they fail to catch the culprit."                                                                                                     
## [3] "\nThor is imprisoned on the other side of the universe and finds himself in a race against time to get back to Asgard to stop Ragnarok, the destruction of his homeworld and the end of ...                See full summary »\n"
## [4] "\nCelebrates the birth of show business, and tells of a visionary who rose from nothing to create a spectacle that became a worldwide sensation."                                                                               
## [5] "\nFueled by his restored faith in humanity and inspired by Superman's selfless act, Bruce Wayne enlists the help of his newfound ally, Diana Prince, to face an even greater enemy."                                            
## [6] "\nIn 1980s Italy, a romance blossoms between a seventeen year-old student and the older man hired as his father's research assistant."

Example 1: IMDB Movies

runtime_data_html <- html_nodes(webpage, ".text-muted .runtime")
runtime_data <- html_text(runtime_data_html)
head(runtime_data)
## [1] "123 min" "115 min" "130 min" "105 min" "120 min" "132 min"
runtime_data <- as.numeric(gsub(" min","", runtime_data))

Example 1: IMDB Movies

Metascore: what goes wrong?

metascore_data_html <- html_nodes(webpage,'.metascore')
metascore_data <- html_text(metascore_data_html)
metascore_data<-gsub(" ","",metascore_data)
length(metascore_data)
## [1] 94

Example 1: IMDB Movies

Six of the top 100 movies didn’t have a metascore!

for (i in c(11, 29, 33, 39, 59, 85)) {
    a <- metascore_data[1:(i - 1)]
    b <- metascore_data[i:length(metascore_data)]
    metascore_data <- append(a, list("NA"))
    metascore_data <- append(metascore_data, b)
}
metascore_data <- as.numeric(metascore_data)

Example 1: IMDB Movies

Combine everything into a single dataframe.

## 'data.frame':    100 obs. of  11 variables:
##  $ Rank                : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title               : Factor w/ 100 levels "A Bad Moms Christmas",..: 84 89 87 74 38 13 96 42 10 18 ...
##  $ Description         : Factor w/ 100 levels "\n1920, rural Ireland. Anglo Irish twins Rachel and Edward share a strange existence in their crumbling family "| __truncated__,..: 37 10 88 45 54 58 41 59 18 49 ...
##  $ Runtime             : num  123 115 130 105 120 132 113 94 164 125 ...
##  $ Genre               : Factor w/ 9 levels "Action","Adventure",..: 2 6 1 4 1 7 7 5 7 4 ...
##  $ Rating              : num  7.7 8.3 8 8 6.8 8.1 8 7.7 8.2 7.4 ...
##  $ Metascore           : num  87 88 74 48 45 93 66 94 81 75 ...
##  $ Votes               : num  100594 141988 253283 71881 203815 ...
##  $ Gross_Earning_in_Mil: num  55.7 50.5 314.8 161.4 228.9 ...
##  $ Director            : Factor w/ 99 levels "Aaron Sorkin",..: 39 61 89 67 99 57 86 38 30 48 ...
##  $ Actor               : Factor w/ 92 levels "Andy Serkis",..: 77 34 21 41 8 6 43 78 40 36 ...

Example 1: IMDB Movies

p1 <- ggplot(movies_df, aes(x = Runtime, y = Rating)) + 
    geom_point(aes(size = Votes, col = Genre)) 
ggplotly(p1)

Example 1: IMDB Movies

ggplotly(ggplot(movies_df, aes(x = Runtime, fill = Genre)) + 
             geom_bar(stat = "count"))

Example 2: Twitter Data

Initial setup: generate consumer_key, consumer_secret, access_token, access_secret from Twitter Application Management, and use them to authorize your R session.

library(twitteR)
setup_twitter_oauth(consumer_key, consumer_secret,
                    access_token, access_secret)
## [1] "Using direct authentication"

Example 2: Twitter Data

searchResults <- searchTwitteR("#coco", n = 100)
head(searchResults)
## [[1]]
## [1] "LorenaMeritano: Llorando a moco tendido ... #Coco ❤️ #majestuosa una obra de arte.\nGracias \xed\xa0\xbd\xed\xb2\x9c\nNos morimos… https://t.co/zffjBwHljg"
## 
## [[2]]
## [1] "im_bala_aadvik: RT @sridevisreedhar: Hearing such good reports about #KolamaavuKokila #CoCo from inside sources.... \nA quirky comedy-drama which has #Nayan…"
## 
## [[3]]
## [1] "sdxacademy: RT @ninafurfur: What’s your favorite?\n#vintageshop #nfv #vintagegucci #chanel #vintagechanel #bag #coco #fashion… https://t.co/0eqtcyN61T"
## 
## [[4]]
## [1] "verito_carolina: Vere #Coco para saber que tanta cuestión con la película... si no lloro me sentire estafada."
## 
## [[5]]
## [1] "beccaa_s: Gonna start watching this everyday \xed\xa0\xbd\xed\xb8\x8d my favorite Disney movie \xed\xa0\xbd\xed\xb6\xa4\xed\xa0\xbc\xed\xb7\xb2\xed\xa0\xbc\xed\xb7\xbd #Coco #Disney https://t.co/DQkDzb2S5v"
## 
## [[6]]
## [1] "YhingG7love1: RT @NinkMickey: แปลไอจีท่าน Ars\nสายตาที่แน่วแน่...แต่ว่า ดาเมะเดส...\n(ดาเมะเดส=ไม่ได้นะครับ)\n#GOT7 #Youngjae #Coco\n@GOT7Official https://t.…"

Example 2: Twitter Data

searchResults <- searchTwitteR("#coco", n = 3000, lang = "en")
head(searchResults)
## [[1]]
## [1] "im_bala_aadvik: RT @sridevisreedhar: Hearing such good reports about #KolamaavuKokila #CoCo from inside sources.... \nA quirky comedy-drama which has #Nayan…"
## 
## [[2]]
## [1] "sdxacademy: RT @ninafurfur: What’s your favorite?\n#vintageshop #nfv #vintagegucci #chanel #vintagechanel #bag #coco #fashion… https://t.co/0eqtcyN61T"
## 
## [[3]]
## [1] "beccaa_s: Gonna start watching this everyday \xed\xa0\xbd\xed\xb8\x8d my favorite Disney movie \xed\xa0\xbd\xed\xb6\xa4\xed\xa0\xbc\xed\xb7\xb2\xed\xa0\xbc\xed\xb7\xbd #Coco #Disney https://t.co/DQkDzb2S5v"
## 
## [[4]]
## [1] "RssSathisha: RT @NayantharaLive: Marvellous March updates....\n#VikatanAwards Part2 - 4th March\n#Coco 1st Look 5th March\n#Coco 1st Single 8th March\n#Imai…"
## 
## [[5]]
## [1] "coweeeeen: ...for even if I'm far away, I hold you in my heart. I sing a secret song to you each night we are apart...\n\n#Coco \xed\xa0\xbd\xed\xb2\x94"
## 
## [[6]]
## [1] "al_brogan: RT @MitchBenn: Just come out of #Coco and I can’t remember the last time a movie left me so utterly emotionally spent. It’s simultaneously…"

Example 2: Twitter Data

Example 2: Twitter Data

One more thing: word cloud about Donald Trump

For large, formulated websites: customized packages

Rfacebook, Rlinkedin, RGoogleData, googlesheets

These packages are very similar to twitteR, as they all require specific API’s from the websites. You’ll need to create a web app.

facebook

linkedin

googlesheets, RGoogleData

More reading: