Web Scraping with Beautiful Soup

12.1. Web Scraping with Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

More information about webscraping can be found at this site: https://towardsdatascience.com/a-step-by- step-guide-to-web-scraping-in-python-5c4d9cef76e8

First, we need to install BeautifulSoup, we can use pip install beautifulsoup4

We can read any webpage with BeautifulSoup using the following method.

import pandas as pd
import requests as rq
from bs4 import BeautifulSoup as bs
from IPython.core.display import display, HTML
url = 'https://en.wikipedia.org/wiki/List_of_municipalities_in_Connecticut'
page = rq.get(url).text
soup = bs(page)
display(HTML(page))

We can navigate any part of the html tree using the specific tag we want.

soup.title

We can find any part of the html tree in a similar way, but it will default to the first appearance of that tag. We can specify that we want the text component.

soup.find('h1').text

We can find all appearances of that tag.

for x in soup.find_all('h2'):
    print(x.text)

If we we just want a table as a dataframe, the easiest way to do this is with pandas.

# Extract tables
dfs = pd.read_html(url)

# Get first table                                                                                                           
df = dfs[1]

# Extract columns                                                                                                           
df.info()

However, if we want to get information from different webpages, this method is not sufficient. In this example, we use Beautiful Soup to create an easy way to access information of CT towns. We first start by going through the table to create a list of links for the towns of Connecticut. Then we will extract some information from each page’s infobox.

table = soup.find('table', {'class': "wikitable sortable"}) # Finds table
rows = table.find_all('tr')[1:] # Finds all rows of the table
base_url = 'https://en.wikipedia.org' # base_url to which the href will be added
town_links = []
for r in range(0, 169):
    town_links.append(rows[r].find('a')['href']) 
town_urls = [base_url + link for link in town_links]
town_urls

Here, we define a function that extracts what county a town is from.

def get_county(link):
    link_page = rq.get(link).text
    link_soup = bs(link_page)
    table = link_soup.find('table')
    c = table.find(text = 'County') # Finds part of html that has 'County'
    b = link_soup.find('h1').text
    if c is None:
        return [b, ' ']
    else:
        county = c.next.next.text
        return [b, ' '.join(county.split('\xa0'))] 
get_county('https://en.wikipedia.org/wiki/Andover,_Connecticut')

We could do a separate function for each piece of information, or we could do things a little faster.

def get_info(link):
    link_page = rq.get(link).text
    link_soup = bs(link_page)
    table = link_soup.find('table')
    rows = table.find_all('tr', {'class' : 'mergedrow'}) 
    town_info = [link_soup.find('h1').text]
    for row in rows:
        r = row.find('td', {'class' : 'infobox-data'})
        if r is None:
            town_info.append('')
        else:
            town_info.append(' '.join(r.text.split('\xa0')))
    return town_info
get_info('https://en.wikipedia.org/wiki/Hartford,_Connecticut')

We can also define a function to extract a specific piece of info

def get_info_2(link, key): # This function takes in a keyword
    link_page = rq.get(link).text
    link_soup = bs(link_page)
    table = link_soup.find('table')
    i = table.find(text = key) # Finds part of html with this keyword
    b = link_soup.find('h1').text
    if i is None:
        return [b, ' ']
    else:
        info = i.next.next.text
        return [b, ' '.join(info.split('\xa0'))]
get_info_2('https://en.wikipedia.org/wiki/Mansfield,_Connecticut', ' • Density')

Here we use the above function to create a list of CT towns and their elevation.

#to add pauses
import time
#to store towns info
town_info_list = []

for link in town_urls:
  #get town info
  town_info = get_info_2(link, 'Elevation')
  #if everything is correct and no error occurs
  if town_info:
    town_info_list.append(town_info)
  #pause a second between each town
  time.sleep(1)
town_info_list

We can convert the list to a dataframe, and use the data to make a nice barplot.

df1 = pd.DataFrame(town_info_list, columns = ['town', 'elevation'])
df1.info()
df1
import numpy as np
df1[df1.elevation == ' '] = np.nan
df1 = df1.dropna()
df1.info()
df1['town_name'] = [x.split(',')[0] for x in df1['town']]
df1['elevation_feet'] = [int(x.split(' ')[0].replace(',', '')) for x in df1['elevation']]
df1 = df1[df1['elevation_feet'] >= 700]
df1 = df1.sort_values(by = 'elevation_feet', ascending = False)
df1
import matplotlib.pyplot as plt
plt.figure(figsize = (30, 30))
plt.bar(df1['town_name'], df1['elevation_feet'], align = 'center', alpha = 0.5)