Previous articles

Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1

Next articles

Web Scraping for Fun and Profit: Part 3
Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5
Web Scraping for Fun and Profit: Part 6
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8

Recap

In Part 1 we installed Python, pip, and the Requests library. We set up a basic structure for a program which fetches the content of your target url.

In this post, we will learn how to parse the content of the page to extract what we want out of it.

Parsing the data: introducing BeautifulSoup

First, you’ll need to install Python’s BeautifulSoup library. We can leverage pip for this again, via “sudo pip3 install beautifulsoup4” for Python 3.x or “sudo pip install beautifulsoup4” for Python 2.x.

BeautifulSoup takes as input html or xml data, which we have fetched in this example via the requests.get() method, and returns a BeautifulSoup object, which allows you to easily navigate the parsed tree structure. By default, it will use the html parser included in the Python standard library. However, lxml is the recommended html parser. You can install it via “sudo pip3 install lxml” for Python 3.x and “sudo pip install lxml” for Python 2.x. You can specify that BeautifulSoup use it for parsing the data by passing “lxml” as the second argument to BeautifulSoup.

Let us now update the basic skeleton to parse the data via lxml in BeautifulSoup.

import requests
# Add import of BeautifulSoup
from bs4 import BeautifulSoup
import time

if __name__ == '__main__':
    response = requests.get("http://andythemoron.com")
    # Add use of BeautifulSoup to parse the response
    soup = BeautifulSoup(response, "lxml")
    # Let's print out the BeautifulSoup object for now instead
    print(soup)

This “soup” variable will refer to a BeautifulSoup object that we can use for parsing the html document. Typically, you’ll want to use the find_all() or find() methods on this object to retrieve certain elements based on certain filters, and then do something with them. Most commonly, you’ll want to fetch all elements of a specific tag type and/or fetch elements based on the value of one or more of their attributes, usually their class or id. In my personal use cases, I haven’t really had to use any other methods than these, although there are indeed many more that are possible. Check out the documentation for more information.

The first named argument to find() and find_all() is name, which represents a filter on the searched for tag/element type. For example, in the above code, soup.find_all(‘div’) would return a list of all elements with the div tag. An argument that is not recognized will be turned into a filter on one of the tag’s attributes. So one could search for elements with a specific id by invoking find_all(id=”some-id-value”) on a BeautifulSoup object to search for all tags with an id of “some-id-value”. One of the most useful filters is to search by class. However, since “class” is a reserved keyword in Python, filtering by class is performed by appending an underscore after class, e.g. find_all(class_=’post-listing’).

Here are some examples:

  • find_all(‘div’) # find all elements with the div tag
  • find_all(‘div’, class_=’post-listing’) # find all elements with the div tag that have a class of ‘post-listing’
  • find_all(class_=’post-listing’) # find all elements with a class of ‘post-listing’
  • find(id=’avatar’) # find the element with id ‘avatar’

As previously alluded to, there are many other options available here. BeautifulSoup supports various filters for searching the document tree. Check out the documentation here for more information. One can also use BeautifulSoup to traverse the tree, rather than querying it.

The elements that BeautifulSoup returns in a find() or find_all() query are Tag objects. A tag may have any number of attributes, and you can view this attribute dictionary directly via .attrs, e.g. tag.attrs. You can access specific attributes by treating the tag itself like a dictionary, e.g. tag[‘class’] will return the value of the class attribute defined on the tag. If this attribute may or may not be present, you should use the get() method instead, e.g. tag.get(‘class’), as you’ll have an exception raised when this attribute is not present if you try to do a direct dictionary access. You can retrieve the text from within a tag via .string, e.g. tag.string. These tools, along with the occasional string formatting (especially string.strip() to remove whitespace at the beginning and end), will be sufficient for accomplishing the vast majority of basic scraping tasks.

But how do I know what I need? Using browser development tools to inspect a web page

Build into both Firefox and Chrome for the desktop is the ability to simply right click anywhere on the page and inspect the associated html element. Awesome, right? In Chrome, the selection is “Inspect”, and for Firefox, the selection is “Inspect Element”. This will open up the relevant page inspector.

As you hover over the html elements in the page inspector, you will notice the associated elements become highlighted in the browser. You can expand and contract the tree structure to figure out the location of the data you want to extract. Have fun and play around with it!

Putting it all together

Following our example of checking this page for new posts, what information might you want to collect? The title and URL should be sufficient for this use case of wanting to be notified when a new entry is posted.

There are many different ways you could fetch this information via BeautifulSoup. Using the page inspector, you can see that both of these items are present in an “a” element of class “post-title”, and we could just search for these directly in this case. We can fetch the title via the ‘title’ attribute and the url via the ‘href’ attribute. Note that the link is relative, so you’ll have to add “andythemoron.com” before the href attribute to generate a valid link.

Putting it all together now, we’ll request the page, parse it in BeautifulSoup, extract all the titles and links of posts on the home page, and pull this data out in a list of Python dictionaries containing keys ‘title’ and ‘link’.

import requests
from bs4 import BeautifulSoup
import time

if __name__ == '__main__':
    response = requests.get("http://andythemoron.com")
    soup = BeautifulSoup(response.content, "lxml")
    post_titles = soup.find_all("a", class_="post-title")
    extracted = []
    for post_title in post_titles:
        extracted.append({
            'title' : post_title['title'],
            'link'  : "andythemoron.com" + post_title['href']
        })
    # Verify that we're getting valid data
    for data in extracted:
        print(data)

Try playing around with it and extracting different data!

If you’re following my advice on version control, stage your changes for commit with “git add scraper.py”, and commit them with something like “git commit -m ‘Use BeautifulSoup to parse the response and fetch the data’”.

Stay tuned for Part 3


Previous articles

Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1

Next articles

Web Scraping for Fun and Profit: Part 3
Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5
Web Scraping for Fun and Profit: Part 6
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8