Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2
Web Scraping for Fun and Profit: Part 3
Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5
Web Scraping for Fun and Profit: Part 6

Web Scraping for Fun and Profit: Part 8

Recap

In Part 1 we installed Python, pip, and the Requests library. We set up a basic program which fetches the content of this site. In Part 2, we installed BeautifulSoup and used it to parse the page in order to extract the data we care about. In Part 3, we installed MongoDB and the pymongo library, and used them to determine if the data fetched is new. In Part 4, we added a basic function to format an extracted post for SMS delivery and used Twilio to send this SMS. In Part 5, we added a few command-line options to help us in future development as we add support for new sites. In Part 6, we went over basic error handling and how to automate this script.

In this part, we are going to do some refactoring to clean up the code so that it is easier to add new requests.

Refactoring

One of the main reasons to refactor is to improve the architecture of your program in the process of adding new functionality. The work we do here will make it very easy to add support for new requests, as you’ll see in the next post. The --verbose and --dryrun options we added in part 5 will make this work much easier as you introduce change into your program!

In this post, we’re going to work towards leveraging a Python dictionary with the parameters unique for a given request and hollowing out scraper.py as the generic mechanism for dispatching the heavy lifting as needed. The end result will be simpler and clearer code that is easier to work with going forward. For now, we are going to separate out the base url, the extraction method, the formatting method, and the MongoDB collection, and while we’re at it, we’ll move some of the implementation out to a separate scraper_utils.py file.

Pull out extraction logic

Let’s start by pulling out the logic for extracting posts from the BeautifulSoup object into a new method extract_posts in a new file andy_the_moron_parser.py:

# andy_the_moron_parser.py

def extract_posts(soup):
    extracted = []
    post_titles = soup.find_all("a", class_="post-title")
    for post_title in post_titles:
        extracted.append({
            'title' : post_title['title'],
            'link'  : "andythemoron.com" + post_title['href']
        })
    return extracted

Pull out formatting logic

Now let’s move over the format_message function into andy_the_moron_parser.py, but let’s rename it format_post to make the naming more consistent with extract_posts:

# andy_the_moron_parser.py

def extract_posts(soup):
    extracted = []
    post_titles = soup.find_all("a", class_="post-title")
    for post_title in post_titles:
        extracted.append({
            'title' : post_title['title'],
            'link'  : "andythemoron.com" + post_title['href']
        })
    return extracted

def format_post(post):
    return "New post: " + post['title'] + "\n" + post['link']

Get the request

Let’s create a file scraper_requests.py that returns the request dictionary for the current request. (In the next post, we’ll make this a list of dictionaries, but let’s stick to the singular for now…) Since we only need the MongoDB client to fetch the connection, let’s do that work here:

# scraper_requests.py

from pymongo import MongoClient
import andy_the_moron_parser

def get_request():
    db_client = MongoClient()

    request = {
        'base_url' : 'http://andythemoron.com',
        'extract_method' : andy_the_moron_parser.extract_posts,
        'format_method' : andy_the_moron_parser.format_post,
        'collection' : db_client.my_db.my_posts
    }

    return request

Since we are getting the collection from scraper_requests.py, we no longer need it in scraper.py.

# DELETE
from pymongo import MongoClient

# REPLACE WITH
import scraper_requests

# DELETE
if args.verbose:
    print("Connecting to MongoDB")
db_client = MongoClient()

# REPLACE WITH
request = scraper_requests.get_request()

# DELETE
print("Requesting page...")

# REPLACE WITH
print("Requesting page: " + request['base_url'])

Using the extraction method

Now let’s use the ‘extract_method’ from the request dictionary:

# DELETE
post_titles = soup.find_all("a", class_="post-title")
...

# REPLACE WITH
extracted = request['extract_method'](soup)

Let’s also use the request dictionary ‘base_url’ value for requesting the data and sending failure notifications:

# DELETE
response = requests.get("http://andythemoron.com", timeout=10)

# REPLACE WITH
response = requests.get(request['base_url'], timeout=10)

# DELETE
message = "Error fetching andythemoron.com"

# REPLACE WITH
message = "Error fetching " + request['base_url']

# DELETE
message = "Error parsing andythemoron.com"

# REPLACE WITH
message = "Error parsing " + request['base_url']

Using the collection

Let’s update the references to the collection with the collection from the request dictionary:

# DELETE
if db_client.my_db.my_posts.find_one({'link': post['link']}) is None:

# REPLACE WITH
if request['collection'].find_one({'link': post['link']}) is None:

# DELETE
db_client.my_db.my_posts.insert({

# REPLACE WITH
request['collection'].insert({

Using the formatting method

Let’s update the call to format_message() to use the request’s format method:

# DELETE
message = format_message(post)

# REPLACE WITH
message = request['format_method'](post)

Test it out

$ python scraper.py --verbose --dryrun
Connecting to Twilio
Requesting page: http://andythemoron.com
Parsing response with BeautifulSoup
Extracting data from response
Extracted 8 items
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 6', 'link': 'andythemoron.com/blog/2017-04-03/Scraping-For-Fun-And-Profit-Part-6'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 5', 'link': 'andythemoron.com/blog/2017-03-29/Scraping-For-Fun-And-Profit-Part-5'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 4', 'link': 'andythemoron.com/blog/2017-03-27/Scraping-For-Fun-And-Profit-Part-4'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 3', 'link': 'andythemoron.com/blog/2017-03-22/Scraping-For-Fun-And-Profit-Part-3'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 2', 'link': 'andythemoron.com/blog/2017-03-18/Scraping-For-Fun-And-Profit-Part-2'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 1', 'link': 'andythemoron.com/blog/2017-03-15/Scraping-For-Fun-And-Profit-Part-1'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: A Primer', 'link': 'andythemoron.com/blog/2017-03-12/Scraping-For-Fun-And-Profit'}
Found a new item: {'title': 'A Different Kind of Moron', 'link': 'andythemoron.com/blog/2017-03-08/A-Different-Kind-Of-Moron'}

Awesome, it works as before! If not, the verbose output should help you figure out where the failure lies…

Pull out extracted data processing loop

Let’s copy the logic from within the “for post in extracted” loop into a new method process_extracted in a new file scraper_utils.py. This new method will need to be passed in the twilio client, the extracted data, the collection, the from_number, the to_number, the format_method, and the option values for seed_db, dry_run, and verbose. We’ll pass these values in and adjust the references accordingly. You may notice that I also changed “post” to “item” for a more generic representation.

# scraper_utils.py

import requests
import time

def process_extracted(
        twilio_client, extracted, collection, from_number, to_number, format_method, seed_db, dry_run, verbose):
    for item in extracted:
        if collection.find_one({'link': item['link']}) is None:
            if verbose:
                print("Found a new item: " + str(item))

            # If in dry run mode, continue to skip sending alerts or DB inserts
            if dry_run:
                continue

            # Send SMS via Twilio if not in seed mode
            if not seed_db:
                if verbose:
                    print("Sending SMS")
                message = format_message(item)
                twilio_client.messages.create(to=to_number, from_=from_number, body=message)
                
            # Add item to DB
            if verbose:
                print("Adding item to DB")
            collection.insert(item)

            # Sleep here if we sent out an SMS
            if not seed_db:
                time.sleep(3)

We’ll call it in place of the previous code like so:

# ADD BELOW IMPORTS
import scraper_utils

# DELETE
for post in extracted:
    ...

# REPLACE WITH
scraper_utils.process_extracted(
    twilio_client, extracted, request['collection'], from_number, to_number,
    request['format_method'], args.seed, args.dryrun, args.verbose)

Let’s test it out!

$ python scraper.py --verbose --dryrun
Connecting to Twilio
Requesting page: http://andythemoron.com
Parsing response with BeautifulSoup
Extracting data from response
Extracted 8 items
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 6', 'link': 'andythemoron.com/blog/2017-04-03/Scraping-For-Fun-And-Profit-Part-6'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 5', 'link': 'andythemoron.com/blog/2017-03-29/Scraping-For-Fun-And-Profit-Part-5'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 4', 'link': 'andythemoron.com/blog/2017-03-27/Scraping-For-Fun-And-Profit-Part-4'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 3', 'link': 'andythemoron.com/blog/2017-03-22/Scraping-For-Fun-And-Profit-Part-3'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 2', 'link': 'andythemoron.com/blog/2017-03-18/Scraping-For-Fun-And-Profit-Part-2'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 1', 'link': 'andythemoron.com/blog/2017-03-15/Scraping-For-Fun-And-Profit-Part-1'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: A Primer', 'link': 'andythemoron.com/blog/2017-03-12/Scraping-For-Fun-And-Profit'}
Found a new item: {'title': 'A Different Kind of Moron', 'link': 'andythemoron.com/blog/2017-03-08/A-Different-Kind-Of-Moron'}

Extract the request loop

Just one more bit of refactoring for now! Let’s pull out the request loop into a new method get_response() in scraper_utils.py:

# scraper_utils.py

def get_response(base_url):
    response = None
    for _ in range(5):
        try:
            response = requests.get(base_url, timeout=10)
            response.raise_for_status()
            break
        except requests.exceptions.RequestException as e:
            print("Caught exception...")
            print(e)
            time.sleep(60)
            continue
    return response

And let’s use it in scraper.py:

# scraper.py

# DELETE
import requests

# DELETE
response = None
for _ in range(5):
    ...

# REPLACE WITH
response = scraper_utils.get_response(request['base_url'])

Let’s verify that this works as expected:

$ python scraper.py --verbose --dryrun
Connecting to Twilio
Requesting page: http://andythemoron.com
Parsing response with BeautifulSoup
Extracting data from response
Extracted 8 items
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 6', 'link': 'andythemoron.com/blog/2017-04-03/Scraping-For-Fun-And-Profit-Part-6'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 5', 'link': 'andythemoron.com/blog/2017-03-29/Scraping-For-Fun-And-Profit-Part-5'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 4', 'link': 'andythemoron.com/blog/2017-03-27/Scraping-For-Fun-And-Profit-Part-4'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 3', 'link': 'andythemoron.com/blog/2017-03-22/Scraping-For-Fun-And-Profit-Part-3'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 2', 'link': 'andythemoron.com/blog/2017-03-18/Scraping-For-Fun-And-Profit-Part-2'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: Part 1', 'link': 'andythemoron.com/blog/2017-03-15/Scraping-For-Fun-And-Profit-Part-1'}
Found a new item: {'title': 'Web Scraping for Fun and Profit: A Primer', 'link': 'andythemoron.com/blog/2017-03-12/Scraping-For-Fun-And-Profit'}
Found a new item: {'title': 'A Different Kind of Moron', 'link': 'andythemoron.com/blog/2017-03-08/A-Different-Kind-Of-Moron'}

Tada!

Your scraper.py file should now look something like this:

from bs4 import BeautifulSoup
from twilio.rest import TwilioRestClient
import time
import argparse
import scraper_requests
import scraper_utils

# Use your own values for these
accountSid = "ACXXXXXXXXXXXXXXXXX"
authToken = "YYYYYYYYYYYYYYYYYY"
to_number = "+11231234567"
from_number = "+11239876543"

if __name__ == '__main__':
    # Add options for dry run, seeding the DB, and verbose output
    arg_parser = argparse.ArgumentParser()
    arg_parser.add_argument("-d", "--dryrun", help="dry run mode- no changes", action="store_true")
    arg_parser.add_argument("-s", "--seed", help="seed database, no messages", action="store_true")
    arg_parser.add_argument("-v", "--verbose", help="enable verbose output", action="store_true")
    args = arg_parser.parse_args()

    # Disallow seed and dryrun options from both being enabled
    if args.seed and args.dryrun:
        print("--seed and --dryrun are mutually exclusive")
        sys.exit(1)

    if args.verbose:
        print("Connecting to Twilio")
    twilio_client = TwilioRestClient(accountSid, authToken)

    request = scraper_requests.get_request()

    if args.verbose:
        print("Requesting page: " + request['base_url'])
    response = scraper_utils.get_response(request['base_url'])
    if response is None:
        if not args.dryrun:
            message = "Error fetching " + request['base_url']
            twilio_client.messages.create(to=to_number, from_=from_number, body=message)
        sys.exit(1)

    if args.verbose:
        print("Parsing response with BeautifulSoup")
    soup = BeautifulSoup(response.content, "lxml")

    if args.verbose:
        print("Extracting data from response")
    extracted = request['extract_method'](soup)
    if args.verbose:
        if not extracted:
            print("Failed to extract data")
        else:
            print("Extracted " + str(len(extracted)) + " items")

    # Assuming we should always find some data, exit if nothing found...
    if not extracted:
        if not args.dryrun:
            message = "Error parsing " + request['base_url']
            twilio_client.messages.create(to=to_number, from_=from_number, body=message)
        sys.exit(1)

    scraper_utils.process_extracted(
        twilio_client, extracted, request['collection'], from_number, to_number,
        request['format_method'], args.seed, args.dryrun, args.verbose)

Much cleaner, eh?

If you’re following along with my suggestion on using version control, stage your files for commit and then do it.

$ git add scraper.py scraper_requests.py scraper_utils.py andy_the_moron_parser.py
$ git commit -m "Refactoring; add scraper_requests.py, scraper_utils.py, and andy_the_moron_parser.py"

In the next post, we’ll see how much simpler it is to add a new request after the work we’ve done to this point. Till next time!

Web Scraping for Fun and Profit: Part 8

Web Scraping for Fun and Profit: Part 7

Time to refactor

Previous articles

Next article

Recap

Refactoring

Pull out extraction logic

Pull out formatting logic

Get the request

Using the extraction method

Using the collection

Using the formatting method

Test it out

Pull out extracted data processing loop

Extract the request loop

Previous articles

Next article

Previous articles

Next article

Recap

Refactoring

Pull out extraction logic

Pull out formatting logic

Get the request

Using the extraction method

Using the collection

Using the formatting method

Test it out

Pull out extracted data processing loop

Extract the request loop

Previous articles

Next article

Share on: