Previous articles

Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2
Web Scraping for Fun and Profit: Part 3
Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5

Next articles

Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8

Recap

In Part 1 we installed Python, pip, and the Requests library. We set up a basic program which fetches the content of this site. In Part 2, we installed BeautifulSoup and used it to parse the page in order to extract the data we care about. In Part 3, we installed MongoDB and the pymongo library, and used them to determine if the data fetched is new. In Part 4, we added a basic function to format an extracted post for SMS delivery and used Twilio to send this SMS. In Part 5, we added a few command-line options to help us in future development as we add support for new sites.

In this part, we will add some basic error handling and automation so that our scraper can be executed at specified times or intervals of time.

Error handling


Remember in part 1, when we added the initial request “response = requests.get(‘http://andythemoron.com’)”? We’ll want to add a timeout argument so that the program doesn’t just hang indefinitely:

response = requests.get("http://andythemoron.com", timeout=10)

If the request takes longer than the number of seconds supplied via the timeout parameter, a Timeout exception will be raised. Network issues, such as DNS failures and refused connections, will raise a ConnectionError exception. If, however, your request receives a non-successful HTTP response, you’ll have to specifically invoke raise_for_status() on the returned Response object to raise an HTTPError. All of these exceptions inherit from requests.exceptions.RequestException.

Depending on how frequently you plan on running the script, you may or may not want to try again a certain number of times. Let’s say that you don’t really need incredibly up to date information and only run the script every hour, but want to retry a few times on failure. You could implement something like this to try 5 times, waiting a minute between each attempt:

response = None
for _ in range(5):
    try:
        response = requests.get("http://andythemoron.com", timeout=10)
        response.raise_for_status()
        break
    except requests.exceptions.RequestException as e:
        print("Caught exception...")
        print(e)
        time.sleep(60)
        continue
if response is None:
    # Depending on how critical this is, you could send yourself an SMS alert
    # to the effect that the request could not be completed.
    sys.exit(1)

Another possibility is that the logic you implemented to parse the web page becomes obsolete if the owners decide to restructure their page elements. If you expect to always find matches for a given request, you might want to send yourself an SMS if no data is extracted to the effect that your parsing logic is no longer working.

And if you’re sending out notifications on failure, let’s set those as global variables so we can reuse them. Your code might now look something like this:

import requests
from bs4 import BeautifulSoup
from pymongo import MongoClient
from twilio.rest import TwilioRestClient
import time
import argparse

# Use your own values for these
accountSid = "ACXXXXXXXXXXXXXXXXX"
authToken = "YYYYYYYYYYYYYYYYYY"
to_number = "+11231234567"
from_number = "+11239876543"

def format_message(post):
    return "New post: " + post['title'] + "\n" + post['link']

if __name__ == '__main__':
    # Add options for dry run, seeding the DB, and verbose output
    arg_parser = argparse.ArgumentParser()
    arg_parser.add_argument("-d", "--dryrun", help="dry run mode- no changes", action="store_true")
    arg_parser.add_argument("-s", "--seed", help="seed database, no messages", action="store_true")
    arg_parser.add_argument("-v", "--verbose", help="enable verbose output", action="store_true")
    args = arg_parser.parse_args()

    # Disallow seed and dryrun options from both being enabled
    if args.seed and args.dryrun:
        print("--seed and --dryrun are mutually exclusive")
        sys.exit(1)

    if args.verbose:
        print("Connecting to MongoDB")
    db_client = MongoClient()

    if args.verbose:
        print("Connecting to Twilio")
    twilio_client = TwilioRestClient(accountSid, authToken)

    if args.verbose:
        print("Requesting page...")
    response = None
    for _ in range(5):
        try:
            response = requests.get("http://andythemoron.com", timeout=10)
            response.raise_for_status()
            break
        except requests.exceptions.RequestException as e:
            print("Caught exception...")
            print(e)
            time.sleep(60)
            continue
    if response is None:
        if not args.dryrun:
            message = "Error fetching andythemoron.com"
            twilio_client.messages.create(to=to_number, from_=from_number, body=message)
        sys.exit(1)

    if args.verbose:
        print("Parsing response with BeautifulSoup")
    soup = BeautifulSoup(response.content, "lxml")

    if args.verbose:
        print("Extracting data from response")
    post_titles = soup.find_all("a", class_="post-title")
    extracted = []
    for post_title in post_titles:
        extracted.append({
            'title' : post_title['title'],
            'link'  : "andythemoron.com" + post_title['href']
        })
    if args.verbose:
        if not extracted:
            print("Failed to extract data")
        else:
            print("Extracted " + str(len(extracted)) + " items")

    # Assuming we should always find some data, exit if nothing found...
    if not extracted:
        if not args.dryrun:
            message = "Error parsing andythemoron.com"
            twilio_client.messages.create(to=to_number, from_=from_number, body=message)
        sys.exit(1)

    for post in extracted:
        if db_client.my_db.my_posts.find_one({'link': post['link']}) is None:
            if args.verbose:
                print("Found a new item: " + str(post))

            # If in dry run mode, continue to skip sending alerts or DB inserts
            if args.dryrun:
                continue

            # Send SMS via Twilio if not in seed mode
            if not args.seed:
                if args.verbose:
                    print("Sending SMS")
                message = format_message(post)
                twilio_client.messages.create(to=to_number, from_=from_number, body=message)
                
            # Add item to DB
            if args.verbose:
                print("Adding item to DB")
            db_client.my_db.my_posts.insert(post)

            # Sleep here if we sent out an SMS
            if not args.seed:
                time.sleep(3)

If you’re following along with my suggestion on using version control, stage your script for commit via “git add scraper.py” and commit with something like “git commit -m ‘Improve error handling; retry on failure; notify on failure to fetch or parse data’”.

Automate all the things!


Windows

In Windows, you can schedule tasks via the “Task Scheduler” application. It appears to be a fairly intuitive mechanism, but I don’t have any personal experience with it, so YMMV.

Linux and MacOS

In Linux and MacOS, you can use cron to automate tasks. The cron daemon (background process) should already be running on your system. It wakes up every minute and checks all loaded crontabs to see if there are any tasks that need to be executed that minute. To modify your user-specific crontab, simply run:

$ crontab -e

This will open your personal crontab file for editing. You can create a crontab entry by specifying 6 fields, separated by whitespace. These fields correspond to specific minutes, hours, days of the month, months, days of the week, and the command to execute. You can specify individual values, match all (*), ranges, e.g., hours 8-17 to run from 8AM to 5PM, or lists of individual values or ranges, e.g., hours 6,8-12,17 to run at hours 6, 8, 9, 10, 11, 12, and 17. You can also specify intervals via /, e.g., minutes */5 to match every 5 minutes. Here’s a few examples:

# minute    hour    day_of_month    month    day_of_week    command

# Run every 5 minutes
*/5 * * * * /path/to/python /path/to/scraper.py

# Run every 10 minutes from 8AM up until 1AM
*/10 1,8-23 * * * /path/to/python /path/to/scraper.py

# Run every hour from 8:30AM to 5:30PM on the first of the month
30 8-17 1 * * /path/to/python /path/to/scraper.py

# Run every day at 9AM for the month of February
0 9 * 2 * /path/to/python /path/to/scraper.py

# Run every Monday at 9AM (Sunday is both 0 and 7, Monday 1, etc.)
0 9 * * 1 /path/to/python /path/to/scraper.py

You can verify that you successfully updated your crontab by listing your crontab entries like so:

$ crontab -l

Right on! We now have an automated setup for fetching data from the web and sending out alerts via text message! Next time, we’ll refactor our code to handle different requests.


Previous articles

Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2
Web Scraping for Fun and Profit: Part 3
Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5

Next articles

Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8