Previous articles

Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2
Web Scraping for Fun and Profit: Part 3
Web Scraping for Fun and Profit: Part 4

Next articles

Web Scraping for Fun and Profit: Part 6
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8

Recap

In Part 1 we installed Python, pip, and the Requests library. We set up a basic program which fetches the content of this site. In Part 2, we installed BeautifulSoup and used it to parse the page in order to extract the data we care about. In Part 3, we installed MongoDB and the pymongo library, and used them to determine if the data fetched is new. In Part 4, we added a basic function to format an extracted post for SMS delivery and used Twilio to send this SMS.

In this part, we will add a few command-line options which will help us with refactoring so that we can extend this program to handle a wide variety of sites and use cases.

Let’s add some command-line options to help us out!

At this point, it would be handy to add some options to the script to help our development when we want to support new sites or extract different data. We’re going to add options for verbose output, dry run mode, and seed mode.

Let’s start by importing the argparse library and initializing an ArgumentParser:

>>> import argparse
>>> parser = argparse.ArgumentParser()

--verbose

Verbose output to aid with debugging, etc.

>>> parser.add_argument("-v", "--verbose", help="enable verbose output", action="store_true")
_StoreTrueAction(option_strings=['-v', '--verbose'], dest='verbose', nargs=0, const=True, default=False, type=None, choices=None, help='enable verbose output', metavar=None)

--dryrun

Don’t make any changes to the database, or send any messages. Useful if you’re developing a new parser or extracting new data.

>>> parser.add_argument("-d", "--dryrun", help="dry run mode- no changes", action="store_true")
_StoreTrueAction(option_strings=['-d', '--dryrun'], dest='dryrun', nargs=0, const=True, default=False, type=None, choices=None, help='dry run mode- no changes', metavar=None)

--seed

Populate the database, but don’t send out any messages. Useful when you want to add the initial data from a request, but you don’t want 30 text messages telling you what you already see…

>>> parser.add_argument("-s", "--seed", help="seed database, no messages", action="store_true")
_StoreTrueAction(option_strings=['-s', '--seed'], dest='seed', nargs=0, const=True, default=False, type=None, choices=None, help='seed database, no messages', metavar=None)
>>> args = parser.parse_args()
>>> args
Namespace(seed=False, dryrun=False, verbose=False)
>>> args.verbose
False
>>> args.seed
False
>>> args.dryrun
False

Let’s go ahead and implement these:

import requests
from bs4 import BeautifulSoup
from pymongo import MongoClient
from twilio.rest import TwilioRestClient
import time
# Import argparse library to parse command-line options
import argparse

# Use your own values for these
accountSid = "ACXXXXXXXXXXXXXXXXX"
authToken = "YYYYYYYYYYYYYYYYYY"

def format_message(post):
    return "New post: " + post['title'] + "\n" + post['link']

if __name__ == '__main__':
    # Add options for dry run, seeding the DB, and verbose output
    arg_parser = argparse.ArgumentParser()
    arg_parser.add_argument("-d", "--dryrun", help="dry run mode- no changes", action="store_true")
    arg_parser.add_argument("-s", "--seed", help="seed database, no messages", action="store_true")
    arg_parser.add_argument("-v", "--verbose", help="enable verbose output", action="store_true")
    args = arg_parser.parse_args()

    # Disallow seed and dryrun options from both being enabled
    if args.seed and args.dryrun:
        print("--seed and --dryrun are mutually exclusive")
        sys.exit(1)

    if args.verbose:
        print("Connecting to MongoDB")
    db_client = MongoClient()

    if args.verbose:
        print("Connecting to Twilio")
    twilio_client = TwilioRestClient(accountSid, authToken)

    if args.verbose:
        print("Requesting page...")
    response = requests.get("http://andythemoron.com")

    if args.verbose:
        print("Parsing response with BeautifulSoup")
    soup = BeautifulSoup(response.content, "lxml")

    if args.verbose:
        print("Extracting data from response")
    post_titles = soup.find_all("a", class_="post-title")
    extracted = []
    for post_title in post_titles:
        extracted.append({
            'title' : post_title['title'],
            'link'  : "andythemoron.com" + post_title['href']
        })
    if args.verbose:
        if not extracted:
            print("Failed to extract data")
        else:
            print("Extracted " + str(len(extracted)) + " items")

    for post in extracted:
        if db_client.my_db.my_posts.find_one({'link': post['link']}) is None:
            if args.verbose:
                print("Found a new item: " + str(post))

            # If in dry run mode, continue to skip sending alerts or DB inserts
            if args.dryrun:
                continue

            # Send SMS via Twilio if not in seed mode
            if not args.seed:
                if args.verbose:
                    print("Sending SMS")
                toNumber = "+11231234567"  # Use your own cell number here
                fromNumber = "+11239876543"  # Use your own Twilio number here
                message = format_message(post)
                twilio_client.messages.create(to=toNumber, from_=fromNumber, body=message)
                
            # Add item to DB
            if args.verbose:
                print("Adding item to DB")
            db_client.my_db.my_posts.insert(post)

            # Sleep here if we sent out an SMS
            if not args.seed:
                time.sleep(3)

You can now pass in command line arguments in the Unix-style single hyphen format or via GNU style option keywords. To illustrate the many different ways you can specify verbose output and dry-run mode: “-dv”, “-d -v”, “-d --verbose”, “--dryrun -v”, and “--dryrun --verbose” all represent the same options.

For example:

$ python3 scraper.py --dryrun --verbose
Connecting to MongoDB
Connecting to Twilio
Requesting page...
Parsing response with BeautifulSoup
Extracting data from response
Extracted 7 elements
Found a new listing: {'title': 'Web Scraping for Fun and Profit: Part 5', 'link': 'andythemoron.com/blog/2017-03-29/Scraping-For-Fun-And-Profit-Part-5'}
Found a new listing: {'title': 'Web Scraping for Fun and Profit: Part 4', 'link': 'andythemoron.com/blog/2017-03-27/Scraping-For-Fun-And-Profit-Part-4'}
Found a new listing: {'title': 'Web Scraping for Fun and Profit: Part 3', 'link': 'andythemoron.com/blog/2017-03-22/Scraping-For-Fun-And-Profit-Part-3'}
Found a new listing: {'title': 'Web Scraping for Fun and Profit: Part 2', 'link': 'andythemoron.com/blog/2017-03-18/Scraping-For-Fun-And-Profit-Part-2'}
Found a new listing: {'title': 'Web Scraping for Fun and Profit: Part 1', 'link': 'andythemoron.com/blog/2017-03-15/Scraping-For-Fun-And-Profit-Part-1'}
Found a new listing: {'title': 'Web Scraping for Fun and Profit: A Primer', 'link': 'andythemoron.com/blog/2017-03-12/Scraping-For-Fun-And-Profit'}
Found a new listing: {'title': 'A Different Kind of Moron', 'link': 'andythemoron.com/blog/2017-03-08/A-Different-Kind-Of-Moron'}

What’s left from here?

By now, you might be able to glean some of the components you might want to have vary across different sites.

A given request will be associated with a specific URL, an optional dictionary of request parameters (we didn’t need to use any request parameters for this initial example), a function for extracting the relevant data, and a function for formatting the extracted data for delivery. You may also want to specify different number(s) for SMS delivery on a per-request basis, or specify the specific collection into which to insert the extracted data for a given request. These are elements we can refactor around, and which we’ll come around to in future posts.

In the next post we’ll come around to automating the execution of this script, the implementation of which will vary depending upon your operation system. I will cover the basic configuration of cron for Linux and MacOS. For Windows, you can use a “scheduled task”, which seems like it’s relatively easy to configure, although I don’t have any personal experience setting it up, so YMMV.

We’ll also add some basic error handling to support connection or timeout errors that may inevitably occur during our page fetching.

If you’re following my advice on version control, stage your changes for commit with “git add scraper.py”, and commit them with something like “git commit -m ‘Add command-line options for dry run, seed database, and verbose output’”.


Previous articles

Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2
Web Scraping for Fun and Profit: Part 3
Web Scraping for Fun and Profit: Part 4

Next articles

Web Scraping for Fun and Profit: Part 6
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8