In Part 1 we installed Python, pip, and the Requests library. We set up a basic program which fetches the content of this site. In Part 2, we installed BeautifulSoup and used it to parse the page in order to extract the data we care about. In Part 3, we installed MongoDB and the pymongo library, and used them to determine if the data fetched is new. In Part 4, we added a basic function to format an extracted post for SMS delivery and used Twilio to send this SMS.
In this part, we will add a few command-line options which will help us with refactoring so that we can extend this program to handle a wide variety of sites and use cases.
Let’s add some command-line options to help us out!
At this point, it would be handy to add some options to the script to help our development when we want to support new sites or extract different data. We’re going to add options for verbose output, dry run mode, and seed mode.
Let’s start by importing the argparse library and initializing an ArgumentParser:
--verbose
Verbose output to aid with debugging, etc.
--dryrun
Don’t make any changes to the database, or send any messages. Useful if you’re developing a new parser or extracting new data.
--seed
Populate the database, but don’t send out any messages. Useful when you want to add the initial data from a request, but you don’t want 30 text messages telling you what you already see…
Let’s go ahead and implement these:
You can now pass in command line arguments in the Unix-style single hyphen format or via GNU style option keywords. To illustrate the many different ways you can specify verbose output and dry-run mode: “-dv”, “-d -v”, “-d --verbose”, “--dryrun -v”, and “--dryrun --verbose” all represent the same options.
For example:
What’s left from here?
By now, you might be able to glean some of the components you might want to have vary across different sites.
A given request will be associated with a specific URL, an optional dictionary of request parameters (we didn’t need to use any request parameters for this initial example), a function for extracting the relevant data, and a function for formatting the extracted data for delivery. You may also want to specify different number(s) for SMS delivery on a per-request basis, or specify the specific collection into which to insert the extracted data for a given request. These are elements we can refactor around, and which we’ll come around to in future posts.
In the next post we’ll come around to automating the execution of this script, the implementation of which will vary depending upon your operation system. I will cover the basic configuration of cron for Linux and MacOS. For Windows, you can use a “scheduled task”, which seems like it’s relatively easy to configure, although I don’t have any personal experience setting it up, so YMMV.
We’ll also add some basic error handling to support connection or timeout errors that may inevitably occur during our page fetching.
If you’re following my advice on version control, stage your changes for commit with “git add scraper.py”, and commit them with something like “git commit -m ‘Add command-line options for dry run, seed database, and verbose output’”.