Previous articles
Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2
Web Scraping for Fun and Profit: Part 3
Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5
Next articles
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8
Recap
In Part 1 we installed Python, pip, and the Requests library. We set up a basic program which fetches the content of this site. In Part 2, we installed BeautifulSoup and used it to parse the page in order to extract the data we care about. In Part 3, we installed MongoDB and the pymongo library, and used them to determine if the data fetched is new. In Part 4, we added a basic function to format an extracted post for SMS delivery and used Twilio to send this SMS. In Part 5, we added a few command-line options to help us in future development as we add support for new sites.
In this part, we will add some basic error handling and automation so that our scraper can be executed at specified times or intervals of time.
Error handling
Remember in part 1, when we added the initial request “response = requests.get(‘http://andythemoron.com’)”? We’ll want to add a timeout argument so that the program doesn’t just hang indefinitely:
If the request takes longer than the number of seconds supplied via the timeout parameter, a Timeout exception will be raised. Network issues, such as DNS failures and refused connections, will raise a ConnectionError exception. If, however, your request receives a non-successful HTTP response, you’ll have to specifically invoke raise_for_status() on the returned Response object to raise an HTTPError. All of these exceptions inherit from requests.exceptions.RequestException.
Depending on how frequently you plan on running the script, you may or may not want to try again a certain number of times. Let’s say that you don’t really need incredibly up to date information and only run the script every hour, but want to retry a few times on failure. You could implement something like this to try 5 times, waiting a minute between each attempt:
Another possibility is that the logic you implemented to parse the web page becomes obsolete if the owners decide to restructure their page elements. If you expect to always find matches for a given request, you might want to send yourself an SMS if no data is extracted to the effect that your parsing logic is no longer working.
And if you’re sending out notifications on failure, let’s set those as global variables so we can reuse them. Your code might now look something like this:
If you’re following along with my suggestion on using version control, stage your script for commit via “git add scraper.py” and commit with something like “git commit -m ‘Improve error handling; retry on failure; notify on failure to fetch or parse data’”.
Automate all the things!
Windows
In Windows, you can schedule tasks via the “Task Scheduler” application. It appears to be a fairly intuitive mechanism, but I don’t have any personal experience with it, so YMMV.
Linux and MacOS
In Linux and MacOS, you can use cron to automate tasks. The cron daemon (background process) should already be running on your system. It wakes up every minute and checks all loaded crontabs to see if there are any tasks that need to be executed that minute. To modify your user-specific crontab, simply run:
This will open your personal crontab file for editing. You can create a crontab entry by specifying 6 fields, separated by whitespace. These fields correspond to specific minutes, hours, days of the month, months, days of the week, and the command to execute. You can specify individual values, match all (*), ranges, e.g., hours 8-17 to run from 8AM to 5PM, or lists of individual values or ranges, e.g., hours 6,8-12,17 to run at hours 6, 8, 9, 10, 11, 12, and 17. You can also specify intervals via /, e.g., minutes */5 to match every 5 minutes. Here’s a few examples:
You can verify that you successfully updated your crontab by listing your crontab entries like so:
Right on! We now have an automated setup for fetching data from the web and sending out alerts via text message! Next time, we’ll refactor our code to handle different requests.
Previous articles
Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2
Web Scraping for Fun and Profit: Part 3
Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5
Next articles
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8