In Part 1 we installed Python, pip, and the Requests library. We set up a basic program which fetches the content of this site. In Part 2, we installed BeautifulSoup and used it to parse the page in order to extract the data we care about. In Part 3, we installed MongoDB and the pymongo library, and used them to determine if the data fetched is new. In Part 4, we added a basic function to format an extracted post for SMS delivery and used Twilio to send this SMS. In Part 5, we added a few command-line options to help us in future development as we add support for new sites. In Part 6, we went over basic error handling and how to automate this script.
In this part, we are going to do some refactoring to clean up the code so that it is easier to add new requests.
Refactoring
One of the main reasons to refactor is to improve the architecture of your program in the process of adding new functionality. The work we do here will make it very easy to add support for new requests, as you’ll see in the next post. The --verbose and --dryrun options we added in part 5 will make this work much easier as you introduce change into your program!
In this post, we’re going to work towards leveraging a Python dictionary with the parameters unique for a given request and hollowing out scraper.py as the generic mechanism for dispatching the heavy lifting as needed. The end result will be simpler and clearer code that is easier to work with going forward. For now, we are going to separate out the base url, the extraction method, the formatting method, and the MongoDB collection, and while we’re at it, we’ll move some of the implementation out to a separate scraper_utils.py file.
Pull out extraction logic
Let’s start by pulling out the logic for extracting posts from the BeautifulSoup object into a new method extract_posts in a new file andy_the_moron_parser.py:
Pull out formatting logic
Now let’s move over the format_message function into andy_the_moron_parser.py, but let’s rename it format_post to make the naming more consistent with extract_posts:
Get the request
Let’s create a file scraper_requests.py that returns the request dictionary for the current request. (In the next post, we’ll make this a list of dictionaries, but let’s stick to the singular for now…) Since we only need the MongoDB client to fetch the connection, let’s do that work here:
Since we are getting the collection from scraper_requests.py, we no longer need it in scraper.py.
Using the extraction method
Now let’s use the ‘extract_method’ from the request dictionary:
Let’s also use the request dictionary ‘base_url’ value for requesting the data and sending failure notifications:
Using the collection
Let’s update the references to the collection with the collection from the request dictionary:
Using the formatting method
Let’s update the call to format_message() to use the request’s format method:
Test it out
Awesome, it works as before! If not, the verbose output should help you figure out where the failure lies…
Pull out extracted data processing loop
Let’s copy the logic from within the “for post in extracted” loop into a new method process_extracted in a new file scraper_utils.py. This new method will need to be passed in the twilio client, the extracted data, the collection, the from_number, the to_number, the format_method, and the option values for seed_db, dry_run, and verbose. We’ll pass these values in and adjust the references accordingly. You may notice that I also changed “post” to “item” for a more generic representation.
We’ll call it in place of the previous code like so:
Let’s test it out!
Extract the request loop
Just one more bit of refactoring for now! Let’s pull out the request loop into a new method get_response() in scraper_utils.py:
And let’s use it in scraper.py:
Let’s verify that this works as expected:
Tada!
Your scraper.py file should now look something like this:
Much cleaner, eh?
If you’re following along with my suggestion on using version control, stage your files for commit and then do it.
In the next post, we’ll see how much simpler it is to add a new request after the work we’ve done to this point. Till next time!