Previous articles
Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2
Next articles
Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5
Web Scraping for Fun and Profit: Part 6
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8
Recap
In Part 1 we installed Python, pip, and the Requests library. We set up a basic program which fetches the content of this site. In Part 2, we installed BeautifulSoup and used it to parse the page in order to extract the data we care about.
In this post, we will set up a database for post listings, which we will use to determine if a post is new, and hence, whether a notification should be sent out.
Introducing MongoDB
So why use MongoDB and not some standard lightweight relational database management system like sqlite? For the uninitiated, standard relational databases organize data into columns and rows that allow for easy querying and maintaining of the database (usually using SQL). This requires you to specify the structure of your data, which imposes a cost from the perspective of time spent managing the database.
MongoDB is a NoSQL database which simply stores arbitrary documents (i.e., blobs of JSON data). Given that the basic functionality we need here is simply the ability to add new listings and to query for whether a given listing already exists, MongoDB’s schema-less flexibility is a boon rather than a liability, and fits in perfectly with storing each saved entry as a Python dictionary.
Install MongoDB
If you are using a Linux distribution or MacOS, I would highly recommend using a package manager for this. Otherwise, you can check out the official MongoDB installation instructions here.
If you are using MacOS, the choice of package manager is simple: homebrew. You can install it with the following command:
After installing homebrew, simply run “homebrew install mongodb” to install MongoDB. You should be able to start MongoDB, and have it automatically ran on system start, by running “brew services start mongodb”. However, this did not work on my set up… To manually start the MongoDB server, you can create the default MongoDB data directory by running “mkdir -p /data/db” and then invoking “mongod” to start the MongoDB server.
For Linux, the picture is a little more murky as there are a variety of distributions which support different package managers. Perhaps the most widely support package manager for Linux distributions, which includes Ubuntu, is apt, which you can use to install MongoDB as so: “sudo apt-get install mongodb”. Several others use yum, which you can similarly utilize to install MongoDB: “sudo yum install mongodb”. I believe that both of these methods will start mongod automatically (the running process for connecting to your databases) and configure it to run when your computer is booted up. Individual distributions vary in terms of how startup services are managed.
Start MongoDB and verify that it is running
On Linux and MacOS, you can run the following command to check that the mongod process is actually running:
This should return two lines similar to the output in the example: one for the mongod process itself (the first line), and one for the grep process itself searching for “mongod” (the second line).
Don’t worry, I’ll undoubtedly come back and dive further into processes and utilities like ‘ps’ in a future post.
Install pymongo- Python driver for MongoDB
Let’s use pip to install the Python drivers for MongoDB.
If using Python 3.x:
If using Python 2.x:
MongoDB basics
A single MongoDB instance contains 0 or more databases, each of which contains 0 or more associated collections of documents. MongoDB will automatically create databases and collections for you if they are referenced, but do not yet exist.
Let’s play around with MongoDB in the Python REPL by invoking “python3” or “python” as appropriate, and see how this works.
Within a collection, we’ll be using the find_one() and insert() methods, which do exactly what you’d think. It might surprise you that invoking find() on an empty collection will return a pymongo.cursor.Cursor object. All find() queries return this query cursor, which allows you iterate over all the results, even when there are none. Since we only need to check whether a listing exists in the database, we only need to use the find_one method, however. Let’s see how this works in the REPL, continuing on from before.
Putting it all together
Now run the program and verify that the posts were added on the first run. Try running it again, and verify that there are no new posts to be added to the database.
If you’re following my advice on version control, stage your changes for commit with “git add scraper.py”, and commit them with something like “git commit -m ‘Insert new listings in the database’”.
Voila!
Previous articles
Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2
Next articles
Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5
Web Scraping for Fun and Profit: Part 6
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8