Previous articles

Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2

Next articles

Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5
Web Scraping for Fun and Profit: Part 6
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8

Recap

In Part 1 we installed Python, pip, and the Requests library. We set up a basic program which fetches the content of this site. In Part 2, we installed BeautifulSoup and used it to parse the page in order to extract the data we care about.

In this post, we will set up a database for post listings, which we will use to determine if a post is new, and hence, whether a notification should be sent out.

Introducing MongoDB

So why use MongoDB and not some standard lightweight relational database management system like sqlite? For the uninitiated, standard relational databases organize data into columns and rows that allow for easy querying and maintaining of the database (usually using SQL). This requires you to specify the structure of your data, which imposes a cost from the perspective of time spent managing the database.

MongoDB is a NoSQL database which simply stores arbitrary documents (i.e., blobs of JSON data). Given that the basic functionality we need here is simply the ability to add new listings and to query for whether a given listing already exists, MongoDB’s schema-less flexibility is a boon rather than a liability, and fits in perfectly with storing each saved entry as a Python dictionary.

Install MongoDB

If you are using a Linux distribution or MacOS, I would highly recommend using a package manager for this. Otherwise, you can check out the official MongoDB installation instructions here.

If you are using MacOS, the choice of package manager is simple: homebrew. You can install it with the following command:

$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

After installing homebrew, simply run “homebrew install mongodb” to install MongoDB. You should be able to start MongoDB, and have it automatically ran on system start, by running “brew services start mongodb”. However, this did not work on my set up… To manually start the MongoDB server, you can create the default MongoDB data directory by running “mkdir -p /data/db” and then invoking “mongod” to start the MongoDB server.

For Linux, the picture is a little more murky as there are a variety of distributions which support different package managers. Perhaps the most widely support package manager for Linux distributions, which includes Ubuntu, is apt, which you can use to install MongoDB as so: “sudo apt-get install mongodb”. Several others use yum, which you can similarly utilize to install MongoDB: “sudo yum install mongodb”. I believe that both of these methods will start mongod automatically (the running process for connecting to your databases) and configure it to run when your computer is booted up. Individual distributions vary in terms of how startup services are managed.

Start MongoDB and verify that it is running

On Linux and MacOS, you can run the following command to check that the mongod process is actually running:

$ ps -ef | grep mongo
mongodb  842       1   0  7:51PM ?          00:00:11 /usr/bin/mongod --config /etc/mongodb.conf
andy     84676 23435   0  8:02PM ttys004    00:00:00 grep mongod

This should return two lines similar to the output in the example: one for the mongod process itself (the first line), and one for the grep process itself searching for “mongod” (the second line).

Don’t worry, I’ll undoubtedly come back and dive further into processes and utilities like ‘ps’ in a future post.

Install pymongo- Python driver for MongoDB

Let’s use pip to install the Python drivers for MongoDB.

If using Python 3.x:

$ sudo pip3 install pymongo

If using Python 2.x:

$ sudo pip install pymongo

MongoDB basics

A single MongoDB instance contains 0 or more databases, each of which contains 0 or more associated collections of documents. MongoDB will automatically create databases and collections for you if they are referenced, but do not yet exist.

Let’s play around with MongoDB in the Python REPL by invoking “python3” or “python” as appropriate, and see how this works.

>>> from pymongo import MongoClient()
>>> db_client = MongoClient()
>>> db_client
MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True)
>>> my_db = db_client.my_db
>>> my_db
Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), u'my_db')
>>> my_posts = my_db.posts
>>> my_posts
Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True), u'my_db'), u'posts')

Within a collection, we’ll be using the find_one() and insert() methods, which do exactly what you’d think. It might surprise you that invoking find() on an empty collection will return a pymongo.cursor.Cursor object. All find() queries return this query cursor, which allows you iterate over all the results, even when there are none. Since we only need to check whether a listing exists in the database, we only need to use the find_one method, however. Let’s see how this works in the REPL, continuing on from before.

>>> my_posts.find_one()
>>> my_posts.insert({'movie': 'Pulp Fiction'})
ObjectId('58d33418caa3635ece979a82')
>>> my_posts.find_one()
{u'movie': u'Pulp Fiction', u'_id': ObjectId('58d33418caa3635ece979a82')}
>>> my_posts.find_one({'movie': 'Star Wars'})
>>> my_posts.find_one({'movie': 'Pulp Fiction'})
{u'movie': u'Pulp Fiction', u'_id': ObjectId('58d33418caa3635ece979a82')}

Putting it all together

import requests
from bs4 import BeautifulSoup
# Import MongoClient from pymongo so we can connect to the database
from pymongo import MongoClient

if __name__ == '__main__':
    # Instantiate a client to our MongoDB instance
    db_client = MongoClient()
    response = requests.get("http://andythemoron.com")
    soup = BeautifulSoup(response.content, "lxml")

    post_titles = soup.find_all("a", class_="post-title")

    extracted = []
    for post_title in post_titles:
        extracted.append({
            'title' : post_title['title'],
            'link'  : "andythemoron.com" + post_title['href']
        })

    # Iterate over each post. If the link does not exist in the database, it's new! Add it.
    for post in extracted:
        if db_client.my_db.my_posts.find_one({'link': post['link']}) is None:
            # Let's print it out to verify that we added the new post
            print("Found a new listing at the following url: ", post['link'])
            db_client.my_db.my_posts.insert(post)

Now run the program and verify that the posts were added on the first run. Try running it again, and verify that there are no new posts to be added to the database.

If you’re following my advice on version control, stage your changes for commit with “git add scraper.py”, and commit them with something like “git commit -m ‘Insert new listings in the database’”.

Voila!


Previous articles

Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2

Next articles

Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5
Web Scraping for Fun and Profit: Part 6
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8