Previous articles

Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2
Web Scraping for Fun and Profit: Part 3

Next article

Web Scraping for Fun and Profit: Part 5
Web Scraping for Fun and Profit: Part 6
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8

Recap

In Part 1 we installed Python, pip, and the Requests library. We set up a basic program which fetches the content of this site. In Part 2, we installed BeautifulSoup and used it to parse the page in order to extract the data we care about. In Part 3, we then installed MongoDB and the pymongo library, and used them to determine if the data fetched is new.

In this part, we will set up a Twilio account for sending ourselves a text message (SMS) when new data arrives.

Why Twilio?

Twilio provides a simple mechanism for delivering SMS messages programmatically out of the box. It is fairly easy to get up and running, and the trial period provides you with a ton of free credit so long as you’re content with the “Sent from your Twilio trial account - “ prefix showing up in your text messages.

You could alternatively use an email account with Python’s smtplib library and send alert texts via email, which is the solution I ended up adopting when my first Twilio number got spam-blocked when I accidentally sent a bunch of texts all at once. Big takeaway: always add a quick sleep between sending text messages, just in case! Also, when performing an initial fetching of the data from the page, it’s useful to have a command-line option for just seeding the database with the initial entries, rather than sending out a massive flurry of notifications on the first run. We will add this and other options in subsequent posts.

Create a Twilio account

Head over to twilio.com/try-twilio to create an account. Once you’re set up and logged in, head over to the Twilio console and take note of your “Account SID” and “Auth Token”. You will need these to connect to Twilio via Python. You can get one free phone number with your Twilio trial account, which you should also take note of for later.

Install twilio-python

Let’s use pip to install the Python drivers for Twilio.

If using Python 3.x:

$ sudo pip3 install twilio

If using Python 2.x:

$ sudo pip install twilio

Twilio API: SMS basics

What you need to know to get the basics working here are pretty simple.

1) Instantiate a Twilio client passing in your “Account SID” and “Auth Token” as constructor parameters.

>>> from twilio.rest import TwilioRestClient
>>> accountSid  = "ACXXXXXXXXXXXXXXXXX"
>>> authToken = "YYYYYYYYYYYYYYYYYY"
>>> client = TwilioRestClient(accountSid, authToken)

Alternately, a TwilioRestClient constructor without these parameters will look for TWILIO_ACCOUNT_SID and TWILIO_AUTH_TOKEN variables inside the current environment, which is the preferred method. Undoubtedly, I’ll come back around to such things as configuring your environment/shell in future posts. :)

2) Create a new message with parameters “to”, “from_”, and “body”.

>>> client.messages.create(to="+11231234567", from_="+11239876543", body="hey, what's up?")

A brief note on character encodings

In theory, Twilio recognizes the character encoding you need and it performs this encoding automatically, with full unicode support… but in practice, I’ve noticed some inconsistencies in this regard. If you don’t anticipate needing non-ASCII characters, you can strip them out or replace them with another character via something like the snippet below.

>>> message = "I love heiße liebe!"
>>> message = ''.join([i if ord(i) < 128 else '_' for i in message])
>>> message
'I love hei_e liebe!'

Putting it all together

We’ll create a simple method for formatting an extracted post for SMS delivery, and pass the returned value as the body of the message.

import requests
from bs4 import BeautifulSoup
from pymongo import MongoClient
from twilio.rest import TwilioRestClient
# Import the time library so that we can sleep between SMS messages- see warning in introduction
import time

# Use your own values for these
accountSid = "ACXXXXXXXXXXXXXXXXX"
authToken = "YYYYYYYYYYYYYYYYYY"

# Create a simple method for creating a formatted message from an extracted post
def format_message(post):
    return "New post: " + post['title'] + "\n" + post['link']

if __name__ == '__main__':
    db_client = MongoClient()
    # Instantiate a Twilio client
    twilio_client = TwilioRestClient(accountSid, authToken)
    response = requests.get("http://andythemoron.com")
    soup = BeautifulSoup(response.content, "lxml")

    post_titles = soup.find_all("a", class_="post-title")

    extracted = []
    for post_title in post_titles:
        extracted.append({
            'title' : post_title['title'],
            'link'  : "andythemoron.com" + post_title['href']
        })

    for post in extracted:
        if db_client.my_db.my_posts.find_one({'link': post['link']}) is None:
            # Send an SMS message via Twilio
            toNumber = "+11231234567"  # Use your own cell number here
            fromNumber = "+11239876543"  # Use your own Twilio number here
            message = format_message(post)
            twilio_client.messages.create(to=toNumber, from_=fromNumber, body=message)
            
            # Add the post to MongoDB
            db_client.my_db.my_posts.insert(post)

            # Sleep for a few seconds to avoid getting spam-blocked
            time.sleep(3)

If you’re following my advice on version control, stage your changes for commit with “git add scraper.py”, and commit them with something like “git commit -m ‘Format new listing for SMS delivery and send via Twilio’”.

Hey, we got a basic, functional, alert program set up now! We’ll want to automate the execution of the program, either by executing the main logic in a perpetual loop which sleeps for a good while after each iteration, or by using a service such as cron to execute this program at specified times for you. We’ll also want to refactor the code to make it more extensible for doing this kind of thing for a wide variety of sites and data. These will be covered in coming posts. Stay tuned!


Previous articles

Web Scraping for Fun and Profit: A Primer
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2
Web Scraping for Fun and Profit: Part 3

Next article

Web Scraping for Fun and Profit: Part 5
Web Scraping for Fun and Profit: Part 6
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8