In Part 1 we installed Python, pip, and the Requests library. We set up a basic program which fetches the content of this site. In Part 2, we installed BeautifulSoup and used it to parse the page in order to extract the data we care about. In Part 3, we installed MongoDB and the pymongo library, and used them to determine if the data fetched is new. In Part 4, we added a basic function to format an extracted post for SMS delivery and used Twilio to send this SMS. In Part 5, we added a few command-line options to help us in future development as we add support for new sites. In Part 6, we went over basic error handling and how to automate this script. In Part 7, we did some refactoring to make it simpler to add new requests.
In this part, we are going to add a request for fetching tweets from the realDonaldTrump Twitter account. If you only cared about checking twitter for updates, you would probably just want to use the Twitter APIs, but it’s hard to find sites with acceptable Terms and Conditions and robots.txt for scraping, and Twitter fits the bill as of the time of this posting. And besides, you can get the latest Trump tweets before they’re disseminated all over the interwebs!
Support multiple requests
Let’s start by updating scraper_requests.py and scraper.py to handle multiple requests.
Test it out in dryrun mode to verify that we didn’t break anything.
Choose your index
To check for new blog posts, we’ve been keying on the URLs of the articles. Tweets, however, do not have a URL, although they are associated with a unique identifier. Let’s update our code to return a tuple of the index to key on and the extracted data instead of just the extracted data.
Let’s verify that everything still works…
Add twitter parsing
Let’s start by adding the request for Donald Trump’s tweets.
Now let’s add the twitter-specific parsing and formatting methods. Using your browser development tools (see Part 2) you’ll notice that tweets are “li” elements wrapping a “div” element with the following classes: tweet, js-stream-tweet, js-actionable-tweet, js-profile-popup-actionable, original-tweet, js-original-tweet, has-cards, and has-content. In writing a parsing method, I initially tried extracting the “div” elements of class “tweet”, but it seems to be used for one additional element that is not a visible “tweet”. “js-stream-tweet” seems to fit the bill for the tweets we’ll want to scrape. We can fetch them like so: “soup.find_all(‘div’, class_=’js-stream-tweet’)”.
These div elements of class “js-stream-tweet” contain metadata we might want, such as the unique tweet id (data-tweet-id), the tweeter’s twitter handle (data-screen-name), and the retweeter’s twitter handle, if applicable (data-retweeter).
You’ll notice that the actual content of the tweet itself is contained inside a p element of class “js-tweet-text”, among others, inside a div element of class “js-tweet-text-container” inside a div element of class “content”. Given a “tweet” that is one of the “js-stream-tweet” divs we extracted earlier, we can fetch the tweet’s text like this: “tweet_text = tweet.find(‘div’, class_=’js-tweet-text-container’).find(‘p’, class_=’js-tweet-text’).text”.
Let’s create a new file twitter_parser.py to parse and format tweet data.
Let’s check it out!
Awesome! After the groundwork we’ve laid, it wasn’t too much of a pain to add the parsing of twitter for new Donald Trump tweets. From here, you can try adding your own requests and parsers, and finding all sort of applications for this tool!