Previous article

Web Scraping for Fun and Profit: A Primer

Next articles

Web Scraping for Fun and Profit: Part 2
Web Scraping for Fun and Profit: Part 3
Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5
Web Scraping for Fun and Profit: Part 6
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8

Web scraping and the law

Before proceeding, there are a few things you should be aware of when it comes to web scraping and the law. A full legal analysis of the practice of web scraping is beyond the scope of this post, and it is no way a settled issue. The summary version of standards you should adhere to in order to stay on the right side of the law are included below. The source which also contains further details can be found here.

  • Content being scraped is not copyright protected
  • The act of scraping does not burden the services of the site being scraped
  • The scraper does not violate the Terms of Use of the site being scraped
  • The scraper does not gather sensitive user information
  • The scraped content adheres to fair use standards

The Terms of Use of some useful sites for scraping are highly prohitive in nature. For instance, the Craigslist Terms of Use state that “Robots, spiders, scripts, scrapers, crawlers, etc. are prohibited”. Others are not so strict. To avoid any potential conflicts with any Terms of Use (as these may change at any time and render this post retroactively illegal), I’ll be sticking with the use case of checking this site for new blog posts. In the course of working through this project, you’ll find that it’ll be relatively easy to write your own code for checking a wide variety of different sites for updates.

Please be mindful of the traffic you generate from any programmatic web request code, and please don’t be a dick!

Install Python

Okay, now that we’ve gotten that out of the way, let’s get to installing Python!

If you are using a Linux machine, Python should already be installed. You can verify and check out which version you are using by invoking “python --version” or “python -V” in a terminal for Python 2.x and “python3 --version” or “python3 -V” for Python 3.x.

If you are running on Windows or MacOS, but don’t have Python installed, you can visit https://www.python.org/downloads/ to download the latest 2.x or 3.x release.

Python 2.x or 3.x?

If you are new to Python, I’d probably recommend starting with the latest 3.x release, since it is the language’s future. Python 2.x, while still widely used, is considered a legacy language at this point. Its development, which is essentially restricted to bug fixes and maintenance, is slated to cease after 2020.

I’ll be testing the code we implement on both Python 2.7 and Python 3.6, and note where any divergences may be needed.

In calling the script we’re creating, I’ll be invoking “python3” to run on Python 3.x. If you wish to run on Python 2.x, simply invoke with “python” instead.

Install pip

To install new Python libraries, you’ll want to use pip, which is the de facto package manager for Python.

If you just installed Python 2 >= 2.7.9 or Python 3 >= 3.4 from python.org, pip will already be installed, although you may want to upgrade it. Otherwise you can check out pip’s installation guide to get pip up and running. You may need to run the install script with super user privileges, i.e., “sudo python get-pip.py”.

Install the Python requests library via pip

To fetch data from the web, we are going to use the Python requests library. The requests library is the de facto standard for fetching data from the web via Python. As the documentation states, “The NSA, Her Majesty’s Government, Amazon, Google, Twilio, Runscope, Mozilla, Heroku, PayPal, NPR, Obama for America, Transifex, Native Instruments, The Washington Post, Twitter, SoundCloud, Kippt, Sony, and Federal U.S. Institutions that prefer to be unnamed claim to use Requests internally.”

To install the requests library, simply run “sudo pip install requests” in the terminal! If Python 2.x is your system’s default Python version, and you want to develop this in Python 3.x, run “sudo pip3 install requests” instead.

Set up a basic skeleton

Now that we’ve installed the initial basics, let’s put together a basic skeleton which simply fetches andythemoron.com as we work towards the goal of fetching web data on a periodic basis, checking for new data, and then sending out a message when new data is discovered.

I’d highly recommend that you create a new folder for this project, e.g. Scraper, and begin getting acquainted with version control by installing git, configuring a user name and email, and then invoking “git init” from the command line from within this new folder to initialize a new git repository. I’ll be sprinkling in git basics throughout this tutorial series, but I will most certainly come back to it in further detail in later posts. To globally configure your name and email, run the following commands, substituting your name and email for mine: “git config --global user.name ‘Andy Moran’” and “git config --global user.email [email protected]”.

Create a new file scraper.py in this folder:

import requests

if __name__ == '__main__':
    response = requests.get("http://andythemoron.com")
    # Let's print out the response for now to make sure we got some valid data back
    print(response.text)

You can run this code with “python3 scraper.py” for Python 3.x or “python scraper.py” for Python 2.x, and observe that we print out the content of andythemoron.com.

This little snippet of code begins by importing the requests library used to fetch data from the web via the get() function. Note that it is generally considered a best practice to include all library imports at the beginning of your code.

The “if __name__ == ‘__main__’” bit basically means to execute the code contained in the “if” block if this code is being run as a script (as opposed to being imported by another script). Within this “if” block, we will simply invoke the request library’s get() function with the argument of “http://andythemoron.com”, which returns a Response object, assigned to the variable “response”. We then call the print() function with the argument of response.text, which is the content of the response, in order to ensure that we’re actually getting meaningful data back.

Ahem, version control

If you took my advice on creating a new folder for this project and initializing a git repository in it via “git init”, you can run “git status” in the terminal from within this new folder, which shows that this new file scraper.py is untracked. You’ll want to run “git add scraper.py” to indicate to git that you want the changes made to this file to be included in the next commit to this code repository. Another invocation of “git status” reveals that the addition of scraper.py is now a change that is to be committed. You can now run “git commit -m ‘Initial checkin of basic scraper skeleton’” to commit this change to your repository. If you run “git log” in this folder, you’ll see that this change has been committed. To “push” these changes to another location, e.g. to github or a private repository, you’d invoke “git push” after configuring a remote repository. I’ll undoubtedly come out with a post on how to create a free cloud-based repository and configure it as the remote repository in the near future.

Thanks for reading!


Previous article

Web Scraping for Fun and Profit: A Primer

Next articles

Web Scraping for Fun and Profit: Part 2
Web Scraping for Fun and Profit: Part 3
Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5
Web Scraping for Fun and Profit: Part 6
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8