Next articles
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2
Web Scraping for Fun and Profit: Part 3
Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5
Web Scraping for Fun and Profit: Part 6
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8
Why web scraping? The beauty of simplicity and positive momentum
So I know I just ripped on the bifurcation of learning materials into introductory tutorials and more advanced material, but I think the process of putting together a basic web scraping utility for “real-time” alerts provides an excellent opportunity for learning that is not too difficult, yet provides valuable learning opportunities, tangible benefits, and contributes towards the needed positive momentum of completing real projects.
I don’t know about you, but when I started out on the path of programming, I’d often struggle to come up with ideas for projects. Having completed a few degrees in different fields, I was intimately aware of the importance of frequent repetition for solidifying one’s mental models and conceptual understanding. But outside of completing required homework projects, I’d fail to come up with meaningful projects on which to apply the skills I was trying to learn. And my learning curve took significantly longer than perhaps it needed to.
You see, you’ll learn a tremendous amount from starting on, and completing, projects! On a smaller scale, you’ll learn from your mistakes, and learn from experience what it means to write “good” code. By “good” code, I primarily mean code that is easy to understand and easy to maintain.
To a certain extent, these aims coalesce. The clearer and more explicit your code is, the easier it is to maintain. And the harder your code is to maintain, the harder it will become to understand when you inevitably want to add new features or fix existing issues. What constitutes “simple” code and the steps needed to make it happen are not always easy when you’re trying to stay afloat in the deep end, but I’ll try my best to help you through the murky waters by breaking things down in a (hopefully) understandable manner.
Entropy is a thing
One of the most fundamental (and obvious, when you think about it) realities of coding for a living is that you will almost certainly be writing code that is going to be seen, and likely modified, by other people. There is a certain degree of entropy that is endemic to any codebase, and this is usually magnified tremendously in accord with the scale of the project. Given this fundamental reality, ease of understandability and ease of maintenance are king. Sure, there are undoubtedly instances where a “clever” solution may be ideal, but such cases are the exception, rather than the rule, especially if you’re not highly experienced. As a general rule, a “clever” solution is a failure to make the code adequately simple and expressive.
You won’t want to write simple and expressive code just so that others may understand it, however. You’ll also want to make your code clear and expressive so that you can understand it. “But I wrote it, so I’ll understand it when I need to make changes,” you say. This attitude will undoubtedly be challenged after getting burned a few times. I cannot recall how many times I’ve had to go back and look at my own code to figure out how something was implemented or why. To the extent that you make your code easier to understand, you are doing your future self a huge favor! You may have to trust me on this for now, but the lesson will sink in for you on a personal level as you continue to have the experience of looking at your own code, or others’, and try to figure out what the hell is going on and why.
Given entropy, use version control for everything!
Seriously. Any professional work you’ll do will utilize some form of a version control system. If you’re offered a job at a place that doesn’t use version control, I’d be more than a little concerned.
I’ll undoubtedly come back to version control in greater detail in future posts, but at its core, a version control system (or VCS) records changes to files over time. Among other things, this allows you to trace out what happened to the code over time, and it helps you the manage the entropy of the codebase you’re working with. This becomes critically important when managing multiple versions of the same software. Even for small personal projects, version control allows you to take snapshots of your progress in case you go down a path you end up wanting to get out of, or you accidentally delete a file, for instance.
I won’t dive in too deep in this project, but I’ll be utilizing some basic version control over the course of this little project to get you acquainted.
Okay, whatever dude. Where’s my scraper?
I’m going to spend a few posts going through the process of developing this utility, and thinking through the basic design, rather than just spoon feeding you a finished piece of code. Although, I’ll probably set up a github repo for each section as we go along as well. :)
Fundamentally, what do we need to do to create a generalized mechanism for receiving automated alerts for new listings of apartments, unique items, jobs, etc. on the web? What are the core components for automating a single search? You’ll want to automate something like the following process:
- Fetch the data you need from the web.
- Parse this data and extract what you care about from it.
- Compare this extracted data against what you’ve already processed to determine if it’s new.
- If the data is new, format it for the appropriate delivery mechanism.
- Send this formatted data.
We’ll be working through all of these steps in subsequent posts. And once we have the basics in place, we’ll do some refactoring (making changes to the code which don’t alter the external behavior, but which improve its internal structure) to make this basic framework more generally useful. Code entropy may be a thing, but refactoring is the common cure, and a fundamental component of the development process. Refactoring is typically thought of as a discrete thing you do, but usually you’ll just want to refactor as a natural part of the development process in order to achieve some other end more easily. Such is the natural time for refactoring.
I’ll be using Python as the language for implementing this scraper. If you are not familiar with Python, but have some experience with other modern languages, you’ll probably be able to pick it up without too much difficulty. One of the many things I love about Python is how easy it is to learn! It embodies a lot of the principles I’ve been expounding, viz., readability, clarity, expressiveness. Hopefully the subsequent series of posts will help illuminate what I mean if you are coming from a background which doesn’t include Python exposure.
Thank you for tuning in! Till next time!
Next articles
Web Scraping for Fun and Profit: Part 1
Web Scraping for Fun and Profit: Part 2
Web Scraping for Fun and Profit: Part 3
Web Scraping for Fun and Profit: Part 4
Web Scraping for Fun and Profit: Part 5
Web Scraping for Fun and Profit: Part 6
Web Scraping for Fun and Profit: Part 7
Web Scraping for Fun and Profit: Part 8