26
Building a scraping tool with Python and storing it in Airtable (with real code)
A startup often needs extremely custom tools to achieve its goals.
At Arbington.com we've had to build scraping tools, data analytics tools, and custom email functions.
None of this required a database. We used files as our "database" but mostly we used Airtable.
Nobody wants to admit it, but scraping is pretty important for gathering huge amounts of useful data.
It's frowned upon, but frankly, everyone does it. Whether they use an automated tool, or manually sift through thousands of websites to collect email addresses - most organizations do it.
In fact, scraping is what made the worlds best search engine: Google.
And in Python, this is REALLY easy.
The hardest part is reading through various forms of HTML, but even then, we have a tool for that. Let's take a look at an example that I've adjusted so you can scrape my website.
We'll use https://kalob.io/teaching/ as the example and get all the courses I teach.
First, we look for a pattern in the DOM. Open up that page, right click, inspect element, and look for all the blue buttons.
You'll see they all have class="btn btn-primary"
. Interesting, we've found a pattern. Great! We can work with that.
Now let's just right into the code. And if you're a Python dev, feel free to paste this into your terminal.
import requests
response = requests.get("https://kalob.io/teaching/")
print(response.content)
You'll see the HTML for my website. Now, all we need to do is parse the HTML.
Note: utf-8 encoding is most commonly used on the internet. So we'll want to decode the HTML we scraped into utf-8 compatible text (in a giant string)
Our code now looks like this:
import requests
response = requests.get("https://kalob.io/teaching/")
html = response.content.decode("utf-8")
print(html)
And you'll see the HTML looks a little nicer now.
Now here's a big hairy problem: parsing HTML. Some people use attr=""
some people use attr=''
some people use XHTML and some don't.
So how do we get around this?
Introducing: Beautiful Soup 4.
In your Python environment pip install this package:
pip install beautifulsoup4
And your code now looks like this:
import requests
response = requests.get("https://kalob.io/teaching/")
html = response.content.decode("utf-8")
import bs4 # You'll need to `pip install `
soup = bs4.BeautifulSoup(html, "html.parser")
print(soup) # Shows the parsed HTML
print(type(soup)) # Returns <class 'bs4.BeautifulSoup'>
So our soup
variable is no longer a string, but an object. This means we can use object methods on it - like looking for certain elements in the HTML we scraped.
Let's put together a list of all the links on this page.
import requests
response = requests.get("https://kalob.io/teaching/")
html = response.content.decode("utf-8")
import bs4 # You'll need to `pip install `
soup = bs4.BeautifulSoup(html, "html.parser")
courses = soup.findAll("a", {"class": ["btn btn-primary"]})
print(courses)
Look at that.. now we have a list of buttons from the page we scraped at the beginning of this article.
Lastly, let's loop through them to get the button text and the link:
for course in courses:
print(course.get("href"))
print(course.text.strip())
print("\n")
Listen, I wrote 3 print statements to make this clear - but typically I'd write this in a single line.
Now we have something to work with! We have the entire HTML element, the href
attribute, and the innerText
without any whitespace.
The entire script is 9 lines of code and looks like this:
import requests
import bs4 # You'll need to `pip install `
response = requests.get("https://kalob.io/teaching/")
html = response.content.decode("utf-8")
soup = bs4.BeautifulSoup(html, "html.parser")
courses = soup.findAll("a", {"class": ["btn btn-primary"]})
for course in courses:
print(f"{course.get('href')} -> {course.text.strip()}")
You know me, I'm a HUGE fan of Airtable.
And instead of using local database or a cloud based database, I like to use Airtable so me and my team and work with the data and easily expand the tables if we need to. Like if we needed to add a column to see if a course meetings our criteria to be on Arbington.com.
For this we use Airtables API and the python package known as
airtable-python-wrapper
.
Go ahead an install this through pip.
pip install airtable-python-wrapper
Now before we continue, you'll need a free Airtable account đ that's our referral link. No need to use it, it's just a nice kickback for us for constantly promoting Airtable đ
Once you have an account, you need to dig up your app API key, your table API key, and your Base Name. It would look something like this in python:
from airtable.airtable import Airtable
airtable = Airtable('appXXXXXXXXX', 'Links', 'keyXXXXXXXXXX')
Lastly, all we need to do is create a dictionary of Airtable Column Names, and insert the record.
import requests
import bs4 # You'll need to `pip install `
from airtable.airtable import Airtable
response = requests.get("https://kalob.io/teaching/")
html = response.content.decode("utf-8")
soup = bs4.BeautifulSoup(html, "html.parser")
courses = soup.findAll("a", {"class": ["btn btn-primary"]})
airtable = Airtable('appXXXXXXXXX', 'Links', 'keyXXXXXXXXXX')
for course in courses:
new_record = {
"Link": course.get('href'),
"Text": course.text.strip(),
}
airtable.insert(new_record)
Assuming you setup your Airtable columns, table and API keys properly, you should see my website links and URLs appear in your Airtable.
Now you and your team can scrape webpages and store the data in Airtable for the rest of your team to use!
Now that all the data we want is in Airtable, we can use the same Python package to pull the data out, work with it, scrape more data, and update each record.
But that's for another day đ
If you're looking for online courses, take a look at Arbington.com, there are over 40 Python courses available.
And it comes with a free 14 day trial to access over 1,500 courses immediately! đ„
26