26
Scrape Your First Website in Minutes with Python
Ever felt the need to pull out data from a website? What would you do? Visit the sites one after the other and gather information?
Well that would work if you have a page or two. However, if you have lots of them, manual extraction will become too difficult a task; this is where web scraping comes to the rescue!
Web scraping, as the name suggests, is a method of extracting data from web pages in an automated fashion. Scraping is super helpful in price comparisons, R&D, gathering data from social media, job listings, and more.
Many methods can be used to perform web scraping such as online services, APIs, or even writing your own script. And that’s why we are here. This article will teach you the basics of how to scrape data from the web. Before we get into that, let’s take a quick look at why we would even want to scrape data from the web.
Websites, in general, have huge quantities of information. This information is mostly unstructured or cluttered. When users visit a website they only need a small percentage of what’s available.
While they can manually access it, the process is quite cumbersome, especially when repetition is involved (given that the data is dynamic and updated frequently). Hence, the need for web scraping.
Once the script is set up for a particular webpage, it can be executed any number of times to extract data and use it as required.
Let’s get started!
This script will extract weather data from a webpage and save it to a .csv
file. We will be using the following libraries to help us with the scraping and managing the extracted data:
Requests - This library is required to send an HTTP request to the web page. This will give us access to the HTML content of the webpage we want to scrape.
Beautiful Soup - This library gives us functions to help extract data from the HTML content we receive when we send an HTTP request.
Pandas - This library helps us manage the data that has been extracted. In this case we will use it to save our data to a
.csv
file.
In case you don’t have the aforementioned libraries installed, follow the commands given below to install them:
**Installing BeautifulSoup
pip install beautifulsoup4
Installing requests
pip install requests
Installing Pandas
pip install pandas
Once you have the libraries installed, follow the steps given below to scrape data from web in python3
Start by importing all the libraries.
Send an HTTP request to the webpage using its URL. Make sure the response code is 200 which means the request was successful.
Use the BeautifulSoup function to extract the raw HTML from the response received.
From the raw HTML, extract the data we need using different selectors. The selectors used here are ‘class’ and ‘id’.
Save the extracted data into a pandas dataframe in the form of a python dictionary.
Save the dataframe to a csv file. Note: We are using the utf-16BE encoding to render the degree symbol properly in the csv file.
Once you have your code ready, you can deploy it directly to the cloud using Codesphere. Codesphere let’s you avoid the hassle of config so that you can spend more time doing what you do best: Actually coding!
Let us know what you’re going to scrape down below!
Till then, happy coding.
26