17
How to Transform Data Extracted from Wikipedia into a Map in Python
In this tutorial I describe a strategy to extract geographical items from Wikipedia, organized in lists and then show them into a geographical map. I exploit the following Python libraries:
- selenium, which is a Python library for data extraction from any Web site. For more details on how to use selenium, you can read my previous article, entitled Scraping Data from Nested HTML Pages with Python Selenium
- geopy, which is a Python library working as a client for the most famous geocoding services. More details can be found in this interesting article by Eunjoo Byeon, entitled Introduction to Geopy: Using Your Latitude & Longitude Data in Python.
- folium, which is a Python library for geographical data visualisation. For more details, you can read this interesting article by Dario Radečić, entitled How to Make Stunning Interactive Maps with Python and Folium in Minutes.
As example, I exploit 5 Wikipedia pages, related to the Italian Jewish Communities:
- Communities
- Museums
- Cemeteries
- Ghettos
- Synagogues
All the considered Wikipedia pages contain a list of items, each representing a geographical entity, i.e. an Italian city. Thus, the idea is to build a geographical map with all those localities extracted from Wikipedia. The procedure is organized in three steps:
- Data Extraction
- Data Cleaning
- Data Enrichment Data Visualisation
All the localities in all the considered Wikipedia pages is represented as bullets of unordered lists. Thus, they can be easily extracted through a common procedure, implemented by means of the selenium library. In order to make your code working, you should install the correct selenium driver for your browser, as explained in this video.
Now, I am ready to write the code.
Firstly, I import the driver:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
Then, I define a function, called extract_list which receives as input the URL of the Wikipedia page as well as the XPath expression used to extract data from that page. The function extracts all the text associated to that XPath, splits the extracted text by lines and returns the list of items as a result:
def extract_list(url, xpath):
options = Options()
options.add_argument("--headless")
options.add_argument("--lang=it");
driver = webdriver.Chrome(options=options)
driver.get(url)
table = []
# get the list of terms
words = driver.find_element_by_xpath(xpath).text
table.extend(words.split('\n'))
driver.close()
return table
Now, I can invoke the function for each considered Wikipedia page, for which I define a list:
pages = ['Comunit%C3%A0_ebraiche_italiane', 'Cimiteri_ebraici_in_Italia', 'Musei_ebraici_in_Italia','Ghetti_ebraici_in_Italia','Sinagoghe_in_Italia']
Then, I build a loop on the created list of pages and I invoke the extract_list function. I also convert the extracted tables into a pandas Dataframe and I associate to each extracted item a category, corresponding to the considered page (with some stylistic changes):
import pandas as pd
df_dict = {}
xpath = '//*[@id="mw-content-text"]'
table = {}
base_url = 'https://it.wikipedia.org/wiki/'
for page in pages:
name = page.replace('_', ' ').title().replace('%C3%A0', 'à')
print(name)
url = base_url + page
table[page] = extract_list(url,xpath)
df_dict[page] = pd.DataFrame(table[page], columns=['value'])
df_dict[page]['category'] = name
Finally, I build a Dataframe by concatenating the previous built dataframes:
df = pd.DataFrame(df_dict[pages[0]])
for i in range(1,len(pages)):
df = pd.concat([df, df_dict[pages[i]]])
Extracted data contain many errors, which need to be corrected. However, I can store this first raw dataset as a CSV file:
df.to_csv('data/raw_data.csv')
17