24
How to Transform Data Extracted from Wikipedia into a Map in Python
In this tutorial I describe a strategy to extract geographical items from Wikipedia, organized in lists and then show them into a geographical map. I exploit the following Python libraries:
As example, I exploit 5 Wikipedia pages, related to the Italian Jewish Communities:
All the considered Wikipedia pages contain a list of items, each representing a geographical entity, i.e. an Italian city. Thus, the idea is to build a geographical map with all those localities extracted from Wikipedia. The procedure is organized in three steps:
All the localities in all the considered Wikipedia pages is represented as bullets of unordered lists. Thus, they can be easily extracted through a common procedure, implemented by means of the selenium library. In order to make your code working, you should install the correct selenium driver for your browser, as explained in this video.
Now, I am ready to write the code.
Firstly, I import the driver:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
Then, I define a function, called extract_list which receives as input the URL of the Wikipedia page as well as the XPath expression used to extract data from that page. The function extracts all the text associated to that XPath, splits the extracted text by lines and returns the list of items as a result:
def extract_list(url, xpath):
options = Options()
options.add_argument("--headless")
options.add_argument("--lang=it");
driver = webdriver.Chrome(options=options)
driver.get(url)
table = []
# get the list of terms
words = driver.find_element_by_xpath(xpath).text
table.extend(words.split('\n'))
driver.close()
return table
Now, I can invoke the function for each considered Wikipedia page, for which I define a list:
pages = ['Comunit%C3%A0_ebraiche_italiane', 'Cimiteri_ebraici_in_Italia', 'Musei_ebraici_in_Italia','Ghetti_ebraici_in_Italia','Sinagoghe_in_Italia']
Then, I build a loop on the created list of pages and I invoke the extract_list function. I also convert the extracted tables into a pandas Dataframe and I associate to each extracted item a category, corresponding to the considered page (with some stylistic changes):
import pandas as pd
df_dict = {}
xpath = '//*[@id="mw-content-text"]'
table = {}
base_url = 'https://it.wikipedia.org/wiki/'
for page in pages:
name = page.replace('_', ' ').title().replace('%C3%A0', 'à')
print(name)
url = base_url + page
table[page] = extract_list(url,xpath)
df_dict[page] = pd.DataFrame(table[page], columns=['value'])
df_dict[page]['category'] = name
Finally, I build a Dataframe by concatenating the previous built dataframes:
df = pd.DataFrame(df_dict[pages[0]])
for i in range(1,len(pages)):
df = pd.concat([df, df_dict[pages[i]]])
Extracted data contain many errors, which need to be corrected. However, I can store this first raw dataset as a CSV file:
df.to_csv('data/raw_data.csv')
24