How to Transform Data Extracted from Wikipedia into a Map in Python

In this tutorial I describe a strategy to extract geographical items from Wikipedia, organized in lists and then show them into a geographical map. I exploit the following Python libraries:

selenium, which is a Python library for data extraction from any Web site. For more details on how to use selenium, you can read my previous article, entitled Scraping Data from Nested HTML Pages with Python Selenium

geopy, which is a Python library working as a client for the most famous geocoding services. More details can be found in this interesting article by Eunjoo Byeon, entitled Introduction to Geopy: Using Your Latitude & Longitude Data in Python.

folium, which is a Python library for geographical data visualisation. For more details, you can read this interesting article by Dario Radečić, entitled How to Make Stunning Interactive Maps with Python and Folium in Minutes.

As example, I exploit 5 Wikipedia pages, related to the Italian Jewish Communities:

Communities

Museums

Cemeteries

Ghettos

Synagogues

All the considered Wikipedia pages contain a list of items, each representing a geographical entity, i.e. an Italian city. Thus, the idea is to build a geographical map with all those localities extracted from Wikipedia. The procedure is organized in three steps:

Data Extraction

Data Cleaning

Data Enrichment Data Visualisation

1 Data Extraction

All the localities in all the considered Wikipedia pages is represented as bullets of unordered lists. Thus, they can be easily extracted through a common procedure, implemented by means of the selenium library. In order to make your code working, you should install the correct selenium driver for your browser, as explained in this video.

Now, I am ready to write the code.

Firstly, I import the driver:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

Then, I define a function, called extract_list which receives as input the URL of the Wikipedia page as well as the XPath expression used to extract data from that page. The function extracts all the text associated to that XPath, splits the extracted text by lines and returns the list of items as a result:

def extract_list(url, xpath):
    options = Options()  
    options.add_argument("--headless") 
    options.add_argument("--lang=it");
    driver = webdriver.Chrome(options=options)
    driver.get(url)
    table = []
    # get the list of terms
    words = driver.find_element_by_xpath(xpath).text
    table.extend(words.split('\n'))
    driver.close()
    return table

Now, I can invoke the function for each considered Wikipedia page, for which I define a list:

pages = ['Comunit%C3%A0_ebraiche_italiane', 'Cimiteri_ebraici_in_Italia', 'Musei_ebraici_in_Italia','Ghetti_ebraici_in_Italia','Sinagoghe_in_Italia']

Then, I build a loop on the created list of pages and I invoke the extract_list function. I also convert the extracted tables into a pandas Dataframe and I associate to each extracted item a category, corresponding to the considered page (with some stylistic changes):

import pandas as pd
df_dict = {}
xpath = '//*[@id="mw-content-text"]'
table = {}
base_url = 'https://it.wikipedia.org/wiki/'
for page in pages:
    name = page.replace('_', ' ').title().replace('%C3%A0', 'à')
    print(name)
    url = base_url + page
    table[page] = extract_list(url,xpath)
    df_dict[page] = pd.DataFrame(table[page], columns=['value'])
    df_dict[page]['category'] = name

Finally, I build a Dataframe by concatenating the previous built dataframes:

df = pd.DataFrame(df_dict[pages[0]])
for i in range(1,len(pages)):
    df = pd.concat([df, df_dict[pages[i]]])

Extracted data contain many errors, which need to be corrected. However, I can store this first raw dataset as a CSV file:

df.to_csv('data/raw_data.csv')

Continue reading...
https://towardsdatascience.com/how-to-transform-data-extracted-from-wikipedia-into-a-map-in-python-8325dce0710b

Categories:

Python

Tags:

Datascience Dataviz Python

NodeJS vs Python

App Development Practices To Follow For Your Business In 2021

ETL: Extract Data with Node.js

ETL: Load Data to Destination with Node.js

ETL: Transform Data with Node.js

Apache Hudi - The Streaming Data Lake Platform

10 Data Science FREE Courses Udemy

Cómo conseguir generación de leads con web scraping

How to Transform Data Extracted from Wikipedia into a Map in Python

Categories:

Tags:

Related Posts