11
Building a Random Movie Picker Using Python and Selenium
My wife (Dany) has been reviewing movies every week and posting them on her Youtube Channel. She recently asked me if it was possible to scrape this list of directorsand return a random movie they directed in order to help her decide on which movie to review next.
Sounds like a job for some good-ol-web-selenium.
Scraping websites can be finicky. You're at the whim of the content creators markdown decisions.
The markup of Elacervo is tricky. Their director's page is consistent, but the individual director posts are not. Some director pages have their movie list individual p
tags while some have their entire movie list formatted within a single span
element. This is a problem.
Rather than trying to hack together a way to get most of the directors videos, I chose to get the list of directors and gather their movie lists from a reliable source.
I chose IMDB's which has a convenient python wrapper around their API IMDbPY.
Dany is a beginner web developer and is curious about Python. Selenium provides the developer with visual confirmation with the automated browser interactions. Spawning a new browser instance and clicking through the site does impact performance, but I believe the benefits of the visual aspect of Selenium out-weighs the performance issues.
A framework such as Scrapy would provide the data quicker, but a big reason I'm building this is to help Dany learn Python.
Selenium takes a bit of setting up in order to start automating your browser.
Using Python 3's built in package manager pip, download Selenium with the command pip install selenium
.
I highly recommend taking advantage of a virtualenv and creating an isolated Python environment.
You will also need to download the appropriate web driver. Selenium's documentation has links to the most popular browser drivers. For this tutorial, I will be using Google Chrome and Chromedriver.
Below is the code snippet I decided to use. I've appended numbers comments to describe important choices made within the snippet.
import json
from selenium import webdriver
driver = webdriver.Chrome('chromedriver') // 1
driver.get('https://www.elacervo.com/directores') // 2
// 3
for i in range(4):
time.sleep(5)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
// 4
directors = driver.find_elements_by_css_selector(
"a[href*='https://www.elacervo.com/post/']"
)
// 5
unique_directors = []
for link in directors:
if (link.get_attribute("href")) not in unique_directors:
unique_directors.append(link.get_attribute("href"))
// 6
names = []
for link in unique_directors:
slug = link.split('/')[-1]
name = slug.replace('-', ' ').title()
names.append({"name": name})
// 7
with open('directors.json', 'w') as outfile:
json.dump(names, outfile)
// 8
driver.quit()
-
driver = webdriver.Chrome('chromedriver')
- This is where we are telling Selenium to use spawn a new Google Chrome instance. The value,chromedriver
, that we are passing to the.Chrome()
method is the location of the chromedriver file we downloaded in the previous step. -
driver.get('https://www.elacervo.com/directores')
- Here we are telling our now made Selenium driver to navigate to the URLhttps://www.elacervo.com/directores
. - The website I am scraping has some lazy-loading logic where only a certain number of directors are loaded until the page is scrolled to the bottom. This is executing some client-side javascript to scroll to the bottom of the page, wait a few seconds for the new directors to load, then scroll to the new bottom of the page.
- Here I am gathering all of the html events that contain an
a
tag with andhref
that containshttps://www.elacervo.com/post/
. This is using the logichref*=
which includes the wildcard character*
. - This is extracting the directors URL that is inside the
href
source. It's then placing the URL into aunique_directors
list. Some of the directors on this page have their link twice so I'm removing any duplicate URLs. - I'm cleaning the URL links to simply get the directors names from them. The links gathered look like
https://www.elacervo.com/post/martin-scorsese
. The logic here is taking everything after the last/
character, replacing-
's with spaces, and then capitalizing the first letter of each word within their names. - I then using
json.dump
to write the gathered director names into ajson
file for quicker use later on. Reading from ajson
file is much quicker than spawning a browser to click around and extract data. -
driver.quit()
- This closes the a Selenium Chrome instance.
import json
from imdb import IMDb
file = open('directors.json',)
directors = json.load(file)
movies = []
ia = IMDb()
for person in directors:
try:
director = ia.search_person(person['name'])[0]
try:
films = ia.get_person_filmography(director.personID)['data']['filmography']['director']
for film in films:
if film['kind'] == 'movie':
try:
if (film['year']):
movies.append(film)
except KeyError:
continue
except AttributeError:
continue
except IndexError:
continue
with open('movies.json', 'w') as outfile:
json.dump([{"title": movie['title'], 'year': movie['year']} for movie in movies], outfile)
We use Python's built in open function to open the directors.json
file we created in the Selenium section. Then using Python's JSON decoder, we can load data from the file into a usable JSON format.
Initializing an IMDb object gives us access to the packages functions, allowing us to query IMDb's database.
The method .search_person(person['name'])
returns a list of people IMDb has within in their database. It appears the first result in the returned list is the most popular and reasoning behind the [0]
. For this project, I'm making the assumption that is the director I want to work with.
IMDbPY Movie
objects properties can be seen documented here. For this project, I'm just interested in movies, so I apply a conditional to check, appending the accepted films to a movies list.
It turns out that IMDbPY's Movie object only has the property year
if the movie has been released, otherwise, it has the property status
. I only want movies that are watchable now, and filter out the data accordingly.
movies = []
if film['kind'] == 'movie':
try:
if (film['year']):
movies.append(film)
except KeyError:
continue
Like the data extracted using Selenium, I decided to reduce the number of API requests to IMDb by writing the acquired data into a reusable JSON file.
Rather than extracting all of the IMDb Movie object's data, I decided to simply take the movie's title and year values. In the future, it could be cool to extract additional data allowing for advanced filtering. For example, wanting to only watch a movie between the years 1970 and 1980 with a rating greater than 9.0. Although the IMDb Movie object said it has certain properties, it ended up being a bit finicky and I decided against it for now.
with open('movies.json', 'w') as outfile:
json.dump([{"title": movie['title'], 'year': movie['year']} for movie in movies], outfile)
Now that we have a list of all of the movies from our group of directors in a json file, I can use Python's random.choice to randomly select a movie.
import random
import json
file = open('movies.json')
data = json.load(file)
print(random.choice(data))
Seriously give it a shot. At the very least, randomly select a movie and watch its trailer. Many of these movies I had never heard of but they are fascinating, creative and artistic.
I hope this article was helpful!
11