Content-based Recommender System with Python

Recommender systems are methods that help us predict interests of users and generate relevant recommendations for them for different products or services. These products can range from songs to play on Apple Music to movies to watch on one of the streaming services, articles to read on news journal or products from Amazon.

Recommender systems are differentiated mainly by the type of data in use.

Whereas content-based recommenders rely on features of users and/or items, the collaborative filtering uses information on the interaction between users and items, as defined in the user-item matrix.

Recommender systems are generally divided into 3 main approaches:

content-based recommendation engines
collaborative filtering recommendation engines
and hybrid recommendation systems

What are content-based recommender systems?

Content-based recommenders produce recommendations using the features or attributes of items and/or users.

User attributes can include age, sex, job and other personal information. Item attributes are different in that they are of descriptive kind that distinguishes items from each other.

Example features for movies would be title, cast, description, genre and others.

Content-based methods, by means of their reliance on features are similar to traditional machine learning models which are often feature based.

One of the inherent advantages of content-based recommenders is that they have a certain degree of user independence. To generate recommendation for a user, they namely do not need information about other users, like the CF (collaborative filtering) methods do.

Content-based approach is thus easier to scale. Explainability of AI models has become very important in last years. There has been a whole field developed from efforts in this area - called XAI.

There are many nice libraries available to help explainability of AI predictions, personally I like SHAP and LIME.

Content-based methods are better from respect of explainability as it is easier to explain their recommendations than in case of collaborative filtering.

Although CF methods also have some explainability available. CF library https://github.com/benfred/implicit which I used a lot in my past projects, e.g. has the method model.explain available for that.

Returning back to content-based approach, it also has its drawbacks. One of them is that it can over-specialize – if the user is only interested in specific categories, recommender will have difficulty recommending items outside of this area. This can lead the user to remain in the area of current items.

I will now build an example of content-based recommender in python, by using the MovieLens data.

Content-based recommender system for recommendation of movies
Our recommender system will be able to recommend movies to us.

First, we load the models:

import pandas as pd

import ast

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

import seaborn as sns

import numpy as np

import matplotlib.pyplot as plt`

import pandas as pd

import ast

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

import seaborn as sns

import numpy as np

import matplotlib.pyplot as plt

We next get our data set data from https://www.kaggle.com/rounakbanik/the-movies-dataset and https://grouplens.org/datasets/movielens/latest/:

df_data = pd.read_csv(‘movies_metadata.csv’, low_memory=False)

As part of pre-processing we remove movies which have low number of votes:

df_data = df_data[df_data['vote_count'].notna()]

plt.figure(figsize=(20,5))

sns.distplot(df_data['vote_count'])

plt.title("Histogram of vote counts")

df_data = df_data[df_data['vote_count'].notna()]

plt.figure(figsize=(20,5))

sns.distplot(df_data['vote_count'])

plt.title("Histogram of vote counts")
# determine the minimum number of votes that the movie must have to be included 

min_votes = np.percentile(df_data['vote_count'].values, 85)
1
min_votes = np.percentile(df_data['vote_count'].values, 85)
# exclude movies that do not have minimum number of votes

df = df_data.copy(deep=True).loc[df_data['vote_count'] > min_votes]
1
df = df_data.copy(deep=True).loc[df_data['vote_count'] > min_votes]

Content-based recommender will have a goal of recommending movies which have a similar plot to a selected movie.

We will use “overview” feature from our dataset:

# removing rows with missing overview

df = df[df['overview'].notna()]

df.reset_index(inplace=True)


# processing of overviews

def process_text(text):

    # replace multiple spaces with one

    text = ' '.join(text.split())

    # lowercase

    text = text.lower()

    return text

df['overview'] = df.apply(lambda x: process_text(x.overview),axis=1)

# removing rows with missing overview

df = df[df['overview'].notna()]

df.reset_index(inplace=True)


# processing of overviews

def process_text(text):

    # replace multiple spaces with one

    text = ' '.join(text.split())

    # lowercase

    text = text.lower()

    return text

df['overview'] = df.apply(lambda x: process_text(x.overview),axis=1)

To compare movie plots, we first need to compute their vector representation. There are various methods available from from bag of words, word embeddings to TF-IDF, we will select the latter.

TF-IDF approach
TF-IDF of a word in a text which is part of a larger corpus of text is a combination of two values. One is term frequency (TF), which measures how frequently the word occurs in the document.

However, some of the words, such as “the” and “is”, occur frequently in all documents and we want to downsize their importance. This is done by multiplying term frequency with the inverse document frequency.

In this way only those words are considered relevant for the document that are frequent in this text but more rarely present in the rest of the corpus.

For building the TF-IDF representation of movie plots we will use the TfidfVectorizer from scikit-learn. We first fit TfidfVectorizer on train data set of movie plot descriptions and then transform the movie plots into TF-IDF numerical representation:

tf_idf = TfidfVectorizer(stop_words='english')

tf_idf_matrix = tf_idf.fit_transform(df['overview']);

tf_idf = TfidfVectorizer(stop_words='english')

tf_idf_matrix = tf_idf.fit_transform(df['overview']);

We can now compute similarity of movies by calculating their pair-wise cosine similarities and storing them in cosine similarity matrix:

calculating cosine similarity between movies

cosine_similarity_matrix = cosine_similarity(tf_idf_matrix, tf_idf_matrix)
1
2
3
# calculating cosine similarity between movies

cosine_similarity_matrix = cosine_similarity(tf_idf_matrix, tf_idf_matrix)

With cosine similarity matrix computed, we can define the function “recommendations” that will return top recommendations for a given movie:

def index_from_title(df,title):

return df[df['original_title']==title].index.values[0]

# function that returns the title of the movie from its index

def title_from_index(df,index):

return df[df.index==index].original_title.values[0]`



# generating recommendations for given title

def recommendations( original_title, df,cosine_similarity_matrix,number_of_recommendations):

index = index_from_title(df,original_title)

similarity_scores = list(enumerate(cosine_similarity_matrix[index]))

similarity_scores_sorted = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

recommendations_indices = [t[0] for t in similarity_scores_sorted[1:(number_of_recommendations+1)]]

return df['original_title'].iloc[recommendations_indices]


def index_from_title(df,title):

return df[df['original_title']==title].index.values[0]


# function that returns the title of the movie from its index

def title_from_index(df,index):

return df[df.index==index].original_title.values[0]


# generating recommendations for given title

def recommendations( original_title, df,cosine_similarity_matrix,number_of_recommendations):

index = index_from_title(df,original_title)

similarity_scores = list(enumerate(cosine_similarity_matrix[index]))

similarity_scores_sorted = sorted(similarity_scores, key=lambda x: x[1], reverse=True)

recommendations_indices = [t[0] for t in similarity_scores_sorted[1:(number_of_recommendations+1)]]

return df['original_title'].iloc[recommendations_indices]