Observe differences in the behavior of recommendation models using RecBole

This post is translation(DeepL & Human) of this qiita article. RecBole を使ってレコメンドモデルの挙動の違いを観察する.

In this article, I will focus on recommendation. Specifically, I'm going to talk about RecBole.

What is RecBole?

I wrote a similar article on company blog, so I'll quote this part of the explanation from there.

RecBole seems to be a joint project started by the laboratories of Renmin University of China and Peking University, and it appeared on arxiv in November 2020. In August 2021, the module that we provide reached v1.0, and it seems to be used by various people in earnest.

The most attractive feature of RecBole is that it implements a lot of recommendation models with a unified interface for comparison. The number of models implemented and datasets that can be applied is tremendous. There are currently more than 70 models (model list) and more than 20 datasets (dataset list) that can be tested immediately. You can try instantly.

pip install recbole
python run_recbole.py --model=<your favorite model> --dataset_name ml-100k

That's all. Instantly you can try out over 70 models (some models require additional configuration) against the MovieLens-100k dataset, the most famous benchmark in the recommendation community. There are not many environments where you can try this many models and data. All of the 70+ models in the collection have been carefully reimplemented in PyTorch and are very reliable, and the basic interfaces, such as the predict function, are standardized to make it easy to experiment.

In summary, RecBole provides an ecosystem to try out an unusual number of recommendation models (more than 70) on an unusual number of datasets (about 20) immediately.

In the RecBole paper, there is a comparison with other similarly positioned products.

(nits: I don't understand why they compare Fork and Star/Issue numbers...)

"I have no idea which model to use for this recommendation task."

If you are working on a recommendation task, you may have faced this a lot.

I often see State of the Art in recommendation tasks.

However, the benchmark data is not 100K MovieLens, which makes it difficult to compare.

Even if I try to use a public model implementation, I can't have data with rating (Explicit Feedback) like MovieLens, but only data with purchase (Implicit Feedback)

I don't know how to split Train, valid, and test, so I don't do offline testing. without doing any offline testing, so I create a model and submit it for online evaluation.

We manage to implement model A and get a good result, but we are told that model B is actually better. But I don't have the time to reproduce and implement Model B (e.g., change the implementation to fit my company's data). Since we haven't done any offline evaluation, there is no way to make a decision.

In the first place, our own data is too big to train!

Model A was good for our service α, so we tried to apply it to service β, but the results were not so good (the pattern was that we had to use model C to get good results).

I'm sure you must have experienced various hardships. I always thought the same thing: "I don't know which model I should use for the recommendation task.
I like the recommendation task itself, but the only models I understand were Item2Vec, which I could do by using gensim, and MatrixFactorization, so I couldn't expand my knowledge at all. I think I picked up some information like VAEs are strong, there are graph based models, and you can use Transformers to make recommendations, but I didn't have the time (or room) to understand and implement them to be able to use them at any time... I've been making excuses.

However, with RecBole, we can leave the implementation to RecBole and try it out right away. It is also quite easy to make RecBole work with your own data.

With RecBole, you'll have one less question about working on a recommendation task.
So, the question has now moved to the next stage.

"Why does the recommendation task vary so much in terms of which models are strong depending on the data?"

When you start using RecBole to test various models on various data, you will notice that the distribution of strong and weak models is completely different depending on the data.

I'm sure you've all heard that the strongest models in machine learning differ depending on the task. I'm not sure if this is appropriate as a concrete example, but if I were to write it down, I'd say that CNNs are generally strong when dealing with images (although the accuracy of "generally" has dropped a lot recently due to Vision in Transformer, I think it can be acknowledged that CNN-like structures have been effective), and when dealing with linguistic data When dealing with linguistic data, neural networks with sequential structure are stronger, and when dealing with table data, gradient boosting trees are stronger than neural networks. It is like this.

I understand that the strongest model for each task is different. However, I don't understand why the top models change when the data is different...
In this article, I have tried to confirm that the strong model differs depending on the dataset through recbole datasets.
First, We will introduce the results of our experiments with two datasets.

Case1. MovieLens 1M

First dataset: MovieLens. It is a well-known dataset where users give grades to movies. In this case, we use 1M records of grade history data.

This can be used in RecBole without any special procedures.

from recbole.quick_start import run_recbole
run_recbole(model={model_name}, dataset="movielens-1m")

This code will download the MovieLens data on its own. This is great.
However, If this is all, the settings you want to add will not be reflected, so you need to prepare the following yaml.

# general
gpu_id: 0
use_gpu: True
seed: 2020
state: INFO
reproducibility: True
data_path: 'dataset/'
checkpoint_dir: 'saved/movielens-1m'
show_progress: True
save_dataset: False
save_dataloaders: False

# Atomic File Format
field_separator: "\t"
seq_separator: "@"

# Common Features
USER_ID_FIELD: user_id
ITEM_ID_FIELD: item_id
RATING_FIELD: rating
TIME_FIELD: timestamp
seq_len: ~
# Label for Point-wise DataLoader
LABEL_FIELD: label
# NegSample Prefix for Pair-wise DataLoader
NEG_PREFIX: neg_
# Sequential Model Needed
ITEM_LIST_LENGTH_FIELD: item_length
LIST_SUFFIX: _list
MAX_ITEM_LIST_LENGTH: 50
POSITION_FIELD: position_id
# Knowledge-based Model Needed
HEAD_ENTITY_ID_FIELD: head_id
TAIL_ENTITY_ID_FIELD: tail_id
RELATION_ID_FIELD: relation_id
ENTITY_ID_FIELD: entity_id

# Selectively Loading
load_col:
    inter: [user_id, item_id, timestamp, rating]
    user: [user_id, age, gender, occupation, zip_code]
    item: [item_id, movie_title, release_year, genre]
unused_col:
    inter: [timestamp, rating]

# Filtering
rm_dup_inter: ~
val_interval: ~
filter_inter_by_user_or_item: True
user_inter_num_interval: "[1,inf]"
item_inter_num_interval: "[1,inf]"

# Preprocessing
alias_of_user_id: ~
alias_of_item_id: ~
alias_of_entity_id: ~
alias_of_relation_id: ~
preload_weight: ~
normalize_field: ~
normalize_all: True

# Training and evaluation config
epochs: 50
stopping_step: 10
train_batch_size: 4096
eval_batch_size: 4096
neg_sampling:
    uniform: 1
eval_args:
    group_by: user
    order: TO
    split: {'RS': [0.8,0.1,0.1]}
    mode: full
metrics: ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
topk: 10
valid_metric: MRR@10
metric_decimal_place: 4

Save this as config/movielens-1m.yml or something similar and run

run_recbole(model={model_name}, dataset="movielens-1m", config_file_list=["config/movielens-1m.yml"])

and execute it.

About yaml↓

If you don't set seq_separator, it will be separated with a blank. I think this is fine for model training, but it makes it impossible to correctly pull item names when analyzing topk later, so I put in characters that are not likely to be separated (but I think this is not the right way to do it, since you can just avoid it in another way when pulling item names). I don't think this is the right way to do it.)

You need to pay special attention to ITEM_ID_FIELD, because the column name of item id is different for each data.

The MAX_ITEM_LIST_LENGTH, user_inter_num_interval and item_inter_num_interval are all settings that cut off data. If you can't run the model because of too many data, set these settings to loosen the model (sometimes there is no way to loosen them).

(Most important) In eval_args, set order to TO and group_by to user! This will put in the process of groupbying the items by user and arranging them into timeseries. After that, the split setting comes into effect, and in this case, it creates a train:valid:test at 80%:10%:10%. I think this is the most realistic way to split the data.

In this case, I used this experimental setup to test 36 models, and in terms of RecBole categories, I used General and Context-Aware (Factorization Machine) models. The reason I am not working on Sequential models is that there is a bug that causes errors when outputting topk, and about knowledge-base models, I think it would be quite difficult to prepare required data in real. Also, each model was tested with default parameters. I didn't have time to tune them all...

Now, RecBole will show us the basic statistics of the data, so let's check that first.

The number of users: 1084
Average actions of users: 84.04801477377654
The number of items: 38334
Average actions of items: 2.374559778780685
The number of inters: 91024
The sparsity of the dataset: 99.78095038424168%

The number of users and items is not that large. The average number of items acted on by one user is 165, and the average number of users acting on one item is 269.

Let's take a look at the experimental results of the 36 models. The results are as follows, sorted in descending order by NDCG@10.

Name	recall@10	precision@10	ndcg@10	mrr@10	hit@10
NGCF	0.0581	0.0647	0.0813	0.1616	0.3745
LightGCN	0.0578	0.0644	0.081	0.1594	0.3671
DGCF	0.0587	0.0633	0.0802	0.1585	0.3608
SLIMElastic	0.0631	0.0612	0.0801	0.1546	0.3664
BPR	0.0572	0.0638	0.0798	0.1566	0.3618
AutoInt	0.0552	0.0635	0.0797	0.1588	0.3591
GCMC	0.0567	0.0631	0.0793	0.1571	0.3596
AFM	0.0535	0.0637	0.0789	0.1592	0.3652
NNCF	0.0554	0.0626	0.0788	0.1583	0.3609
NAIS	0.0592	0.0608	0.0782	0.1528	0.3589
EASE	0.0658	0.0583	0.0779	0.1473	0.3598
DeepFM	0.054	0.0621	0.0779	0.1579	0.3566
DCN	0.0548	0.0618	0.0775	0.1538	0.3505
WideDeep	0.0534	0.062	0.0774	0.1564	0.356
Item2Vec	0.0591	0.0609	0.0773	0.1477	0.3598
FM	0.0553	0.0611	0.0773	0.1557	0.3611
SpectralCF	0.0527	0.0608	0.0768	0.1575	0.3531
RecVAE	0.0563	0.0602	0.0765	0.1499	0.347
NeuMF	0.055	0.0606	0.0763	0.152	0.3543
FFM	0.0551	0.0612	0.076	0.1507	0.3614
xDeepFM	0.0532	0.0599	0.0754	0.1545	0.3551
NFM	0.0515	0.0605	0.075	0.153	0.3543
DMF	0.0575	0.0582	0.0748	0.1455	0.345
PNN	0.0524	0.06	0.0745	0.1509	0.3518
ItemKNN	0.0558	0.0549	0.0716	0.1376	0.3243
FNN	0.0476	0.058	0.0711	0.1434	0.3296
MultiDAE	0.0513	0.0566	0.0707	0.1403	0.3336
MacridVAE	0.0493	0.0536	0.0666	0.1341	0.3321
CDAE	0.0384	0.0532	0.0632	0.1293	0.2965
FwFM	0.0386	0.0532	0.063	0.1262	0.2921
LR	0.0381	0.0534	0.0628	0.1271	0.2949
Pop	0.0358	0.0494	0.0556	0.1095	0.2891
LINE	0.0253	0.0485	0.054	0.1185	0.2609
DSSM	0.0305	0.0411	0.0483	0.104	0.2627
ENMF	0.0115	0.0176	0.0193	0.0461	0.1442

It would be too hard for me to explain all what each model represents, so I hope you will take a look at RecBole's page and look at the table.

There are four things I want to say here.

NGCF, LightGCN, DGCF, GCMC graph-based models dominate the top.

Long-established models such as SLIMElastic and BPR are also at the top.

The only context-aware models near the top are AutoInt and AFM.

RecVAE and other VAEs that I was hoping are not doing so well.

The reason why I was hoping this is because RecVAE was overwhelming in the experimental results I wrote about in my company's blog.

Case2. FourSquare NYC

Now, let's continue to try out RecBole on another data set, the second one being FourSquare NYC. I quote the description from https://github.com/RUCAIBox/RecSysDatasets ↓

This dataset contains check-ins in NYC and Tokyo collected for about 10 month. Each check-in is associated with its time stamp, its GPS coordinates and its semantic meaning.

As with MovieLens, we'll first look at the underlying statistics of the data.

The number of users: 1084
Average actions of users: 84.04801477377654
The number of items: 38334
Average actions of items: 2.374559778780685
The number of inters: 91024
The sparsity of the dataset: 99.78095038424168%

In this dataset, the number of users and the number of items are not balanced at all. The average number of items interacted on by one user is 84, but the average number of users interacting on one item is 2, so it's obvious that there is a long tail. sparsity is also 95% for MovieLens, but 99% for sparsity.

This yaml is almost the same as MovieLens, but the selectively loading part and item_id are different.

# Selectively Loading
load_col:
    inter: [user_id, venue_id, timestamp]
unused_col:
    inter: [timestamp]

Please note that item_id has been changed to venue_id. And I didn't use any of the item information this time because I wasn't sure which one to use. So the result of context-aware model might be quite different depending on this setting.

Let's look at the results of the experiment in the same way. We have 33 experimental results(The number of models is slightly reduced).

Name	hit@10	mrr@10	ndcg@10	precision@10	recall@10
LightGCN	0.2004	0.1089	0.0401	0.0243	0.0323
SLIMElastic	0.205	0.1071	0.0399	0.0246	0.0332
RecVAE	0.1958	0.0979	0.0373	0.0236	0.0316
FNN	0.1884	0.098	0.0367	0.0224	0.0303
DMF	0.1911	0.0953	0.0364	0.023	0.0307
DeepFM	0.1911	0.0931	0.0359	0.0229	0.0312
GCMC	0.1819	0.0955	0.0359	0.0222	0.0297
MacridVAE	0.1911	0.0934	0.0358	0.0228	0.0306
MultiVAE	0.1745	0.0936	0.0349	0.0211	0.0282
NeuMF	0.1791	0.0867	0.0343	0.0223	0.0299
MultiDAE	0.1671	0.0928	0.0341	0.0203	0.0281
xDeepFM	0.1662	0.0948	0.034	0.0198	0.026
FwFM	0.169	0.0931	0.0335	0.0202	0.0264
WideDeep	0.1616	0.0937	0.0335	0.0192	0.0259
LR	0.1801	0.086	0.0329	0.0219	0.0292
AutoInt	0.1644	0.0914	0.0323	0.0189	0.0253
SpectralCF	0.1653	0.0888	0.0322	0.0194	0.026
PNN	0.1717	0.0827	0.0319	0.0209	0.0283
AFM	0.1717	0.0802	0.0315	0.0208	0.0281
DCN	0.169	0.0839	0.0315	0.0207	0.0271
FFM	0.1634	0.0829	0.0308	0.0191	0.0257
Pop	0.1791	0.0711	0.0301	0.0214	0.0289
FM	0.1468	0.0594	0.0243	0.0172	0.0244
BPR	0.1505	0.0557	0.0237	0.0173	0.0238
NNCF	0.1274	0.063	0.0232	0.0144	0.0204
NFM	0.0822	0.022	0.0098	0.0084	0.0106
LINE	0.0683	0.0225	0.0094	0.0074	0.0098
NGCF	0.0572	0.023	0.0093	0.0059	0.0086
ItemKNN	0.0406	0.0178	0.0062	0.0043	0.0054
Item2Vec	0.0194	0.007	0.0026	0.0021	0.0024
DSSM	0.0037	0.0015	0.0007	0.0004	0.0007
ENMF	0.0028	0.0013	0.0005	0.0003	0.0005
CDAE	0.0009	0.0003	0.0001	0.0001	0.0001

LightGCN is at the top, but NGCF and GCMC have fallen (I forgot to calculate DGCF ).

Similarly, SLIMElastic is at the top, but BPR has fallen to the lower rank.

I was surprised because I had a feeling that BPR was always at the top rank.

VAEsVansVasesVanesVatsValesViesVAXesAESBAESCAES such as RecVAE and MacridVAE are getting higher rank. FNN, DeepFM and other context-aware systems are also in the top rank.

However, NDCG@10 is only 0.04 at the top, and the task is unusually difficult to begin with. MovieLens had 0.08.

This result is not the definitive version because it is not tuned, but I think you can understand that the results are completely different depending on data even though they were compared under the same conditions. We were able to show in actual figures that "which model is stronger in the recommendation task varies considerably depending on the data.

Visualize how similar the recommended models are to each other as a network graph

From now on, I would like to discuss about the recommendation models obtained from the two datasets.
Each recommendation model has a list of top 10 items for each user, and you can use that list to indicate “this model and this model match this much ” or“ they don't match ”.

Interesting points I think ...

Whether the model of a similar system, such as a model of a graph system or a model of a factorization machine system, has a high degree of coincidence.

Does the top-performing (lower-performing) model make very different recommendations?

It may possibly help you to understand the behavior of recommendation model.
This time, we call it "recommendation list match degree", and try to calculate it.

First, we get the topk list from the RecBole model.

import json
import click
import numpy as np
import torch
from recbole.config import Config
from recbole.data import create_dataset, data_preparation
from recbole.quick_start.quick_start import load_data_and_model
from recbole.utils import get_model
from recbole.utils.case_study import full_sort_topk
import pandas as pd
from src.custom_models.Item2Vec import Item2Vec
from src.metrics import calculate_indicators
from tqdm.auto import tqdm

@click.command()
@click.option(
    "--model_file",
    required=True,
    type=str,
    help="example. saved/ckpd_recipe/Item2Vec-Nov-06-2021_02-41-35.pth",
)
@click.option(
    "--output_file",
    required=True,
    type=str,
    help="example. pop.json",
)
@click.option(
    "--is_item2vec",
    type=bool,
    is_flag=True
)
def main(model_file, output_file, is_item2vec):
    print("=====")
    print(model_file)
    print("=====")

    # for get Item title
    # e.g. foursquare
    _df = pd.read_csv("dataset/foursquare-nyc-merged/foursquare-nyc-merged.item", sep="\t")
    internal_id_to_title = _df["venue_category_name:token"].to_dict()

    # custom model(Item2Vec)
    if is_item2vec:
        checkpoint = torch.load(model_file)
        config = checkpoint["config"]

        config.seq_separator = "@"
        dataset = create_dataset(config)
        train_data, valid_data, test_data = data_preparation(config, dataset)

        model = Item2Vec(config, train_data.dataset).to(config["device"])
        model.load_state_dict(checkpoint["state_dict"])
        model.load_other_parameter(checkpoint.get("other_parameter"))

    # when not custom model
    else:
        config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
            model_file=model_file
        )

    ground_list = []
    uid_list = []
    for batch_idx, batched_data in enumerate(test_data):
        interaction, row_idx, positive_u, positive_i = batched_data
        ground_list.append([int(v) for v in positive_i.numpy().tolist()])
        uid_list.append(interaction.user_id.numpy()[0])

    ranked_list = []
    for uid in tqdm(uid_list):
        topk_score, topk_iid_list = full_sort_topk(
            [uid], model, test_data, k=10, device="cuda"
        )
        ranked_list += topk_iid_list.cpu()

    all_metrics_results = {}
    for uid, g_list, r_list in zip(uid_list, ground_list, ranked_list):
        external_uid = dataset.id2token(dataset.uid_field, uid)
        all_metrics_results[external_uid] = {
            "ground_list_id": [v for v in dataset.id2token(dataset.iid_field, g_list)],
            "predict_list_id": [v for v in dataset.id2token(dataset.iid_field, r_list)],
            "ground_list": [internal_id_to_title[v-1] for v in g_list],
            "predict_list": [internal_id_to_title[v-1] for v in r_list.numpy()],
        }

    text = json.dumps(all_metrics_results, sort_keys=True, ensure_ascii=False, indent=2)
    with open(output_file, "w") as fh:
        fh.write(text)


if __name__ == "__main__":
    main()

Item2Vec is a custom model that I implemented, so I'm taking special care of it.

from recbole.utils.case_study import The function full_sort_topk allows for topk output, so I used in this script.
(The name of the module recbole.utils.case_study suggests that topk output is not the main purpose of RecBole.)

And I use the output topk list to see the recommendation list match degree per model from the following script.

import glob
import itertools

import numpy as np
import pandas as pd
from tqdm.auto import tqdm


def main():
    filelist = glob.glob("output/foursquare_nyc_case_study/*.json")
    model_results = {}
    for file in tqdm(filelist):
        _model = file.split("/")[-1].split(".")[0]
        try:
            _df = pd.read_json(file).T
            model_results[_model] = _df
        except:
            print(f"{_model} read is failed")
    _models = model_results.keys()
    combis = list(itertools.combinations(_models, 2))
    model_similarities = []

    for c in tqdm(combis):
        model1 = c[0]
        model2 = c[1]
        model1_result = model_results[model1]
        model2_result = model_results[model2]

        model1_predict_list = model1_result["predict_list_id"].values
        model2_predict_list = model2_result["predict_list_id"].values
        sims = []
        for m1_preds, m2_preds in zip(model1_predict_list, model2_predict_list):
            _sim = len(set(m1_preds) & set(m2_preds)) / len(m1_preds)
            sims.append(_sim)
        similarity = np.mean(sims)
        model_similarities.append([model1, model2, similarity])
        result = pd.DataFrame(
            model_similarities, columns=["source_model", "dest_model", "similarity"]
        )
        result.to_csv("foursquare_nyc_survey_with_recbole.csv", index=False)


if __name__ == "__main__":
    main()

Doing this will generate a table like the one below. This is a result from foursquare.

Full spreadsheet link is here.

We will visualize this similarity list a a network graph. If the similarity is more than 0.5, we put an edge between the models(node).
So the network graph looks like this. The result of modularity clustering is reflected in the colors. And the layout algorithm is Yifan Hu Multilevel.

MovieLens

FourSquare

Discussion

Totally different results.

Overall, the shape of the network is completely different

In MovieLens Network, Factorization Machine is part of the same community. but not so for FourSquare Network.

LightGCN, GCMC, and DGCF (graph-based model) seem to be relatively in the same community

I forgot to calculate DGCF on FourSquare, so I don't have one. 😭

The centrality of LightGCN, which was high NDCG in both, was high in MovieLens Network but low in FourSquare Network

eigenvector centrality: 0.97(movielens) -> 0.34(foursquare)
By the way, the number one centrality is DGCF and BPR for MovieLens, and DMF and Pop (which just puts out the most popular items) for FourSquare!

There are significantly more nodes that don't have an edge in FourSquare Network compared to MovieLens Network

I found that as the recommended data changes, the behavior of the model seems to change greatly.

What is the cause?

Through the analysis so far, we have found that the top models are being replaced because the behavior of the models is changing significantly depending on the data. What is the cause of this in the end?

As mentioned above, we don't know anything yet, but from the basic statistics of the data, it is clear that the number of users, items, and sparsity are different, and this time we compared the data, so I think it is probably because of the influence (bias) of the nature of the service that produced the data. I think.
The fact that we were able to confirm this hypothesis with multiple model data sets was a great achievement.

Future Work

I'd like to be able to express the bias of the data described above in some kind of index. If we can express the bias in the form of an index, we can make guesses such as "Model A or B seems to be strong because the index is this high.

Summary

It is thanks to RecBole that we can now aim at such a prospect. Let's try RecBole!