Observe differences in the behavior of recommendation models using RecBole

This post is translation(DeepL & Human) of this qiita article. RecBole を使ってレコメンドモデルの挙動の違いを観察する.
In this article, I will focus on recommendation. Specifically, I'm going to talk about RecBole.
What is RecBole?
I wrote a similar article on company blog, so I'll quote this part of the explanation from there.

RecBole seems to be a joint project started by the laboratories of Renmin University of China and Peking University, and it appeared on arxiv in November 2020. In August 2021, the module that we provide reached v1.0, and it seems to be used by various people in earnest.

The most attractive feature of RecBole is that it implements a lot of recommendation models with a unified interface for comparison. The number of models implemented and datasets that can be applied is tremendous. There are currently more than 70 models (model list) and more than 20 datasets (dataset list) that can be tested immediately. You can try instantly.

pip install recbole
python run_recbole.py --model=<your favorite model> --dataset_name ml-100k

That's all. Instantly you can try out over 70 models (some models require additional configuration) against the MovieLens-100k dataset, the most famous benchmark in the recommendation community. There are not many environments where you can try this many models and data. All of the 70+ models in the collection have been carefully reimplemented in PyTorch and are very reliable, and the basic interfaces, such as the predict function, are standardized to make it easy to experiment.

In summary, RecBole provides an ecosystem to try out an unusual number of recommendation models (more than 70) on an unusual number of datasets (about 20) immediately.
In the RecBole paper, there is a comparison with other similarly positioned products.
(nits: I don't understand why they compare Fork and Star/Issue numbers...)
"I have no idea which model to use for this recommendation task."
If you are working on a recommendation task, you may have faced this a lot.
  • I often see State of the Art in recommendation tasks.
    • However, the benchmark data is not 100K MovieLens, which makes it difficult to compare.
  • Even if I try to use a public model implementation, I can't have data with rating (Explicit Feedback) like MovieLens, but only data with purchase (Implicit Feedback)
  • I don't know how to split Train, valid, and test, so I don't do offline testing. without doing any offline testing, so I create a model and submit it for online evaluation.
  • We manage to implement model A and get a good result, but we are told that model B is actually better. But I don't have the time to reproduce and implement Model B (e.g., change the implementation to fit my company's data). Since we haven't done any offline evaluation, there is no way to make a decision.
  • In the first place, our own data is too big to train!
  • Model A was good for our service α, so we tried to apply it to service β, but the results were not so good (the pattern was that we had to use model C to get good results).
  • I'm sure you must have experienced various hardships. I always thought the same thing: "I don't know which model I should use for the recommendation task.
    I like the recommendation task itself, but the only models I understand were Item2Vec, which I could do by using gensim, and MatrixFactorization, so I couldn't expand my knowledge at all. I think I picked up some information like VAEs are strong, there are graph based models, and you can use Transformers to make recommendations, but I didn't have the time (or room) to understand and implement them to be able to use them at any time... I've been making excuses.
    However, with RecBole, we can leave the implementation to RecBole and try it out right away. It is also quite easy to make RecBole work with your own data.
    With RecBole, you'll have one less question about working on a recommendation task.
    So, the question has now moved to the next stage.
    "Why does the recommendation task vary so much in terms of which models are strong depending on the data?"
    When you start using RecBole to test various models on various data, you will notice that the distribution of strong and weak models is completely different depending on the data.
    I'm sure you've all heard that the strongest models in machine learning differ depending on the task. I'm not sure if this is appropriate as a concrete example, but if I were to write it down, I'd say that CNNs are generally strong when dealing with images (although the accuracy of "generally" has dropped a lot recently due to Vision in Transformer, I think it can be acknowledged that CNN-like structures have been effective), and when dealing with linguistic data When dealing with linguistic data, neural networks with sequential structure are stronger, and when dealing with table data, gradient boosting trees are stronger than neural networks. It is like this.
    I understand that the strongest model for each task is different. However, I don't understand why the top models change when the data is different...
    In this article, I have tried to confirm that the strong model differs depending on the dataset through recbole datasets.
    First, We will introduce the results of our experiments with two datasets.
    Case1. MovieLens 1M
    First dataset: MovieLens. It is a well-known dataset where users give grades to movies. In this case, we use 1M records of grade history data.
    This can be used in RecBole without any special procedures.
    from recbole.quick_start import run_recbole
    run_recbole(model={model_name}, dataset="movielens-1m")
    This code will download the MovieLens data on its own. This is great.
    However, If this is all, the settings you want to add will not be reflected, so you need to prepare the following yaml.
    # general
    gpu_id: 0
    use_gpu: True
    seed: 2020
    state: INFO
    reproducibility: True
    data_path: 'dataset/'
    checkpoint_dir: 'saved/movielens-1m'
    show_progress: True
    save_dataset: False
    save_dataloaders: False
    
    # Atomic File Format
    field_separator: "\t"
    seq_separator: "@"
    
    # Common Features
    USER_ID_FIELD: user_id
    ITEM_ID_FIELD: item_id
    RATING_FIELD: rating
    TIME_FIELD: timestamp
    seq_len: ~
    # Label for Point-wise DataLoader
    LABEL_FIELD: label
    # NegSample Prefix for Pair-wise DataLoader
    NEG_PREFIX: neg_
    # Sequential Model Needed
    ITEM_LIST_LENGTH_FIELD: item_length
    LIST_SUFFIX: _list
    MAX_ITEM_LIST_LENGTH: 50
    POSITION_FIELD: position_id
    # Knowledge-based Model Needed
    HEAD_ENTITY_ID_FIELD: head_id
    TAIL_ENTITY_ID_FIELD: tail_id
    RELATION_ID_FIELD: relation_id
    ENTITY_ID_FIELD: entity_id
    
    # Selectively Loading
    load_col:
        inter: [user_id, item_id, timestamp, rating]
        user: [user_id, age, gender, occupation, zip_code]
        item: [item_id, movie_title, release_year, genre]
    unused_col:
        inter: [timestamp, rating]
    
    # Filtering
    rm_dup_inter: ~
    val_interval: ~
    filter_inter_by_user_or_item: True
    user_inter_num_interval: "[1,inf]"
    item_inter_num_interval: "[1,inf]"
    
    # Preprocessing
    alias_of_user_id: ~
    alias_of_item_id: ~
    alias_of_entity_id: ~
    alias_of_relation_id: ~
    preload_weight: ~
    normalize_field: ~
    normalize_all: True
    
    # Training and evaluation config
    epochs: 50
    stopping_step: 10
    train_batch_size: 4096
    eval_batch_size: 4096
    neg_sampling:
        uniform: 1
    eval_args:
        group_by: user
        order: TO
        split: {'RS': [0.8,0.1,0.1]}
        mode: full
    metrics: ['Recall', 'MRR', 'NDCG', 'Hit', 'Precision']
    topk: 10
    valid_metric: MRR@10
    metric_decimal_place: 4
    Save this as config/movielens-1m.yml or something similar and run
    run_recbole(model={model_name}, dataset="movielens-1m", config_file_list=["config/movielens-1m.yml"])
    and execute it.
    About yaml↓
  • If you don't set seq_separator, it will be separated with a blank. I think this is fine for model training, but it makes it impossible to correctly pull item names when analyzing topk later, so I put in characters that are not likely to be separated (but I think this is not the right way to do it, since you can just avoid it in another way when pulling item names). I don't think this is the right way to do it.)
  • You need to pay special attention to ITEM_ID_FIELD, because the column name of item id is different for each data.
  • The MAX_ITEM_LIST_LENGTH, user_inter_num_interval and item_inter_num_interval are all settings that cut off data. If you can't run the model because of too many data, set these settings to loosen the model (sometimes there is no way to loosen them).
  • (Most important) In eval_args, set order to TO and group_by to user! This will put in the process of groupbying the items by user and arranging them into timeseries. After that, the split setting comes into effect, and in this case, it creates a train:valid:test at 80%:10%:10%. I think this is the most realistic way to split the data.
  • In this case, I used this experimental setup to test 36 models, and in terms of RecBole categories, I used General and Context-Aware (Factorization Machine) models. The reason I am not working on Sequential models is that there is a bug that causes errors when outputting topk, and about knowledge-base models, I think it would be quite difficult to prepare required data in real. Also, each model was tested with default parameters. I didn't have time to tune them all...
    Now, RecBole will show us the basic statistics of the data, so let's check that first.
    The number of users: 1084
    Average actions of users: 84.04801477377654
    The number of items: 38334
    Average actions of items: 2.374559778780685
    The number of inters: 91024
    The sparsity of the dataset: 99.78095038424168%
    The number of users and items is not that large. The average number of items acted on by one user is 165, and the average number of users acting on one item is 269.
    Let's take a look at the experimental results of the 36 models. The results are as follows, sorted in descending order by NDCG@10.
    Name recall@10 precision@10 ndcg@10 mrr@10 hit@10
    NGCF 0.0581 0.0647 0.0813 0.1616 0.3745
    LightGCN 0.0578 0.0644 0.081 0.1594 0.3671
    DGCF 0.0587 0.0633 0.0802 0.1585 0.3608
    SLIMElastic 0.0631 0.0612 0.0801 0.1546 0.3664
    BPR 0.0572 0.0638 0.0798 0.1566 0.3618
    AutoInt 0.0552 0.0635 0.0797 0.1588 0.3591
    GCMC 0.0567 0.0631 0.0793 0.1571 0.3596
    AFM 0.0535 0.0637 0.0789 0.1592 0.3652
    NNCF 0.0554 0.0626 0.0788 0.1583 0.3609
    NAIS 0.0592 0.0608 0.0782 0.1528 0.3589
    EASE 0.0658 0.0583 0.0779 0.1473 0.3598
    DeepFM 0.054 0.0621 0.0779 0.1579 0.3566
    DCN 0.0548 0.0618 0.0775 0.1538 0.3505
    WideDeep 0.0534 0.062 0.0774 0.1564 0.356
    Item2Vec 0.0591 0.0609 0.0773 0.1477 0.3598
    FM 0.0553 0.0611 0.0773 0.1557 0.3611
    SpectralCF 0.0527 0.0608 0.0768 0.1575 0.3531
    RecVAE 0.0563 0.0602 0.0765 0.1499 0.347
    NeuMF 0.055 0.0606 0.0763 0.152 0.3543
    FFM 0.0551 0.0612 0.076 0.1507 0.3614
    xDeepFM 0.0532 0.0599 0.0754 0.1545 0.3551
    NFM 0.0515 0.0605 0.075 0.153 0.3543
    DMF 0.0575 0.0582 0.0748 0.1455 0.345
    PNN 0.0524 0.06 0.0745 0.1509 0.3518
    ItemKNN 0.0558 0.0549 0.0716 0.1376 0.3243
    FNN 0.0476 0.058 0.0711 0.1434 0.3296
    MultiDAE 0.0513 0.0566 0.0707 0.1403 0.3336
    MacridVAE 0.0493 0.0536 0.0666 0.1341 0.3321
    CDAE 0.0384 0.0532 0.0632 0.1293 0.2965
    FwFM 0.0386 0.0532 0.063 0.1262 0.2921
    LR 0.0381 0.0534 0.0628 0.1271 0.2949
    Pop 0.0358 0.0494 0.0556 0.1095 0.2891
    LINE 0.0253 0.0485 0.054 0.1185 0.2609
    DSSM 0.0305 0.0411 0.0483 0.104 0.2627
    ENMF 0.0115 0.0176 0.0193 0.0461 0.1442
    It would be too hard for me to explain all what each model represents, so I hope you will take a look at RecBole's page and look at the table.
    There are four things I want to say here.
  • NGCF, LightGCN, DGCF, GCMC graph-based models dominate the top.
  • Long-established models such as SLIMElastic and BPR are also at the top.
  • The only context-aware models near the top are AutoInt and AFM.
  • RecVAE and other VAEs that I was hoping are not doing so well.
  • Case2. FourSquare NYC
    Now, let's continue to try out RecBole on another data set, the second one being FourSquare NYC. I quote the description from https://github.com/RUCAIBox/RecSysDatasets

    This dataset contains check-ins in NYC and Tokyo collected for about 10 month. Each check-in is associated with its time stamp, its GPS coordinates and its semantic meaning.

    As with MovieLens, we'll first look at the underlying statistics of the data.
    The number of users: 1084
    Average actions of users: 84.04801477377654
    The number of items: 38334
    Average actions of items: 2.374559778780685
    The number of inters: 91024
    The sparsity of the dataset: 99.78095038424168%
    In this dataset, the number of users and the number of items are not balanced at all. The average number of items interacted on by one user is 84, but the average number of users interacting on one item is 2, so it's obvious that there is a long tail. sparsity is also 95% for MovieLens, but 99% for sparsity.
    This yaml is almost the same as MovieLens, but the selectively loading part and item_id are different.
    # Selectively Loading
    load_col:
        inter: [user_id, venue_id, timestamp]
    unused_col:
        inter: [timestamp]
    Please note that item_id has been changed to venue_id. And I didn't use any of the item information this time because I wasn't sure which one to use. So the result of context-aware model might be quite different depending on this setting.
    Let's look at the results of the experiment in the same way. We have 33 experimental results(The number of models is slightly reduced).
    Name hit@10 mrr@10 ndcg@10 precision@10 recall@10
    LightGCN 0.2004 0.1089 0.0401 0.0243 0.0323
    SLIMElastic 0.205 0.1071 0.0399 0.0246 0.0332
    RecVAE 0.1958 0.0979 0.0373 0.0236 0.0316
    FNN 0.1884 0.098 0.0367 0.0224 0.0303
    DMF 0.1911 0.0953 0.0364 0.023 0.0307
    DeepFM 0.1911 0.0931 0.0359 0.0229 0.0312
    GCMC 0.1819 0.0955 0.0359 0.0222 0.0297
    MacridVAE 0.1911 0.0934 0.0358 0.0228 0.0306
    MultiVAE 0.1745 0.0936 0.0349 0.0211 0.0282
    NeuMF 0.1791 0.0867 0.0343 0.0223 0.0299
    MultiDAE 0.1671 0.0928 0.0341 0.0203 0.0281
    xDeepFM 0.1662 0.0948 0.034 0.0198 0.026
    FwFM 0.169 0.0931 0.0335 0.0202 0.0264
    WideDeep 0.1616 0.0937 0.0335 0.0192 0.0259
    LR 0.1801 0.086 0.0329 0.0219 0.0292
    AutoInt 0.1644 0.0914 0.0323 0.0189 0.0253
    SpectralCF 0.1653 0.0888 0.0322 0.0194 0.026
    PNN 0.1717 0.0827 0.0319 0.0209 0.0283
    AFM 0.1717 0.0802 0.0315 0.0208 0.0281
    DCN 0.169 0.0839 0.0315 0.0207 0.0271
    FFM 0.1634 0.0829 0.0308 0.0191 0.0257
    Pop 0.1791 0.0711 0.0301 0.0214 0.0289
    FM 0.1468 0.0594 0.0243 0.0172 0.0244
    BPR 0.1505 0.0557 0.0237 0.0173 0.0238
    NNCF 0.1274 0.063 0.0232 0.0144 0.0204
    NFM 0.0822 0.022 0.0098 0.0084 0.0106
    LINE 0.0683 0.0225 0.0094 0.0074 0.0098
    NGCF 0.0572 0.023 0.0093 0.0059 0.0086
    ItemKNN 0.0406 0.0178 0.0062 0.0043 0.0054
    Item2Vec 0.0194 0.007 0.0026 0.0021 0.0024
    DSSM 0.0037 0.0015 0.0007 0.0004 0.0007
    ENMF 0.0028 0.0013 0.0005 0.0003 0.0005
    CDAE 0.0009 0.0003 0.0001 0.0001 0.0001
  • LightGCN is at the top, but NGCF and GCMC have fallen (I forgot to calculate DGCF ).
  • Similarly, SLIMElastic is at the top, but BPR has fallen to the lower rank.
    • I was surprised because I had a feeling that BPR was always at the top rank.
  • VAEsVansVasesVanesVatsValesViesVAXesAESBAESCAES such as RecVAE and MacridVAE are getting higher rank. FNN, DeepFM and other context-aware systems are also in the top rank.
  • However, NDCG@10 is only 0.04 at the top, and the task is unusually difficult to begin with. MovieLens had 0.08.
  • This result is not the definitive version because it is not tuned, but I think you can understand that the results are completely different depending on data even though they were compared under the same conditions. We were able to show in actual figures that "which model is stronger in the recommendation task varies considerably depending on the data.
    Visualize how similar the recommended models are to each other as a network graph
    From now on, I would like to discuss about the recommendation models obtained from the two datasets.
    Each recommendation model has a list of top 10 items for each user, and you can use that list to indicate “this model and this model match this much ” or“ they don't match ”.
    Interesting points I think ...
  • Whether the model of a similar system, such as a model of a graph system or a model of a factorization machine system, has a high degree of coincidence.
  • Does the top-performing (lower-performing) model make very different recommendations?
  • It may possibly help you to understand the behavior of recommendation model.
    This time, we call it "recommendation list match degree", and try to calculate it.
    First, we get the topk list from the RecBole model.
    import json
    import click
    import numpy as np
    import torch
    from recbole.config import Config
    from recbole.data import create_dataset, data_preparation
    from recbole.quick_start.quick_start import load_data_and_model
    from recbole.utils import get_model
    from recbole.utils.case_study import full_sort_topk
    import pandas as pd
    from src.custom_models.Item2Vec import Item2Vec
    from src.metrics import calculate_indicators
    from tqdm.auto import tqdm
    
    @click.command()
    @click.option(
        "--model_file",
        required=True,
        type=str,
        help="example. saved/ckpd_recipe/Item2Vec-Nov-06-2021_02-41-35.pth",
    )
    @click.option(
        "--output_file",
        required=True,
        type=str,
        help="example. pop.json",
    )
    @click.option(
        "--is_item2vec",
        type=bool,
        is_flag=True
    )
    def main(model_file, output_file, is_item2vec):
        print("=====")
        print(model_file)
        print("=====")
    
        # for get Item title
        # e.g. foursquare
        _df = pd.read_csv("dataset/foursquare-nyc-merged/foursquare-nyc-merged.item", sep="\t")
        internal_id_to_title = _df["venue_category_name:token"].to_dict()
    
        # custom model(Item2Vec)
        if is_item2vec:
            checkpoint = torch.load(model_file)
            config = checkpoint["config"]
    
            config.seq_separator = "@"
            dataset = create_dataset(config)
            train_data, valid_data, test_data = data_preparation(config, dataset)
    
            model = Item2Vec(config, train_data.dataset).to(config["device"])
            model.load_state_dict(checkpoint["state_dict"])
            model.load_other_parameter(checkpoint.get("other_parameter"))
    
        # when not custom model
        else:
            config, model, dataset, train_data, valid_data, test_data = load_data_and_model(
                model_file=model_file
            )
    
        ground_list = []
        uid_list = []
        for batch_idx, batched_data in enumerate(test_data):
            interaction, row_idx, positive_u, positive_i = batched_data
            ground_list.append([int(v) for v in positive_i.numpy().tolist()])
            uid_list.append(interaction.user_id.numpy()[0])
    
        ranked_list = []
        for uid in tqdm(uid_list):
            topk_score, topk_iid_list = full_sort_topk(
                [uid], model, test_data, k=10, device="cuda"
            )
            ranked_list += topk_iid_list.cpu()
    
        all_metrics_results = {}
        for uid, g_list, r_list in zip(uid_list, ground_list, ranked_list):
            external_uid = dataset.id2token(dataset.uid_field, uid)
            all_metrics_results[external_uid] = {
                "ground_list_id": [v for v in dataset.id2token(dataset.iid_field, g_list)],
                "predict_list_id": [v for v in dataset.id2token(dataset.iid_field, r_list)],
                "ground_list": [internal_id_to_title[v-1] for v in g_list],
                "predict_list": [internal_id_to_title[v-1] for v in r_list.numpy()],
            }
    
        text = json.dumps(all_metrics_results, sort_keys=True, ensure_ascii=False, indent=2)
        with open(output_file, "w") as fh:
            fh.write(text)
    
    
    if __name__ == "__main__":
        main()
    Item2Vec is a custom model that I implemented, so I'm taking special care of it.
    from recbole.utils.case_study import The function full_sort_topk allows for topk output, so I used in this script.
    (The name of the module recbole.utils.case_study suggests that topk output is not the main purpose of RecBole.)
    And I use the output topk list to see the recommendation list match degree per model from the following script.
    import glob
    import itertools
    
    import numpy as np
    import pandas as pd
    from tqdm.auto import tqdm
    
    
    def main():
        filelist = glob.glob("output/foursquare_nyc_case_study/*.json")
        model_results = {}
        for file in tqdm(filelist):
            _model = file.split("/")[-1].split(".")[0]
            try:
                _df = pd.read_json(file).T
                model_results[_model] = _df
            except:
                print(f"{_model} read is failed")
        _models = model_results.keys()
        combis = list(itertools.combinations(_models, 2))
        model_similarities = []
    
        for c in tqdm(combis):
            model1 = c[0]
            model2 = c[1]
            model1_result = model_results[model1]
            model2_result = model_results[model2]
    
            model1_predict_list = model1_result["predict_list_id"].values
            model2_predict_list = model2_result["predict_list_id"].values
            sims = []
            for m1_preds, m2_preds in zip(model1_predict_list, model2_predict_list):
                _sim = len(set(m1_preds) & set(m2_preds)) / len(m1_preds)
                sims.append(_sim)
            similarity = np.mean(sims)
            model_similarities.append([model1, model2, similarity])
            result = pd.DataFrame(
                model_similarities, columns=["source_model", "dest_model", "similarity"]
            )
            result.to_csv("foursquare_nyc_survey_with_recbole.csv", index=False)
    
    
    if __name__ == "__main__":
        main()
    Doing this will generate a table like the one below. This is a result from foursquare.
    Full spreadsheet link is here.
    We will visualize this similarity list a a network graph. If the similarity is more than 0.5, we put an edge between the models(node).
    So the network graph looks like this. The result of modularity clustering is reflected in the colors. And the layout algorithm is Yifan Hu Multilevel.
    MovieLens
    FourSquare
    Discussion
    Totally different results.
  • Overall, the shape of the network is completely different
  • In MovieLens Network, Factorization Machine is part of the same community. but not so for FourSquare Network.
  • LightGCN, GCMC, and DGCF (graph-based model) seem to be relatively in the same community
    • I forgot to calculate DGCF on FourSquare, so I don't have one. 😭
  • The centrality of LightGCN, which was high NDCG in both, was high in MovieLens Network but low in FourSquare Network
    • eigenvector centrality: 0.97(movielens) -> 0.34(foursquare)
    • By the way, the number one centrality is DGCF and BPR for MovieLens, and DMF and Pop (which just puts out the most popular items) for FourSquare!
  • There are significantly more nodes that don't have an edge in FourSquare Network compared to MovieLens Network
  • I found that as the recommended data changes, the behavior of the model seems to change greatly.
    What is the cause?
    Through the analysis so far, we have found that the top models are being replaced because the behavior of the models is changing significantly depending on the data. What is the cause of this in the end?
    As mentioned above, we don't know anything yet, but from the basic statistics of the data, it is clear that the number of users, items, and sparsity are different, and this time we compared the data, so I think it is probably because of the influence (bias) of the nature of the service that produced the data. I think.
    The fact that we were able to confirm this hypothesis with multiple model data sets was a great achievement.
    Future Work
    I'd like to be able to express the bias of the data described above in some kind of index. If we can express the bias in the form of an index, we can make guesses such as "Model A or B seems to be strong because the index is this high.
    Summary
    It is thanks to RecBole that we can now aim at such a prospect. Let's try RecBole!

    37

    This website collects cookies to deliver better user experience

    Observe differences in the behavior of recommendation models using RecBole