diffsasrec

https://github.com/shinypuff/diffsasrec

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: Shinypuff
Language: Python
Default Branch: main
Size: 16.9 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme Citation

DiffSASRec: Diffusion-based Sequential Recommendation

The following repository is the implementation of a diffusion SASRec model.

Overview

The repository provides two main model variants: - Original SASRec based on pmixer's PyTorch implementation

Diffusion-based Language Modeling inspired by LLaDA:
- Additional mask token embedding
- Forward diffusion process to add noise to sequences
- Reverse diffusion process for generative recommendation

Training

To train the diffusion model, use the following command:

bash python main.py \ --data_path your_data.csv \ --train_dir experiment_name \ --model_type diffusion \ --num_recs 10 \ --batch_size 128 \ --maxlen 200 \ --hidden_units 50 \ --num_blocks 2 \ --num_heads 2 \ --dropout_rate 0.2 \ --device cuda

Key parameters: - --data_path: Path to your input data CSV file - --train_dir: Directory to save model checkpoints and logs - --model_type: Choose between 'vanilla' (original SASRec) or 'diffusion' - --diffusion_type: Choose between 'multi' or 'single' for diffusion and topK inference respectively - --num_recs: Number of recommendations (mask tokens for diffusion inference or K in topK) - --maxlen: Maximum sequence length - --hidden_units: Hidden dimension size - --num_blocks: Number of transformer blocks - --num_heads: Number of attention heads - --SFT: Enable supervised fine-tuning after diffusion pretraining

Data Format

The input data should be a CSV file with the following columns (default names can be customized through the argument parameters): - UserId (--users_col) - ProductId (--items_col) - Timestamp (--time_col)

Training Process

Diffusion Pretraining: Similarly to LLaDA, our implementation defines a model distribution $p{\theta}(x0)$ through a forward process and a reverse process. With $t \in (0,1)$, the forward process generates partially masked sequence $xt$, with each token from $x0$ being masked with probability $t$ or remaining unmasked with probability $1 - t$. Thus, the distribution of masked tokens is:

$$ q{t|0}(xt^i|x0^i) = \begin{cases} 1 - t, & xt^i = x0^i, \ t, & xt^i = \text{M (mask token)}. \end{cases} $$

The predictor of DiffSASRec is a parametric model $p{\theta}(\cdot|xt)$ that takes $x_t$ as input and predicts all masked tokens simultaneously. It is trained using a cross-entropy loss computed only on the masked tokens:

$$ L(\theta) = -E{t, x0, xt} \left[ \frac{1}{t} \sum{i=1}^{L} 1[xt^i = \mathbf{M}] \log p{\theta}(x0^i | xt) \right] $$

Thus, the training algorithm is the following:

Inference

The inference is based on the reverse process: given a user interaction history $p_0$, we recover the data distribution by iteratively predicting masked tokens as t moves from 1 to 0.

However, our objective is to provide K recommendations so that the next relevant item is present in our predictions. Thus, there are 2 ways to sample recommendations:

Single-step inference: Predicts the next item directly. Top K logits are considered to compute metrics @K.

Multi-step inference (diffusion-like): The algorithm progressively replaces K masked tokens in an iterative manner. At each step, it predicts possible values for the masked positions and assigns confidence scores to these predictions. Only the tokens with confidence scores exceeding a predefined threshold are updated in the sequence. If no predictions meet this threshold, the confidence requirement is gradually lowered.

The multi-step inference procedure is presented in the Algorithm 2:

Data split

Repository provides a time-based split to simulate realistic sequential recommendation settings. The time-based splitting strategy involves defining a time cutoff (e.g. the 95th percentile mark) of the dataset.

To determine the holdout item, the first interaction of each user after the time split is considered. However, this item is only chosen if both the user and the item were present in the dataset before the split. If the first item does not meet this requirement—either because it is a new item that did not appear in the training set or because the users had no prior interactions with it, it is skipped, and the next interaction of the user is checked. This process continues until a suitable holdout item is found, ensuring that every user in the evaluation set has prior interactions and that the model has seen the selected item during training. The time-based splitting strategy is presented below:

Evaluation

The model is evaluated using standard recommendation metrics: - NDCG@10 - HR@10 - MRR@10 - Coverage

Owner

Login: Shinypuff
Kind: user

Repositories: 1
Profile: https://github.com/Shinypuff

GitHub Events

Total

Release event: 1
Watch event: 1
Public event: 1
Push event: 17
Pull request event: 2
Create event: 1

Last Year

Release event: 1
Watch event: 1
Public event: 1
Push event: 17
Pull request event: 2
Create event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science