Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: goyalkaraniit
- Language: Python
- Default Branch: main
- Size: 4.76 MB
Statistics
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
This is the official repository for the paper "SymTax: Symbiotic Relationship and Taxonomy Fusion for Effective Citation Recommendation" presented in ACL 2024.
The ArSyTa dataset introduced in the paper is now hosted at HuggingFace https://huggingface.co/datasets/goyalkaraniit/ArSyTa
SymTax
To fix and streamline your workflow for running the model, generating taxonomy, and custom training, here is a structured approach. Let's break down the instructions into coherent steps and ensure all paths, files, and commands are clearly stated.
running the SymTax models
python per.py
Step 1: Running Baseline Datasets
To run the model on the arXiv, RefSeer, acl, or PeerRead datasets, follow these steps:
- Open the
baseline_datasets.pyfile. - Set the
enrichvariable toTrueorFalsebased on whether you want to use the enricher. - Set the
datasetvariable to the desired dataset (e.g.,arxiv,refseer,peerread,acl).
```python
baseline_datasets.py
enrich = True # or False dataset = 'arxiv' # or 'refseer', 'peerread'
```
- Run the script to get the output metrics (recall, MRR, NCG):
bash
python baseline_datasets.py
The arxiv_acm_mapping_updated.csv file should contain the mapping of the flat-level arXiv taxonomy to the tree-level
ACM taxonomy.
Step 2: Custom Training
To custom train the model, follow these steps:
- Edit the relevant arguments in
src/rerank/train.py.
```python
src/rerank/train.py
Ensure the correct paths and parameters are set
Example placeholder
traindatapath = 'path/to/train/data' valdatapath = 'path/to/validation/data' modelsavepath = 'path/to/save/model'
Add or modify other training parameters as needed
```
- Modify the configuration file
src/rerank/config/arxiv/scibert/training_NN_prefetch.configto reflect your desired settings.
```yaml
src/rerank/config/arxiv/scibert/trainingNNprefetch.config
Ensure the configuration parameters match your training requirements
batchsize: 32 learningrate: 1e-5 num_epochs: 10
Add or modify other configurations as needed
```
- Navigate to the
src/rerankdirectory and run the training script:
bash
cd src/rerank
python train.py -config_file_path config/arxiv/scibert/training_NN_prefetch.config
Summary of Commands
- Running baseline datasets:
bash
python baseline_datasets.py
- Generating arXiv fusion taxonomy (in Jupyter Notebook):
```bash
Open and run all cells in arxivacmfusion.ipynb
```
- Custom training:
bash
cd src/rerank
python train.py -config_file_path config/arxiv/scibert/training_NN_prefetch.config
By following these steps, you can ensure your workflow is clear, and each part of the process is correctly configured and executed.
Owner
- Name: Karan Goyal
- Login: goyalkaraniit
- Kind: user
- Location: New Delhi
- Website: https://www.linkedin.com/in/karan-goyal-757727a6/
- Repositories: 1
- Profile: https://github.com/goyalkaraniit
PhD in CSE@IIIT Delhi | Ex-Qualcomm | Masters@IIT Delhi
Citation (citation_recommender.py)
from src.prefetch.rankers import PrefetchEncoder
from src.prefetch.rankers import Ranker as PrefetchRanker
import pickle
import json
import numpy as np
from tqdm import tqdm
from src.rerank.datautils import RerankDataset
from src.rerank.model import Scorer, Scorer_PER_v1, Scorer_PER_v2
from transformers import AutoTokenizer
import torch
import torch.nn as nn
from torch.nn.parallel import DataParallel, DistributedDataParallel
class Prefetcher:
def __init__( self, model_path,
embedding_path,
gpu_list = [],
vocab_path = "model/glove/vocabulary_200dim.pkl",
embed_dim = 200,
num_heads = 8,
hidden_dim = 1024,
max_seq_len = 512,
max_doc_len = 3,
n_para_types = 100,
num_enc_layers = 1,
citation_title_label = 0,
citation_abstract_label = 1,
citation_context_label = 3,
):
with open( vocab_path,"rb") as f:
words = pickle.load(f)
encoder = PrefetchEncoder( model_path, vocab_path,
embed_dim, gpu_list[:1] ,
num_heads, hidden_dim,
max_seq_len, max_doc_len,
n_para_types, num_enc_layers
)
ranker = PrefetchRanker( embedding_path, embed_dim , gpu_list = gpu_list )
ranker.encoder = encoder
self.ranker = ranker
self.encoder = encoder
self.citation_title_label = citation_title_label
self.citation_abstract_label = citation_abstract_label
self.citation_context_label = citation_context_label
def get_top_n( self, query, n = 10 ):
"""
query structure
{
"citing_title": The title of the citing paper, default = ""
"citing_abstract": The abstract of the citing paper, default = ""
"local_context": The local citation sentence as the local context
}
"""
query_text = [
[
[ query.get("citing_title",""), self.citation_title_label ],
[ query.get("citing_abstract",""), self.citation_abstract_label ],
[ query.get("local_context", "") , self.citation_context_label ]
]
]
candidates = self.ranker.get_top_n( n, query_text )
return candidates
class Reranker:
def __init__(self,
model_path,
gpu_list = [],
initial_model_path = "allenai/scibert_scivocab_uncased" ):
self.tokenizer = AutoTokenizer.from_pretrained( initial_model_path )
self.tokenizer.add_special_tokens( { 'additional_special_tokens': ['<cit>','<sep>','<eos>'] } )
vocab_size = len( self.tokenizer )
ckpt = torch.load( model_path, map_location=torch.device('cpu') )
self.scorer = Scorer( initial_model_path, vocab_size )
self.scorer.load_state_dict( ckpt["scorer"] )
self.device = torch.device( "cuda:%d"%(gpu_list[0]) if torch.cuda.is_available() and len(gpu_list) > 0 else "cpu" )
self.scorer = self.scorer.to(self.device)
if self.device.type == "cuda" and len( gpu_list ) > 1:
self.scorer = DataParallel(self.scorer, gpu_list)
self.sep_token = "<sep>"
def rerank(self, citing_title = "", citing_abstract="", local_context="", original_candidate_list=[ {} ], max_input_length = 512, reranking_batch_size = 50 ):
candidate_list = original_candidate_list.copy()
if len(candidate_list) == 0:
return []
global_context = citing_title + " " + citing_abstract
# add section in query_text
query_text = " ".join(global_context.split()[:int(max_input_length * 0.35)]) + self.sep_token + local_context
score_list = []
for pos in range(0, len(candidate_list), reranking_batch_size):
candidate_batch = candidate_list[pos: pos + reranking_batch_size]
query_text_batch = [query_text for _ in range(len(candidate_batch))]
candidate_text_batch = [item.get("title", "") + " " + item.get("abstract", "") for item in candidate_batch]
encoded_seqs = self.tokenizer(query_text_batch, candidate_text_batch, max_length=max_input_length,
padding="max_length", truncation=True)
for key in encoded_seqs:
encoded_seqs[key] = torch.from_numpy(np.asarray(encoded_seqs[key])).to(self.device)
with torch.no_grad():
score_list.append(self.scorer({
"input_ids": encoded_seqs["input_ids"],
"token_type_ids": encoded_seqs["token_type_ids"],
"attention_mask": encoded_seqs["attention_mask"]
}).detach())
score_list = torch.cat(score_list, dim=0).view(-1).cpu().numpy().tolist()
candidate_list, _ = list(zip(*sorted(zip(candidate_list, score_list), key=lambda x: -x[1])))
return candidate_list
class Reranker_PER_v1:
def __init__(self,
model_path,
gpu_list=[],
initial_model_path="allenai/scibert_scivocab_uncased"):
self.tokenizer = AutoTokenizer.from_pretrained(initial_model_path)
self.tokenizer.add_special_tokens({'additional_special_tokens': ['<cit>', '<sep>', '<eos>']})
vocab_size = len(self.tokenizer)
ckpt = torch.load(model_path, map_location=torch.device('cpu'))
self.scorer = Scorer_PER_v1(initial_model_path, vocab_size)
self.scorer.load_state_dict(ckpt["scorer"])
self.device = torch.device(
"cuda:%d" % (gpu_list[0]) if torch.cuda.is_available() and len(gpu_list) > 0 else "cpu")
self.scorer = self.scorer.to(self.device)
if self.device.type == "cuda" and len(gpu_list) > 1:
self.scorer = DataParallel(self.scorer, gpu_list)
self.sep_token = "<sep>"
def rerank(self, citing_title="", citing_abstract="", local_context="", citing_category="",
original_candidate_list=[{}], cat_map={}, max_input_length=512, reranking_batch_size=50, heading=""):
candidate_list = original_candidate_list.copy()
if len(candidate_list) == 0:
return []
global_context = citing_title + " " + citing_abstract
# add section in query_text
query_text = " ".join(global_context.split()[:int(max_input_length * 0.35)]) + self.sep_token + local_context
score_list = []
for pos in range(0, len(candidate_list), reranking_batch_size):
candidate_batch = candidate_list[pos: pos + reranking_batch_size]
query_text_batch = [query_text for _ in range(len(candidate_batch))]
candidate_text_batch = [item.get("title", "") + " " + item.get("abstract", "") for item in candidate_batch]
encoded_seqs = self.tokenizer(query_text_batch, candidate_text_batch, max_length=max_input_length,
padding="max_length", truncation=True)
for key in encoded_seqs:
encoded_seqs[key] = torch.from_numpy(np.asarray(encoded_seqs[key])).to(self.device)
citing_category_batch = [cat_map[citing_category][0] for _ in range(len(candidate_batch))]
citing_category_batch = torch.from_numpy(np.array(citing_category_batch)).to(self.device)
candidate_category_batch = [cat_map[item['categories'][0]][0] for item in candidate_batch]
candidate_category_batch = torch.from_numpy(np.array(candidate_category_batch)).to(self.device)
with torch.no_grad():
param = {
"input_ids": encoded_seqs["input_ids"],
"token_type_ids": encoded_seqs["token_type_ids"],
"attention_mask": encoded_seqs["attention_mask"]
}
score_list.append(self.scorer({'param': param, 'category_batch_query': citing_category_batch,
'category_batch_candidate': candidate_category_batch}).detach())
score_list = torch.cat(score_list, dim=0).view(-1).cpu().numpy().tolist()
candidate_list, _ = list(zip(*sorted(zip(candidate_list, score_list), key=lambda x: -x[1])))
return candidate_list
class Reranker_PER_v2:
def __init__(self,
model_path,
gpu_list=[],
initial_model_path="allenai/scibert_scivocab_uncased"):
self.tokenizer = AutoTokenizer.from_pretrained(initial_model_path)
self.tokenizer.add_special_tokens({'additional_special_tokens': ['<cit>', '<sep>', '<eos>']})
vocab_size = len(self.tokenizer)
ckpt = torch.load(model_path, map_location=torch.device('cpu'))
self.scorer = Scorer_PER_v2(initial_model_path, vocab_size)
self.scorer.load_state_dict(ckpt["scorer"])
self.device = torch.device(
"cuda:%d" % (gpu_list[0]) if torch.cuda.is_available() and len(gpu_list) > 0 else "cpu")
self.scorer = self.scorer.to(self.device)
if self.device.type == "cuda" and len(gpu_list) > 1:
self.scorer = DataParallel(self.scorer, gpu_list)
self.sep_token = "<sep>"
def rerank(self, citing_title="", citing_abstract="", local_context="", citing_category="",
original_candidate_list=[{}], cat_map={}, max_input_length=512, reranking_batch_size=50, heading=""):
candidate_list = original_candidate_list.copy()
if len(candidate_list) == 0:
return []
global_context = citing_title + " " + citing_abstract
# add section in query_text
query_text = " ".join(
global_context.split()[:int(max_input_length * 0.35)]) + self.sep_token + local_context + heading
score_list = []
for pos in range(0, len(candidate_list), reranking_batch_size):
candidate_batch = candidate_list[pos: pos + reranking_batch_size]
query_text_batch = [query_text for _ in range(len(candidate_batch))]
candidate_text_batch = [item.get("title", "") + " " + item.get("abstract", "") for item in candidate_batch]
encoded_seqs = self.tokenizer(query_text_batch, candidate_text_batch, max_length=max_input_length,
padding="max_length", truncation=True)
for key in encoded_seqs:
encoded_seqs[key] = torch.from_numpy(np.asarray(encoded_seqs[key])).to(self.device)
citing_category_batch = [cat_map[citing_category][0] for _ in range(len(candidate_batch))]
citing_category_batch = torch.from_numpy(np.array(citing_category_batch)).to(self.device)
candidate_category_batch = [cat_map[item['categories'][0]][0] for item in candidate_batch]
candidate_category_batch = torch.from_numpy(np.array(candidate_category_batch)).to(self.device)
with torch.no_grad():
param = {
"input_ids": encoded_seqs["input_ids"],
"token_type_ids": encoded_seqs["token_type_ids"],
"attention_mask": encoded_seqs["attention_mask"]
}
score_list.append(self.scorer({'param': param, 'category_batch_query': citing_category_batch,
'category_batch_candidate': candidate_category_batch}).detach())
score_list = torch.cat(score_list, dim=0).view(-1).cpu().numpy().tolist()
candidate_list, _ = list(zip(*sorted(zip(candidate_list, score_list), key=lambda x: -x[1])))
return candidate_list
GitHub Events
Total
- Issues event: 1
- Push event: 4
- Fork event: 1
Last Year
- Issues event: 1
- Push event: 4
- Fork event: 1