nlp_citation_prediction
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (4.7%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: okostis
- Language: Jupyter Notebook
- Default Branch: main
- Size: 6.84 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
This repository contains a Jupyter notebook that we developed during the University of Ioannina's MYE053 NLP Course. It's main focus is Citation Link Prediction for research papers. The notebook's core is predicting citation links: we extract rich semantic meaning from paper abstracts using Sentence-BERT (SBERT) embeddings and combine them with practical author-based features like shared author counts and Jaccard similarity. A LightGBM classifier then takes all these engineered features and learns to predict the likelihood of a citation existing between any given pair of papers, offering an effective approach for uncovering connections within academic graphs.
More info can be found in the course's kaggle challenge : https://www.kaggle.com/competitions/nlp-cse-uoi-2025/overview
Owner
- Name: okostis
- Login: okostis
- Kind: user
- Repositories: 2
- Profile: https://github.com/okostis
Citation (citation_prediction.ipynb)
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
"_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
"execution": {
"iopub.execute_input": "2025-06-15T15:26:09.200044Z",
"iopub.status.busy": "2025-06-15T15:26:09.199705Z",
"iopub.status.idle": "2025-06-15T16:45:10.952356Z",
"shell.execute_reply": "2025-06-15T16:45:10.951007Z",
"shell.execute_reply.started": "2025-06-15T15:26:09.200019Z"
},
"trusted": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-06-15 15:26:39.256752: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
"E0000 00:00:1750001199.539223 35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"E0000 00:00:1750001199.618137 35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Loading data...\n",
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 138499 entries, 0 to 138498\n",
"Data columns (total 3 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 article_id 138499 non-null int32 \n",
" 1 authors 138499 non-null object\n",
" 2 abstract 138499 non-null object\n",
"dtypes: int32(1), object(2)\n",
"memory usage: 162.7 MB\n",
" col1 col2\n",
"0 34977 59394\n",
"1 22518 46602\n",
"2 36762 22813\n",
"3 44960 110384\n",
"4 29015 26366\n",
"Loading pre-computed SBERT mean vectors from /kaggle/working/nlp-cse-uoi-2025/data_new/sbert_mean_vec_all-MiniLM-L6-v2.csv\n",
"Shape of SBERT_mean_vectors: (138499, 384)\n",
"Loaded 1091955 raw positive edges from edgelist.txt.\n",
"Using 400000 positive edges for training.\n",
"Generating 400000 negative samples...\n",
"Generated 400000 negative samples.\n",
"Total samples for training: 800000 (Positive: 400000, Negative: 400000)\n",
"Computing features for training data...\n",
"Shape of X_train: (640000, 1539), y_train: (640000,)\n",
"Shape of X_test: (160000, 1539), y_test: (160000,)\n",
"\n",
"Starting LightGBM model training...\n",
"[100]\tvalid_0's auc: 0.970799\n",
"[200]\tvalid_0's auc: 0.973946\n",
"[300]\tvalid_0's auc: 0.975484\n",
"[400]\tvalid_0's auc: 0.976496\n",
"[500]\tvalid_0's auc: 0.977208\n",
"[600]\tvalid_0's auc: 0.977706\n",
"[700]\tvalid_0's auc: 0.978084\n",
"[800]\tvalid_0's auc: 0.978371\n",
"[900]\tvalid_0's auc: 0.978605\n",
"[1000]\tvalid_0's auc: 0.978808\n",
"[1100]\tvalid_0's auc: 0.979007\n",
"[1200]\tvalid_0's auc: 0.979183\n",
"[1300]\tvalid_0's auc: 0.979317\n",
"[1400]\tvalid_0's auc: 0.979408\n",
"[1500]\tvalid_0's auc: 0.979528\n",
"[1600]\tvalid_0's auc: 0.979649\n",
"[1700]\tvalid_0's auc: 0.979744\n",
"[1800]\tvalid_0's auc: 0.979812\n",
"[1900]\tvalid_0's auc: 0.979874\n",
"[2000]\tvalid_0's auc: 0.979935\n",
"\n",
"LightGBM Train AUC: 0.9928\n",
"LightGBM Test AUC: 0.9799\n",
"LightGBM Test Accuracy: 92.66%\n",
"\n",
"Making predictions on test.txt...\n",
"Submission file created at /kaggle/working/nlp-cse-uoi-2025/data_new/submission.csv\n",
" ID Label\n",
"0 0 0.100375\n",
"1 1 0.248836\n",
"2 2 0.811147\n",
"3 3 0.580352\n",
"4 4 0.017505\n"
]
}
],
"source": [
"##all-MiniLM-L6-v2\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"from pathlib import Path\n",
"import logging\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import StandardScaler\n",
"from sklearn.metrics.pairwise import cosine_similarity\n",
"import gc \n",
"import random \n",
"\n",
"\n",
"#!pip3 install lightgbm\n",
"import lightgbm as lgb\n",
"from sklearn.metrics import roc_auc_score, accuracy_score\n",
"\n",
"\n",
"#!pip3 install sentence-transformers\n",
"\n",
"\n",
"from sentence_transformers import SentenceTransformer\n",
"\n",
"logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)\n",
"\n",
"\n",
"data_path = Path(\"/kaggle/working/nlp-cse-uoi-2025/data_new\")\n",
"TARGET_TOTAL_SAMPLES = 800_000 \n",
"SBERT_MODEL_NAME = 'all-MiniLM-L6-v2' \n",
"EMBEDDING_DIM = 384 \n",
"\n",
"\n",
"print(\"Loading data...\")\n",
"with open(data_path / \"authors.txt\") as f: \n",
" authors_data = [i.split(\"|--|\") for i in f.read().splitlines()]\n",
" authors = pd.DataFrame({\n",
" \"article_id\": np.int32(np.array(authors_data)[:, 0]),\n",
" \"authors\": np.array(authors_data)[:, 1]\n",
" })\n",
"\n",
"edgelist = pd.read_csv(data_path / \"edgelist.txt\", names=[\"article_id\", \"cited_id\"], header=None, sep=\",\", dtype=np.int32) \n",
"\n",
"with open(data_path / \"abstracts.txt\") as f:\n",
" abstracts_data = [i.split(\"|--|\") for i in f.read().splitlines()]\n",
" abstracts = pd.DataFrame({\n",
" \"article_id\": np.int32(np.array(abstracts_data)[:, 0]),\n",
" \"abstract\": np.array(abstracts_data)[:, 1]\n",
" })\n",
"\n",
"test_df = pd.read_csv(data_path / \"test.txt\", header=None, names=['col1', 'col2'], dtype=np.int32)\n",
"\n",
"assert len(authors) == len(abstracts)\n",
"data = authors.merge(abstracts, on=\"article_id\") \n",
"\n",
"data = data.sort_values(by='article_id').reset_index(drop=True) \n",
"\n",
"article_id_to_idx = pd.Series(data.index, index=data['article_id']).to_dict() \n",
"\n",
"del authors, abstracts, authors_data, abstracts_data\n",
"gc.collect()\n",
"\n",
"data.info(verbose=True, memory_usage=\"deep\")\n",
"print(test_df.head())\n",
"\n",
"\n",
"sbert_pretrained_dir = data_path / \"sbert_pretrained\" \n",
"sbert_means_csv = data_path / f\"sbert_mean_vec_{SBERT_MODEL_NAME.replace('/', '_')}.csv\"\n",
"\n",
"sbert_mean_vectors = None\n",
"\n",
"if sbert_means_csv.exists():\n",
" print(f\"Loading pre-computed SBERT mean vectors from {sbert_means_csv}\")\n",
" sbert_mean_vectors = pd.read_csv(sbert_means_csv, header=None).values.astype(np.float32)\n",
"else:\n",
" sbert_pretrained_dir.mkdir(parents=True, exist_ok=True)\n",
" print(f\"Downloading/Loading SBERT model: {SBERT_MODEL_NAME}... This may take a while.\")\n",
" \n",
" model_sbert = SentenceTransformer(SBERT_MODEL_NAME)\n",
"\n",
" print(\"Computing SBERT mean vectors for abstracts...\")\n",
" \n",
" sbert_mean_vectors = model_sbert.encode(data['abstract'].tolist(), convert_to_numpy=True, show_progress_bar=True, batch_size=32) \n",
" sbert_mean_vectors = sbert_mean_vectors.astype(np.float32) \n",
"\n",
" pd.DataFrame(sbert_mean_vectors).to_csv(sbert_means_csv, header=False, index=False)\n",
" print(f\"Saved SBERT mean vectors to {sbert_means_csv}\") \n",
"\n",
" del model_sbert\n",
" gc.collect()\n",
"\n",
"\n",
"\n",
"print(f\"Shape of SBERT_mean_vectors: {sbert_mean_vectors.shape}\") \n",
"\n",
"all_positive_edges = edgelist.values.tolist()\n",
"print(f\"Loaded {len(all_positive_edges)} raw positive edges from edgelist.txt.\")\n",
"\n",
"num_positive_samples_to_use = min(len(all_positive_edges), TARGET_TOTAL_SAMPLES // 2) \n",
"num_negative_samples_to_generate = TARGET_TOTAL_SAMPLES - num_positive_samples_to_use \n",
"\n",
"random.seed(42)\n",
"positive_edges = random.sample(all_positive_edges, num_positive_samples_to_use) \n",
"print(f\"Using {len(positive_edges)} positive edges for training.\")\n",
"\n",
"existing_edges_set = set()\n",
"for p_id, c_id in all_positive_edges:\n",
" existing_edges_set.add(tuple(sorted((p_id, c_id))))\n",
"\n",
"print(f\"Generating {num_negative_samples_to_generate} negative samples...\")\n",
"negative_edges = []\n",
"all_article_ids = data['article_id'].unique()\n",
"num_articles = len(all_article_ids)\n",
"\n",
"while len(negative_edges) < num_negative_samples_to_generate: \n",
" idx_pair = np.random.choice(num_articles, 2, replace=False) \n",
" article_id1 = all_article_ids[idx_pair[0]]\n",
" article_id2 = all_article_ids[idx_pair[1]]\n",
"\n",
" current_pair = tuple(sorted((article_id1, article_id2)))\n",
"\n",
" if current_pair not in existing_edges_set: \n",
" negative_edges.append((article_id1, article_id2))\n",
"\n",
"print(f\"Generated {len(negative_edges)} negative samples.\")\n",
"\n",
"pairs = np.array(positive_edges + negative_edges, dtype=np.int32)\n",
"labels = np.array([1] * len(positive_edges) + [0] * len(negative_edges), dtype=np.float32)\n",
"\n",
"indices = np.arange(len(pairs))\n",
"np.random.shuffle(indices) \n",
"pairs_shuffled = pairs[indices]\n",
"labels_shuffled = labels[indices]\n",
"\n",
"print(f\"Total samples for training: {len(pairs_shuffled)} (Positive: {len(positive_edges)}, Negative: {len(negative_edges)})\")\n",
"\n",
"del positive_edges, negative_edges\n",
"gc.collect()\n",
"\n",
"FEATURE_SIZE = 4 * EMBEDDING_DIM + 1 + 1 + 1 \n",
"\n",
"def compute_features_batch(article_id_pairs, embeddings, data_df, article_id_map):\n",
" num_pairs = len(article_id_pairs)\n",
" features_array = np.zeros((num_pairs, FEATURE_SIZE), dtype=np.float32)\n",
"\n",
" processed_authors = {} \n",
" for idx, row in data_df.iterrows():\n",
" authors_str = row['authors'] \n",
" processed_authors[row['article_id']] = set(authors_str.lower().replace(\" \", \"\").split(';')) if authors_str else set()\n",
"\n",
" for i, (id1, id2) in enumerate(article_id_pairs):\n",
" idx1 = article_id_map.get(id1)\n",
" idx2 = article_id_map.get(id2)\n",
" \n",
" if idx1 is None or idx2 is None:\n",
" features_array[i] = np.zeros(FEATURE_SIZE, dtype=np.float32)\n",
" continue\n",
"\n",
" vec_i = embeddings[idx1] \n",
" vec_j = embeddings[idx2] \n",
"\n",
" combined_embeddings = np.concatenate([ \n",
" vec_i,\n",
" vec_j,\n",
" np.abs(vec_i - vec_j),\n",
" vec_i * vec_j\n",
" ])\n",
"\n",
" authors_i_set = processed_authors.get(id1, set())\n",
" authors_j_set = processed_authors.get(id2, set())\n",
"\n",
" intersection_len = len(authors_i_set & authors_j_set) \n",
" union_len = len(authors_i_set | authors_j_set) \n",
" author_sim = intersection_len / union_len if union_len > 0 else 0.0 \n",
" shared_authors_count = intersection_len\n",
" \n",
" vec_i_reshaped = vec_i.reshape(1, -1)\n",
" vec_j_reshaped = vec_j.reshape(1, -1)\n",
" if np.linalg.norm(vec_i) == 0.0 or np.linalg.norm(vec_j) == 0.0:\n",
" abstract_cos_sim = 0.0\n",
" else:\n",
" abstract_cos_sim = cosine_similarity(vec_i_reshaped, vec_j_reshaped)[0][0]\n",
" \n",
" all_features = np.concatenate([combined_embeddings, [author_sim, shared_authors_count, abstract_cos_sim]])\n",
" features_array[i] = all_features\n",
"\n",
" return features_array\n",
"\n",
"print(\"Computing features for training data...\") \n",
"X_features = compute_features_batch(pairs_shuffled, sbert_mean_vectors, data, article_id_to_idx)\n",
"y = labels_shuffled\n",
"\n",
"scaler = StandardScaler()\n",
"X_scaled = scaler.fit_transform(X_features) \n",
"\n",
"del X_features, labels_shuffled, pairs \n",
"gc.collect()\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X_scaled, y,\n",
" test_size=0.2,\n",
" stratify=y, \n",
" random_state=42\n",
")\n",
"\n",
"del X_scaled, y \n",
"gc.collect()\n",
"\n",
"print(f\"Shape of X_train: {X_train.shape}, y_train: {y_train.shape}\")\n",
"print(f\"Shape of X_test: {X_test.shape}, y_test: {y_test.shape}\")\n",
"\n",
"\n",
"print(\"\\nStarting LightGBM model training...\")\n",
"\n",
"lgb_params = { \n",
" 'objective': 'binary', \n",
" 'metric': 'auc',\n",
" 'boosting_type': 'gbdt', \n",
" 'num_leaves': 31, \n",
" 'learning_rate': 0.05,\n",
" 'n_estimators': 2000, \n",
" 'feature_fraction': 0.8, \n",
" 'bagging_fraction': 0.8, \n",
" 'bagging_freq': 1,\n",
" 'lambda_l1': 0.1,\n",
" 'lambda_l2': 0.1,\n",
" 'num_threads': -1, \n",
" 'verbose': -1,\n",
" 'seed': 42,\n",
" 'n_jobs': -1 \n",
"}\n",
"\n",
"model = lgb.LGBMClassifier(**lgb_params)\n",
"\n",
"model.fit(X_train, y_train, \n",
" eval_set=[(X_test, y_test)],\n",
" eval_metric='auc',\n",
" callbacks=[lgb.log_evaluation(period=100), lgb.early_stopping(100, verbose=False)])\n",
"\n",
"y_pred_proba_train = model.predict_proba(X_train)[:, 1] \n",
"y_pred_proba_test = model.predict_proba(X_test)[:, 1]\n",
"y_pred_test_class = (y_pred_proba_test > 0.5).astype(int)\n",
"#we calculate our metrics\n",
"train_auc = roc_auc_score(y_train, y_pred_proba_train)\n",
"test_auc = roc_auc_score(y_test, y_pred_proba_test)\n",
"test_accuracy = accuracy_score(y_test, y_pred_test_class)\n",
"\n",
"print(f\"\\nLightGBM Train AUC: {train_auc:.4f}\")\n",
"print(f\"LightGBM Test AUC: {test_auc:.4f}\")\n",
"print(f\"LightGBM Test Accuracy: {test_accuracy*100:.2f}%\")\n",
"\n",
"\n",
"\n",
"print(\"\\nMaking predictions on test.txt...\")\n",
"test_results = []\n",
"\n",
"processed_authors_test = {} \n",
"for idx, row in data.iterrows():\n",
" authors_str = row['authors']\n",
" processed_authors_test[row['article_id']] = set(authors_str.lower().replace(\" \", \"\").split(';')) if authors_str else set()\n",
"\n",
"test_pairs = test_df.values\n",
"PREDICTION_BATCH_SIZE = 1024\n",
"\n",
"for i in range(0, len(test_pairs), PREDICTION_BATCH_SIZE):\n",
" batch_pairs = test_pairs[i:i + PREDICTION_BATCH_SIZE]\n",
" batch_indices = np.arange(i, min(i + PREDICTION_BATCH_SIZE, len(test_pairs)))\n",
"\n",
" batch_feature_vectors = np.zeros((len(batch_pairs), FEATURE_SIZE), dtype=np.float32)\n",
" for j, (id1, id2) in enumerate(batch_pairs):\n",
" idx1 = article_id_to_idx.get(id1)\n",
" idx2 = article_id_to_idx.get(id2)\n",
"\n",
" if idx1 is None or idx2 is None:\n",
" batch_feature_vectors[j] = np.zeros(FEATURE_SIZE, dtype=np.float32)\n",
" continue\n",
"\n",
" vec_i = sbert_mean_vectors[idx1] \n",
" vec_j = sbert_mean_vectors[idx2] \n",
"\n",
" combined_embeddings = np.concatenate([\n",
" vec_i,\n",
" vec_j,\n",
" np.abs(vec_i - vec_j),\n",
" vec_i * vec_j\n",
" ])\n",
"\n",
" authors_i_set = processed_authors_test.get(id1, set())\n",
" authors_j_set = processed_authors_test.get(id2, set())\n",
"\n",
" intersection_len = len(authors_i_set & authors_j_set)\n",
" union_len = len(authors_i_set | authors_j_set)\n",
" author_sim = intersection_len / union_len if union_len > 0 else 0.0\n",
" shared_authors_count = intersection_len\n",
"\n",
" vec_i_reshaped = vec_i.reshape(1, -1)\n",
" vec_j_reshaped = vec_j.reshape(1, -1)\n",
" if np.linalg.norm(vec_i) == 0.0 or np.linalg.norm(vec_j) == 0.0:\n",
" abstract_cos_sim = 0.0\n",
" else:\n",
" abstract_cos_sim = cosine_similarity(vec_i_reshaped, vec_j_reshaped)[0][0]\n",
"\n",
" all_features = np.concatenate([combined_embeddings, [author_sim, shared_authors_count, abstract_cos_sim]])\n",
" batch_feature_vectors[j] = all_features\n",
"\n",
" scaled_batch_vectors = scaler.transform(batch_feature_vectors)\n",
" batch_predictions = model.predict_proba(scaled_batch_vectors)[:, 1]\n",
"\n",
" for k, prob in zip(batch_indices, batch_predictions):\n",
" test_results.append((f\"{k}\", prob))\n",
"\n",
"result_df = pd.DataFrame(test_results, columns=[\"ID\", \"Label\"]) \n",
"result_df.to_csv(data_path / \"submission.csv\", index=False)\n",
"\n",
"print(f\"Submission file created at {data_path / 'submission.csv'}\")\n",
"print(result_df.head())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"trusted": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kaggle": {
"accelerator": "none",
"dataSources": [
{
"databundleVersionId": 11214388,
"sourceId": 93866,
"sourceType": "competition"
}
],
"dockerImageVersionId": 31040,
"isGpuEnabled": false,
"isInternetEnabled": true,
"language": "python",
"sourceType": "notebook"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.11"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
GitHub Events
Total
- Member event: 1
- Push event: 2
Last Year
- Member event: 1
- Push event: 2