nlp_citation_prediction

https://github.com/okostis/nlp_citation_prediction

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (4.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: okostis
Language: Jupyter Notebook
Default Branch: main
Size: 6.84 KB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created 12 months ago · Last pushed 12 months ago

Metadata Files

Readme Citation

This repository contains a Jupyter notebook that we developed during the University of Ioannina's MYE053 NLP Course. It's main focus is Citation Link Prediction for research papers. The notebook's core is predicting citation links: we extract rich semantic meaning from paper abstracts using Sentence-BERT (SBERT) embeddings and combine them with practical author-based features like shared author counts and Jaccard similarity. A LightGBM classifier then takes all these engineered features and learns to predict the likelihood of a citation existing between any given pair of papers, offering an effective approach for uncovering connections within academic graphs.

More info can be found in the course's kaggle challenge : https://www.kaggle.com/competitions/nlp-cse-uoi-2025/overview

Owner

Name: okostis
Login: okostis
Kind: user

Repositories: 2
Profile: https://github.com/okostis

Citation (citation_prediction.ipynb)

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
    "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
    "execution": {
     "iopub.execute_input": "2025-06-15T15:26:09.200044Z",
     "iopub.status.busy": "2025-06-15T15:26:09.199705Z",
     "iopub.status.idle": "2025-06-15T16:45:10.952356Z",
     "shell.execute_reply": "2025-06-15T16:45:10.951007Z",
     "shell.execute_reply.started": "2025-06-15T15:26:09.200019Z"
    },
    "trusted": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2025-06-15 15:26:39.256752: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
      "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
      "E0000 00:00:1750001199.539223      35 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
      "E0000 00:00:1750001199.618137      35 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading data...\n",
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 138499 entries, 0 to 138498\n",
      "Data columns (total 3 columns):\n",
      " #   Column      Non-Null Count   Dtype \n",
      "---  ------      --------------   ----- \n",
      " 0   article_id  138499 non-null  int32 \n",
      " 1   authors     138499 non-null  object\n",
      " 2   abstract    138499 non-null  object\n",
      "dtypes: int32(1), object(2)\n",
      "memory usage: 162.7 MB\n",
      "    col1    col2\n",
      "0  34977   59394\n",
      "1  22518   46602\n",
      "2  36762   22813\n",
      "3  44960  110384\n",
      "4  29015   26366\n",
      "Loading pre-computed SBERT mean vectors from /kaggle/working/nlp-cse-uoi-2025/data_new/sbert_mean_vec_all-MiniLM-L6-v2.csv\n",
      "Shape of SBERT_mean_vectors: (138499, 384)\n",
      "Loaded 1091955 raw positive edges from edgelist.txt.\n",
      "Using 400000 positive edges for training.\n",
      "Generating 400000 negative samples...\n",
      "Generated 400000 negative samples.\n",
      "Total samples for training: 800000 (Positive: 400000, Negative: 400000)\n",
      "Computing features for training data...\n",
      "Shape of X_train: (640000, 1539), y_train: (640000,)\n",
      "Shape of X_test: (160000, 1539), y_test: (160000,)\n",
      "\n",
      "Starting LightGBM model training...\n",
      "[100]\tvalid_0's auc: 0.970799\n",
      "[200]\tvalid_0's auc: 0.973946\n",
      "[300]\tvalid_0's auc: 0.975484\n",
      "[400]\tvalid_0's auc: 0.976496\n",
      "[500]\tvalid_0's auc: 0.977208\n",
      "[600]\tvalid_0's auc: 0.977706\n",
      "[700]\tvalid_0's auc: 0.978084\n",
      "[800]\tvalid_0's auc: 0.978371\n",
      "[900]\tvalid_0's auc: 0.978605\n",
      "[1000]\tvalid_0's auc: 0.978808\n",
      "[1100]\tvalid_0's auc: 0.979007\n",
      "[1200]\tvalid_0's auc: 0.979183\n",
      "[1300]\tvalid_0's auc: 0.979317\n",
      "[1400]\tvalid_0's auc: 0.979408\n",
      "[1500]\tvalid_0's auc: 0.979528\n",
      "[1600]\tvalid_0's auc: 0.979649\n",
      "[1700]\tvalid_0's auc: 0.979744\n",
      "[1800]\tvalid_0's auc: 0.979812\n",
      "[1900]\tvalid_0's auc: 0.979874\n",
      "[2000]\tvalid_0's auc: 0.979935\n",
      "\n",
      "LightGBM Train AUC: 0.9928\n",
      "LightGBM Test AUC: 0.9799\n",
      "LightGBM Test Accuracy: 92.66%\n",
      "\n",
      "Making predictions on test.txt...\n",
      "Submission file created at /kaggle/working/nlp-cse-uoi-2025/data_new/submission.csv\n",
      "  ID     Label\n",
      "0  0  0.100375\n",
      "1  1  0.248836\n",
      "2  2  0.811147\n",
      "3  3  0.580352\n",
      "4  4  0.017505\n"
     ]
    }
   ],
   "source": [
    "##all-MiniLM-L6-v2\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from pathlib import Path\n",
    "import logging\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.metrics.pairwise import cosine_similarity\n",
    "import gc \n",
    "import random \n",
    "\n",
    "\n",
    "#!pip3 install lightgbm\n",
    "import lightgbm as lgb\n",
    "from sklearn.metrics import roc_auc_score, accuracy_score\n",
    "\n",
    "\n",
    "#!pip3 install sentence-transformers\n",
    "\n",
    "\n",
    "from sentence_transformers import SentenceTransformer\n",
    "\n",
    "logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)\n",
    "\n",
    "\n",
    "data_path = Path(\"/kaggle/working/nlp-cse-uoi-2025/data_new\")\n",
    "TARGET_TOTAL_SAMPLES = 800_000 \n",
    "SBERT_MODEL_NAME = 'all-MiniLM-L6-v2' \n",
    "EMBEDDING_DIM = 384 \n",
    "\n",
    "\n",
    "print(\"Loading data...\")\n",
    "with open(data_path / \"authors.txt\") as f: \n",
    "    authors_data = [i.split(\"|--|\") for i in f.read().splitlines()]\n",
    "    authors = pd.DataFrame({\n",
    "        \"article_id\": np.int32(np.array(authors_data)[:, 0]),\n",
    "        \"authors\": np.array(authors_data)[:, 1]\n",
    "    })\n",
    "\n",
    "edgelist = pd.read_csv(data_path / \"edgelist.txt\", names=[\"article_id\", \"cited_id\"], header=None, sep=\",\", dtype=np.int32) \n",
    "\n",
    "with open(data_path / \"abstracts.txt\") as f:\n",
    "    abstracts_data = [i.split(\"|--|\") for i in f.read().splitlines()]\n",
    "    abstracts = pd.DataFrame({\n",
    "        \"article_id\": np.int32(np.array(abstracts_data)[:, 0]),\n",
    "        \"abstract\": np.array(abstracts_data)[:, 1]\n",
    "    })\n",
    "\n",
    "test_df = pd.read_csv(data_path / \"test.txt\", header=None, names=['col1', 'col2'], dtype=np.int32)\n",
    "\n",
    "assert len(authors) == len(abstracts)\n",
    "data = authors.merge(abstracts, on=\"article_id\") \n",
    "\n",
    "data = data.sort_values(by='article_id').reset_index(drop=True) \n",
    "\n",
    "article_id_to_idx = pd.Series(data.index, index=data['article_id']).to_dict() \n",
    "\n",
    "del authors, abstracts, authors_data, abstracts_data\n",
    "gc.collect()\n",
    "\n",
    "data.info(verbose=True, memory_usage=\"deep\")\n",
    "print(test_df.head())\n",
    "\n",
    "\n",
    "sbert_pretrained_dir = data_path / \"sbert_pretrained\" \n",
    "sbert_means_csv = data_path / f\"sbert_mean_vec_{SBERT_MODEL_NAME.replace('/', '_')}.csv\"\n",
    "\n",
    "sbert_mean_vectors = None\n",
    "\n",
    "if sbert_means_csv.exists():\n",
    "    print(f\"Loading pre-computed SBERT mean vectors from {sbert_means_csv}\")\n",
    "    sbert_mean_vectors = pd.read_csv(sbert_means_csv, header=None).values.astype(np.float32)\n",
    "else:\n",
    "    sbert_pretrained_dir.mkdir(parents=True, exist_ok=True)\n",
    "    print(f\"Downloading/Loading SBERT model: {SBERT_MODEL_NAME}... This may take a while.\")\n",
    "    \n",
    "    model_sbert = SentenceTransformer(SBERT_MODEL_NAME)\n",
    "\n",
    "    print(\"Computing SBERT mean vectors for abstracts...\")\n",
    "    \n",
    "    sbert_mean_vectors = model_sbert.encode(data['abstract'].tolist(), convert_to_numpy=True, show_progress_bar=True, batch_size=32) \n",
    "    sbert_mean_vectors = sbert_mean_vectors.astype(np.float32) \n",
    "\n",
    "    pd.DataFrame(sbert_mean_vectors).to_csv(sbert_means_csv, header=False, index=False)\n",
    "    print(f\"Saved SBERT mean vectors to {sbert_means_csv}\") \n",
    "\n",
    "    del model_sbert\n",
    "    gc.collect()\n",
    "\n",
    "\n",
    "\n",
    "print(f\"Shape of SBERT_mean_vectors: {sbert_mean_vectors.shape}\") \n",
    "\n",
    "all_positive_edges = edgelist.values.tolist()\n",
    "print(f\"Loaded {len(all_positive_edges)} raw positive edges from edgelist.txt.\")\n",
    "\n",
    "num_positive_samples_to_use = min(len(all_positive_edges), TARGET_TOTAL_SAMPLES // 2) \n",
    "num_negative_samples_to_generate = TARGET_TOTAL_SAMPLES - num_positive_samples_to_use \n",
    "\n",
    "random.seed(42)\n",
    "positive_edges = random.sample(all_positive_edges, num_positive_samples_to_use) \n",
    "print(f\"Using {len(positive_edges)} positive edges for training.\")\n",
    "\n",
    "existing_edges_set = set()\n",
    "for p_id, c_id in all_positive_edges:\n",
    "    existing_edges_set.add(tuple(sorted((p_id, c_id))))\n",
    "\n",
    "print(f\"Generating {num_negative_samples_to_generate} negative samples...\")\n",
    "negative_edges = []\n",
    "all_article_ids = data['article_id'].unique()\n",
    "num_articles = len(all_article_ids)\n",
    "\n",
    "while len(negative_edges) < num_negative_samples_to_generate:  \n",
    "    idx_pair = np.random.choice(num_articles, 2, replace=False) \n",
    "    article_id1 = all_article_ids[idx_pair[0]]\n",
    "    article_id2 = all_article_ids[idx_pair[1]]\n",
    "\n",
    "    current_pair = tuple(sorted((article_id1, article_id2)))\n",
    "\n",
    "    if current_pair not in existing_edges_set: \n",
    "        negative_edges.append((article_id1, article_id2))\n",
    "\n",
    "print(f\"Generated {len(negative_edges)} negative samples.\")\n",
    "\n",
    "pairs = np.array(positive_edges + negative_edges, dtype=np.int32)\n",
    "labels = np.array([1] * len(positive_edges) + [0] * len(negative_edges), dtype=np.float32)\n",
    "\n",
    "indices = np.arange(len(pairs))\n",
    "np.random.shuffle(indices) \n",
    "pairs_shuffled = pairs[indices]\n",
    "labels_shuffled = labels[indices]\n",
    "\n",
    "print(f\"Total samples for training: {len(pairs_shuffled)} (Positive: {len(positive_edges)}, Negative: {len(negative_edges)})\")\n",
    "\n",
    "del positive_edges, negative_edges\n",
    "gc.collect()\n",
    "\n",
    "FEATURE_SIZE = 4 * EMBEDDING_DIM + 1 + 1 + 1 \n",
    "\n",
    "def compute_features_batch(article_id_pairs, embeddings, data_df, article_id_map):\n",
    "    num_pairs = len(article_id_pairs)\n",
    "    features_array = np.zeros((num_pairs, FEATURE_SIZE), dtype=np.float32)\n",
    "\n",
    "    processed_authors = {} \n",
    "    for idx, row in data_df.iterrows():\n",
    "        authors_str = row['authors'] \n",
    "        processed_authors[row['article_id']] = set(authors_str.lower().replace(\" \", \"\").split(';')) if authors_str else set()\n",
    "\n",
    "    for i, (id1, id2) in enumerate(article_id_pairs):\n",
    "        idx1 = article_id_map.get(id1)\n",
    "        idx2 = article_id_map.get(id2)\n",
    "       \n",
    "        if idx1 is None or idx2 is None:\n",
    "            features_array[i] = np.zeros(FEATURE_SIZE, dtype=np.float32)\n",
    "            continue\n",
    "\n",
    "        vec_i = embeddings[idx1] \n",
    "        vec_j = embeddings[idx2] \n",
    "\n",
    "        combined_embeddings = np.concatenate([ \n",
    "            vec_i,\n",
    "            vec_j,\n",
    "            np.abs(vec_i - vec_j),\n",
    "            vec_i * vec_j\n",
    "        ])\n",
    "\n",
    "        authors_i_set = processed_authors.get(id1, set())\n",
    "        authors_j_set = processed_authors.get(id2, set())\n",
    "\n",
    "        intersection_len = len(authors_i_set & authors_j_set) \n",
    "        union_len = len(authors_i_set | authors_j_set) \n",
    "        author_sim = intersection_len / union_len if union_len > 0 else 0.0 \n",
    "        shared_authors_count = intersection_len\n",
    "        \n",
    "        vec_i_reshaped = vec_i.reshape(1, -1)\n",
    "        vec_j_reshaped = vec_j.reshape(1, -1)\n",
    "        if np.linalg.norm(vec_i) == 0.0 or np.linalg.norm(vec_j) == 0.0:\n",
    "            abstract_cos_sim = 0.0\n",
    "        else:\n",
    "            abstract_cos_sim = cosine_similarity(vec_i_reshaped, vec_j_reshaped)[0][0]\n",
    "       \n",
    "        all_features = np.concatenate([combined_embeddings, [author_sim, shared_authors_count, abstract_cos_sim]])\n",
    "        features_array[i] = all_features\n",
    "\n",
    "    return features_array\n",
    "\n",
    "print(\"Computing features for training data...\") \n",
    "X_features = compute_features_batch(pairs_shuffled, sbert_mean_vectors, data, article_id_to_idx)\n",
    "y = labels_shuffled\n",
    "\n",
    "scaler = StandardScaler()\n",
    "X_scaled = scaler.fit_transform(X_features) \n",
    "\n",
    "del X_features, labels_shuffled, pairs \n",
    "gc.collect()\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X_scaled, y,\n",
    "    test_size=0.2,\n",
    "    stratify=y, \n",
    "    random_state=42\n",
    ")\n",
    "\n",
    "del X_scaled, y \n",
    "gc.collect()\n",
    "\n",
    "print(f\"Shape of X_train: {X_train.shape}, y_train: {y_train.shape}\")\n",
    "print(f\"Shape of X_test: {X_test.shape}, y_test: {y_test.shape}\")\n",
    "\n",
    "\n",
    "print(\"\\nStarting LightGBM model training...\")\n",
    "\n",
    "lgb_params = { \n",
    "    'objective': 'binary', \n",
    "    'metric': 'auc',\n",
    "    'boosting_type': 'gbdt', \n",
    "    'num_leaves': 31, \n",
    "    'learning_rate': 0.05,\n",
    "    'n_estimators': 2000, \n",
    "    'feature_fraction': 0.8, \n",
    "    'bagging_fraction': 0.8, \n",
    "    'bagging_freq': 1,\n",
    "    'lambda_l1': 0.1,\n",
    "    'lambda_l2': 0.1,\n",
    "    'num_threads': -1, \n",
    "    'verbose': -1,\n",
    "    'seed': 42,\n",
    "    'n_jobs': -1 \n",
    "}\n",
    "\n",
    "model = lgb.LGBMClassifier(**lgb_params)\n",
    "\n",
    "model.fit(X_train, y_train, \n",
    "          eval_set=[(X_test, y_test)],\n",
    "          eval_metric='auc',\n",
    "          callbacks=[lgb.log_evaluation(period=100), lgb.early_stopping(100, verbose=False)])\n",
    "\n",
    "y_pred_proba_train = model.predict_proba(X_train)[:, 1] \n",
    "y_pred_proba_test = model.predict_proba(X_test)[:, 1]\n",
    "y_pred_test_class = (y_pred_proba_test > 0.5).astype(int)\n",
    "#we calculate our metrics\n",
    "train_auc = roc_auc_score(y_train, y_pred_proba_train)\n",
    "test_auc = roc_auc_score(y_test, y_pred_proba_test)\n",
    "test_accuracy = accuracy_score(y_test, y_pred_test_class)\n",
    "\n",
    "print(f\"\\nLightGBM Train AUC: {train_auc:.4f}\")\n",
    "print(f\"LightGBM Test AUC: {test_auc:.4f}\")\n",
    "print(f\"LightGBM Test Accuracy: {test_accuracy*100:.2f}%\")\n",
    "\n",
    "\n",
    "\n",
    "print(\"\\nMaking predictions on test.txt...\")\n",
    "test_results = []\n",
    "\n",
    "processed_authors_test = {}  \n",
    "for idx, row in data.iterrows():\n",
    "    authors_str = row['authors']\n",
    "    processed_authors_test[row['article_id']] = set(authors_str.lower().replace(\" \", \"\").split(';')) if authors_str else set()\n",
    "\n",
    "test_pairs = test_df.values\n",
    "PREDICTION_BATCH_SIZE = 1024\n",
    "\n",
    "for i in range(0, len(test_pairs), PREDICTION_BATCH_SIZE):\n",
    "    batch_pairs = test_pairs[i:i + PREDICTION_BATCH_SIZE]\n",
    "    batch_indices = np.arange(i, min(i + PREDICTION_BATCH_SIZE, len(test_pairs)))\n",
    "\n",
    "    batch_feature_vectors = np.zeros((len(batch_pairs), FEATURE_SIZE), dtype=np.float32)\n",
    "    for j, (id1, id2) in enumerate(batch_pairs):\n",
    "        idx1 = article_id_to_idx.get(id1)\n",
    "        idx2 = article_id_to_idx.get(id2)\n",
    "\n",
    "        if idx1 is None or idx2 is None:\n",
    "            batch_feature_vectors[j] = np.zeros(FEATURE_SIZE, dtype=np.float32)\n",
    "            continue\n",
    "\n",
    "        vec_i = sbert_mean_vectors[idx1] \n",
    "        vec_j = sbert_mean_vectors[idx2] \n",
    "\n",
    "        combined_embeddings = np.concatenate([\n",
    "            vec_i,\n",
    "            vec_j,\n",
    "            np.abs(vec_i - vec_j),\n",
    "            vec_i * vec_j\n",
    "        ])\n",
    "\n",
    "        authors_i_set = processed_authors_test.get(id1, set())\n",
    "        authors_j_set = processed_authors_test.get(id2, set())\n",
    "\n",
    "        intersection_len = len(authors_i_set & authors_j_set)\n",
    "        union_len = len(authors_i_set | authors_j_set)\n",
    "        author_sim = intersection_len / union_len if union_len > 0 else 0.0\n",
    "        shared_authors_count = intersection_len\n",
    "\n",
    "        vec_i_reshaped = vec_i.reshape(1, -1)\n",
    "        vec_j_reshaped = vec_j.reshape(1, -1)\n",
    "        if np.linalg.norm(vec_i) == 0.0 or np.linalg.norm(vec_j) == 0.0:\n",
    "            abstract_cos_sim = 0.0\n",
    "        else:\n",
    "            abstract_cos_sim = cosine_similarity(vec_i_reshaped, vec_j_reshaped)[0][0]\n",
    "\n",
    "        all_features = np.concatenate([combined_embeddings, [author_sim, shared_authors_count, abstract_cos_sim]])\n",
    "        batch_feature_vectors[j] = all_features\n",
    "\n",
    "    scaled_batch_vectors = scaler.transform(batch_feature_vectors)\n",
    "    batch_predictions = model.predict_proba(scaled_batch_vectors)[:, 1]\n",
    "\n",
    "    for k, prob in zip(batch_indices, batch_predictions):\n",
    "        test_results.append((f\"{k}\", prob))\n",
    "\n",
    "result_df = pd.DataFrame(test_results, columns=[\"ID\", \"Label\"]) \n",
    "result_df.to_csv(data_path / \"submission.csv\", index=False)\n",
    "\n",
    "print(f\"Submission file created at {data_path / 'submission.csv'}\")\n",
    "print(result_df.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "trusted": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kaggle": {
   "accelerator": "none",
   "dataSources": [
    {
     "databundleVersionId": 11214388,
     "sourceId": 93866,
     "sourceType": "competition"
    }
   ],
   "dockerImageVersionId": 31040,
   "isGpuEnabled": false,
   "isInternetEnabled": true,
   "language": "python",
   "sourceType": "notebook"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

nlp_citation_prediction

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Owner

Citation (citation_prediction.ipynb)

GitHub Events

Total

Last Year