citation-prediction-modelling-project

https://github.com/semihalperdundar/citation-prediction-modelling-project

Scientific Fields

Economics Social Sciences - 40% confidence

Computer Science Computer Science - 40% confidence

Engineering Computer Science - 40% confidence

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: semihalperdundar
Language: Python
Default Branch: main
Size: 0 Bytes

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created 12 months ago · Last pushed 12 months ago

Metadata Files

Readme Citation

Citation Prediction Modeling

This repository contains a Machine Learning Model designed to predict the number of citations a scientific paper will receive based on its metadata. The model is built using Python, Scikit-learn, and XGBoost, incorporating advanced feature engineering and hyperparameter tuning to optimize predictive accuracy.

📂 Project Structure

├── Citation_Prediction_Modelling_ML.py ├── README.md

📌 Files Overview

CitationPredictionModelling_ML.py → The main Python script implementing feature engineering, model selection, training, and prediction.
description.docx → Detailed explanation of the model pipeline, evaluation metrics, and optimization techniques.
README.md → This document.

🎯 Objective

The goal of this project is to develop a citation prediction model using metadata-based features extracted from academic papers. The model predicts future citation counts using various machine learning techniques and optimizations.

🔬 Methodology

Feature Engineering

The model extracts and processes the following features: - Numerical Features: - Year of publication - Number of authors - Number of references cited - Paper age (difference between publication year and current year) - Title word count - Categorical Features: - Venue (converted into numerical format using Label Encoding) - Text Features: - Title and abstract transformed using TF-IDF vectorization

Model Selection & Training

The project evaluates several models to determine the best approach for citation prediction: - Ridge Regression (Regularized linear model) - Gradient Boosting (Ensemble learning) - XGBoost (Optimized gradient boosting) - Random Forest (Best-performing model) - Ensemble Model (Combining multiple models for improved accuracy)

Optimization & Validation

Hyperparameter tuning: Conducted using RandomizedSearchCV and GridSearchCV
Cross-validation: Applied to improve generalization and avoid overfitting
Log transformation: Citation counts were log-transformed for better model learning

🛠️ Installation & Setup

To run the model, install the required dependencies:

bash pip install pandas numpy scikit-learn xgboost sentence-transformers torch

🚀 Running the Model

To execute the citation prediction model: bash python Citation_Prediction_Modelling_ML.py

The script will: 1. Load the training dataset 2. Perform feature engineering 3. Train multiple models 4. Evaluate models using Mean Absolute Error (MAE) 5. Select the best-performing model 6. Generate predictions for test data

📊 Results & Performance

Best Model: Random Forest Regressor & Ensemble Model
Validation Mean Absolute Error (MAE): 31.27
Hyperparameters Tuned: n_estimators=200, max_depth=8, min_samples_split=10, min_samples_leaf=5, max_features=0.8

🔍 Key Findings

TF-IDF vectorization of abstract and title significantly improves performance
Random Forest outperforms Gradient Boosting, Ridge, and XGBoost in citation prediction
Ensemble modeling further refines predictions by combining different model strengths
Cross-validation helps in preventing overfitting and improves model robustness

📌 Future Improvements

Experiment with deep learning models like Transformer-based architectures (BERT, SciBERT)
Introduce additional contextual metadata, such as journal impact factors
Enhance feature engineering with network-based features (e.g., citation graphs)

👨‍💻 Author

Semih Alper Dundar
Tilburg University - Data Science & Society

📝 License

This project is released under the MIT License.

This repository contributes to research on scientific impact prediction, helping to understand factors influencing paper citations and improving data-driven decision-making in academia. 📚📈

Owner

Login: semihalperdundar
Kind: user

Repositories: 1
Profile: https://github.com/semihalperdundar

Citation (Citation_Prediction_Modelling_ML.py)

import pandas as pd
import numpy as np
import logging
import json

from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import LabelEncoder, MaxAbsScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from sentence_transformers import SentenceTransformer
import torch
import math

# Set up logging configuration
logging.basicConfig(level=logging.INFO)

# Load training data
data = pd.DataFrame.from_records(json.load(open('train.json', 'r')))

# Fill missing values with empty strings
data.fillna('', inplace=True)

# Convert 'year' to integer format
data['year'] = pd.to_numeric(data['year'], errors='coerce').fillna(0).astype(int)

# Feature Engineering: Extract additional features from data

# Count the number of authors in the paper
data['num_authors'] = data['authors'].apply(lambda x: len(x.split(',')))

# Count the number of references cited in the paper
data['num_references'] = data['references'].apply(lambda x: len(x))

# Calculate the age of the paper by subtracting the publication year from the current year
data['paper_age'] = 2024 - data['year']

# Count the number of words in the title of the paper
data['title_word_count'] = data['title'].apply(lambda x: len(x.split()))

# Encode categorical variable 'venue' using LabelEncoder
venue_encoder = LabelEncoder()
data['venue_encoded'] = venue_encoder.fit_transform(data['venue'])

# Split dataset into training and validation sets
train_set, validation = train_test_split(data, test_size=0.15, random_state=123)

# Load test data
test = pd.DataFrame.from_records(json.load(open('test.json', 'r')))

# Fill missing values in test set
test.fillna('', inplace=True)

# Convert 'year' to integer format
test['year'] = pd.to_numeric(test['year'], errors='coerce').fillna(0).astype(int)

# Apply the same feature engineering to test set

# Count the number of authors in the test data
test['num_authors'] = test['authors'].apply(lambda x: len(x.split(',')))

# Count the number of references in the test data
test['num_references'] = test['references'].apply(lambda x: len(x))

# Calculate the age of the paper in the test data
test['paper_age'] = 2024 - test['year']

# Count the number of words in the title of the paper in the test data
test['title_word_count'] = test['title'].apply(lambda x: len(x.split()))

# Encode 'venue' in test set using previously trained encoder
venue_mapping = {venue: idx for idx, venue in enumerate(venue_encoder.classes_)}
test['venue_encoded'] = test['venue'].apply(lambda x: venue_mapping.get(x, -1))

# Apply TF-IDF transformation for text features
title_tfidf_vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,2), max_features=3000)
abstract_tfidf_vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,2), max_features=5000)

# Generate embeddings using SentenceTransformer model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

if torch.cuda.is_available():
    embedder = embedder.to('cuda')

def generate_embedding_in_batches(text_list, batch_size=128):
    embeddings = []
    for i in range(0, len(text_list), batch_size):
        batch_texts = text_list[i:i + batch_size]
        batch_embeddings = embedder.encode(batch_texts, show_progress_bar=True, batch_size=batch_size, device='cuda' if torch.cuda.is_available() else 'cpu')
        embeddings.append(batch_embeddings)
    return np.vstack(embeddings)

# Define ColumnTransformer for feature processing
featurizer = ColumnTransformer([
    ("year", 'passthrough', ['year']),
    ("num_authors", 'passthrough', ['num_authors']),
    ("num_references", 'passthrough', ['num_references']),
    ("paper_age", 'passthrough', ['paper_age']),
    ("title_word_count", 'passthrough', ['title_word_count']),
    ("venue_encoded", 'passthrough', ['venue_encoded']),
    ("title_tfidf", title_tfidf_vectorizer, 'title'),
    ("abstract_tfidf", abstract_tfidf_vectorizer, 'abstract')
], remainder='drop')

# Define ML models with their respective hyperparameters

# Ridge Regression: A linear model with L2 regularization to prevent overfitting
ridge = make_pipeline(
    featurizer, 
    Ridge(alpha=0.5, random_state=42)
)

# Gradient Boosting Regressor: An ensemble model that builds trees sequentially to minimize error
gbr = make_pipeline(
    featurizer, 
    GradientBoostingRegressor(
        n_estimators=100,   # Number of boosting stages
        learning_rate=0.01,  # Controls the contribution of each tree
        max_depth=3,        # Maximum depth of each tree
        subsample=0.5,      # Fraction of samples used for fitting individual trees
        min_samples_split=10,  # Minimum samples required to split a node
        min_samples_leaf=5,    # Minimum samples required in a leaf node
        random_state=42
    )
)

# Random Forest Regressor: A bagging ensemble of decision trees that reduces variance
rf = make_pipeline(
    featurizer, 
    RandomForestRegressor(
        n_estimators=200,   # Number of trees in the forest
        max_depth=8,        # Maximum depth of each tree
        min_samples_split=10,  # Minimum number of samples required to split an internal node
        min_samples_leaf=5,    # Minimum number of samples required to be at a leaf node
        max_features=0.8,   # Fraction of features considered for splitting
        random_state=42,
        n_jobs=-1  # Use all available CPU cores
    )
)

# XGBoost Regressor: An optimized gradient boosting model that handles missing data efficiently
xgb = make_pipeline(
    featurizer, 
    XGBRegressor(
        n_estimators=100,   # Number of boosting rounds
        learning_rate=0.005, # Step size shrinkage to prevent overfitting
        max_depth=3,       # Maximum depth of a tree
        gamma=1,           # Minimum loss reduction required to make a split
        subsample=0.5,     # Fraction of samples used for training
        random_state=42,
        use_label_encoder=False,  # Disable label encoding
        eval_metric='mae',  # Use Mean Absolute Error as evaluation metric
        n_jobs=-1  # Use all available CPU cores
    )
)

# Train and evaluate models
models = {"Ridge": ridge, "GradientBoosting": gbr, "RandomForest": rf, "XGBoost": xgb}
validation_scores = {}

for model_name, model in models.items():
    logging.info(f"Training {model_name}")
    model.fit(train_set.drop(columns=['n_citation'], errors='ignore'), np.log1p(train_set['n_citation'].values))
    validation_pred = np.expm1(model.predict(validation.drop(columns=['n_citation'], errors='ignore')))
    validation_mae = mean_absolute_error(validation['n_citation'], validation_pred)
    validation_scores[model_name] = validation_mae
    logging.info(f"{model_name} validation MAE: {validation_mae:.2f}")

# Identify the best model based on validation performance
best_model_name = min(validation_scores, key=validation_scores.get)
best_model = models[best_model_name]
logging.info(f"Best model is {best_model_name} with MAE: {validation_scores[best_model_name]:.2f}")

# Generate predictions for the best model
test['n_citation'] = np.expm1(best_model.predict(test))

# Save the predictions to a JSON file
json.dump(test[['n_citation']].to_dict(orient='records'), open(f'predicted_{best_model_name}.json', 'w'),indent=2)
logging.info(f"Predictions for the best model '{best_model_name}' saved to 'predicted_{best_model_name}.json'")

# Set logging level
logging.getLogger().setLevel(logging.INFO)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science