Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.4%) to scientific vocabulary

Scientific Fields

Economics Social Sciences - 40% confidence
Computer Science Computer Science - 40% confidence
Engineering Computer Science - 40% confidence
Last synced: 4 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: semihalperdundar
  • Language: Python
  • Default Branch: main
  • Size: 0 Bytes
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 10 months ago · Last pushed 10 months ago
Metadata Files
Readme Citation

README.md

Citation Prediction Modeling

This repository contains a Machine Learning Model designed to predict the number of citations a scientific paper will receive based on its metadata. The model is built using Python, Scikit-learn, and XGBoost, incorporating advanced feature engineering and hyperparameter tuning to optimize predictive accuracy.

📂 Project Structure

├── Citation_Prediction_Modelling_ML.py ├── README.md

📌 Files Overview

  • CitationPredictionModelling_ML.py → The main Python script implementing feature engineering, model selection, training, and prediction.
  • description.docx → Detailed explanation of the model pipeline, evaluation metrics, and optimization techniques.
  • README.md → This document.

🎯 Objective

The goal of this project is to develop a citation prediction model using metadata-based features extracted from academic papers. The model predicts future citation counts using various machine learning techniques and optimizations.

🔬 Methodology

Feature Engineering

The model extracts and processes the following features: - Numerical Features: - Year of publication - Number of authors - Number of references cited - Paper age (difference between publication year and current year) - Title word count - Categorical Features: - Venue (converted into numerical format using Label Encoding) - Text Features: - Title and abstract transformed using TF-IDF vectorization

Model Selection & Training

The project evaluates several models to determine the best approach for citation prediction: - Ridge Regression (Regularized linear model) - Gradient Boosting (Ensemble learning) - XGBoost (Optimized gradient boosting) - Random Forest (Best-performing model) - Ensemble Model (Combining multiple models for improved accuracy)

Optimization & Validation

  • Hyperparameter tuning: Conducted using RandomizedSearchCV and GridSearchCV
  • Cross-validation: Applied to improve generalization and avoid overfitting
  • Log transformation: Citation counts were log-transformed for better model learning

🛠️ Installation & Setup

To run the model, install the required dependencies:

bash pip install pandas numpy scikit-learn xgboost sentence-transformers torch

🚀 Running the Model

To execute the citation prediction model: bash python Citation_Prediction_Modelling_ML.py

The script will: 1. Load the training dataset 2. Perform feature engineering 3. Train multiple models 4. Evaluate models using Mean Absolute Error (MAE) 5. Select the best-performing model 6. Generate predictions for test data

📊 Results & Performance

  • Best Model: Random Forest Regressor & Ensemble Model
  • Validation Mean Absolute Error (MAE): 31.27
  • Hyperparameters Tuned: n_estimators=200, max_depth=8, min_samples_split=10, min_samples_leaf=5, max_features=0.8

🔍 Key Findings

  • TF-IDF vectorization of abstract and title significantly improves performance
  • Random Forest outperforms Gradient Boosting, Ridge, and XGBoost in citation prediction
  • Ensemble modeling further refines predictions by combining different model strengths
  • Cross-validation helps in preventing overfitting and improves model robustness

📌 Future Improvements

  • Experiment with deep learning models like Transformer-based architectures (BERT, SciBERT)
  • Introduce additional contextual metadata, such as journal impact factors
  • Enhance feature engineering with network-based features (e.g., citation graphs)

👨‍💻 Author

  • Semih Alper Dundar
  • Tilburg University - Data Science & Society

📝 License

This project is released under the MIT License.


This repository contributes to research on scientific impact prediction, helping to understand factors influencing paper citations and improving data-driven decision-making in academia. 📚📈

Owner

  • Login: semihalperdundar
  • Kind: user

Citation (Citation_Prediction_Modelling_ML.py)

import pandas as pd
import numpy as np
import logging
import json

from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.preprocessing import LabelEncoder, MaxAbsScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from sentence_transformers import SentenceTransformer
import torch
import math

# Set up logging configuration
logging.basicConfig(level=logging.INFO)

# Load training data
data = pd.DataFrame.from_records(json.load(open('train.json', 'r')))

# Fill missing values with empty strings
data.fillna('', inplace=True)

# Convert 'year' to integer format
data['year'] = pd.to_numeric(data['year'], errors='coerce').fillna(0).astype(int)

# Feature Engineering: Extract additional features from data

# Count the number of authors in the paper
data['num_authors'] = data['authors'].apply(lambda x: len(x.split(',')))

# Count the number of references cited in the paper
data['num_references'] = data['references'].apply(lambda x: len(x))

# Calculate the age of the paper by subtracting the publication year from the current year
data['paper_age'] = 2024 - data['year']

# Count the number of words in the title of the paper
data['title_word_count'] = data['title'].apply(lambda x: len(x.split()))

# Encode categorical variable 'venue' using LabelEncoder
venue_encoder = LabelEncoder()
data['venue_encoded'] = venue_encoder.fit_transform(data['venue'])

# Split dataset into training and validation sets
train_set, validation = train_test_split(data, test_size=0.15, random_state=123)

# Load test data
test = pd.DataFrame.from_records(json.load(open('test.json', 'r')))

# Fill missing values in test set
test.fillna('', inplace=True)

# Convert 'year' to integer format
test['year'] = pd.to_numeric(test['year'], errors='coerce').fillna(0).astype(int)

# Apply the same feature engineering to test set

# Count the number of authors in the test data
test['num_authors'] = test['authors'].apply(lambda x: len(x.split(',')))

# Count the number of references in the test data
test['num_references'] = test['references'].apply(lambda x: len(x))

# Calculate the age of the paper in the test data
test['paper_age'] = 2024 - test['year']

# Count the number of words in the title of the paper in the test data
test['title_word_count'] = test['title'].apply(lambda x: len(x.split()))

# Encode 'venue' in test set using previously trained encoder
venue_mapping = {venue: idx for idx, venue in enumerate(venue_encoder.classes_)}
test['venue_encoded'] = test['venue'].apply(lambda x: venue_mapping.get(x, -1))

# Apply TF-IDF transformation for text features
title_tfidf_vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,2), max_features=3000)
abstract_tfidf_vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1,2), max_features=5000)

# Generate embeddings using SentenceTransformer model
embedder = SentenceTransformer('all-MiniLM-L6-v2')

if torch.cuda.is_available():
    embedder = embedder.to('cuda')

def generate_embedding_in_batches(text_list, batch_size=128):
    embeddings = []
    for i in range(0, len(text_list), batch_size):
        batch_texts = text_list[i:i + batch_size]
        batch_embeddings = embedder.encode(batch_texts, show_progress_bar=True, batch_size=batch_size, device='cuda' if torch.cuda.is_available() else 'cpu')
        embeddings.append(batch_embeddings)
    return np.vstack(embeddings)

# Define ColumnTransformer for feature processing
featurizer = ColumnTransformer([
    ("year", 'passthrough', ['year']),
    ("num_authors", 'passthrough', ['num_authors']),
    ("num_references", 'passthrough', ['num_references']),
    ("paper_age", 'passthrough', ['paper_age']),
    ("title_word_count", 'passthrough', ['title_word_count']),
    ("venue_encoded", 'passthrough', ['venue_encoded']),
    ("title_tfidf", title_tfidf_vectorizer, 'title'),
    ("abstract_tfidf", abstract_tfidf_vectorizer, 'abstract')
], remainder='drop')

# Define ML models with their respective hyperparameters

# Ridge Regression: A linear model with L2 regularization to prevent overfitting
ridge = make_pipeline(
    featurizer, 
    Ridge(alpha=0.5, random_state=42)
)

# Gradient Boosting Regressor: An ensemble model that builds trees sequentially to minimize error
gbr = make_pipeline(
    featurizer, 
    GradientBoostingRegressor(
        n_estimators=100,   # Number of boosting stages
        learning_rate=0.01,  # Controls the contribution of each tree
        max_depth=3,        # Maximum depth of each tree
        subsample=0.5,      # Fraction of samples used for fitting individual trees
        min_samples_split=10,  # Minimum samples required to split a node
        min_samples_leaf=5,    # Minimum samples required in a leaf node
        random_state=42
    )
)

# Random Forest Regressor: A bagging ensemble of decision trees that reduces variance
rf = make_pipeline(
    featurizer, 
    RandomForestRegressor(
        n_estimators=200,   # Number of trees in the forest
        max_depth=8,        # Maximum depth of each tree
        min_samples_split=10,  # Minimum number of samples required to split an internal node
        min_samples_leaf=5,    # Minimum number of samples required to be at a leaf node
        max_features=0.8,   # Fraction of features considered for splitting
        random_state=42,
        n_jobs=-1  # Use all available CPU cores
    )
)

# XGBoost Regressor: An optimized gradient boosting model that handles missing data efficiently
xgb = make_pipeline(
    featurizer, 
    XGBRegressor(
        n_estimators=100,   # Number of boosting rounds
        learning_rate=0.005, # Step size shrinkage to prevent overfitting
        max_depth=3,       # Maximum depth of a tree
        gamma=1,           # Minimum loss reduction required to make a split
        subsample=0.5,     # Fraction of samples used for training
        random_state=42,
        use_label_encoder=False,  # Disable label encoding
        eval_metric='mae',  # Use Mean Absolute Error as evaluation metric
        n_jobs=-1  # Use all available CPU cores
    )
)

# Train and evaluate models
models = {"Ridge": ridge, "GradientBoosting": gbr, "RandomForest": rf, "XGBoost": xgb}
validation_scores = {}

for model_name, model in models.items():
    logging.info(f"Training {model_name}")
    model.fit(train_set.drop(columns=['n_citation'], errors='ignore'), np.log1p(train_set['n_citation'].values))
    validation_pred = np.expm1(model.predict(validation.drop(columns=['n_citation'], errors='ignore')))
    validation_mae = mean_absolute_error(validation['n_citation'], validation_pred)
    validation_scores[model_name] = validation_mae
    logging.info(f"{model_name} validation MAE: {validation_mae:.2f}")

# Identify the best model based on validation performance
best_model_name = min(validation_scores, key=validation_scores.get)
best_model = models[best_model_name]
logging.info(f"Best model is {best_model_name} with MAE: {validation_scores[best_model_name]:.2f}")

# Generate predictions for the best model
test['n_citation'] = np.expm1(best_model.predict(test))

# Save the predictions to a JSON file
json.dump(test[['n_citation']].to_dict(orient='records'), open(f'predicted_{best_model_name}.json', 'w'),indent=2)
logging.info(f"Predictions for the best model '{best_model_name}' saved to 'predicted_{best_model_name}.json'")

# Set logging level
logging.getLogger().setLevel(logging.INFO)

GitHub Events

Total
  • Push event: 1
  • Create event: 2
Last Year
  • Push event: 1
  • Create event: 2