boun-mis-citations

Bogazici University MIS Department Citation Scraper

https://github.com/fusuyfusuy/boun-mis-citations

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Bogazici University MIS Department Citation Scraper

Basic Info
  • Host: GitHub
  • Owner: fusuyfusuy
  • License: other
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 166 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 1 year ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

Bogazici University MIS Department Citation Scraper

A comprehensive web scraping and analysis tool for extracting faculty publications and citations from the Bogazici University Management Information Systems (MIS) department website.

🎯 Project Overview

This project consists of two main components: 1. Faculty Scraper (faculty_scraper.py) - Extracts faculty profiles and publications from the MIS website 2. Citation Analyzer (citation_analyzer.py) - Processes and organizes the scraped data into structured formats

📊 Features

Web Scraping Capabilities

  • ✅ Extracts faculty data from multiple department pages (full-time, part-time, contributing faculty, teaching assistants)
  • ✅ Comprehensive profile scraping (contact info, education, research interests)
  • ✅ Citation-only extraction mode for faster processing
  • ✅ Automatic URL deduplication
  • ✅ Respectful rate limiting (configurable delays)
  • ✅ Robust error handling and retry mechanisms

Data Processing & Analysis

  • 📈 Year-based citation organization (1950-2030 range)
  • 🌍 Bilingual support (English/Turkish)
  • 📋 Multiple export formats (JSON, CSV, HTML)
  • 📊 Statistical analysis and metadata generation
  • 🏷️ Citation categorization by publication type

🚀 Quick Start

Prerequisites

bash pip install requests beautifulsoup4 lxml

Basic Usage

```bash

Step 1: Scrape faculty data

python faculty_scraper.py

Step 2: Process and analyze citations

python citation_analyzer.py ```

📁 File Structure

├── faculty_scraper.py # Main scraping engine ├── citation_analyzer.py # Data processing and export ├── complete_faculty_data.json # Raw scraped data (generated) ├── citations_en.csv # English CSV export (generated) ├── citations_tr.csv # Turkish CSV export (generated) ├── citations_en.html # English HTML export (generated) ├── citations_tr.html # Turkish HTML export (generated) └── README.md # This file

🔧 Configuration Options

Scraping Parameters

```python

Adjust delay between requests (seconds)

delay = 1.0 # Default: 1 second

Target URLs (customizable)

facultypages = [ "https://mis.bogazici.edu.tr/fulltimefaculty", "https://mis.bogazici.edu.tr/parttimefaculty", "https://mis.bogazici.edu.tr/facultymemberscontributingtodepartment", "https://mis.bogazici.edu.tr/teachingassistants" ] ```

Processing Options

```python

Year extraction range

YEAR_RANGE = (1950, 2030)

Supported languages

LANGUAGES = ['en', 'tr']

Publication categories

CATEGORIES = [ 'internationalarticles', 'internationalbookchapters', 'nationalarticles', 'internationalconferencepapers', 'nationalconferencepapers' ] ```

📈 Output Formats

JSON Structure

json { "name": "Faculty Name", "email": "email@example.com", "citations": { "international_articles": ["Citation 1", "Citation 2"], "international_conference_papers": ["Paper 1", "Paper 2"] } }

CSV Columns

| Column | Description | |--------|-------------| | Category | Publication type (localized) | | Year | Publication year | | Author | Faculty member name | | Citation | Full citation text |

HTML Format

  • Organized by publication category
  • Chronologically sorted (newest first)
  • Citation counts per category/year
  • Clean, readable formatting

⚡ Performance Optimizations

Algorithm Complexity

  • URL Extraction: O(n) where n = number of faculty links
  • Data Processing: O(m) where m = total citations
  • Year Extraction: O(1) per citation using regex optimization
  • Memory Usage: O(k) where k = total scraped data size

Built-in Optimizations

  • ✅ Single-pass citation processing
  • ✅ Hash-based URL deduplication: O(1) lookup
  • ✅ Lazy loading with generators for large datasets
  • ✅ In-place text processing to minimize memory allocation
  • ✅ Early termination on invalid year ranges
  • ✅ Session reuse for HTTP connection pooling

Best Practices Implemented

```python

Input validation with bounds checking

if not (1950 <= year <= 2030): return None

Hash map for O(1) category lookups

TRANSLATIONS = {...} # Pre-computed translations

Single regex compilation for performance

YEAR_PATTERN = re.compile(r'((\d{4}))') ```

🛡️ Error Handling

Network Resilience

  • HTTP timeout handling
  • Connection retry mechanisms
  • Graceful degradation on failed requests
  • Status code validation

Data Validation

  • Empty input checking
  • Null value handling
  • Duplicate detection and removal
  • Year range validation
  • Character encoding safeguards

📊 Statistical Output

The analyzer provides comprehensive statistics:

``` === CITATION METADATA === International Articles: 245 total citations International Conference Papers: 189 total citations ...

=== OVERALL STATISTICS === Total citations: 1,234 Year range: 1995 - 2024 Top productive years: 2023 (89), 2022 (76), 2021 (68) ```

⚠️ Legal & Ethical Considerations

Compliance Features

  • ✅ Respectful crawling with configurable delays
  • ✅ User-Agent headers for transparency
  • ✅ No authentication bypass attempts
  • ✅ Public data only (no private content access)
  • ✅ Rate limiting to prevent server overload

Usage Guidelines

  • Use responsibly and within reasonable limits
  • Respect the website's robots.txt if present
  • Consider reaching out to the institution for bulk data needs
  • Ensure compliance with local data protection regulations

🔄 Development Workflow

Adding New Faculty Pages

```python

Extend the URL list in main()

newurls = scraper.getfacultyurls("https://mis.bogazici.edu.tr/newpage") urls.extend(new_urls) ```

Adding Publication Categories

```python

Update TRANSLATIONS dictionary

TRANSLATIONS['new_category'] = { 'en': 'New Category', 'tr': 'Yeni Kategori' } ```

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Follow the existing code style and optimization patterns
  4. Add appropriate error handling and input validation
  5. Test with small datasets before full runs
  6. Submit a pull request with clear documentation

📝 Changelog

v1.0.0

  • Initial release
  • Basic faculty scraping functionality
  • Citation extraction and organization
  • Multi-language support
  • Statistical analysis features

🆘 Troubleshooting

Common Issues

"No faculty URLs found" - Check if the target website structure has changed - Verify CSS selectors in get_faculty_urls() - Ensure network connectivity

"Error scraping profile" - Website may be temporarily unavailable - Check if profile page structure has changed - Increase delay between requests

"Year extraction failing" - Citations may use non-standard date formats - Update regex patterns in extract_year() - Check citation text encoding

Performance Issues

  • Reduce concurrent requests (increase delay)
  • Process data in smaller batches
  • Use citation-only mode for faster extraction
  • Check available memory for large datasets

📧 Support

For questions, issues, or contributions, please create an issue in the repository or contact the development team.


Note: This tool is designed for academic and research purposes. Please use responsibly and in accordance with the target website's terms of service.

Owner

  • Name: Yusuf Akçakaya
  • Login: fusuyfusuy
  • Kind: user
  • Location: Istanbul, TR
  • Company: Bogazici University

Citation (citation_analyzer.py)

import json
import re
import csv
import html
from collections import defaultdict

# Translation mappings
TRANSLATIONS = {
    'international_articles': {
        'en': 'International Articles',
        'tr': 'Uluslararası Makaleler'
    },
    'international_book_chapters': {
        'en': 'International Book Chapters', 
        'tr': 'Uluslararası Kitap Bölümleri'
    },
    'international_conference_papers': {
        'en': 'International Conference Paper',
        'tr': 'Uluslararası Bildiriler'
    },
    'national_conference_papers': {
        'en': 'National Conference Paper',
        'tr': 'Ulusal Bildiriler'  
    },
    'national_articles': {
        'en': 'National Articles',
        'tr': 'Ulusal Makaleler'
    },
    'national_books': {
        'en': 'National Books',
        'tr': 'Ulusal Kitaplar'
    },
    'national_conferences': {
        'en': 'National Conferences', 
        'tr': 'Ulusal Konferanslar'
    }
}

def get_category_name(category_key, language='en'):
    """Get translated category name."""
    if category_key in TRANSLATIONS:
        return TRANSLATIONS[category_key][language]
    # Fallback: clean up the key
    return category_key.replace('_', ' ').title()

def extract_year(citation_text):
    """Extract year from citation - looks for (YYYY) format first."""
    # Try (YYYY) format first - most common in academic citations
    match = re.search(r'\((\d{4})\)', citation_text)
    if match:
        year = int(match.group(1))
        if 1950 <= year <= 2030:  # Valid range
            return year
    
    # Fallback: find any 4-digit number in valid range
    numbers = re.findall(r'\b(\d{4})\b', citation_text)
    for num in numbers:
        year = int(num)
        if 1950 <= year <= 2030:
            return year
    
    return None

def parse_citations(json_file):
    """Parse faculty JSON and organize citations by category and year."""
    with open(json_file, 'r', encoding='utf-8') as f:
        faculty_data = json.load(f)
    
    # Group citations by category and year
    organized = defaultdict(lambda: defaultdict(list))
    
    for faculty in faculty_data:
        if 'citations' not in faculty:
            continue
            
        faculty_name = faculty.get('name', 'Unknown')
        
        for category, citations in faculty['citations'].items():
            for citation in citations:
                year = extract_year(citation)
                if year:
                    organized[category][year].append({
                        'text': citation,
                        'author': faculty_name
                    })
    
    # Sort years (newest first)
    for category in organized:
        organized[category] = dict(sorted(
            organized[category].items(), 
            reverse=True
        ))
    
    return dict(organized)

def save_to_csv(organized_citations, filename, language='en'):
    """Save citations to CSV file."""
    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Category', 'Year', 'Author', 'Citation'])
        
        for category, years in organized_citations.items():
            category_name = get_category_name(category, language)
            for year, citations in years.items():
                for citation in citations:
                    writer.writerow([
                        category_name,
                        year,
                        citation['author'],
                        citation['text']
                    ])

def save_to_html(organized_citations, filename, language='en'):
    """Save citations to HTML file."""
    with open(filename, 'w', encoding='utf-8') as f:
        for category, years in organized_citations.items():
            # Count total citations in this category
            total_count = sum(len(citations) for citations in years.values())
            
            # Category header with translation
            category_name = get_category_name(category, language)
            f.write(f'<h1><strong>{category_name} ({total_count} articles)</strong></h1>')
            
            # Years and citations
            for year, citations in years.items():
                f.write(f'<h2><strong>{year} </strong></h2>')
                f.write('<ol>')
                
                for citation in citations:
                    # Escape HTML entities
                    escaped_citation = html.escape(citation['text'])
                    f.write(f'<li>{escaped_citation}</li>')
                
                f.write('</ol>')

def print_metadata(organized_citations):
    """Print detailed metadata about citations."""
    print("\n=== CITATION METADATA ===")
    
    # Collect all years and count citations per year
    year_counts = defaultdict(int)
    category_year_counts = defaultdict(lambda: defaultdict(int))
    
    total_citations = 0
    
    for category, years in organized_citations.items():
        category_total = 0
        for year, citations in years.items():
            count = len(citations)
            year_counts[year] += count
            category_year_counts[category][year] = count
            category_total += count
            total_citations += count
        
        print(f"\n{get_category_name(category, 'en')}: {category_total} total citations")
        # Show year breakdown for this category
        for year in sorted(years.keys(), reverse=True):
            print(f"  {year}: {len(years[year])} citations")
    
    print(f"\n=== OVERALL STATISTICS ===")
    print(f"Total citations: {total_citations}")
    print(f"Year range: {min(year_counts.keys())} - {max(year_counts.keys())}")
    print(f"Total years covered: {len(year_counts)}")
    
    print(f"\n=== CITATIONS PER YEAR (ALL CATEGORIES) ===")
    for year in sorted(year_counts.keys(), reverse=True):
        print(f"{year}: {year_counts[year]} citations")
    
    print(f"\n=== TOP PRODUCTIVE YEARS ===")
    top_years = sorted(year_counts.items(), key=lambda x: x[1], reverse=True)[:5]
    for year, count in top_years:
        print(f"{year}: {count} citations")

def print_summary(organized_citations):
    """Print a simple summary."""
    total = 0
    for category, years in organized_citations.items():
        cat_total = sum(len(citations) for citations in years.values())
        total += cat_total
        print(f"{get_category_name(category, 'en')}: {cat_total} citations")
    
    print(f"Total citations with years: {total}")

def main():
    input_file = "complete_faculty_data.json"
    
    try:
        print("Parsing citations...")
        organized = parse_citations(input_file)
        
        print("\n=== SUMMARY ===")
        print_summary(organized)
        
        # Print detailed metadata
        print_metadata(organized)
        
        # Save to CSV (both languages)
        save_to_csv(organized, 'citations_en.csv', 'en')
        save_to_csv(organized, 'citations_tr.csv', 'tr')
        print("✓ Saved to citations_en.csv (English)")
        print("✓ Saved to citations_tr.csv (Turkish)")
        
        # Save to HTML (both languages)
        save_to_html(organized, 'citations_en.html', 'en')
        save_to_html(organized, 'citations_tr.html', 'tr')
        print("✓ Saved to citations_en.html (English)")
        print("✓ Saved to citations_tr.html (Turkish)")
        
        # Show example
        if organized:
            first_category = list(organized.keys())[0]
            first_year = list(organized[first_category].keys())[0]
            category_name = get_category_name(first_category, 'en')
            print(f"\nExample: {category_name} in {first_year}:")
            for i, citation in enumerate(organized[first_category][first_year][:2], 1):
                print(f"{i}. {citation['author']}: {citation['text'][:80]}...")
        
    except FileNotFoundError:
        print(f"Error: {input_file} not found. Run the first script first!")
    except Exception as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    main()

GitHub Events

Total
  • Push event: 1
  • Public event: 1
Last Year
  • Push event: 1
  • Public event: 1