boun-mis-citations

Bogazici University MIS Department Citation Scraper

https://github.com/fusuyfusuy/boun-mis-citations

Last synced: 10 months ago · JSON representation ·

Repository

Bogazici University MIS Department Citation Scraper

Basic Info

Host: GitHub
Owner: fusuyfusuy
License: other
Language: Python
Default Branch: main
Homepage:
Size: 166 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

Bogazici University MIS Department Citation Scraper

A comprehensive web scraping and analysis tool for extracting faculty publications and citations from the Bogazici University Management Information Systems (MIS) department website.

🎯 Project Overview

This project consists of two main components: 1. Faculty Scraper (faculty_scraper.py) - Extracts faculty profiles and publications from the MIS website 2. Citation Analyzer (citation_analyzer.py) - Processes and organizes the scraped data into structured formats

📊 Features

Web Scraping Capabilities

✅ Extracts faculty data from multiple department pages (full-time, part-time, contributing faculty, teaching assistants)
✅ Comprehensive profile scraping (contact info, education, research interests)
✅ Citation-only extraction mode for faster processing
✅ Automatic URL deduplication
✅ Respectful rate limiting (configurable delays)
✅ Robust error handling and retry mechanisms

Data Processing & Analysis

📈 Year-based citation organization (1950-2030 range)
🌍 Bilingual support (English/Turkish)
📋 Multiple export formats (JSON, CSV, HTML)
📊 Statistical analysis and metadata generation
🏷️ Citation categorization by publication type

🚀 Quick Start

Prerequisites

bash pip install requests beautifulsoup4 lxml

Basic Usage

```bash

Step 1: Scrape faculty data

python faculty_scraper.py

Step 2: Process and analyze citations

python citation_analyzer.py ```

📁 File Structure

├── faculty_scraper.py # Main scraping engine ├── citation_analyzer.py # Data processing and export ├── complete_faculty_data.json # Raw scraped data (generated) ├── citations_en.csv # English CSV export (generated) ├── citations_tr.csv # Turkish CSV export (generated) ├── citations_en.html # English HTML export (generated) ├── citations_tr.html # Turkish HTML export (generated) └── README.md # This file

🔧 Configuration Options

Scraping Parameters

```python

Adjust delay between requests (seconds)

delay = 1.0 # Default: 1 second

Target URLs (customizable)

facultypages = [ "https://mis.bogazici.edu.tr/fulltimefaculty", "https://mis.bogazici.edu.tr/parttimefaculty", "https://mis.bogazici.edu.tr/facultymemberscontributingtodepartment", "https://mis.bogazici.edu.tr/teachingassistants" ] ```

Processing Options

```python

Year extraction range

YEAR_RANGE = (1950, 2030)

Supported languages

LANGUAGES = ['en', 'tr']

Publication categories

CATEGORIES = [ 'internationalarticles', 'internationalbookchapters', 'nationalarticles', 'internationalconferencepapers', 'nationalconferencepapers' ] ```

📈 Output Formats

JSON Structure

json { "name": "Faculty Name", "email": "email@example.com", "citations": { "international_articles": ["Citation 1", "Citation 2"], "international_conference_papers": ["Paper 1", "Paper 2"] } }

CSV Columns

| Column | Description | |--------|-------------| | Category | Publication type (localized) | | Year | Publication year | | Author | Faculty member name | | Citation | Full citation text |

HTML Format

Organized by publication category
Chronologically sorted (newest first)
Citation counts per category/year
Clean, readable formatting

⚡ Performance Optimizations

Algorithm Complexity

URL Extraction: O(n) where n = number of faculty links
Data Processing: O(m) where m = total citations
Year Extraction: O(1) per citation using regex optimization
Memory Usage: O(k) where k = total scraped data size

Built-in Optimizations

✅ Single-pass citation processing
✅ Hash-based URL deduplication: O(1) lookup
✅ Lazy loading with generators for large datasets
✅ In-place text processing to minimize memory allocation
✅ Early termination on invalid year ranges
✅ Session reuse for HTTP connection pooling

Best Practices Implemented

```python

Input validation with bounds checking

if not (1950 <= year <= 2030): return None

Hash map for O(1) category lookups

TRANSLATIONS = {...} # Pre-computed translations

Single regex compilation for performance

YEAR_PATTERN = re.compile(r'((\d{4}))') ```

🛡️ Error Handling

Network Resilience

HTTP timeout handling
Connection retry mechanisms
Graceful degradation on failed requests
Status code validation

Data Validation

Empty input checking
Null value handling
Duplicate detection and removal
Year range validation
Character encoding safeguards

📊 Statistical Output

The analyzer provides comprehensive statistics:

``` === CITATION METADATA === International Articles: 245 total citations International Conference Papers: 189 total citations ...

=== OVERALL STATISTICS === Total citations: 1,234 Year range: 1995 - 2024 Top productive years: 2023 (89), 2022 (76), 2021 (68) ```

⚠️ Legal & Ethical Considerations

Compliance Features

✅ Respectful crawling with configurable delays
✅ User-Agent headers for transparency
✅ No authentication bypass attempts
✅ Public data only (no private content access)
✅ Rate limiting to prevent server overload

Usage Guidelines

Use responsibly and within reasonable limits
Respect the website's robots.txt if present
Consider reaching out to the institution for bulk data needs
Ensure compliance with local data protection regulations

🔄 Development Workflow

Adding New Faculty Pages

```python

Extend the URL list in main()

newurls = scraper.getfacultyurls("https://mis.bogazici.edu.tr/newpage") urls.extend(new_urls) ```

Adding Publication Categories

```python

Update TRANSLATIONS dictionary

TRANSLATIONS['new_category'] = { 'en': 'New Category', 'tr': 'Yeni Kategori' } ```

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Follow the existing code style and optimization patterns
Add appropriate error handling and input validation
Test with small datasets before full runs
Submit a pull request with clear documentation

📝 Changelog

v1.0.0

Initial release
Basic faculty scraping functionality
Citation extraction and organization
Multi-language support
Statistical analysis features

🆘 Troubleshooting

Common Issues

"No faculty URLs found" - Check if the target website structure has changed - Verify CSS selectors in get_faculty_urls() - Ensure network connectivity

"Error scraping profile" - Website may be temporarily unavailable - Check if profile page structure has changed - Increase delay between requests

"Year extraction failing" - Citations may use non-standard date formats - Update regex patterns in extract_year() - Check citation text encoding

Performance Issues

Reduce concurrent requests (increase delay)
Process data in smaller batches
Use citation-only mode for faster extraction
Check available memory for large datasets

📧 Support

For questions, issues, or contributions, please create an issue in the repository or contact the development team.

Note: This tool is designed for academic and research purposes. Please use responsibly and in accordance with the target website's terms of service.

Owner

Name: Yusuf Akçakaya
Login: fusuyfusuy
Kind: user
Location: Istanbul, TR
Company: Bogazici University

Website: https://hepyeni.net
Repositories: 2
Profile: https://github.com/fusuyfusuy

Citation (citation_analyzer.py)

import json
import re
import csv
import html
from collections import defaultdict

# Translation mappings
TRANSLATIONS = {
    'international_articles': {
        'en': 'International Articles',
        'tr': 'Uluslararası Makaleler'
    },
    'international_book_chapters': {
        'en': 'International Book Chapters', 
        'tr': 'Uluslararası Kitap Bölümleri'
    },
    'international_conference_papers': {
        'en': 'International Conference Paper',
        'tr': 'Uluslararası Bildiriler'
    },
    'national_conference_papers': {
        'en': 'National Conference Paper',
        'tr': 'Ulusal Bildiriler'  
    },
    'national_articles': {
        'en': 'National Articles',
        'tr': 'Ulusal Makaleler'
    },
    'national_books': {
        'en': 'National Books',
        'tr': 'Ulusal Kitaplar'
    },
    'national_conferences': {
        'en': 'National Conferences', 
        'tr': 'Ulusal Konferanslar'
    }
}

def get_category_name(category_key, language='en'):
    """Get translated category name."""
    if category_key in TRANSLATIONS:
        return TRANSLATIONS[category_key][language]
    # Fallback: clean up the key
    return category_key.replace('_', ' ').title()

def extract_year(citation_text):
    """Extract year from citation - looks for (YYYY) format first."""
    # Try (YYYY) format first - most common in academic citations
    match = re.search(r'\((\d{4})\)', citation_text)
    if match:
        year = int(match.group(1))
        if 1950 <= year <= 2030:  # Valid range
            return year
    
    # Fallback: find any 4-digit number in valid range
    numbers = re.findall(r'\b(\d{4})\b', citation_text)
    for num in numbers:
        year = int(num)
        if 1950 <= year <= 2030:
            return year
    
    return None

def parse_citations(json_file):
    """Parse faculty JSON and organize citations by category and year."""
    with open(json_file, 'r', encoding='utf-8') as f:
        faculty_data = json.load(f)
    
    # Group citations by category and year
    organized = defaultdict(lambda: defaultdict(list))
    
    for faculty in faculty_data:
        if 'citations' not in faculty:
            continue
            
        faculty_name = faculty.get('name', 'Unknown')
        
        for category, citations in faculty['citations'].items():
            for citation in citations:
                year = extract_year(citation)
                if year:
                    organized[category][year].append({
                        'text': citation,
                        'author': faculty_name
                    })
    
    # Sort years (newest first)
    for category in organized:
        organized[category] = dict(sorted(
            organized[category].items(), 
            reverse=True
        ))
    
    return dict(organized)

def save_to_csv(organized_citations, filename, language='en'):
    """Save citations to CSV file."""
    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Category', 'Year', 'Author', 'Citation'])
        
        for category, years in organized_citations.items():
            category_name = get_category_name(category, language)
            for year, citations in years.items():
                for citation in citations:
                    writer.writerow([
                        category_name,
                        year,
                        citation['author'],
                        citation['text']
                    ])

def save_to_html(organized_citations, filename, language='en'):
    """Save citations to HTML file."""
    with open(filename, 'w', encoding='utf-8') as f:
        for category, years in organized_citations.items():
            # Count total citations in this category
            total_count = sum(len(citations) for citations in years.values())
            
            # Category header with translation
            category_name = get_category_name(category, language)
            f.write(f'<h1><strong>{category_name} ({total_count} articles)</strong></h1>')
            
            # Years and citations
            for year, citations in years.items():
                f.write(f'<h2><strong>{year} </strong></h2>')
                f.write('<ol>')
                
                for citation in citations:
                    # Escape HTML entities
                    escaped_citation = html.escape(citation['text'])
                    f.write(f'<li>{escaped_citation}</li>')
                
                f.write('</ol>')

def print_metadata(organized_citations):
    """Print detailed metadata about citations."""
    print("\n=== CITATION METADATA ===")
    
    # Collect all years and count citations per year
    year_counts = defaultdict(int)
    category_year_counts = defaultdict(lambda: defaultdict(int))
    
    total_citations = 0
    
    for category, years in organized_citations.items():
        category_total = 0
        for year, citations in years.items():
            count = len(citations)
            year_counts[year] += count
            category_year_counts[category][year] = count
            category_total += count
            total_citations += count
        
        print(f"\n{get_category_name(category, 'en')}: {category_total} total citations")
        # Show year breakdown for this category
        for year in sorted(years.keys(), reverse=True):
            print(f"  {year}: {len(years[year])} citations")
    
    print(f"\n=== OVERALL STATISTICS ===")
    print(f"Total citations: {total_citations}")
    print(f"Year range: {min(year_counts.keys())} - {max(year_counts.keys())}")
    print(f"Total years covered: {len(year_counts)}")
    
    print(f"\n=== CITATIONS PER YEAR (ALL CATEGORIES) ===")
    for year in sorted(year_counts.keys(), reverse=True):
        print(f"{year}: {year_counts[year]} citations")
    
    print(f"\n=== TOP PRODUCTIVE YEARS ===")
    top_years = sorted(year_counts.items(), key=lambda x: x[1], reverse=True)[:5]
    for year, count in top_years:
        print(f"{year}: {count} citations")

def print_summary(organized_citations):
    """Print a simple summary."""
    total = 0
    for category, years in organized_citations.items():
        cat_total = sum(len(citations) for citations in years.values())
        total += cat_total
        print(f"{get_category_name(category, 'en')}: {cat_total} citations")
    
    print(f"Total citations with years: {total}")

def main():
    input_file = "complete_faculty_data.json"
    
    try:
        print("Parsing citations...")
        organized = parse_citations(input_file)
        
        print("\n=== SUMMARY ===")
        print_summary(organized)
        
        # Print detailed metadata
        print_metadata(organized)
        
        # Save to CSV (both languages)
        save_to_csv(organized, 'citations_en.csv', 'en')
        save_to_csv(organized, 'citations_tr.csv', 'tr')
        print("✓ Saved to citations_en.csv (English)")
        print("✓ Saved to citations_tr.csv (Turkish)")
        
        # Save to HTML (both languages)
        save_to_html(organized, 'citations_en.html', 'en')
        save_to_html(organized, 'citations_tr.html', 'tr')
        print("✓ Saved to citations_en.html (English)")
        print("✓ Saved to citations_tr.html (Turkish)")
        
        # Show example
        if organized:
            first_category = list(organized.keys())[0]
            first_year = list(organized[first_category].keys())[0]
            category_name = get_category_name(first_category, 'en')
            print(f"\nExample: {category_name} in {first_year}:")
            for i, citation in enumerate(organized[first_category][first_year][:2], 1):
                print(f"{i}. {citation['author']}: {citation['text'][:80]}...")
        
    except FileNotFoundError:
        print(f"Error: {input_file} not found. Run the first script first!")
    except Exception as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    main()

boun-mis-citations

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Bogazici University MIS Department Citation Scraper

🎯 Project Overview

📊 Features

Web Scraping Capabilities

Data Processing & Analysis

🚀 Quick Start

Prerequisites

Basic Usage

Step 1: Scrape faculty data

Step 2: Process and analyze citations

📁 File Structure

🔧 Configuration Options

Scraping Parameters

Adjust delay between requests (seconds)

Target URLs (customizable)

Processing Options

Year extraction range

Supported languages

Publication categories

📈 Output Formats

JSON Structure

CSV Columns

HTML Format

⚡ Performance Optimizations

Algorithm Complexity

Built-in Optimizations

Best Practices Implemented

Input validation with bounds checking

Hash map for O(1) category lookups

Single regex compilation for performance

🛡️ Error Handling

Network Resilience

Data Validation

📊 Statistical Output

⚠️ Legal & Ethical Considerations

Compliance Features

Usage Guidelines

🔄 Development Workflow

Adding New Faculty Pages

Extend the URL list in main()

Adding Publication Categories

Update TRANSLATIONS dictionary

🤝 Contributing

📝 Changelog

v1.0.0

🆘 Troubleshooting

Common Issues

Performance Issues

📧 Support

Owner

Citation (citation_analyzer.py)

GitHub Events

Total

Last Year