boun-mis-citations
Bogazici University MIS Department Citation Scraper
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary
Repository
Bogazici University MIS Department Citation Scraper
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Bogazici University MIS Department Citation Scraper
A comprehensive web scraping and analysis tool for extracting faculty publications and citations from the Bogazici University Management Information Systems (MIS) department website.
🎯 Project Overview
This project consists of two main components:
1. Faculty Scraper (faculty_scraper.py) - Extracts faculty profiles and publications from the MIS website
2. Citation Analyzer (citation_analyzer.py) - Processes and organizes the scraped data into structured formats
📊 Features
Web Scraping Capabilities
- ✅ Extracts faculty data from multiple department pages (full-time, part-time, contributing faculty, teaching assistants)
- ✅ Comprehensive profile scraping (contact info, education, research interests)
- ✅ Citation-only extraction mode for faster processing
- ✅ Automatic URL deduplication
- ✅ Respectful rate limiting (configurable delays)
- ✅ Robust error handling and retry mechanisms
Data Processing & Analysis
- 📈 Year-based citation organization (1950-2030 range)
- 🌍 Bilingual support (English/Turkish)
- 📋 Multiple export formats (JSON, CSV, HTML)
- 📊 Statistical analysis and metadata generation
- 🏷️ Citation categorization by publication type
🚀 Quick Start
Prerequisites
bash
pip install requests beautifulsoup4 lxml
Basic Usage
```bash
Step 1: Scrape faculty data
python faculty_scraper.py
Step 2: Process and analyze citations
python citation_analyzer.py ```
📁 File Structure
├── faculty_scraper.py # Main scraping engine
├── citation_analyzer.py # Data processing and export
├── complete_faculty_data.json # Raw scraped data (generated)
├── citations_en.csv # English CSV export (generated)
├── citations_tr.csv # Turkish CSV export (generated)
├── citations_en.html # English HTML export (generated)
├── citations_tr.html # Turkish HTML export (generated)
└── README.md # This file
🔧 Configuration Options
Scraping Parameters
```python
Adjust delay between requests (seconds)
delay = 1.0 # Default: 1 second
Target URLs (customizable)
facultypages = [ "https://mis.bogazici.edu.tr/fulltimefaculty", "https://mis.bogazici.edu.tr/parttimefaculty", "https://mis.bogazici.edu.tr/facultymemberscontributingtodepartment", "https://mis.bogazici.edu.tr/teachingassistants" ] ```
Processing Options
```python
Year extraction range
YEAR_RANGE = (1950, 2030)
Supported languages
LANGUAGES = ['en', 'tr']
Publication categories
CATEGORIES = [ 'internationalarticles', 'internationalbookchapters', 'nationalarticles', 'internationalconferencepapers', 'nationalconferencepapers' ] ```
📈 Output Formats
JSON Structure
json
{
"name": "Faculty Name",
"email": "email@example.com",
"citations": {
"international_articles": ["Citation 1", "Citation 2"],
"international_conference_papers": ["Paper 1", "Paper 2"]
}
}
CSV Columns
| Column | Description | |--------|-------------| | Category | Publication type (localized) | | Year | Publication year | | Author | Faculty member name | | Citation | Full citation text |
HTML Format
- Organized by publication category
- Chronologically sorted (newest first)
- Citation counts per category/year
- Clean, readable formatting
⚡ Performance Optimizations
Algorithm Complexity
- URL Extraction: O(n) where n = number of faculty links
- Data Processing: O(m) where m = total citations
- Year Extraction: O(1) per citation using regex optimization
- Memory Usage: O(k) where k = total scraped data size
Built-in Optimizations
- ✅ Single-pass citation processing
- ✅ Hash-based URL deduplication:
O(1)lookup - ✅ Lazy loading with generators for large datasets
- ✅ In-place text processing to minimize memory allocation
- ✅ Early termination on invalid year ranges
- ✅ Session reuse for HTTP connection pooling
Best Practices Implemented
```python
Input validation with bounds checking
if not (1950 <= year <= 2030): return None
Hash map for O(1) category lookups
TRANSLATIONS = {...} # Pre-computed translations
Single regex compilation for performance
YEAR_PATTERN = re.compile(r'((\d{4}))') ```
🛡️ Error Handling
Network Resilience
- HTTP timeout handling
- Connection retry mechanisms
- Graceful degradation on failed requests
- Status code validation
Data Validation
- Empty input checking
- Null value handling
- Duplicate detection and removal
- Year range validation
- Character encoding safeguards
📊 Statistical Output
The analyzer provides comprehensive statistics:
``` === CITATION METADATA === International Articles: 245 total citations International Conference Papers: 189 total citations ...
=== OVERALL STATISTICS === Total citations: 1,234 Year range: 1995 - 2024 Top productive years: 2023 (89), 2022 (76), 2021 (68) ```
⚠️ Legal & Ethical Considerations
Compliance Features
- ✅ Respectful crawling with configurable delays
- ✅ User-Agent headers for transparency
- ✅ No authentication bypass attempts
- ✅ Public data only (no private content access)
- ✅ Rate limiting to prevent server overload
Usage Guidelines
- Use responsibly and within reasonable limits
- Respect the website's robots.txt if present
- Consider reaching out to the institution for bulk data needs
- Ensure compliance with local data protection regulations
🔄 Development Workflow
Adding New Faculty Pages
```python
Extend the URL list in main()
newurls = scraper.getfacultyurls("https://mis.bogazici.edu.tr/newpage") urls.extend(new_urls) ```
Adding Publication Categories
```python
Update TRANSLATIONS dictionary
TRANSLATIONS['new_category'] = { 'en': 'New Category', 'tr': 'Yeni Kategori' } ```
🤝 Contributing
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Follow the existing code style and optimization patterns
- Add appropriate error handling and input validation
- Test with small datasets before full runs
- Submit a pull request with clear documentation
📝 Changelog
v1.0.0
- Initial release
- Basic faculty scraping functionality
- Citation extraction and organization
- Multi-language support
- Statistical analysis features
🆘 Troubleshooting
Common Issues
"No faculty URLs found"
- Check if the target website structure has changed
- Verify CSS selectors in get_faculty_urls()
- Ensure network connectivity
"Error scraping profile" - Website may be temporarily unavailable - Check if profile page structure has changed - Increase delay between requests
"Year extraction failing"
- Citations may use non-standard date formats
- Update regex patterns in extract_year()
- Check citation text encoding
Performance Issues
- Reduce concurrent requests (increase delay)
- Process data in smaller batches
- Use citation-only mode for faster extraction
- Check available memory for large datasets
📧 Support
For questions, issues, or contributions, please create an issue in the repository or contact the development team.
Note: This tool is designed for academic and research purposes. Please use responsibly and in accordance with the target website's terms of service.
Owner
- Name: Yusuf Akçakaya
- Login: fusuyfusuy
- Kind: user
- Location: Istanbul, TR
- Company: Bogazici University
- Website: https://hepyeni.net
- Repositories: 2
- Profile: https://github.com/fusuyfusuy
Citation (citation_analyzer.py)
import json
import re
import csv
import html
from collections import defaultdict
# Translation mappings
TRANSLATIONS = {
'international_articles': {
'en': 'International Articles',
'tr': 'Uluslararası Makaleler'
},
'international_book_chapters': {
'en': 'International Book Chapters',
'tr': 'Uluslararası Kitap Bölümleri'
},
'international_conference_papers': {
'en': 'International Conference Paper',
'tr': 'Uluslararası Bildiriler'
},
'national_conference_papers': {
'en': 'National Conference Paper',
'tr': 'Ulusal Bildiriler'
},
'national_articles': {
'en': 'National Articles',
'tr': 'Ulusal Makaleler'
},
'national_books': {
'en': 'National Books',
'tr': 'Ulusal Kitaplar'
},
'national_conferences': {
'en': 'National Conferences',
'tr': 'Ulusal Konferanslar'
}
}
def get_category_name(category_key, language='en'):
"""Get translated category name."""
if category_key in TRANSLATIONS:
return TRANSLATIONS[category_key][language]
# Fallback: clean up the key
return category_key.replace('_', ' ').title()
def extract_year(citation_text):
"""Extract year from citation - looks for (YYYY) format first."""
# Try (YYYY) format first - most common in academic citations
match = re.search(r'\((\d{4})\)', citation_text)
if match:
year = int(match.group(1))
if 1950 <= year <= 2030: # Valid range
return year
# Fallback: find any 4-digit number in valid range
numbers = re.findall(r'\b(\d{4})\b', citation_text)
for num in numbers:
year = int(num)
if 1950 <= year <= 2030:
return year
return None
def parse_citations(json_file):
"""Parse faculty JSON and organize citations by category and year."""
with open(json_file, 'r', encoding='utf-8') as f:
faculty_data = json.load(f)
# Group citations by category and year
organized = defaultdict(lambda: defaultdict(list))
for faculty in faculty_data:
if 'citations' not in faculty:
continue
faculty_name = faculty.get('name', 'Unknown')
for category, citations in faculty['citations'].items():
for citation in citations:
year = extract_year(citation)
if year:
organized[category][year].append({
'text': citation,
'author': faculty_name
})
# Sort years (newest first)
for category in organized:
organized[category] = dict(sorted(
organized[category].items(),
reverse=True
))
return dict(organized)
def save_to_csv(organized_citations, filename, language='en'):
"""Save citations to CSV file."""
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f)
writer.writerow(['Category', 'Year', 'Author', 'Citation'])
for category, years in organized_citations.items():
category_name = get_category_name(category, language)
for year, citations in years.items():
for citation in citations:
writer.writerow([
category_name,
year,
citation['author'],
citation['text']
])
def save_to_html(organized_citations, filename, language='en'):
"""Save citations to HTML file."""
with open(filename, 'w', encoding='utf-8') as f:
for category, years in organized_citations.items():
# Count total citations in this category
total_count = sum(len(citations) for citations in years.values())
# Category header with translation
category_name = get_category_name(category, language)
f.write(f'<h1><strong>{category_name} ({total_count} articles)</strong></h1>')
# Years and citations
for year, citations in years.items():
f.write(f'<h2><strong>{year} </strong></h2>')
f.write('<ol>')
for citation in citations:
# Escape HTML entities
escaped_citation = html.escape(citation['text'])
f.write(f'<li>{escaped_citation}</li>')
f.write('</ol>')
def print_metadata(organized_citations):
"""Print detailed metadata about citations."""
print("\n=== CITATION METADATA ===")
# Collect all years and count citations per year
year_counts = defaultdict(int)
category_year_counts = defaultdict(lambda: defaultdict(int))
total_citations = 0
for category, years in organized_citations.items():
category_total = 0
for year, citations in years.items():
count = len(citations)
year_counts[year] += count
category_year_counts[category][year] = count
category_total += count
total_citations += count
print(f"\n{get_category_name(category, 'en')}: {category_total} total citations")
# Show year breakdown for this category
for year in sorted(years.keys(), reverse=True):
print(f" {year}: {len(years[year])} citations")
print(f"\n=== OVERALL STATISTICS ===")
print(f"Total citations: {total_citations}")
print(f"Year range: {min(year_counts.keys())} - {max(year_counts.keys())}")
print(f"Total years covered: {len(year_counts)}")
print(f"\n=== CITATIONS PER YEAR (ALL CATEGORIES) ===")
for year in sorted(year_counts.keys(), reverse=True):
print(f"{year}: {year_counts[year]} citations")
print(f"\n=== TOP PRODUCTIVE YEARS ===")
top_years = sorted(year_counts.items(), key=lambda x: x[1], reverse=True)[:5]
for year, count in top_years:
print(f"{year}: {count} citations")
def print_summary(organized_citations):
"""Print a simple summary."""
total = 0
for category, years in organized_citations.items():
cat_total = sum(len(citations) for citations in years.values())
total += cat_total
print(f"{get_category_name(category, 'en')}: {cat_total} citations")
print(f"Total citations with years: {total}")
def main():
input_file = "complete_faculty_data.json"
try:
print("Parsing citations...")
organized = parse_citations(input_file)
print("\n=== SUMMARY ===")
print_summary(organized)
# Print detailed metadata
print_metadata(organized)
# Save to CSV (both languages)
save_to_csv(organized, 'citations_en.csv', 'en')
save_to_csv(organized, 'citations_tr.csv', 'tr')
print("✓ Saved to citations_en.csv (English)")
print("✓ Saved to citations_tr.csv (Turkish)")
# Save to HTML (both languages)
save_to_html(organized, 'citations_en.html', 'en')
save_to_html(organized, 'citations_tr.html', 'tr')
print("✓ Saved to citations_en.html (English)")
print("✓ Saved to citations_tr.html (Turkish)")
# Show example
if organized:
first_category = list(organized.keys())[0]
first_year = list(organized[first_category].keys())[0]
category_name = get_category_name(first_category, 'en')
print(f"\nExample: {category_name} in {first_year}:")
for i, citation in enumerate(organized[first_category][first_year][:2], 1):
print(f"{i}. {citation['author']}: {citation['text'][:80]}...")
except FileNotFoundError:
print(f"Error: {input_file} not found. Run the first script first!")
except Exception as e:
print(f"Error: {e}")
if __name__ == "__main__":
main()
GitHub Events
Total
- Push event: 1
- Public event: 1
Last Year
- Push event: 1
- Public event: 1