bis-scraper
Script to scrape speeches from Bank for International Settlements
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (18.9%) to scientific vocabulary
Repository
Script to scrape speeches from Bank for International Settlements
Basic Info
- Host: GitHub
- Owner: HanssonMagnus
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 25.4 KB
Statistics
- Stars: 8
- Watchers: 1
- Forks: 4
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
BIS Scraper
A Python package for downloading and processing central bank speeches from the Bank for International Settlements (BIS) website.
Overview
BIS Scraper allows you to download speeches from various central banks worldwide that are collected on the BIS website. It organizes these speeches by institution, downloads the PDFs, and can convert them to text format for further analysis.
Features
- Download speeches from the BIS website by date range
- Filter by specific institutions
- Convert PDFs to text format
- Clean organization structure for the downloaded files
- Efficient caching to avoid re-downloading existing files
- Command-line interface for easy usage
Documentation
- API Documentation: Detailed information about the package's APIs
- Test Coverage: Testing approach and requirements
- Project Plan: Current development status and roadmap
Installation
From PyPI (recommended)
bash
pip install bis-scraper
From Source
bash
git clone https://github.com/HanssonMagnus/bis-scraper.git
cd bis-scraper
./install.sh # This creates a virtual environment and installs the package
source .venv/bin/activate
Dependencies
The package requires Python 3.9+ and the following main dependencies: - requests - beautifulsoup4 - textract - click - pydantic
Usage
BIS Scraper provides two ways to use its functionality:
- Command-Line Interface (CLI): A terminal-based tool called
bis-scraperfor easy use - Python API: Functions and classes that can be imported into your Python scripts
Both methods provide the same core functionality but are suited for different use cases: - Use the CLI for quick downloads or one-off tasks - Use the Python API for integration with your own code or for more complex workflows
Command-Line Interface
Download Speeches
Download speeches for a specific date range:
bash
bis-scraper scrape --start-date 2020-01-01 --end-date 2020-01-31
Filter by institution:
bash
bis-scraper scrape --start-date 2020-01-01 --end-date 2020-01-31 --institutions "European Central Bank" --institutions "Federal Reserve System"
Force re-download of existing files:
bash
bis-scraper scrape --start-date 2020-01-01 --end-date 2020-01-31 --force
Convert to Text
Convert all downloaded PDFs to text:
bash
bis-scraper convert
Convert only specific institutions:
bash
bis-scraper convert --institutions "European Central Bank"
Run Both Steps
Run both scraping and conversion in one command:
bash
bis-scraper run-all --start-date 2020-01-01 --end-date 2020-01-31
Helper Scripts
For large-scale scraping operations, we provide helper scripts in the scripts directory:
```bash
Run a full scraping and conversion process
scripts/runfullscrape.sh
Analyze the results
scripts/analyze_results.sh ```
See Scripts README for more details.
Python API
Basic Usage
```python from pathlib import Path import datetime from bisscraper.scrapers.controller import scrapebis from bisscraper.converters.controller import convertpdfs
Download speeches
result = scrapebis( datadir=Path("data"), logdir=Path("logs"), startdate=datetime.datetime(2020, 1, 1), end_date=datetime.datetime(2020, 1, 31), institutions=["European Central Bank"], force=False, limit=None )
Convert to text
convertresult = convertpdfs( datadir=Path("data"), logdir=Path("logs"), institutions=["European Central Bank"], force=False, limit=None )
Print results
print(f"Downloaded: {result.downloaded}, Skipped: {result.skipped}, Failed: {result.failed}") print(f"Converted: {convertresult.successful}, Skipped: {convertresult.skipped}, Failed: {convert_result.failed}") ```
Advanced Usage
```python import datetime import logging from pathlib import Path from bisscraper.scrapers.controller import scrapebis from bisscraper.converters.controller import convertpdfs from bisscraper.utils.constants import INSTITUTIONS from bisscraper.utils.institutionutils import getall_institutions
Set up logging
logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler("bis_scraper.log"), logging.StreamHandler() ] )
Custom data directories
datadir = Path("customdatadir") logdir = Path("logs")
Get all available institutions
allinstitutions = getallinstitutions() print(f"Available institutions: {len(allinstitutions)}")
Select specific institutions
selected_institutions = [ "European Central Bank", "Board of Governors of the Federal Reserve System", "Bank of England" ]
Date ranges - past 3 months
enddate = datetime.datetime.now() startdate = end_date - datetime.timedelta(days=90)
Download with limit of 10 speeches per institution
scraperesult = scrapebis( datadir=datadir, logdir=logdir, startdate=startdate, enddate=enddate, institutions=selected_institutions, force=False, limit=10 )
Convert all downloaded speeches
convertresult = convertpdfs( datadir=datadir, logdir=logdir, institutions=selected_institutions, force=False )
Process the results
print(f"Downloaded {scraperesult.downloaded} speeches") print(f"Skipped {scraperesult.skipped} speeches") if scraperesult.failed > 0: print(f"Failed to download {scraperesult.failed} speeches")
print(f"Converted {convertresult.successful} PDFs to text") print(f"Skipped {convertresult.skipped} already converted PDFs") if convertresult.failed > 0: print(f"Failed to convert {convertresult.failed} PDFs") for code, error in convert_result.errors.items(): print(f" - {code}: {error}") ```
Data Organization
By default, the data is organized as follows:
data/
├── pdfs/ # Raw PDF files
│ ├── european_central_bank/
│ │ ├── 200101a.pdf # Speech from 2020-01-01, first speech of the day
│ │ ├── 200103b.pdf # Speech from 2020-01-03, second speech of the day
│ │ └── metadata.json # Structured metadata in JSON format
│ └── federal_reserve_system/
│ └── ...
└── texts/ # Converted text files
├── european_central_bank/
│ ├── 200101a.txt
│ └── ...
└── federal_reserve_system/
└── ...
Metadata JSON Format
Each institution directory contains a metadata.json file with structured information about the speeches:
json
{
"200101a": {
"raw_text": "Original metadata text from the BIS website",
"speech_type": "Speech",
"speaker": "Ms Jane Smith",
"role": "Governor of the Central Bank",
"event": "Annual Banking Conference",
"speech_date": "1 January 2020",
"location": "Frankfurt, Germany",
"organizer": "European Banking Association",
"date": "2020-01-01"
},
"200103b": {
...
}
}
The structured format makes it easier to extract specific information about each speech for analysis.
Extending with Custom Text Processing
You can extend the functionality to perform custom text processing on the downloaded speeches. Here's an example:
```python import glob import os from pathlib import Path import re import pandas as pd from collections import Counter
def analyzespeeches(datadir, institution, keywords): """Analyze text files for keyword frequency.""" # Path to text files for the institution institutiondir = Path(datadir) / "texts" / institution.lower().replace(" ", "_") results = []
# Process each text file
for txt_file in glob.glob(f"{institution_dir}/*.txt"):
file_code = os.path.basename(txt_file).split('.')[0]
with open(txt_file, 'r', encoding='utf-8') as f:
text = f.read().lower()
# Count keywords
word_counts = {}
for keyword in keywords:
pattern = r'\b' + re.escape(keyword.lower()) + r'\b'
word_counts[keyword] = len(re.findall(pattern, text))
# Get total word count
total_words = len(re.findall(r'\b\w+\b', text))
# Add to results
results.append({
'file_code': file_code,
'total_words': total_words,
**word_counts
})
# Convert to DataFrame for analysis
df = pd.DataFrame(results)
return df
Example usage
keywords = ['inflation', 'recession', 'policy', 'interest', 'rate'] resultsdf = analyzespeeches('data', 'European Central Bank', keywords)
Display summary
print(f"Analyzed {len(resultsdf)} speeches") print("\nAverage word counts:") for keyword in keywords: print(f"- {keyword}: {resultsdf[keyword].mean():.2f}")
Most mentioned keywords by speech
print("\nSpeeches with most 'inflation' mentions:") print(resultsdf.sortvalues('inflation', ascending=False)[['file_code', 'inflation']].head()) ```
Development
Setting Up Development Environment
```bash
Clone the repository
git clone https://github.com/HanssonMagnus/bis-scraper.git cd bis-scraper
Run the install script (creates virtual environment and installs package)
./install.sh
Activate the virtual environment
source .venv/bin/activate # On Windows: .venv\Scripts\activate
Install development dependencies
pip install -e ".[dev]" ```
Running Tests
The project uses pytest for testing. Tests are organized into unit tests and integration tests.
```bash
Run all tests
pytest
Run only unit tests
pytest tests/unit/
Run only integration tests
pytest tests/integration/
Run tests with coverage report
pytest --cov=bis_scraper ```
Code Quality
This project uses several tools to ensure code quality:
blackfor code formattingisortfor import sortingmypyfor type checkingrufffor linting
You can run all these checks using the provided script:
```bash
Check code quality
./checkcodequality.py
Fix issues automatically where possible
./checkcodequality.py --fix ```
Or run each tool individually:
```bash
Format code
black bisscraper tests isort bisscraper tests
Check types
mypy bis_scraper
Run linter
ruff bis_scraper tests ```
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Before submitting your PR, please make sure: - All tests pass - Code is formatted with Black - Imports are sorted with isort - Type hints are correct (mypy) - Linting passes with ruff
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Bank for International Settlements for providing access to central bank speeches
- All the central banks for making their speeches publicly available
Owner
- Name: Magnus Hansson
- Login: HanssonMagnus
- Kind: user
- Location: Stockholm, Sweden
- Website: https://magnushansson.xyz/
- Repositories: 1
- Profile: https://github.com/HanssonMagnus
Citation (CITATION.bib)
@TECHREPORT{RePEc:hhs:gunwpe:0811,
title = {Evolution of topics in central bank speech communication},
author = {Hansson, Magnus},
year = {2021},
institution = {University of Gothenburg, Department of Economics},
type = {Working Papers in Economics},
number = {811},
abstract = {This paper studies the content of central bank speech communication from 1997 through 2020 and asks the following questions: (i) What global topics do central banks talk about? (ii) How do these topics evolve over time? I turn to natural language processing, and more specifically Dynamic Topic Models, to answer these questions. The analysis consists of an aggregate study of nine major central banks and a case study of the Federal Reserve, which allows for region specific control variables. I show that: (i) Central banks address a broad range of topics. (ii) The topics are well captured by Dynamic Topic Models. (iii) The global topics exhibit strong and significant autoregressive properties not easily explained by financial control variables.},
keywords = {Central bank communication; Monetary policy; Textual analysis; Dynamic topic models; Narratives},
url = {https://EconPapers.repec.org/RePEc:hhs:gunwpe:0811}
}
GitHub Events
Total
- Push event: 7
Last Year
- Push event: 7
Dependencies
- beautifulsoup4 *
- requests *
- textract *