bis-scraper

Script to scrape speeches from Bank for International Settlements

https://github.com/hanssonmagnus/bis-scraper

Last synced: 9 months ago · JSON representation ·

Repository

Script to scrape speeches from Bank for International Settlements

Basic Info

Host: GitHub
Owner: HanssonMagnus
License: gpl-3.0
Language: Python
Default Branch: main
Size: 25.4 KB

Statistics

Stars: 8
Watchers: 1
Forks: 4
Open Issues: 0
Releases: 0

Created over 4 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

BIS Scraper

A Python package for downloading and processing central bank speeches from the Bank for International Settlements (BIS) website.

Overview

BIS Scraper allows you to download speeches from various central banks worldwide that are collected on the BIS website. It organizes these speeches by institution, downloads the PDFs, and can convert them to text format for further analysis.

Features

Download speeches from the BIS website by date range
Filter by specific institutions
Convert PDFs to text format
Clean organization structure for the downloaded files
Efficient caching to avoid re-downloading existing files
Command-line interface for easy usage

Documentation

API Documentation: Detailed information about the package's APIs
Test Coverage: Testing approach and requirements
Project Plan: Current development status and roadmap

Installation

From PyPI (recommended)

bash pip install bis-scraper

From Source

bash git clone https://github.com/HanssonMagnus/bis-scraper.git cd bis-scraper ./install.sh # This creates a virtual environment and installs the package source .venv/bin/activate

Dependencies

The package requires Python 3.9+ and the following main dependencies: - requests - beautifulsoup4 - textract - click - pydantic

Usage

BIS Scraper provides two ways to use its functionality:

Command-Line Interface (CLI): A terminal-based tool called bis-scraper for easy use
Python API: Functions and classes that can be imported into your Python scripts

Both methods provide the same core functionality but are suited for different use cases: - Use the CLI for quick downloads or one-off tasks - Use the Python API for integration with your own code or for more complex workflows

Command-Line Interface

Download Speeches

Download speeches for a specific date range:

bash bis-scraper scrape --start-date 2020-01-01 --end-date 2020-01-31

Filter by institution:

bash bis-scraper scrape --start-date 2020-01-01 --end-date 2020-01-31 --institutions "European Central Bank" --institutions "Federal Reserve System"

Force re-download of existing files:

bash bis-scraper scrape --start-date 2020-01-01 --end-date 2020-01-31 --force

Convert to Text

Convert all downloaded PDFs to text:

bash bis-scraper convert

Convert only specific institutions:

bash bis-scraper convert --institutions "European Central Bank"

Run Both Steps

Run both scraping and conversion in one command:

bash bis-scraper run-all --start-date 2020-01-01 --end-date 2020-01-31

Helper Scripts

For large-scale scraping operations, we provide helper scripts in the scripts directory:

```bash

Run a full scraping and conversion process

scripts/runfullscrape.sh

Analyze the results

scripts/analyze_results.sh ```

See Scripts README for more details.

Python API

Basic Usage

```python from pathlib import Path import datetime from bisscraper.scrapers.controller import scrapebis from bisscraper.converters.controller import convertpdfs

Download speeches

result = scrapebis( datadir=Path("data"), logdir=Path("logs"), startdate=datetime.datetime(2020, 1, 1), end_date=datetime.datetime(2020, 1, 31), institutions=["European Central Bank"], force=False, limit=None )

Convert to text

convertresult = convertpdfs( datadir=Path("data"), logdir=Path("logs"), institutions=["European Central Bank"], force=False, limit=None )

Print results

print(f"Downloaded: {result.downloaded}, Skipped: {result.skipped}, Failed: {result.failed}") print(f"Converted: {convertresult.successful}, Skipped: {convertresult.skipped}, Failed: {convert_result.failed}") ```

Advanced Usage

```python import datetime import logging from pathlib import Path from bisscraper.scrapers.controller import scrapebis from bisscraper.converters.controller import convertpdfs from bisscraper.utils.constants import INSTITUTIONS from bisscraper.utils.institutionutils import getall_institutions

Set up logging

logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler("bis_scraper.log"), logging.StreamHandler() ] )

Custom data directories

datadir = Path("customdatadir") logdir = Path("logs")

Get all available institutions

allinstitutions = getallinstitutions() print(f"Available institutions: {len(allinstitutions)}")

Select specific institutions

selected_institutions = [ "European Central Bank", "Board of Governors of the Federal Reserve System", "Bank of England" ]

Date ranges - past 3 months

enddate = datetime.datetime.now() startdate = end_date - datetime.timedelta(days=90)

Download with limit of 10 speeches per institution

scraperesult = scrapebis( datadir=datadir, logdir=logdir, startdate=startdate, enddate=enddate, institutions=selected_institutions, force=False, limit=10 )

Convert all downloaded speeches

convertresult = convertpdfs( datadir=datadir, logdir=logdir, institutions=selected_institutions, force=False )

Process the results

print(f"Downloaded {scraperesult.downloaded} speeches") print(f"Skipped {scraperesult.skipped} speeches") if scraperesult.failed > 0: print(f"Failed to download {scraperesult.failed} speeches")

print(f"Converted {convertresult.successful} PDFs to text") print(f"Skipped {convertresult.skipped} already converted PDFs") if convertresult.failed > 0: print(f"Failed to convert {convertresult.failed} PDFs") for code, error in convert_result.errors.items(): print(f" - {code}: {error}") ```

Data Organization

By default, the data is organized as follows:

data/ ├── pdfs/ # Raw PDF files │ ├── european_central_bank/ │ │ ├── 200101a.pdf # Speech from 2020-01-01, first speech of the day │ │ ├── 200103b.pdf # Speech from 2020-01-03, second speech of the day │ │ └── metadata.json # Structured metadata in JSON format │ └── federal_reserve_system/ │ └── ... └── texts/ # Converted text files ├── european_central_bank/ │ ├── 200101a.txt │ └── ... └── federal_reserve_system/ └── ...

Metadata JSON Format

Each institution directory contains a metadata.json file with structured information about the speeches:

json { "200101a": { "raw_text": "Original metadata text from the BIS website", "speech_type": "Speech", "speaker": "Ms Jane Smith", "role": "Governor of the Central Bank", "event": "Annual Banking Conference", "speech_date": "1 January 2020", "location": "Frankfurt, Germany", "organizer": "European Banking Association", "date": "2020-01-01" }, "200103b": { ... } }

The structured format makes it easier to extract specific information about each speech for analysis.

Extending with Custom Text Processing

You can extend the functionality to perform custom text processing on the downloaded speeches. Here's an example:

```python import glob import os from pathlib import Path import re import pandas as pd from collections import Counter

def analyzespeeches(datadir, institution, keywords): """Analyze text files for keyword frequency.""" # Path to text files for the institution institutiondir = Path(datadir) / "texts" / institution.lower().replace(" ", "_") results = []

# Process each text file
for txt_file in glob.glob(f"{institution_dir}/*.txt"):
    file_code = os.path.basename(txt_file).split('.')[0]

    with open(txt_file, 'r', encoding='utf-8') as f:
        text = f.read().lower()

        # Count keywords
        word_counts = {}
        for keyword in keywords:
            pattern = r'\b' + re.escape(keyword.lower()) + r'\b'
            word_counts[keyword] = len(re.findall(pattern, text))

        # Get total word count
        total_words = len(re.findall(r'\b\w+\b', text))

        # Add to results
        results.append({
            'file_code': file_code,
            'total_words': total_words,
            **word_counts
        })

# Convert to DataFrame for analysis
df = pd.DataFrame(results)
return df

Example usage

keywords = ['inflation', 'recession', 'policy', 'interest', 'rate'] resultsdf = analyzespeeches('data', 'European Central Bank', keywords)

Display summary

print(f"Analyzed {len(resultsdf)} speeches") print("\nAverage word counts:") for keyword in keywords: print(f"- {keyword}: {resultsdf[keyword].mean():.2f}")

Most mentioned keywords by speech

print("\nSpeeches with most 'inflation' mentions:") print(resultsdf.sortvalues('inflation', ascending=False)[['file_code', 'inflation']].head()) ```

Development

Setting Up Development Environment

```bash

Clone the repository

git clone https://github.com/HanssonMagnus/bis-scraper.git cd bis-scraper

Run the install script (creates virtual environment and installs package)

./install.sh

Activate the virtual environment

source .venv/bin/activate # On Windows: .venv\Scripts\activate

Install development dependencies

pip install -e ".[dev]" ```

Running Tests

The project uses pytest for testing. Tests are organized into unit tests and integration tests.

```bash

Run all tests

pytest

Run only unit tests

pytest tests/unit/

Run only integration tests

pytest tests/integration/

Run tests with coverage report

pytest --cov=bis_scraper ```

Code Quality

This project uses several tools to ensure code quality:

black for code formatting
isort for import sorting
mypy for type checking
ruff for linting

You can run all these checks using the provided script:

```bash

Check code quality

./checkcodequality.py

Fix issues automatically where possible

./checkcodequality.py --fix ```

Or run each tool individually:

```bash

Format code

black bisscraper tests isort bisscraper tests

Check types

mypy bis_scraper

Run linter

ruff bis_scraper tests ```

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Before submitting your PR, please make sure: - All tests pass - Code is formatted with Black - Imports are sorted with isort - Type hints are correct (mypy) - Linting passes with ruff

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Bank for International Settlements for providing access to central bank speeches
All the central banks for making their speeches publicly available

Owner

Name: Magnus Hansson
Login: HanssonMagnus
Kind: user
Location: Stockholm, Sweden

Website: https://magnushansson.xyz/
Repositories: 1
Profile: https://github.com/HanssonMagnus

Citation (CITATION.bib)

@TECHREPORT{RePEc:hhs:gunwpe:0811,
title = {Evolution of topics in central bank speech communication},
author = {Hansson, Magnus},
year = {2021},
institution = {University of Gothenburg, Department of Economics},
type = {Working Papers in Economics},
number = {811},
abstract = {This paper studies the content of central bank speech communication from 1997 through 2020 and asks the following questions: (i) What global topics do central banks talk about? (ii) How do these topics evolve over time? I turn to natural language processing, and more specifically Dynamic Topic Models, to answer these questions. The analysis consists of an aggregate study of nine major central banks and a case study of the Federal Reserve, which allows for region specific control variables. I show that: (i) Central banks address a broad range of topics. (ii) The topics are well captured by Dynamic Topic Models. (iii) The global topics exhibit strong and significant autoregressive properties not easily explained by financial control variables.},
keywords = {Central bank communication; Monetary policy; Textual analysis; Dynamic topic models; Narratives},
url = {https://EconPapers.repec.org/RePEc:hhs:gunwpe:0811}
}

bis-scraper

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

BIS Scraper

Overview

Features

Documentation

Installation

From PyPI (recommended)

From Source

Dependencies

Usage

Command-Line Interface

Download Speeches

Convert to Text

Run Both Steps

Helper Scripts

Run a full scraping and conversion process

Analyze the results

Python API

Basic Usage

Download speeches

Convert to text

Print results

Advanced Usage

Set up logging

Custom data directories

Get all available institutions

Select specific institutions

Date ranges - past 3 months

Download with limit of 10 speeches per institution

Convert all downloaded speeches

Process the results

Data Organization

Metadata JSON Format

Extending with Custom Text Processing

Example usage

Display summary

Most mentioned keywords by speech

Development

Setting Up Development Environment

Clone the repository

Run the install script (creates virtual environment and installs package)

Activate the virtual environment

Install development dependencies

Running Tests

Run all tests

Run only unit tests

Run only integration tests

Run tests with coverage report

Code Quality

Check code quality

Fix issues automatically where possible

Format code

Check types

Run linter

Contributing

License

Acknowledgments

Owner

Citation (CITATION.bib)

GitHub Events

Total

Last Year

Dependencies