bis-scraper

Script to scrape speeches from Bank for International Settlements

https://github.com/hanssonmagnus/bis-scraper

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Script to scrape speeches from Bank for International Settlements

Basic Info
  • Host: GitHub
  • Owner: HanssonMagnus
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Size: 25.4 KB
Statistics
  • Stars: 8
  • Watchers: 1
  • Forks: 4
  • Open Issues: 0
  • Releases: 0
Created almost 4 years ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

BIS Scraper

Tests Python Versions License: MIT Code style: black Imports: isort Type Checking: mypy

A Python package for downloading and processing central bank speeches from the Bank for International Settlements (BIS) website.

Overview

BIS Scraper allows you to download speeches from various central banks worldwide that are collected on the BIS website. It organizes these speeches by institution, downloads the PDFs, and can convert them to text format for further analysis.

Features

  • Download speeches from the BIS website by date range
  • Filter by specific institutions
  • Convert PDFs to text format
  • Clean organization structure for the downloaded files
  • Efficient caching to avoid re-downloading existing files
  • Command-line interface for easy usage

Documentation

Installation

From PyPI (recommended)

bash pip install bis-scraper

From Source

bash git clone https://github.com/HanssonMagnus/bis-scraper.git cd bis-scraper ./install.sh # This creates a virtual environment and installs the package source .venv/bin/activate

Dependencies

The package requires Python 3.9+ and the following main dependencies: - requests - beautifulsoup4 - textract - click - pydantic

Usage

BIS Scraper provides two ways to use its functionality:

  1. Command-Line Interface (CLI): A terminal-based tool called bis-scraper for easy use
  2. Python API: Functions and classes that can be imported into your Python scripts

Both methods provide the same core functionality but are suited for different use cases: - Use the CLI for quick downloads or one-off tasks - Use the Python API for integration with your own code or for more complex workflows

Command-Line Interface

Download Speeches

Download speeches for a specific date range:

bash bis-scraper scrape --start-date 2020-01-01 --end-date 2020-01-31

Filter by institution:

bash bis-scraper scrape --start-date 2020-01-01 --end-date 2020-01-31 --institutions "European Central Bank" --institutions "Federal Reserve System"

Force re-download of existing files:

bash bis-scraper scrape --start-date 2020-01-01 --end-date 2020-01-31 --force

Convert to Text

Convert all downloaded PDFs to text:

bash bis-scraper convert

Convert only specific institutions:

bash bis-scraper convert --institutions "European Central Bank"

Run Both Steps

Run both scraping and conversion in one command:

bash bis-scraper run-all --start-date 2020-01-01 --end-date 2020-01-31

Helper Scripts

For large-scale scraping operations, we provide helper scripts in the scripts directory:

```bash

Run a full scraping and conversion process

scripts/runfullscrape.sh

Analyze the results

scripts/analyze_results.sh ```

See Scripts README for more details.

Python API

Basic Usage

```python from pathlib import Path import datetime from bisscraper.scrapers.controller import scrapebis from bisscraper.converters.controller import convertpdfs

Download speeches

result = scrapebis( datadir=Path("data"), logdir=Path("logs"), startdate=datetime.datetime(2020, 1, 1), end_date=datetime.datetime(2020, 1, 31), institutions=["European Central Bank"], force=False, limit=None )

Convert to text

convertresult = convertpdfs( datadir=Path("data"), logdir=Path("logs"), institutions=["European Central Bank"], force=False, limit=None )

Print results

print(f"Downloaded: {result.downloaded}, Skipped: {result.skipped}, Failed: {result.failed}") print(f"Converted: {convertresult.successful}, Skipped: {convertresult.skipped}, Failed: {convert_result.failed}") ```

Advanced Usage

```python import datetime import logging from pathlib import Path from bisscraper.scrapers.controller import scrapebis from bisscraper.converters.controller import convertpdfs from bisscraper.utils.constants import INSTITUTIONS from bisscraper.utils.institutionutils import getall_institutions

Set up logging

logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler("bis_scraper.log"), logging.StreamHandler() ] )

Custom data directories

datadir = Path("customdatadir") logdir = Path("logs")

Get all available institutions

allinstitutions = getallinstitutions() print(f"Available institutions: {len(allinstitutions)}")

Select specific institutions

selected_institutions = [ "European Central Bank", "Board of Governors of the Federal Reserve System", "Bank of England" ]

Date ranges - past 3 months

enddate = datetime.datetime.now() startdate = end_date - datetime.timedelta(days=90)

Download with limit of 10 speeches per institution

scraperesult = scrapebis( datadir=datadir, logdir=logdir, startdate=startdate, enddate=enddate, institutions=selected_institutions, force=False, limit=10 )

Convert all downloaded speeches

convertresult = convertpdfs( datadir=datadir, logdir=logdir, institutions=selected_institutions, force=False )

Process the results

print(f"Downloaded {scraperesult.downloaded} speeches") print(f"Skipped {scraperesult.skipped} speeches") if scraperesult.failed > 0: print(f"Failed to download {scraperesult.failed} speeches")

print(f"Converted {convertresult.successful} PDFs to text") print(f"Skipped {convertresult.skipped} already converted PDFs") if convertresult.failed > 0: print(f"Failed to convert {convertresult.failed} PDFs") for code, error in convert_result.errors.items(): print(f" - {code}: {error}") ```

Data Organization

By default, the data is organized as follows:

data/ ├── pdfs/ # Raw PDF files │ ├── european_central_bank/ │ │ ├── 200101a.pdf # Speech from 2020-01-01, first speech of the day │ │ ├── 200103b.pdf # Speech from 2020-01-03, second speech of the day │ │ └── metadata.json # Structured metadata in JSON format │ └── federal_reserve_system/ │ └── ... └── texts/ # Converted text files ├── european_central_bank/ │ ├── 200101a.txt │ └── ... └── federal_reserve_system/ └── ...

Metadata JSON Format

Each institution directory contains a metadata.json file with structured information about the speeches:

json { "200101a": { "raw_text": "Original metadata text from the BIS website", "speech_type": "Speech", "speaker": "Ms Jane Smith", "role": "Governor of the Central Bank", "event": "Annual Banking Conference", "speech_date": "1 January 2020", "location": "Frankfurt, Germany", "organizer": "European Banking Association", "date": "2020-01-01" }, "200103b": { ... } }

The structured format makes it easier to extract specific information about each speech for analysis.

Extending with Custom Text Processing

You can extend the functionality to perform custom text processing on the downloaded speeches. Here's an example:

```python import glob import os from pathlib import Path import re import pandas as pd from collections import Counter

def analyzespeeches(datadir, institution, keywords): """Analyze text files for keyword frequency.""" # Path to text files for the institution institutiondir = Path(datadir) / "texts" / institution.lower().replace(" ", "_") results = []

# Process each text file
for txt_file in glob.glob(f"{institution_dir}/*.txt"):
    file_code = os.path.basename(txt_file).split('.')[0]

    with open(txt_file, 'r', encoding='utf-8') as f:
        text = f.read().lower()

        # Count keywords
        word_counts = {}
        for keyword in keywords:
            pattern = r'\b' + re.escape(keyword.lower()) + r'\b'
            word_counts[keyword] = len(re.findall(pattern, text))

        # Get total word count
        total_words = len(re.findall(r'\b\w+\b', text))

        # Add to results
        results.append({
            'file_code': file_code,
            'total_words': total_words,
            **word_counts
        })

# Convert to DataFrame for analysis
df = pd.DataFrame(results)
return df

Example usage

keywords = ['inflation', 'recession', 'policy', 'interest', 'rate'] resultsdf = analyzespeeches('data', 'European Central Bank', keywords)

Display summary

print(f"Analyzed {len(resultsdf)} speeches") print("\nAverage word counts:") for keyword in keywords: print(f"- {keyword}: {resultsdf[keyword].mean():.2f}")

Most mentioned keywords by speech

print("\nSpeeches with most 'inflation' mentions:") print(resultsdf.sortvalues('inflation', ascending=False)[['file_code', 'inflation']].head()) ```

Development

Setting Up Development Environment

```bash

Clone the repository

git clone https://github.com/HanssonMagnus/bis-scraper.git cd bis-scraper

Run the install script (creates virtual environment and installs package)

./install.sh

Activate the virtual environment

source .venv/bin/activate # On Windows: .venv\Scripts\activate

Install development dependencies

pip install -e ".[dev]" ```

Running Tests

The project uses pytest for testing. Tests are organized into unit tests and integration tests.

```bash

Run all tests

pytest

Run only unit tests

pytest tests/unit/

Run only integration tests

pytest tests/integration/

Run tests with coverage report

pytest --cov=bis_scraper ```

Code Quality

This project uses several tools to ensure code quality:

  • black for code formatting
  • isort for import sorting
  • mypy for type checking
  • ruff for linting

You can run all these checks using the provided script:

```bash

Check code quality

./checkcodequality.py

Fix issues automatically where possible

./checkcodequality.py --fix ```

Or run each tool individually:

```bash

Format code

black bisscraper tests isort bisscraper tests

Check types

mypy bis_scraper

Run linter

ruff bis_scraper tests ```

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Before submitting your PR, please make sure: - All tests pass - Code is formatted with Black - Imports are sorted with isort - Type hints are correct (mypy) - Linting passes with ruff

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Bank for International Settlements for providing access to central bank speeches
  • All the central banks for making their speeches publicly available

Owner

  • Name: Magnus Hansson
  • Login: HanssonMagnus
  • Kind: user
  • Location: Stockholm, Sweden

Citation (CITATION.bib)

@TECHREPORT{RePEc:hhs:gunwpe:0811,
title = {Evolution of topics in central bank speech communication},
author = {Hansson, Magnus},
year = {2021},
institution = {University of Gothenburg, Department of Economics},
type = {Working Papers in Economics},
number = {811},
abstract = {This paper studies the content of central bank speech communication from 1997 through 2020 and asks the following questions: (i) What global topics do central banks talk about? (ii) How do these topics evolve over time? I turn to natural language processing, and more specifically Dynamic Topic Models, to answer these questions. The analysis consists of an aggregate study of nine major central banks and a case study of the Federal Reserve, which allows for region specific control variables. I show that: (i) Central banks address a broad range of topics. (ii) The topics are well captured by Dynamic Topic Models. (iii) The global topics exhibit strong and significant autoregressive properties not easily explained by financial control variables.},
keywords = {Central bank communication; Monetary policy; Textual analysis; Dynamic topic models; Narratives},
url = {https://EconPapers.repec.org/RePEc:hhs:gunwpe:0811}
}

GitHub Events

Total
  • Push event: 7
Last Year
  • Push event: 7

Dependencies

Pipfile pypi
  • beautifulsoup4 *
  • requests *
  • textract *