za-bank-risk

This repository is an initial pipeline for reading, processing, labelling and classifying unstructured annual reports of South African (SA) banks with the aim of identifying financial risk. It leveraged work by the Corporate Financial Information Environment-Final Report Structure Extractor (CFIE–FRSE) of El-Haj et al. which created a corpus of annual reports of United Kingdom (UK) companies.

https://github.com/dsfsi/za-bank-risk

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.8%) to scientific vocabulary

Keywords

african bank-risk dataset dsfsi-datasets finance machine-learning natural-language-processing nlp nlproc south-africa
Last synced: 6 months ago · JSON representation

Repository

This repository is an initial pipeline for reading, processing, labelling and classifying unstructured annual reports of South African (SA) banks with the aim of identifying financial risk. It leveraged work by the Corporate Financial Information Environment-Final Report Structure Extractor (CFIE–FRSE) of El-Haj et al. which created a corpus of annual reports of United Kingdom (UK) companies.

Basic Info
  • Host: GitHub
  • Owner: dsfsi
  • License: cc-by-sa-4.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 1.31 GB
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 5
Topics
african bank-risk dataset dsfsi-datasets finance machine-learning natural-language-processing nlp nlproc south-africa
Created almost 5 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

South African Bank Risk Dataset

DOI

Give Feedback : DSFSI Resource Feedback Form

About Dataset

This repository is an initial pipeline for reading, processing, labelling and classifying unstructured annual reports of South African (SA) banks with the aim of identifying financial risk. It leveraged work by the Corporate Financial Information Environment-Final Report Structure Extractor (CFIEFRSE) of El-Haj et al. which created a corpus of annual reports of United Kingdom (UK) companies.

About Data collection methodology

A register of banks licensed in SA was used to download annual reports from company websites and online portals. Company structures, trading practices and branding complicated the collection. The South African Bank of Athens Limited (SABA) is an example, trading as Grobank after acquisition by GroCapital Holdings Proprietary Limited (GroCapital Holdings). Subsidiary and group companies were analysed as separate entities. This applied to ABSA, Investec, Nedbank and Standard Bank. A UK report was selected as reference to validate pre-processing results against the CFIEFRSE tool. Pre-porcessing also mapped reports to companies and years, and labeled reports as positive or negative with respect to risk. Data was prepared by extracting text, counting all words and those in wordlists, and validating results against the reference. Processing included Bag of Words, word embedding, feature scaling and topic analysis. Modelling entailed various classifiers for which results were compared and the best performing models were applied in a prediction use case.

Description of the data

Of the potential 297 annual reports from 2009 to 2019, 258 were sourced with the balance not found or the bank was not operational that year. Only 7 reports comprised multiple documents so for simplicity each document was treated as a report.

Repository Organisation


 data
    interim
       wordlists                     <- Lists of keywords/substrings to extract linguistic features
          causal.txt                <- Causal reasoning wordlist relating to performance commentary, based on a composite wordlist from El-Haj et al.
          causalMartin50.txt        <- Causal reasoning wordlist based on El-Haj et al.
          causalMartinAll.txt       <- Causal reasoning wordlist based on El-Haj et al.
          ForwardLooking.txt        <- Forwardlooking wordlist based on the list proposed by Hussainey et al.
          forwardLookingNew.txt     <- Forwardlooking wordlist based on an updated version of the list proposed by Hussainey et al.
          HenryNeg2006.txt          <- Negative wordlist based on Henry (2006)
          HenryNeg2008.txt          <- Negative wordlist based on Henry (2008)
          HenryPos2006.txt          <- Positive wordlist based on Henry (2006)
          HenryPos2008.txt          <- Positive wordlist based on Henry (2008)
          LMnegative.txt            <- Negative wordlist based on Loughran and McDonald
          LMpositive.txt            <- Positive wordlist based on Loughran and McDonald
          newStrategy.txt           <- Wordlist for identifying strategy-related commentary based on El-Haj et al.
          performance.txt           <- Wordlist for identifying performance-related commentary based on El-Haj et al.
          Uncertainty.txt           <- Uncertainty wordlist based on Loughran and McDonald
       coMap.csv                     <- Reference data: Company mapping file
       docMap.csv                    <- Reference data: Document mapping file 
       fileNoMap.csv                 <- Output data: PDF files read that were not in the document mapping file 
       header.csv                    <- Output data: Header text extracted from PDF files
       match.csv                     <- Keywords to match and classify headers based on El-Haj et al.
       MatchToC.txt                  <- Keywords to identify Table of Content header
    processed
       docReadability.csv            <- Readability results per document without the text
       pageBlocks.csv                <- Text blocks per page per document with word counts per page and per wordlist
       pageText.csv                  <- Text per page per document with word counts per page and per wordlist
       pageTextRef.csv               <- Text per page of reference report to validate word counts with CFIEFRSE
       prob_test_LR.csv              <- Probability of risk on test dataset reports predicted by Logistic Regression
       prob_test_SVMa.csv            <- Probability of risk on test dataset reports predicted by Support Vector Machine with auto gamma
    raw
        Annual Reports                <- PDF files downloaded from internet websites
           Bank                      <- Annual reports read and processed to create the dataset
           Insurer                   <- Empty folder for insurer reports to be stored in future
        Other                         <- Other risk-related reports downloaded from internet websites
 notebooks                                       <- Python code
    colab                                       <- Code for Google Colaboratory and cloud runtime
       1_0_Colab Import.ipynb                  <- Extract PDF text, convert booklets, count words by page and write pageText.csv
       2_0_Colab EDA.ipynb                     <- Exploratory Data Analysis (incl. Chi-Square) and write docReadability.csv
       3_0_Colab Classifier.ipynb              <- Loop through raw and stemmed/lemmatized tokens as well as classifiers
       4_0_Colab Classifier Applied LR.ipynb   <- Logistic Regression prediction, LIME, feature selection and write prob_test_LR.csv
       4_1_Colab Classifier Applied SVMa.ipynb <- Support Vector Machine (with auto gamma) prediction, LIME, feature selection and write prob_test_SVMa.csv
    jupyter                                     <- Code for Python 3 and local runtime e.g. using Jupyter or JupyterLab
        1_0_Import.ipynb                        <- Extract PDF text, convert booklets, count words by page and write pageText.csv
        1_1_Import Count Blocks.ipynb           <- Extract PDF text, convert booklets, count words by page and write pageBlocks.csv
 .gitignore
 LICENSE
 README.md

Online Repository link

Authors

  • Lamont Theron
  • Vukosi Marivate - @vukosi

See also the list of contributors who participated in this project.

License

Data is Licensed under CC 4.0 BY SA Code is Licences under MIT License.

Owner

  • Name: Data Science for Social Impact Research Group @ University of Pretoria
  • Login: dsfsi
  • Kind: organization
  • Email: vukosi.marivate@cs.up.ac.za
  • Location: University of Pretoria, South Africa

We are the Data Science for Social Impact research group at the Computer Science Department, University of Pretoria.

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels