za-bank-risk
This repository is an initial pipeline for reading, processing, labelling and classifying unstructured annual reports of South African (SA) banks with the aim of identifying financial risk. It leveraged work by the Corporate Financial Information Environment-Final Report Structure Extractor (CFIE–FRSE) of El-Haj et al. which created a corpus of annual reports of United Kingdom (UK) companies.
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.8%) to scientific vocabulary
Keywords
Repository
This repository is an initial pipeline for reading, processing, labelling and classifying unstructured annual reports of South African (SA) banks with the aim of identifying financial risk. It leveraged work by the Corporate Financial Information Environment-Final Report Structure Extractor (CFIE–FRSE) of El-Haj et al. which created a corpus of annual reports of United Kingdom (UK) companies.
Basic Info
Statistics
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 5
Topics
Metadata Files
README.md
South African Bank Risk Dataset
Give Feedback : DSFSI Resource Feedback Form
About Dataset
This repository is an initial pipeline for reading, processing, labelling and classifying unstructured annual reports of South African (SA) banks with the aim of identifying financial risk. It leveraged work by the Corporate Financial Information Environment-Final Report Structure Extractor (CFIEFRSE) of El-Haj et al. which created a corpus of annual reports of United Kingdom (UK) companies.
About Data collection methodology
A register of banks licensed in SA was used to download annual reports from company websites and online portals. Company structures, trading practices and branding complicated the collection. The South African Bank of Athens Limited (SABA) is an example, trading as Grobank after acquisition by GroCapital Holdings Proprietary Limited (GroCapital Holdings). Subsidiary and group companies were analysed as separate entities. This applied to ABSA, Investec, Nedbank and Standard Bank. A UK report was selected as reference to validate pre-processing results against the CFIEFRSE tool. Pre-porcessing also mapped reports to companies and years, and labeled reports as positive or negative with respect to risk. Data was prepared by extracting text, counting all words and those in wordlists, and validating results against the reference. Processing included Bag of Words, word embedding, feature scaling and topic analysis. Modelling entailed various classifiers for which results were compared and the best performing models were applied in a prediction use case.
Description of the data
Of the potential 297 annual reports from 2009 to 2019, 258 were sourced with the balance not found or the bank was not operational that year. Only 7 reports comprised multiple documents so for simplicity each document was treated as a report.
Repository Organisation
data
interim
wordlists <- Lists of keywords/substrings to extract linguistic features
causal.txt <- Causal reasoning wordlist relating to performance commentary, based on a composite wordlist from El-Haj et al.
causalMartin50.txt <- Causal reasoning wordlist based on El-Haj et al.
causalMartinAll.txt <- Causal reasoning wordlist based on El-Haj et al.
ForwardLooking.txt <- Forwardlooking wordlist based on the list proposed by Hussainey et al.
forwardLookingNew.txt <- Forwardlooking wordlist based on an updated version of the list proposed by Hussainey et al.
HenryNeg2006.txt <- Negative wordlist based on Henry (2006)
HenryNeg2008.txt <- Negative wordlist based on Henry (2008)
HenryPos2006.txt <- Positive wordlist based on Henry (2006)
HenryPos2008.txt <- Positive wordlist based on Henry (2008)
LMnegative.txt <- Negative wordlist based on Loughran and McDonald
LMpositive.txt <- Positive wordlist based on Loughran and McDonald
newStrategy.txt <- Wordlist for identifying strategy-related commentary based on El-Haj et al.
performance.txt <- Wordlist for identifying performance-related commentary based on El-Haj et al.
Uncertainty.txt <- Uncertainty wordlist based on Loughran and McDonald
coMap.csv <- Reference data: Company mapping file
docMap.csv <- Reference data: Document mapping file
fileNoMap.csv <- Output data: PDF files read that were not in the document mapping file
header.csv <- Output data: Header text extracted from PDF files
match.csv <- Keywords to match and classify headers based on El-Haj et al.
MatchToC.txt <- Keywords to identify Table of Content header
processed
docReadability.csv <- Readability results per document without the text
pageBlocks.csv <- Text blocks per page per document with word counts per page and per wordlist
pageText.csv <- Text per page per document with word counts per page and per wordlist
pageTextRef.csv <- Text per page of reference report to validate word counts with CFIEFRSE
prob_test_LR.csv <- Probability of risk on test dataset reports predicted by Logistic Regression
prob_test_SVMa.csv <- Probability of risk on test dataset reports predicted by Support Vector Machine with auto gamma
raw
Annual Reports <- PDF files downloaded from internet websites
Bank <- Annual reports read and processed to create the dataset
Insurer <- Empty folder for insurer reports to be stored in future
Other <- Other risk-related reports downloaded from internet websites
notebooks <- Python code
colab <- Code for Google Colaboratory and cloud runtime
1_0_Colab Import.ipynb <- Extract PDF text, convert booklets, count words by page and write pageText.csv
2_0_Colab EDA.ipynb <- Exploratory Data Analysis (incl. Chi-Square) and write docReadability.csv
3_0_Colab Classifier.ipynb <- Loop through raw and stemmed/lemmatized tokens as well as classifiers
4_0_Colab Classifier Applied LR.ipynb <- Logistic Regression prediction, LIME, feature selection and write prob_test_LR.csv
4_1_Colab Classifier Applied SVMa.ipynb <- Support Vector Machine (with auto gamma) prediction, LIME, feature selection and write prob_test_SVMa.csv
jupyter <- Code for Python 3 and local runtime e.g. using Jupyter or JupyterLab
1_0_Import.ipynb <- Extract PDF text, convert booklets, count words by page and write pageText.csv
1_1_Import Count Blocks.ipynb <- Extract PDF text, convert booklets, count words by page and write pageBlocks.csv
.gitignore
LICENSE
README.md
Online Repository link
- Zenodo Data Repository - Link to the data repository.
Authors
- Lamont Theron
- Vukosi Marivate - @vukosi
See also the list of contributors who participated in this project.
License
Data is Licensed under CC 4.0 BY SA Code is Licences under MIT License.
Owner
- Name: Data Science for Social Impact Research Group @ University of Pretoria
- Login: dsfsi
- Kind: organization
- Email: vukosi.marivate@cs.up.ac.za
- Location: University of Pretoria, South Africa
- Website: https://dsfsi.github.io
- Twitter: dsfsi_research
- Repositories: 14
- Profile: https://github.com/dsfsi
We are the Data Science for Social Impact research group at the Computer Science Department, University of Pretoria.
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0