https://github.com/chiragagg5k/finscraper

scraping and processing sec edgar financial data and training fastercnn table detection model with it

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Keywords

financial-data python web-scraping

Last synced: 10 months ago · JSON representation

Repository

scraping and processing sec edgar financial data and training fastercnn table detection model with it

Basic Info

Host: GitHub
Owner: ChiragAgg5k
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 5.21 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

financial-data python web-scraping

Created almost 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License

FINSCRAPER

Unleashing financial insights with cutting-edge tech!

repo-top-language repo-language-count

Table of Contents

- [Overview](#overview) - [Features](#features) - [Repository Structure](#repository-structure) - [Modules](#modules) - [Getting Started](#getting-started) - [Installation](#installation) - [Usage](#usage) - [Tests](#tests) - [Project Roadmap](#project-roadmap) - [Contributing](#contributing) - [License](#license) - [Acknowledgments](#acknowledgments)

Overview

The finscraper project encompasses a suite of financial data extraction and processing functionalities. It includes modules for scraping SEC filings, model training with Fast R-CNN, and preprocessing financial statement images. The project manages dependencies via its pyproject.toml file, ensuring seamless integration for scraping tasks. With capabilities to download, store, analyze, and train models on financial data, finscraper offers a robust platform for financial data extraction, preparation, and analysis.

Features

| | Feature | Description | | --- | ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ⚙️ | Architecture | The project uses a modular architecture with separate modules for scraping, testing, model training, and pre-processing financial data. It leverages external libraries and tools for specific tasks, promoting code separation and reusability. | | 🔩 | Code Quality | The codebase maintains high code quality standards with clear naming conventions, consistent formatting, and well-structured functions. Code comments are used effectively to enhance code readability and maintainability. | | 📄 | Documentation | The project includes informative documentation in the form of code comments, README files, and module descriptions. It provides detailed explanations of the project structure, functionality, and usage, making it easier for developers to onboard and contribute. | | 🔌 | Integrations | Key external dependencies include pytesseract, bs4, matplotlib, torch, and more for tasks such as image processing, web scraping, data visualization, and machine learning. These integrations enhance the project's capabilities and extend its functionality. | | 🧩 | Modularity | The codebase demonstrates high modularity and reusability by encapsulating distinct functionalities into separate modules. This design allows for easy maintenance, testing, and scalability of individual components without affecting the entire system. | | 🧪 | Testing | The project utilizes testing frameworks like pytest to ensure the correctness and reliability of various modules. Test cases are written to validate different functionalities, improving code robustness and facilitating future changes. | | ⚡️ | Performance | The project focuses on efficiency and resource optimization by using libraries like tensorflow and scikit-learn for machine learning tasks. It employs best practices for data processing and model training, enhancing performance and speed. | | 🛡️ | Security | Measures such as data encryption, secure access control, and data validation are implemented to ensure data protection and prevent unauthorized access. The project follows security best practices to safeguard sensitive financial information. | | 📦 | Dependencies | Key external libraries and dependencies include pytesseract, bs4, matplotlib, torch, scikit-learn, and more. These libraries provide essential functionalities for tasks such as image processing, web scraping, and machine learning model training. |

Repository Structure

sh └── finscraper/ ├── README.md ├── finscraper │ ├── __init__.py │ ├── model_training.py │ ├── preprocessor.py │ ├── scraper.py │ └── test_setup.py ├── poetry.lock ├── pyproject.toml └── tests └── __init__.py

Modules

| File | Summary | | -------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [pyproject.toml](https://github.com/ChiragAgg5k/finscraper/blob/master/pyproject.toml) | Manages project dependencies and metadata, including Python version and external libraries.-Ensures smooth integration and compatibility for financial data scraping tasks in the parent finscraper repository. |

finscraper

| File | Summary | | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | [scraper.py](https://github.com/ChiragAgg5k/finscraper/blob/master/finscraper/scraper.py) | Scrapes SEC for Apple Inc.s 10-K documents, downloads and stores them in a structured format. This module contributes to the financial data extraction and storage capabilities of the repository. | | [test_setup.py](https://github.com/ChiragAgg5k/finscraper/blob/master/finscraper/test_setup.py) | Verifies software dependencies and GPU availability. Reports versions of key libraries and checks for CUDA and TensorFlow GPU devices. Helps ensure the system is configured correctly for model training in the finscraper repository. | | [model_training.py](https://github.com/ChiragAgg5k/finscraper/blob/master/finscraper/model_training.py) | Trains a Fast R-CNN model on financial table images.-Prepares and splits data for training.-Automates HTML to image conversion.-Cleans and preprocesses images.-Utilizes a dataset class and data loaders for model training.-Saves the fine-tuned model after training. | | [preprocessor.py](https://github.com/ChiragAgg5k/finscraper/blob/master/finscraper/preprocessor.py) | Analyzes, converts, cleans, and splits financial statement images. Extracts tables from HTML, transforms to images, enhances image quality, and divides dataset for model training. Implemented in preprocessor.py within finscraper project. |

Getting Started

System Requirements:

Python: version x.y.z

Installation

From `source`

Clone the finscraper repository:

console $ git clone https://github.com/ChiragAgg5k/finscraper

Change to the project directory:

console $ cd finscraper

Install the dependencies:

console $ pip install -r requirements.txt

Usage

From `source`

Run finscraper using the command below:

console $ python main.py

Tests

Run the test suite using the command below:

console $ pytest

Project Roadmap

[x] ► Create a pipeline for scraping financial data.
[ ] ► Implement a model training module.
[ ] ► Develop a preprocessor for financial statement images.

Contributing

Contributions are welcome! Here are several ways you can contribute:

Report Issues: Submit bugs found or log feature requests for the finscraper project.
Submit Pull Requests: Review open PRs, and submit your own PRs.
Join the Discussions: Share your insights, provide feedback, or ask questions.

Contributing Guidelines

1. **Fork the Repository**: Start by forking the project repository to your github account. 2. **Clone Locally**: Clone the forked repository to your local machine using a git client. ```sh git clone https://github.com/ChiragAgg5k/finscraper ``` 3. **Create a New Branch**: Always work on a new branch, giving it a descriptive name. ```sh git checkout -b new-feature-x ``` 4. **Make Your Changes**: Develop and test your changes locally. 5. **Commit Your Changes**: Commit with a clear message describing your updates. ```sh git commit -m 'Implemented new feature x.' ``` 6. **Push to github**: Push the changes to your forked repository. ```sh git push origin new-feature-x ``` 7. **Submit a Pull Request**: Create a PR against the original project repository. Clearly describe the changes and their motivations. 8. **Review**: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!

Contributor Graph

License

This project is protected under the SELECT-A-LICENSE License. For more details, refer to the LICENSE file.

Acknowledgments

List any resources, contributors, inspiration, etc. here.

Return

Owner

Name: Chirag Aggarwal
Login: ChiragAgg5k
Kind: user
Location: Noida , Uttar Pradesh , India
Company: Bennett University

Twitter: ChiragAgg5k
Repositories: 3
Profile: https://github.com/ChiragAgg5k

CSE Undergrad | Student at Bennett University

GitHub Events

Total

Last Year

Committers

Last synced: about 1 year ago

All Time

Total Commits: 8
Total Committers: 1
Avg Commits per committer: 8.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 8
Committers: 1
Avg Commits per committer: 8.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
ChiragAgg5k	1****k	8

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

poetry.lock pypi

absl-py 2.1.0
astunparse 1.6.3
attrs 23.2.0
beautifulsoup4 4.12.3
bs4 0.0.2
certifi 2024.7.4
cffi 1.16.0
charset-normalizer 3.3.2
colorama 0.4.6
contourpy 1.2.1
cycler 0.12.1
filelock 3.15.4
flatbuffers 24.3.25
fonttools 4.53.1
fsspec 2024.6.1
gast 0.6.0
google-pasta 0.2.0
grpcio 1.65.1
h11 0.14.0
h5py 3.11.0
idna 3.7
imgkit 1.2.3
jinja2 3.1.4
joblib 1.4.2
keras 3.4.1
kiwisolver 1.4.5
libclang 18.1.1
markdown 3.6
markdown-it-py 3.0.0
markupsafe 2.1.5
matplotlib 3.9.1
mdurl 0.1.2
ml-dtypes 0.4.0
mpmath 1.3.0
namex 0.0.8
networkx 3.3
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.5.82
nvidia-nvtx-cu12 12.1.105
opencv-python 4.10.0.84
opt-einsum 3.3.0
optree 0.12.1
outcome 1.3.0.post0
packaging 24.1
pandas 2.2.2
pdf2image 1.17.0
pillow 10.4.0
protobuf 4.25.3
pycparser 2.22
pygments 2.18.0
pyparsing 3.1.2
pysocks 1.7.1
pytesseract 0.3.10
python-dateutil 2.9.0.post0
pytz 2024.1
requests 2.32.3
rich 13.7.1
scikit-learn 1.5.1
scipy 1.14.0
selenium 4.23.1
setuptools 71.1.0
six 1.16.0
sniffio 1.3.1
sortedcontainers 2.4.0
soupsieve 2.5
sympy 1.13.1
tensorboard 2.17.0
tensorboard-data-server 0.7.2
tensorflow 2.17.0
tensorflow-io-gcs-filesystem 0.37.1
termcolor 2.4.0
threadpoolctl 3.5.0
torch 2.4.0
torchvision 0.19.0
tqdm 4.66.4
trio 0.26.0
trio-websocket 0.11.1
triton 3.0.0
typing-extensions 4.12.2
tzdata 2024.1
urllib3 2.2.2
websocket-client 1.8.0
werkzeug 3.0.3
wheel 0.43.0
wrapt 1.16.0
wsproto 1.2.0

pyproject.toml pypi

bs4 ^0.0.2
imgkit ^1.2.3
matplotlib ^3.9.1
opencv-python ^4.10.0.84
pandas ^2.2.2
pdf2image ^1.17.0
pillow ^10.4.0
pytesseract ^0.3.10
python ^3.11
requests ^2.32.3
scikit-learn ^1.5.1
selenium ^4.23.1
tensorflow ^2.17.0
torch ^2.4.0
torchvision ^0.19.0
tqdm ^4.66.4

https://github.com/chiragagg5k/finscraper

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

FINSCRAPER

Overview

Features

Repository Structure

Modules

Getting Started

Installation

From source

Usage

From source

Tests

Project Roadmap

Contributing

License

Acknowledgments

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

From `source`

From `source`