https://github.com/chiragagg5k/finscraper
scraping and processing sec edgar financial data and training fastercnn table detection model with it
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary
Keywords
Repository
scraping and processing sec edgar financial data and training fastercnn table detection model with it
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
FINSCRAPER
Unleashing financial insights with cutting-edge tech!
<!-- TABLE OF CONTENTS -->
Table of Contents
- [Overview](#overview) - [Features](#features) - [Repository Structure](#repository-structure) - [Modules](#modules) - [Getting Started](#getting-started) - [Installation](#installation) - [Usage](#usage) - [Tests](#tests) - [Project Roadmap](#project-roadmap) - [Contributing](#contributing) - [License](#license) - [Acknowledgments](#acknowledgments)
Overview
The finscraper project encompasses a suite of financial data extraction and processing functionalities. It includes modules for scraping SEC filings, model training with Fast R-CNN, and preprocessing financial statement images. The project manages dependencies via its pyproject.toml file, ensuring seamless integration for scraping tasks. With capabilities to download, store, analyze, and train models on financial data, finscraper offers a robust platform for financial data extraction, preparation, and analysis.
Features
| | Feature | Description | | --- | ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ⚙️ | Architecture | The project uses a modular architecture with separate modules for scraping, testing, model training, and pre-processing financial data. It leverages external libraries and tools for specific tasks, promoting code separation and reusability. | | 🔩 | Code Quality | The codebase maintains high code quality standards with clear naming conventions, consistent formatting, and well-structured functions. Code comments are used effectively to enhance code readability and maintainability. | | 📄 | Documentation | The project includes informative documentation in the form of code comments, README files, and module descriptions. It provides detailed explanations of the project structure, functionality, and usage, making it easier for developers to onboard and contribute. | | 🔌 | Integrations | Key external dependencies include pytesseract, bs4, matplotlib, torch, and more for tasks such as image processing, web scraping, data visualization, and machine learning. These integrations enhance the project's capabilities and extend its functionality. | | 🧩 | Modularity | The codebase demonstrates high modularity and reusability by encapsulating distinct functionalities into separate modules. This design allows for easy maintenance, testing, and scalability of individual components without affecting the entire system. | | 🧪 | Testing | The project utilizes testing frameworks like pytest to ensure the correctness and reliability of various modules. Test cases are written to validate different functionalities, improving code robustness and facilitating future changes. | | ⚡️ | Performance | The project focuses on efficiency and resource optimization by using libraries like tensorflow and scikit-learn for machine learning tasks. It employs best practices for data processing and model training, enhancing performance and speed. | | 🛡️ | Security | Measures such as data encryption, secure access control, and data validation are implemented to ensure data protection and prevent unauthorized access. The project follows security best practices to safeguard sensitive financial information. | | 📦 | Dependencies | Key external libraries and dependencies include pytesseract, bs4, matplotlib, torch, scikit-learn, and more. These libraries provide essential functionalities for tasks such as image processing, web scraping, and machine learning model training. |
Repository Structure
sh
└── finscraper/
├── README.md
├── finscraper
│ ├── __init__.py
│ ├── model_training.py
│ ├── preprocessor.py
│ ├── scraper.py
│ └── test_setup.py
├── poetry.lock
├── pyproject.toml
└── tests
└── __init__.py
Modules
.
| File | Summary | | -------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [pyproject.toml](https://github.com/ChiragAgg5k/finscraper/blob/master/pyproject.toml) | Manages project dependencies and metadata, including Python version and external libraries.-Ensures smooth integration and compatibility for financial data scraping tasks in the parent finscraper repository. |finscraper
| File | Summary | | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | [scraper.py](https://github.com/ChiragAgg5k/finscraper/blob/master/finscraper/scraper.py) | Scrapes SEC for Apple Inc.s 10-K documents, downloads and stores them in a structured format. This module contributes to the financial data extraction and storage capabilities of the repository. | | [test_setup.py](https://github.com/ChiragAgg5k/finscraper/blob/master/finscraper/test_setup.py) | Verifies software dependencies and GPU availability. Reports versions of key libraries and checks for CUDA and TensorFlow GPU devices. Helps ensure the system is configured correctly for model training in the finscraper repository. | | [model_training.py](https://github.com/ChiragAgg5k/finscraper/blob/master/finscraper/model_training.py) | Trains a Fast R-CNN model on financial table images.-Prepares and splits data for training.-Automates HTML to image conversion.-Cleans and preprocesses images.-Utilizes a dataset class and data loaders for model training.-Saves the fine-tuned model after training. | | [preprocessor.py](https://github.com/ChiragAgg5k/finscraper/blob/master/finscraper/preprocessor.py) | Analyzes, converts, cleans, and splits financial statement images. Extracts tables from HTML, transforms to images, enhances image quality, and divides dataset for model training. Implemented in preprocessor.py within finscraper project. |Getting Started
System Requirements:
- Python:
version x.y.z
Installation
From source
- Clone the finscraper repository:
console $ git clone https://github.com/ChiragAgg5k/finscraper
- Change to the project directory:
console $ cd finscraper
- Install the dependencies:
console $ pip install -r requirements.txt
Usage
From source
Run finscraper using the command below:
console $ python main.py
Tests
Run the test suite using the command below:
console $ pytest
Project Roadmap
- [x]
► Create a pipeline for scraping financial data. - [ ]
► Implement a model training module. - [ ]
► Develop a preprocessor for financial statement images.
Contributing
Contributions are welcome! Here are several ways you can contribute:
- Report Issues: Submit bugs found or log feature requests for the
finscraperproject. - Submit Pull Requests: Review open PRs, and submit your own PRs.
- Join the Discussions: Share your insights, provide feedback, or ask questions.
Contributing Guidelines
1. **Fork the Repository**: Start by forking the project repository to your github account. 2. **Clone Locally**: Clone the forked repository to your local machine using a git client. ```sh git clone https://github.com/ChiragAgg5k/finscraper ``` 3. **Create a New Branch**: Always work on a new branch, giving it a descriptive name. ```sh git checkout -b new-feature-x ``` 4. **Make Your Changes**: Develop and test your changes locally. 5. **Commit Your Changes**: Commit with a clear message describing your updates. ```sh git commit -m 'Implemented new feature x.' ``` 6. **Push to github**: Push the changes to your forked repository. ```sh git push origin new-feature-x ``` 7. **Submit a Pull Request**: Create a PR against the original project repository. Clearly describe the changes and their motivations. 8. **Review**: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!License
This project is protected under the SELECT-A-LICENSE License. For more details, refer to the LICENSE file.
Acknowledgments
- List any resources, contributors, inspiration, etc. here.
Owner
- Name: Chirag Aggarwal
- Login: ChiragAgg5k
- Kind: user
- Location: Noida , Uttar Pradesh , India
- Company: Bennett University
- Twitter: ChiragAgg5k
- Repositories: 3
- Profile: https://github.com/ChiragAgg5k
CSE Undergrad | Student at Bennett University
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- absl-py 2.1.0
- astunparse 1.6.3
- attrs 23.2.0
- beautifulsoup4 4.12.3
- bs4 0.0.2
- certifi 2024.7.4
- cffi 1.16.0
- charset-normalizer 3.3.2
- colorama 0.4.6
- contourpy 1.2.1
- cycler 0.12.1
- filelock 3.15.4
- flatbuffers 24.3.25
- fonttools 4.53.1
- fsspec 2024.6.1
- gast 0.6.0
- google-pasta 0.2.0
- grpcio 1.65.1
- h11 0.14.0
- h5py 3.11.0
- idna 3.7
- imgkit 1.2.3
- jinja2 3.1.4
- joblib 1.4.2
- keras 3.4.1
- kiwisolver 1.4.5
- libclang 18.1.1
- markdown 3.6
- markdown-it-py 3.0.0
- markupsafe 2.1.5
- matplotlib 3.9.1
- mdurl 0.1.2
- ml-dtypes 0.4.0
- mpmath 1.3.0
- namex 0.0.8
- networkx 3.3
- numpy 1.26.4
- nvidia-cublas-cu12 12.1.3.1
- nvidia-cuda-cupti-cu12 12.1.105
- nvidia-cuda-nvrtc-cu12 12.1.105
- nvidia-cuda-runtime-cu12 12.1.105
- nvidia-cudnn-cu12 9.1.0.70
- nvidia-cufft-cu12 11.0.2.54
- nvidia-curand-cu12 10.3.2.106
- nvidia-cusolver-cu12 11.4.5.107
- nvidia-cusparse-cu12 12.1.0.106
- nvidia-nccl-cu12 2.20.5
- nvidia-nvjitlink-cu12 12.5.82
- nvidia-nvtx-cu12 12.1.105
- opencv-python 4.10.0.84
- opt-einsum 3.3.0
- optree 0.12.1
- outcome 1.3.0.post0
- packaging 24.1
- pandas 2.2.2
- pdf2image 1.17.0
- pillow 10.4.0
- protobuf 4.25.3
- pycparser 2.22
- pygments 2.18.0
- pyparsing 3.1.2
- pysocks 1.7.1
- pytesseract 0.3.10
- python-dateutil 2.9.0.post0
- pytz 2024.1
- requests 2.32.3
- rich 13.7.1
- scikit-learn 1.5.1
- scipy 1.14.0
- selenium 4.23.1
- setuptools 71.1.0
- six 1.16.0
- sniffio 1.3.1
- sortedcontainers 2.4.0
- soupsieve 2.5
- sympy 1.13.1
- tensorboard 2.17.0
- tensorboard-data-server 0.7.2
- tensorflow 2.17.0
- tensorflow-io-gcs-filesystem 0.37.1
- termcolor 2.4.0
- threadpoolctl 3.5.0
- torch 2.4.0
- torchvision 0.19.0
- tqdm 4.66.4
- trio 0.26.0
- trio-websocket 0.11.1
- triton 3.0.0
- typing-extensions 4.12.2
- tzdata 2024.1
- urllib3 2.2.2
- websocket-client 1.8.0
- werkzeug 3.0.3
- wheel 0.43.0
- wrapt 1.16.0
- wsproto 1.2.0
- bs4 ^0.0.2
- imgkit ^1.2.3
- matplotlib ^3.9.1
- opencv-python ^4.10.0.84
- pandas ^2.2.2
- pdf2image ^1.17.0
- pillow ^10.4.0
- pytesseract ^0.3.10
- python ^3.11
- requests ^2.32.3
- scikit-learn ^1.5.1
- selenium ^4.23.1
- tensorflow ^2.17.0
- torch ^2.4.0
- torchvision ^0.19.0
- tqdm ^4.66.4