dataextratt
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: tercanblg
- Language: Python
- Default Branch: main
- Size: 452 KB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Data Extraction Project Introduction This project aims to extract data from various sources and formats, such as websites, databases, and documents, using Python-based tools and libraries. The extracted data can be used for various purposes, such as analysis, reporting, and machine learning.
Installation Clone this repository to your local machine:
bash Kodu kopyala git clone https://github.com/tercanblg/dataextraction.git Navigate to the project directory:
bash Kodu kopyala cd dataextraction Install the required dependencies:
Kodu kopyala pip install -r requirements.txt Usage Modify the configuration file (config.ini) to specify the sources and formats from which you want to extract data.
Run the main script to start the data extraction process:
css Kodu kopyala python main.py The extracted data will be saved to the specified output location as per the configuration.
Configuration Modify the config.ini file to customize the extraction process. Specify the sources, formats, output location, and any other parameters required for data extraction. Contributing Contributions are welcome! If you find any issues or have suggestions for improvements, feel free to open an issue or submit a pull request.
License This project is licensed under the MIT License.
Contact For any inquiries or feedback, you can reach out to [insert your contact information].
Feel free to customize this template according to your project's specific requirements and details.
Owner
- Login: tercanblg
- Kind: user
- Repositories: 9
- Profile: https://github.com/tercanblg
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Joneš"
given-names: "Jan"
title: "AI-based Structured Web Data Extraction"
version: 1.0.0
date-released: 2022-05-04
url: "https://github.com/jjonescz/awe"
repository-code: "https://github.com/jjonescz/awe"
repository-artifacts: "https://github.com/jjonescz/awe/releases/tag/v1.0"
preferred-citation:
type: thesis
authors:
- family-names: "Joneš"
given-names: "Jan"
title: "AI-based Structured Web Data Extraction"
thesis-type: MS
year: 2022
department: Department of Software Engineering
institution:
name: Charles University
city: Prague
country: CZ
date-published: 2022-06-15
url: "http://hdl.handle.net/20.500.11956/174143"
GitHub Events
Total
Last Year
Dependencies
- actions/checkout v2 composite
- actions/checkout v2 composite
- superfly/flyctl-actions 1.3 composite
- actions/checkout v2 composite
- actions/checkout v2 composite
- actions/checkout v2 composite
- actions/upload-artifact v2 composite
- nvidia/cuda 9.2-base-ubuntu16.04 build
- @types/cli-progress 3.9.2 development
- @types/express 4.17.13 development
- @types/natural-compare-lite 1.4.0 development
- @types/node 16.11.6 development
- ts-node 10.4.0 development
- typescript 4.4.4 development
- @oclif/command 1.8.0
- @oclif/errors 1.3.5
- cheerio 1.0.0-rc.10
- cli-progress 3.9.1
- domhandler 4.3.1
- express 4.17.3
- fast-glob 3.2.7
- generic-pool 3.8.2
- html-template-tag 4.0.0
- natural-compare-lite 1.4.0
- puppeteer-core 11.0.0
- python-shell 3.0.1
- rxjs 7.4.0
- winston 3.3.3
- 252 dependencies
- absl-py =1.0.0=pypi_0
- anyio =3.5.0=pypi_0
- argon2-cffi =21.3.0=pypi_0
- argon2-cffi-bindings =21.2.0=pypi_0
- astroid =2.9.3=pypi_0
- asttokens =2.0.5=pypi_0
- attrs =18.2.0=pypi_0
- autopep8 =1.6.0=pypi_0
- babel =2.10.1=pypi_0
- backcall =0.2.0=pypi_0
- beautifulsoup4 =4.11.1=py39h06a4308_0
- bleach =5.0.0=pypi_0
- brotlipy =0.7.0=py39h27cfd23_1003
- bzip2 =1.0.8=h7b6447c_0
- ca-certificates =2021.10.8=ha878542_0
- cachetools =5.0.0=pypi_0
- certifi =2021.10.8=py39hf3d152e_2
- cffi =1.15.0=py39hd667e15_1
- chardet =4.0.0=py39h06a4308_1003
- charset-normalizer =2.0.4=pyhd3eb1b0_0
- click =8.1.2=pypi_0
- click-completion =0.5.2=pypi_0
- click-didyoumean =0.3.0=pypi_0
- click-help-colors =0.9.1=pypi_0
- colorama =0.4.3=pypi_0
- conda =4.12.0=py39hf3d152e_0
- conda-build =3.21.8=py39h06a4308_2
- conda-content-trust =0.1.1=pyhd3eb1b0_0
- conda-package-handling =1.7.3=py39h27cfd23_1
- cryptography =36.0.0=py39h9ce1e76_0
- cycler =0.11.0=pypi_0
- cython =0.29.28=pypi_0
- debugpy =1.6.0=pypi_0
- decorator =5.1.1=pypi_0
- defusedxml =0.7.1=pypi_0
- descartes =1.1.0=pypi_0
- entrypoints =0.4=pypi_0
- executing =0.8.3=pypi_0
- fastjsonschema =2.15.3=pypi_0
- filelock =3.6.0=pyhd3eb1b0_0
- fonttools =4.33.2=pypi_0
- gensim =4.1.2=pypi_0
- gh =2.6.0=ha8f183a_0
- glob2 =0.7=pyhd3eb1b0_0
- google-auth =2.6.6=pypi_0
- google-auth-oauthlib =0.4.6=pypi_0
- gql =3.0.0a6=pypi_0
- gradient =2.0.2=pypi_0
- gradient-utils =0.5.0=pypi_0
- graphql-core =3.1.7=pypi_0
- grpcio =1.44.0=pypi_0
- halo =0.0.31=pypi_0
- huggingface-hub =0.5.1=pypi_0
- icu =69.1=h9c3ff4c_0
- idna =3.3=pyhd3eb1b0_0
- ijson =3.1.4=pypi_0
- importlib-metadata =4.11.3=pypi_0
- inflection =0.5.1=pypi_0
- ipykernel =6.13.0=pypi_0
- ipython =8.2.0=pypi_0
- ipython-genutils =0.2.0=pypi_0
- ipywidgets =7.6.5=pypi_0
- isort =5.10.1=pypi_0
- jedi =0.18.1=pypi_0
- jinja2 =3.1.1=pypi_0
- joblib =1.1.0=pypi_0
- json5 =0.9.6=pypi_0
- jsonschema =4.4.0=pypi_0
- jupyter-client =7.2.2=pypi_0
- jupyter-core =4.10.0=pypi_0
- jupyter-server =1.16.0=pypi_0
- jupyterlab =3.2.4=pypi_0
- jupyterlab-pygments =0.2.2=pypi_0
- jupyterlab-server =2.13.0=pypi_0
- jupyterlab-widgets =1.1.0=pypi_0
- kiwisolver =1.4.2=pypi_0
- lazy-object-proxy =1.7.1=pypi_0
- ld_impl_linux-64 =2.35.1=h7274673_9
- libarchive =3.4.2=h62408e4_0
- libffi =3.3=he6710b0_2
- libgcc-ng =11.2.0=h1d223b6_16
- libiconv =1.16=h516909a_0
- liblief =0.11.5=h295c915_1
- libstdcxx-ng =11.2.0=he4da1e4_16
- libuv =1.42.0=h7f98852_0
- libxml2 =2.9.12=h885dcf4_1
- libzlib =1.2.11=h166bdaf_1014
- llvm-openmp =13.0.1=he0ac6c6_1
- log-symbols =0.0.14=pypi_0
- lz4-c =1.9.3=h295c915_1
- markdown =3.3.6=pypi_0
- markupsafe =2.0.1=py39h27cfd23_0
- marshmallow =2.21.0=pypi_0
- matplotlib =3.5.1=pypi_0
- matplotlib-inline =0.1.3=pypi_0
- mccabe =0.6.1=pypi_0
- mistune =0.8.4=pypi_0
- mizani =0.7.4=pypi_0
- multidict =6.0.2=pypi_0
- nbclassic =0.3.7=pypi_0
- nbclient =0.6.0=pypi_0
- nbconvert =6.5.0=pypi_0
- nbformat =5.3.0=pypi_0
- ncurses =6.3=h7f8727e_2
- nest-asyncio =1.5.5=pypi_0
- nodejs =17.1.0=h8ca31f7_2
- notebook =6.4.11=pypi_0
- notebook-shim =0.1.0=pypi_0
- numpy =1.22.3=pypi_0
- oauthlib =3.2.0=pypi_0
- openssl =1.1.1n=h166bdaf_0
- packaging =21.3=pypi_0
- palettable =3.3.0=pypi_0
- pandas =1.4.2=pypi_0
- pandocfilters =1.5.0=pypi_0
- parso =0.8.3=pypi_0
- patchelf =0.13=h295c915_0
- patsy =0.5.2=pypi_0
- pexpect =4.8.0=pypi_0
- pickleshare =0.7.5=pypi_0
- pillow =9.1.0=pypi_0
- pip =21.2.4=py39h06a4308_0
- pkginfo =1.8.2=pyhd3eb1b0_0
- platformdirs =2.5.2=pypi_0
- plotnine =0.8.0=pypi_0
- progressbar2 =4.0.0=pypi_0
- prometheus-client =0.9.0=pypi_0
- prompt-toolkit =3.0.29=pypi_0
- protobuf =3.20.1=pypi_0
- psutil =5.8.0=py39h27cfd23_1
- ptyprocess =0.7.0=pypi_0
- pure-eval =0.2.2=pypi_0
- py-lief =0.11.5=py39h295c915_1
- pyasn1 =0.4.8=pypi_0
- pyasn1-modules =0.2.8=pypi_0
- pycodestyle =2.8.0=pypi_0
- pycosat =0.6.3=py39h27cfd23_0
- pycparser =2.21=pyhd3eb1b0_0
- pygments =2.11.2=pypi_0
- pylint =2.12.2=pypi_0
- pymongo =3.12.3=pypi_0
- pyopenssl =21.0.0=pyhd3eb1b0_1
- pyparsing =3.0.8=pypi_0
- pyrsistent =0.18.1=pypi_0
- pysocks =1.7.1=py39h06a4308_0
- python =3.9.7=h12debd9_1
- python-dateutil =2.8.2=pypi_0
- python-libarchive-c =2.9=pyhd3eb1b0_1
- python-slugify =5.0.2=pypi_0
- python-utils =3.1.0=pypi_0
- python_abi =3.9=2_cp39
- pytz =2021.3=pyhd3eb1b0_0
- pyyaml =5.4.1=pypi_0
- pyzmq =22.3.0=pypi_0
- readline =8.1.2=h7f8727e_1
- regex =2022.3.15=pypi_0
- requests =2.27.1=pyhd3eb1b0_0
- requests-oauthlib =1.3.1=pypi_0
- requests-toolbelt =0.9.1=pypi_0
- ripgrep =12.1.1=0
- rsa =4.8=pypi_0
- ruamel_yaml =0.15.100=py39h27cfd23_0
- sacremoses =0.0.49=pypi_0
- scikit-learn =1.0.2=pypi_0
- scipy =1.8.0=pypi_0
- selectolax =0.3.6=pypi_0
- send2trash =1.8.0=pypi_0
- setuptools =58.0.4=py39h06a4308_0
- shellingham =1.4.0=pypi_0
- six =1.16.0=pyhd3eb1b0_0
- smart-open =5.2.1=pypi_0
- sniffio =1.2.0=pypi_0
- soupsieve =2.3.1=pyhd3eb1b0_0
- spinners =0.0.24=pypi_0
- sqlite =3.37.0=hc218d9a_0
- stack-data =0.2.0=pypi_0
- statsmodels =0.13.2=pypi_0
- tensorboard =2.8.0=pypi_0
- tensorboard-data-server =0.6.1=pypi_0
- tensorboard-plugin-wit =1.8.1=pypi_0
- termcolor =1.1.0=pypi_0
- terminado =0.13.3=pypi_0
- terminaltables =3.1.10=pypi_0
- text-unidecode =1.3=pypi_0
- threadpoolctl =3.1.0=pypi_0
- tinycss2 =1.1.1=pypi_0
- tk =8.6.11=h1ccaba5_0
- tokenizers =0.10.3=pypi_0
- toml =0.10.2=pypi_0
- torch =1.10.0=pypi_0
- torch-tb-profiler =0.2.1=pypi_0
- torchinfo =1.6.5=pypi_0
- torchmetrics =0.6.2=pypi_0
- torchtext =0.11.0=pypi_0
- tornado =6.1=pypi_0
- tqdm =4.62.3=pyhd3eb1b0_1
- traitlets =5.1.1=pypi_0
- transformers =4.15.0=pypi_0
- typing-extensions =4.2.0=pypi_0
- tzdata =2021e=hda174b7_0
- urllib3 =1.26.7=pyhd3eb1b0_0
- wcwidth =0.2.5=pypi_0
- webencodings =0.5.1=pypi_0
- websocket-client =0.57.0=pypi_0
- werkzeug =2.1.1=pypi_0
- wheel =0.35.1=pypi_0
- widgetsnbextension =3.5.2=pypi_0
- wrapt =1.13.3=pypi_0
- xz =5.2.5=h7b6447c_0
- yaml =0.2.5=h7b6447c_0
- yarl =1.7.2=pypi_0
- zipp =3.8.0=pypi_0
- zlib =1.2.11=h166bdaf_1014
- zstd =1.4.9=haebb681_0
- janjones/awe-gradient latest build
- janjones/awe-gradient 1650739890 build
- torch ==1.10.0
- autopep8 ==1.6.0
- gensim ==4.1.2
- gradient ==2.0.2
- ijson ==3.1.4
- inflection ==0.5.1
- ipywidgets ==7.6.5
- jupyterlab ==3.2.4
- matplotlib ==3.5.1
- plotnine ==0.8.0
- pylint ==2.12.2
- python-slugify ==5.0.2
- scikit-learn ==1.0.2
- selectolax ==0.3.6
- torch-tb-profiler ==0.2.1
- torchinfo ==1.6.5
- torchmetrics <0.7
- torchtext ==0.11.0
- transformers ==4.15.0