Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary
Keywords
Repository
AI-based web extractor
Basic Info
- Host: GitHub
- Owner: jjonescz
- Language: Python
- Default Branch: main
- Homepage: https://bit.ly/awedemo
- Size: 2.16 MB
Statistics
- Stars: 11
- Watchers: 1
- Forks: 2
- Open Issues: 0
- Releases: 4
Topics
Metadata Files
README.md
AI-based web extractor
This repository contains source code of AI-based structured web data extractor.
- 👨💻 Author: Jan Joneš
- 📜 Thesis: PDF, assignment, submission, slides
- 🚀 Demo: live, Docker Hub, examples below
- 🗃️ Data: SWDE with visuals
Directory structure
- 📂
awe/: Python module (data manipulation and machine learning). Seeawe/README.md. - 📂
js/: Node.js app (visual attribute extractor and inference demo). Seejs/README.md. - 📂
docs/- 📂
dev/ - 📄
env.md: development environment setup. - 📄
tips.md: development guidelines and bash snippets. - 📄
data.md: dataset preparation. - 📄
extractor.md: running the visual extractor. - 📄
train.md: training instructions. - 📄
release.md: release instructions. - 📂
demo/ - 📄
run.md: developing and running the demo. - 📄
deploy.md: online demo deployment.
- 📂
Quickstart
Running the pre-trained demo locally
bash
docker pull janjones/awe-demo
docker run --rm -it -p 3000:3000 janjones/awe-demo
Open a web browser and navigate to http://localhost:3000/.
For more details, see docs/demo/run.md.
Training on the SWDE dataset
bash
docker pull janjones/awe-gradient
docker run --rm -it -v awe:/storage -p 3000:3000 janjones/awe-gradient bash
Then, run inside the Docker container:
```bash git clone https://github.com/jjonescz/awe . git clone https://github.com/jjonescz/swde-visual data/swde python -m awe.training.params python -m awe.training.train
Model is trained, now you can run the demo.
cd js pnpm install DEBUG=1 pnpm run server ```
For more details, see
Examples
Generated by the live demo.


Owner
- Name: Jan Jones
- Login: jjonescz
- Kind: user
- Location: Prague, Czech Republic
- Company: @Microsoft
- Website: janjones.me
- Repositories: 69
- Profile: https://github.com/jjonescz
C# and Razor compiler dev at @Microsoft
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Joneš"
given-names: "Jan"
title: "AI-based Structured Web Data Extraction"
version: 1.0.0
date-released: 2022-05-04
url: "https://github.com/jjonescz/awe"
repository-code: "https://github.com/jjonescz/awe"
repository-artifacts: "https://github.com/jjonescz/awe/releases/tag/v1.0"
preferred-citation:
type: thesis
authors:
- family-names: "Joneš"
given-names: "Jan"
title: "AI-based Structured Web Data Extraction"
thesis-type: MS
year: 2022
department: Department of Software Engineering
institution:
name: Charles University
city: Prague
country: CZ
date-published: 2022-06-15
url: "http://hdl.handle.net/20.500.11956/174143"
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Issues and Pull Requests
Last synced: 12 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v2 composite
- actions/checkout v2 composite
- superfly/flyctl-actions 1.3 composite
- actions/checkout v2 composite
- actions/checkout v2 composite
- actions/checkout v2 composite
- actions/upload-artifact v2 composite
- janjones/awe-gradient latest build
- janjones/awe-gradient 1650739890 build
- nvidia/cuda 9.2-base-ubuntu16.04 build
- @types/cli-progress 3.9.2 development
- @types/express 4.17.13 development
- @types/natural-compare-lite 1.4.0 development
- @types/node 16.11.6 development
- ts-node 10.4.0 development
- typescript 4.4.4 development
- @oclif/command 1.8.0
- @oclif/errors 1.3.5
- cheerio 1.0.0-rc.10
- cli-progress 3.9.1
- domhandler 4.3.1
- express 4.17.3
- fast-glob 3.2.7
- generic-pool 3.8.2
- html-template-tag 4.0.0
- natural-compare-lite 1.4.0
- puppeteer-core 11.0.0
- python-shell 3.0.1
- rxjs 7.4.0
- winston 3.3.3
- 252 dependencies
- absl-py =1.0.0=pypi_0
- anyio =3.5.0=pypi_0
- argon2-cffi =21.3.0=pypi_0
- argon2-cffi-bindings =21.2.0=pypi_0
- astroid =2.9.3=pypi_0
- asttokens =2.0.5=pypi_0
- attrs =18.2.0=pypi_0
- autopep8 =1.6.0=pypi_0
- babel =2.10.1=pypi_0
- backcall =0.2.0=pypi_0
- beautifulsoup4 =4.11.1=py39h06a4308_0
- bleach =5.0.0=pypi_0
- brotlipy =0.7.0=py39h27cfd23_1003
- bzip2 =1.0.8=h7b6447c_0
- ca-certificates =2021.10.8=ha878542_0
- cachetools =5.0.0=pypi_0
- certifi =2021.10.8=py39hf3d152e_2
- cffi =1.15.0=py39hd667e15_1
- chardet =4.0.0=py39h06a4308_1003
- charset-normalizer =2.0.4=pyhd3eb1b0_0
- click =8.1.2=pypi_0
- click-completion =0.5.2=pypi_0
- click-didyoumean =0.3.0=pypi_0
- click-help-colors =0.9.1=pypi_0
- colorama =0.4.3=pypi_0
- conda =4.12.0=py39hf3d152e_0
- conda-build =3.21.8=py39h06a4308_2
- conda-content-trust =0.1.1=pyhd3eb1b0_0
- conda-package-handling =1.7.3=py39h27cfd23_1
- cryptography =36.0.0=py39h9ce1e76_0
- cycler =0.11.0=pypi_0
- cython =0.29.28=pypi_0
- debugpy =1.6.0=pypi_0
- decorator =5.1.1=pypi_0
- defusedxml =0.7.1=pypi_0
- descartes =1.1.0=pypi_0
- entrypoints =0.4=pypi_0
- executing =0.8.3=pypi_0
- fastjsonschema =2.15.3=pypi_0
- filelock =3.6.0=pyhd3eb1b0_0
- fonttools =4.33.2=pypi_0
- gensim =4.1.2=pypi_0
- gh =2.6.0=ha8f183a_0
- glob2 =0.7=pyhd3eb1b0_0
- google-auth =2.6.6=pypi_0
- google-auth-oauthlib =0.4.6=pypi_0
- gql =3.0.0a6=pypi_0
- gradient =2.0.2=pypi_0
- gradient-utils =0.5.0=pypi_0
- graphql-core =3.1.7=pypi_0
- grpcio =1.44.0=pypi_0
- halo =0.0.31=pypi_0
- huggingface-hub =0.5.1=pypi_0
- icu =69.1=h9c3ff4c_0
- idna =3.3=pyhd3eb1b0_0
- ijson =3.1.4=pypi_0
- importlib-metadata =4.11.3=pypi_0
- inflection =0.5.1=pypi_0
- ipykernel =6.13.0=pypi_0
- ipython =8.2.0=pypi_0
- ipython-genutils =0.2.0=pypi_0
- ipywidgets =7.6.5=pypi_0
- isort =5.10.1=pypi_0
- jedi =0.18.1=pypi_0
- jinja2 =3.1.1=pypi_0
- joblib =1.1.0=pypi_0
- json5 =0.9.6=pypi_0
- jsonschema =4.4.0=pypi_0
- jupyter-client =7.2.2=pypi_0
- jupyter-core =4.10.0=pypi_0
- jupyter-server =1.16.0=pypi_0
- jupyterlab =3.2.4=pypi_0
- jupyterlab-pygments =0.2.2=pypi_0
- jupyterlab-server =2.13.0=pypi_0
- jupyterlab-widgets =1.1.0=pypi_0
- kiwisolver =1.4.2=pypi_0
- lazy-object-proxy =1.7.1=pypi_0
- ld_impl_linux-64 =2.35.1=h7274673_9
- libarchive =3.4.2=h62408e4_0
- libffi =3.3=he6710b0_2
- libgcc-ng =11.2.0=h1d223b6_16
- libiconv =1.16=h516909a_0
- liblief =0.11.5=h295c915_1
- libstdcxx-ng =11.2.0=he4da1e4_16
- libuv =1.42.0=h7f98852_0
- libxml2 =2.9.12=h885dcf4_1
- libzlib =1.2.11=h166bdaf_1014
- llvm-openmp =13.0.1=he0ac6c6_1
- log-symbols =0.0.14=pypi_0
- lz4-c =1.9.3=h295c915_1
- markdown =3.3.6=pypi_0
- markupsafe =2.0.1=py39h27cfd23_0
- marshmallow =2.21.0=pypi_0
- matplotlib =3.5.1=pypi_0
- matplotlib-inline =0.1.3=pypi_0
- mccabe =0.6.1=pypi_0
- mistune =0.8.4=pypi_0
- mizani =0.7.4=pypi_0
- multidict =6.0.2=pypi_0
- nbclassic =0.3.7=pypi_0
- nbclient =0.6.0=pypi_0
- nbconvert =6.5.0=pypi_0
- nbformat =5.3.0=pypi_0
- ncurses =6.3=h7f8727e_2
- nest-asyncio =1.5.5=pypi_0
- nodejs =17.1.0=h8ca31f7_2
- notebook =6.4.11=pypi_0
- notebook-shim =0.1.0=pypi_0
- numpy =1.22.3=pypi_0
- oauthlib =3.2.0=pypi_0
- openssl =1.1.1n=h166bdaf_0
- packaging =21.3=pypi_0
- palettable =3.3.0=pypi_0
- pandas =1.4.2=pypi_0
- pandocfilters =1.5.0=pypi_0
- parso =0.8.3=pypi_0
- patchelf =0.13=h295c915_0
- patsy =0.5.2=pypi_0
- pexpect =4.8.0=pypi_0
- pickleshare =0.7.5=pypi_0
- pillow =9.1.0=pypi_0
- pip =21.2.4=py39h06a4308_0
- pkginfo =1.8.2=pyhd3eb1b0_0
- platformdirs =2.5.2=pypi_0
- plotnine =0.8.0=pypi_0
- progressbar2 =4.0.0=pypi_0
- prometheus-client =0.9.0=pypi_0
- prompt-toolkit =3.0.29=pypi_0
- protobuf =3.20.1=pypi_0
- psutil =5.8.0=py39h27cfd23_1
- ptyprocess =0.7.0=pypi_0
- pure-eval =0.2.2=pypi_0
- py-lief =0.11.5=py39h295c915_1
- pyasn1 =0.4.8=pypi_0
- pyasn1-modules =0.2.8=pypi_0
- pycodestyle =2.8.0=pypi_0
- pycosat =0.6.3=py39h27cfd23_0
- pycparser =2.21=pyhd3eb1b0_0
- pygments =2.11.2=pypi_0
- pylint =2.12.2=pypi_0
- pymongo =3.12.3=pypi_0
- pyopenssl =21.0.0=pyhd3eb1b0_1
- pyparsing =3.0.8=pypi_0
- pyrsistent =0.18.1=pypi_0
- pysocks =1.7.1=py39h06a4308_0
- python =3.9.7=h12debd9_1
- python-dateutil =2.8.2=pypi_0
- python-libarchive-c =2.9=pyhd3eb1b0_1
- python-slugify =5.0.2=pypi_0
- python-utils =3.1.0=pypi_0
- python_abi =3.9=2_cp39
- pytz =2021.3=pyhd3eb1b0_0
- pyyaml =5.4.1=pypi_0
- pyzmq =22.3.0=pypi_0
- readline =8.1.2=h7f8727e_1
- regex =2022.3.15=pypi_0
- requests =2.27.1=pyhd3eb1b0_0
- requests-oauthlib =1.3.1=pypi_0
- requests-toolbelt =0.9.1=pypi_0
- ripgrep =12.1.1=0
- rsa =4.8=pypi_0
- ruamel_yaml =0.15.100=py39h27cfd23_0
- sacremoses =0.0.49=pypi_0
- scikit-learn =1.0.2=pypi_0
- scipy =1.8.0=pypi_0
- selectolax =0.3.6=pypi_0
- send2trash =1.8.0=pypi_0
- setuptools =58.0.4=py39h06a4308_0
- shellingham =1.4.0=pypi_0
- six =1.16.0=pyhd3eb1b0_0
- smart-open =5.2.1=pypi_0
- sniffio =1.2.0=pypi_0
- soupsieve =2.3.1=pyhd3eb1b0_0
- spinners =0.0.24=pypi_0
- sqlite =3.37.0=hc218d9a_0
- stack-data =0.2.0=pypi_0
- statsmodels =0.13.2=pypi_0
- tensorboard =2.8.0=pypi_0
- tensorboard-data-server =0.6.1=pypi_0
- tensorboard-plugin-wit =1.8.1=pypi_0
- termcolor =1.1.0=pypi_0
- terminado =0.13.3=pypi_0
- terminaltables =3.1.10=pypi_0
- text-unidecode =1.3=pypi_0
- threadpoolctl =3.1.0=pypi_0
- tinycss2 =1.1.1=pypi_0
- tk =8.6.11=h1ccaba5_0
- tokenizers =0.10.3=pypi_0
- toml =0.10.2=pypi_0
- torch =1.10.0=pypi_0
- torch-tb-profiler =0.2.1=pypi_0
- torchinfo =1.6.5=pypi_0
- torchmetrics =0.6.2=pypi_0
- torchtext =0.11.0=pypi_0
- tornado =6.1=pypi_0
- tqdm =4.62.3=pyhd3eb1b0_1
- traitlets =5.1.1=pypi_0
- transformers =4.15.0=pypi_0
- typing-extensions =4.2.0=pypi_0
- tzdata =2021e=hda174b7_0
- urllib3 =1.26.7=pyhd3eb1b0_0
- wcwidth =0.2.5=pypi_0
- webencodings =0.5.1=pypi_0
- websocket-client =0.57.0=pypi_0
- werkzeug =2.1.1=pypi_0
- wheel =0.35.1=pypi_0
- widgetsnbextension =3.5.2=pypi_0
- wrapt =1.13.3=pypi_0
- xz =5.2.5=h7b6447c_0
- yaml =0.2.5=h7b6447c_0
- yarl =1.7.2=pypi_0
- zipp =3.8.0=pypi_0
- zlib =1.2.11=h166bdaf_1014
- zstd =1.4.9=haebb681_0
- torch ==1.10.0
- autopep8 ==1.6.0
- gensim ==4.1.2
- gradient ==2.0.2
- ijson ==3.1.4
- inflection ==0.5.1
- ipywidgets ==7.6.5
- jupyterlab ==3.2.4
- matplotlib ==3.5.1
- plotnine ==0.8.0
- pylint ==2.12.2
- python-slugify ==5.0.2
- scikit-learn ==1.0.2
- selectolax ==0.3.6
- torch-tb-profiler ==0.2.1
- torchinfo ==1.6.5
- torchmetrics <0.7
- torchtext ==0.11.0
- transformers ==4.15.0