awe

AI-based web extractor

https://github.com/jjonescz/awe

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.4%) to scientific vocabulary

Keywords

deep-learning information-extraction structured-web-data web-data-extraction web-scraping
Last synced: 9 months ago · JSON representation ·

Repository

AI-based web extractor

Basic Info
  • Host: GitHub
  • Owner: jjonescz
  • Language: Python
  • Default Branch: main
  • Homepage: https://bit.ly/awedemo
  • Size: 2.16 MB
Statistics
  • Stars: 11
  • Watchers: 1
  • Forks: 2
  • Open Issues: 0
  • Releases: 4
Topics
deep-learning information-extraction structured-web-data web-data-extraction web-scraping
Created over 4 years ago · Last pushed over 3 years ago
Metadata Files
Readme Citation

README.md

AI-based web extractor

This repository contains source code of AI-based structured web data extractor.

Directory structure

  • 📂 awe/: Python module (data manipulation and machine learning). See awe/README.md.
  • 📂 js/: Node.js app (visual attribute extractor and inference demo). See js/README.md.
  • 📂 docs/

Quickstart

Running the pre-trained demo locally

bash docker pull janjones/awe-demo docker run --rm -it -p 3000:3000 janjones/awe-demo

Open a web browser and navigate to http://localhost:3000/.

For more details, see docs/demo/run.md.

Training on the SWDE dataset

bash docker pull janjones/awe-gradient docker run --rm -it -v awe:/storage -p 3000:3000 janjones/awe-gradient bash

Then, run inside the Docker container:

```bash git clone https://github.com/jjonescz/awe . git clone https://github.com/jjonescz/swde-visual data/swde python -m awe.training.params python -m awe.training.train

Model is trained, now you can run the demo.

cd js pnpm install DEBUG=1 pnpm run server ```

For more details, see

  1. docs/dev/env.md,
  2. docs/data.md,
  3. docs/train.md, and
  4. docs/demo/run.md.

Examples

Generated by the live demo.

E-shop 1

E-shop 2

Owner

  • Name: Jan Jones
  • Login: jjonescz
  • Kind: user
  • Location: Prague, Czech Republic
  • Company: @Microsoft

C# and Razor compiler dev at @Microsoft

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Joneš"
  given-names: "Jan"
title: "AI-based Structured Web Data Extraction"
version: 1.0.0
date-released: 2022-05-04
url: "https://github.com/jjonescz/awe"
repository-code: "https://github.com/jjonescz/awe"
repository-artifacts: "https://github.com/jjonescz/awe/releases/tag/v1.0"
preferred-citation:
  type: thesis
  authors:
  - family-names: "Joneš"
    given-names: "Jan"
  title: "AI-based Structured Web Data Extraction"
  thesis-type: MS
  year: 2022
  department: Department of Software Engineering
  institution:
    name: Charles University
    city: Prague
    country: CZ
  date-published: 2022-06-15
  url: "http://hdl.handle.net/20.500.11956/174143"

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/demo-docker-image.yml actions
  • actions/checkout v2 composite
.github/workflows/fly-deploy.yml actions
  • actions/checkout v2 composite
  • superfly/flyctl-actions 1.3 composite
.github/workflows/gradient-docker-image.yml actions
  • actions/checkout v2 composite
.github/workflows/heroku-deploy.yml actions
  • actions/checkout v2 composite
.github/workflows/training.yml actions
  • actions/checkout v2 composite
  • actions/upload-artifact v2 composite
demo/Dockerfile docker
  • janjones/awe-gradient latest build
gitpod/Dockerfile docker
  • janjones/awe-gradient 1650739890 build
gradient/Dockerfile docker
  • nvidia/cuda 9.2-base-ubuntu16.04 build
js/package.json npm
  • @types/cli-progress 3.9.2 development
  • @types/express 4.17.13 development
  • @types/natural-compare-lite 1.4.0 development
  • @types/node 16.11.6 development
  • ts-node 10.4.0 development
  • typescript 4.4.4 development
  • @oclif/command 1.8.0
  • @oclif/errors 1.3.5
  • cheerio 1.0.0-rc.10
  • cli-progress 3.9.1
  • domhandler 4.3.1
  • express 4.17.3
  • fast-glob 3.2.7
  • generic-pool 3.8.2
  • html-template-tag 4.0.0
  • natural-compare-lite 1.4.0
  • puppeteer-core 11.0.0
  • python-shell 3.0.1
  • rxjs 7.4.0
  • winston 3.3.3
js/pnpm-lock.yaml npm
  • 252 dependencies
awe/requirements.txt pypi
  • absl-py =1.0.0=pypi_0
  • anyio =3.5.0=pypi_0
  • argon2-cffi =21.3.0=pypi_0
  • argon2-cffi-bindings =21.2.0=pypi_0
  • astroid =2.9.3=pypi_0
  • asttokens =2.0.5=pypi_0
  • attrs =18.2.0=pypi_0
  • autopep8 =1.6.0=pypi_0
  • babel =2.10.1=pypi_0
  • backcall =0.2.0=pypi_0
  • beautifulsoup4 =4.11.1=py39h06a4308_0
  • bleach =5.0.0=pypi_0
  • brotlipy =0.7.0=py39h27cfd23_1003
  • bzip2 =1.0.8=h7b6447c_0
  • ca-certificates =2021.10.8=ha878542_0
  • cachetools =5.0.0=pypi_0
  • certifi =2021.10.8=py39hf3d152e_2
  • cffi =1.15.0=py39hd667e15_1
  • chardet =4.0.0=py39h06a4308_1003
  • charset-normalizer =2.0.4=pyhd3eb1b0_0
  • click =8.1.2=pypi_0
  • click-completion =0.5.2=pypi_0
  • click-didyoumean =0.3.0=pypi_0
  • click-help-colors =0.9.1=pypi_0
  • colorama =0.4.3=pypi_0
  • conda =4.12.0=py39hf3d152e_0
  • conda-build =3.21.8=py39h06a4308_2
  • conda-content-trust =0.1.1=pyhd3eb1b0_0
  • conda-package-handling =1.7.3=py39h27cfd23_1
  • cryptography =36.0.0=py39h9ce1e76_0
  • cycler =0.11.0=pypi_0
  • cython =0.29.28=pypi_0
  • debugpy =1.6.0=pypi_0
  • decorator =5.1.1=pypi_0
  • defusedxml =0.7.1=pypi_0
  • descartes =1.1.0=pypi_0
  • entrypoints =0.4=pypi_0
  • executing =0.8.3=pypi_0
  • fastjsonschema =2.15.3=pypi_0
  • filelock =3.6.0=pyhd3eb1b0_0
  • fonttools =4.33.2=pypi_0
  • gensim =4.1.2=pypi_0
  • gh =2.6.0=ha8f183a_0
  • glob2 =0.7=pyhd3eb1b0_0
  • google-auth =2.6.6=pypi_0
  • google-auth-oauthlib =0.4.6=pypi_0
  • gql =3.0.0a6=pypi_0
  • gradient =2.0.2=pypi_0
  • gradient-utils =0.5.0=pypi_0
  • graphql-core =3.1.7=pypi_0
  • grpcio =1.44.0=pypi_0
  • halo =0.0.31=pypi_0
  • huggingface-hub =0.5.1=pypi_0
  • icu =69.1=h9c3ff4c_0
  • idna =3.3=pyhd3eb1b0_0
  • ijson =3.1.4=pypi_0
  • importlib-metadata =4.11.3=pypi_0
  • inflection =0.5.1=pypi_0
  • ipykernel =6.13.0=pypi_0
  • ipython =8.2.0=pypi_0
  • ipython-genutils =0.2.0=pypi_0
  • ipywidgets =7.6.5=pypi_0
  • isort =5.10.1=pypi_0
  • jedi =0.18.1=pypi_0
  • jinja2 =3.1.1=pypi_0
  • joblib =1.1.0=pypi_0
  • json5 =0.9.6=pypi_0
  • jsonschema =4.4.0=pypi_0
  • jupyter-client =7.2.2=pypi_0
  • jupyter-core =4.10.0=pypi_0
  • jupyter-server =1.16.0=pypi_0
  • jupyterlab =3.2.4=pypi_0
  • jupyterlab-pygments =0.2.2=pypi_0
  • jupyterlab-server =2.13.0=pypi_0
  • jupyterlab-widgets =1.1.0=pypi_0
  • kiwisolver =1.4.2=pypi_0
  • lazy-object-proxy =1.7.1=pypi_0
  • ld_impl_linux-64 =2.35.1=h7274673_9
  • libarchive =3.4.2=h62408e4_0
  • libffi =3.3=he6710b0_2
  • libgcc-ng =11.2.0=h1d223b6_16
  • libiconv =1.16=h516909a_0
  • liblief =0.11.5=h295c915_1
  • libstdcxx-ng =11.2.0=he4da1e4_16
  • libuv =1.42.0=h7f98852_0
  • libxml2 =2.9.12=h885dcf4_1
  • libzlib =1.2.11=h166bdaf_1014
  • llvm-openmp =13.0.1=he0ac6c6_1
  • log-symbols =0.0.14=pypi_0
  • lz4-c =1.9.3=h295c915_1
  • markdown =3.3.6=pypi_0
  • markupsafe =2.0.1=py39h27cfd23_0
  • marshmallow =2.21.0=pypi_0
  • matplotlib =3.5.1=pypi_0
  • matplotlib-inline =0.1.3=pypi_0
  • mccabe =0.6.1=pypi_0
  • mistune =0.8.4=pypi_0
  • mizani =0.7.4=pypi_0
  • multidict =6.0.2=pypi_0
  • nbclassic =0.3.7=pypi_0
  • nbclient =0.6.0=pypi_0
  • nbconvert =6.5.0=pypi_0
  • nbformat =5.3.0=pypi_0
  • ncurses =6.3=h7f8727e_2
  • nest-asyncio =1.5.5=pypi_0
  • nodejs =17.1.0=h8ca31f7_2
  • notebook =6.4.11=pypi_0
  • notebook-shim =0.1.0=pypi_0
  • numpy =1.22.3=pypi_0
  • oauthlib =3.2.0=pypi_0
  • openssl =1.1.1n=h166bdaf_0
  • packaging =21.3=pypi_0
  • palettable =3.3.0=pypi_0
  • pandas =1.4.2=pypi_0
  • pandocfilters =1.5.0=pypi_0
  • parso =0.8.3=pypi_0
  • patchelf =0.13=h295c915_0
  • patsy =0.5.2=pypi_0
  • pexpect =4.8.0=pypi_0
  • pickleshare =0.7.5=pypi_0
  • pillow =9.1.0=pypi_0
  • pip =21.2.4=py39h06a4308_0
  • pkginfo =1.8.2=pyhd3eb1b0_0
  • platformdirs =2.5.2=pypi_0
  • plotnine =0.8.0=pypi_0
  • progressbar2 =4.0.0=pypi_0
  • prometheus-client =0.9.0=pypi_0
  • prompt-toolkit =3.0.29=pypi_0
  • protobuf =3.20.1=pypi_0
  • psutil =5.8.0=py39h27cfd23_1
  • ptyprocess =0.7.0=pypi_0
  • pure-eval =0.2.2=pypi_0
  • py-lief =0.11.5=py39h295c915_1
  • pyasn1 =0.4.8=pypi_0
  • pyasn1-modules =0.2.8=pypi_0
  • pycodestyle =2.8.0=pypi_0
  • pycosat =0.6.3=py39h27cfd23_0
  • pycparser =2.21=pyhd3eb1b0_0
  • pygments =2.11.2=pypi_0
  • pylint =2.12.2=pypi_0
  • pymongo =3.12.3=pypi_0
  • pyopenssl =21.0.0=pyhd3eb1b0_1
  • pyparsing =3.0.8=pypi_0
  • pyrsistent =0.18.1=pypi_0
  • pysocks =1.7.1=py39h06a4308_0
  • python =3.9.7=h12debd9_1
  • python-dateutil =2.8.2=pypi_0
  • python-libarchive-c =2.9=pyhd3eb1b0_1
  • python-slugify =5.0.2=pypi_0
  • python-utils =3.1.0=pypi_0
  • python_abi =3.9=2_cp39
  • pytz =2021.3=pyhd3eb1b0_0
  • pyyaml =5.4.1=pypi_0
  • pyzmq =22.3.0=pypi_0
  • readline =8.1.2=h7f8727e_1
  • regex =2022.3.15=pypi_0
  • requests =2.27.1=pyhd3eb1b0_0
  • requests-oauthlib =1.3.1=pypi_0
  • requests-toolbelt =0.9.1=pypi_0
  • ripgrep =12.1.1=0
  • rsa =4.8=pypi_0
  • ruamel_yaml =0.15.100=py39h27cfd23_0
  • sacremoses =0.0.49=pypi_0
  • scikit-learn =1.0.2=pypi_0
  • scipy =1.8.0=pypi_0
  • selectolax =0.3.6=pypi_0
  • send2trash =1.8.0=pypi_0
  • setuptools =58.0.4=py39h06a4308_0
  • shellingham =1.4.0=pypi_0
  • six =1.16.0=pyhd3eb1b0_0
  • smart-open =5.2.1=pypi_0
  • sniffio =1.2.0=pypi_0
  • soupsieve =2.3.1=pyhd3eb1b0_0
  • spinners =0.0.24=pypi_0
  • sqlite =3.37.0=hc218d9a_0
  • stack-data =0.2.0=pypi_0
  • statsmodels =0.13.2=pypi_0
  • tensorboard =2.8.0=pypi_0
  • tensorboard-data-server =0.6.1=pypi_0
  • tensorboard-plugin-wit =1.8.1=pypi_0
  • termcolor =1.1.0=pypi_0
  • terminado =0.13.3=pypi_0
  • terminaltables =3.1.10=pypi_0
  • text-unidecode =1.3=pypi_0
  • threadpoolctl =3.1.0=pypi_0
  • tinycss2 =1.1.1=pypi_0
  • tk =8.6.11=h1ccaba5_0
  • tokenizers =0.10.3=pypi_0
  • toml =0.10.2=pypi_0
  • torch =1.10.0=pypi_0
  • torch-tb-profiler =0.2.1=pypi_0
  • torchinfo =1.6.5=pypi_0
  • torchmetrics =0.6.2=pypi_0
  • torchtext =0.11.0=pypi_0
  • tornado =6.1=pypi_0
  • tqdm =4.62.3=pyhd3eb1b0_1
  • traitlets =5.1.1=pypi_0
  • transformers =4.15.0=pypi_0
  • typing-extensions =4.2.0=pypi_0
  • tzdata =2021e=hda174b7_0
  • urllib3 =1.26.7=pyhd3eb1b0_0
  • wcwidth =0.2.5=pypi_0
  • webencodings =0.5.1=pypi_0
  • websocket-client =0.57.0=pypi_0
  • werkzeug =2.1.1=pypi_0
  • wheel =0.35.1=pypi_0
  • widgetsnbextension =3.5.2=pypi_0
  • wrapt =1.13.3=pypi_0
  • xz =5.2.5=h7b6447c_0
  • yaml =0.2.5=h7b6447c_0
  • yarl =1.7.2=pypi_0
  • zipp =3.8.0=pypi_0
  • zlib =1.2.11=h166bdaf_1014
  • zstd =1.4.9=haebb681_0
gradient/requirements-torch.txt pypi
  • torch ==1.10.0
gradient/requirements.txt pypi
  • autopep8 ==1.6.0
  • gensim ==4.1.2
  • gradient ==2.0.2
  • ijson ==3.1.4
  • inflection ==0.5.1
  • ipywidgets ==7.6.5
  • jupyterlab ==3.2.4
  • matplotlib ==3.5.1
  • plotnine ==0.8.0
  • pylint ==2.12.2
  • python-slugify ==5.0.2
  • scikit-learn ==1.0.2
  • selectolax ==0.3.6
  • torch-tb-profiler ==0.2.1
  • torchinfo ==1.6.5
  • torchmetrics <0.7
  • torchtext ==0.11.0
  • transformers ==4.15.0