wpextract
Create datasets from WordPress sites for research or archiving
Science Score: 85.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
✓Committers with academic emails
1 of 2 committers (50.0%) from academic institutions -
✓Institutional organization owner
Organization gatenlp has institutional domain (gate.ac.uk) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary
Keywords
Repository
Create datasets from WordPress sites for research or archiving
Basic Info
- Host: GitHub
- Owner: GateNLP
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://wpextract.readthedocs.io
- Size: 1.94 MB
Statistics
- Stars: 4
- Watchers: 3
- Forks: 0
- Open Issues: 4
- Releases: 8
Topics
Metadata Files
README.md
WPextract - WordPress Site Extractor
WPextract is a tool to create datasets from WordPress sites.
- Archives posts, pages, tags, categories, media (including files), comments, and users
- Uses the WordPress API to guarantee 100% accurate and complete content
- Resolves internal links and media to IDs
- Automatically parses multilingual sites to create parallel datasets
Quickstart
See the complete documentation for more detailed usage.
- Install with
pipxshell-session $ pipx install wpextract - Download site data
shell-session $ wpextract download "https://example.org" out_dl - Process into a dataset
shell-session $ wpextract extract out_dl out_data
About WPextract
WPextract was built by Freddy Heppell of the GATE Project at the School of Computer Science, University of Sheffield, originally created to scrape mis/disinformation websites for research.
License
Available under the Apache 2.0 license. See LICENSE for more information.
Citing
[!NOTE] This software was developed for our EMNLP 2023 paper Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study. The code has been updated since the paper was written; for archival purposes, the precise version used for the study is available on Zenodo.
We'd love to hear about your use of our tool, you can email us to let us know! Feel free to create issues and/or pull requests for new features or bugs.
If you use this tool in published work, please cite our EMNLP paper:
Freddy Heppell, Kalina Bontcheva, and Carolina Scarton. 2023. Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5729–5741, Singapore. Association for Computational Linguistics.
Permanent references to each release of this software are available from Zenodo.
Owner
- Name: GateNLP
- Login: GateNLP
- Kind: organization
- Location: Sheffield, UK
- Website: https://gate.ac.uk/
- Twitter: gateAcUk
- Repositories: 170
- Profile: https://github.com/GateNLP
GATE - General Architecture for Text Engineering
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- given-names: Freddy
family-names: Heppell
orcid: 'https://orcid.org/0009-0003-7241-5846'
- given-names: Kalina
family-names: Bontcheva
orcid: 'https://orcid.org/0000-0001-6152-9600'
- given-names: Carolina
family-names: Scarton
orcid: 'https://orcid.org/0000-0002-0103-4072'
title: "WPextract"
version: 1.0.0
doi: 10.5281/zenodo.10008086
date-released: 2023-10-20
url: "https://github.com/GateNLP/wpextract"
preferred-citation:
type: conference-paper
authors:
- given-names: Freddy
family-names: Heppell
orcid: 'https://orcid.org/0009-0003-7241-5846'
- given-names: Kalina
family-names: Bontcheva
orcid: 'https://orcid.org/0000-0001-6152-9600'
- given-names: Carolina
family-names: Scarton
orcid: 'https://orcid.org/0000-0002-0103-4072'
collection-title: "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing"
title: "Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study"
year: 2023
month: 12
start: 5729
end: 5741
GitHub Events
Total
- Create event: 6
- Issues event: 4
- Release event: 1
- Watch event: 2
- Delete event: 3
- Issue comment event: 2
- Push event: 4
- Pull request event: 8
Last Year
- Create event: 6
- Issues event: 4
- Release event: 1
- Watch event: 2
- Delete event: 3
- Issue comment event: 2
- Push event: 4
- Pull request event: 8
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Freddy Heppell | f****y@f****m | 88 |
| Ian Roberts | i****s@s****k | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 12
- Total pull requests: 43
- Average time to close issues: 3 days
- Average time to close pull requests: 3 days
- Total issue authors: 3
- Total pull request authors: 2
- Average comments per issue: 0.08
- Average comments per pull request: 0.0
- Merged pull requests: 42
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 8
- Average time to close issues: 1 day
- Average time to close pull requests: 17 days
- Issue authors: 3
- Pull request authors: 2
- Average comments per issue: 0.33
- Average comments per pull request: 0.0
- Merged pull requests: 8
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- freddyheppell (3)
- ducnguyenphanhoai (1)
- ducnguyen04071996 (1)
Pull Request Authors
- freddyheppell (56)
- ianroberts (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 26 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 9
- Total maintainers: 1
pypi.org: wpextract
Create datasets from WordPress sites
- Homepage: https://wpextract.readthedocs.io/
- Documentation: https://wpextract.readthedocs.io/
- License: Apache-2.0
-
Latest release: 1.1.1
published about 1 year ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v3 composite
- actions/setup-python v3 composite
- actions/checkout v3 composite
- actions/setup-node v3 composite
- actions/setup-python v3 composite
- mariadb 10.6.4-focal
- wordpress 6.6
- actions/checkout v4 composite
- actions/download-artifact v3 composite
- actions/setup-python v5 composite
- actions/upload-artifact v3 composite
- pypa/gh-action-pypi-publish release/v1 composite
- sigstore/gh-action-sigstore-python v2.1.1 composite
- babel 2.15.0
- beautifulsoup4 4.12.3
- black 24.4.2
- build 0.9.0
- certifi 2024.7.4
- charset-normalizer 3.3.2
- click 8.1.7
- click-option-group 0.5.6
- colorama 0.4.6
- coverage 7.5.4
- exceptiongroup 1.2.1
- ghp-import 2.1.0
- griffe 0.47.0
- idna 3.7
- importlib-metadata 8.0.0
- iniconfig 2.0.0
- jinja2 3.1.4
- langcodes 3.4.0
- language-data 1.2.0
- lxml 5.2.2
- marisa-trie 1.2.0
- markdown 3.6
- markupsafe 2.1.5
- mergedeep 1.3.4
- mkdocs 1.6.0
- mkdocs-autorefs 1.0.1
- mkdocs-get-deps 0.2.0
- mkdocs-material 9.5.28
- mkdocs-material-extensions 1.3.1
- mkdocstrings 0.25.1
- mkdocstrings-python 1.10.5
- mypy-extensions 1.0.0
- numpy 2.0.0
- packaging 24.1
- paginate 0.5.6
- pandas 2.2.2
- pathspec 0.12.1
- pep517 0.13.1
- platformdirs 4.2.2
- pluggy 1.5.0
- pygments 2.18.0
- pymdown-extensions 10.8.1
- pytest 8.2.2
- pytest-datadir 1.5.0
- pytest-mock 3.14.0
- python-dateutil 2.9.0.post0
- pytz 2024.1
- pyyaml 6.0.1
- pyyaml-env-tag 0.1
- regex 2024.5.15
- requests 2.32.3
- responses 0.25.3
- ruff 0.5.1
- setuptools 70.3.0
- six 1.16.0
- soupsieve 2.5
- tomli 2.0.1
- tqdm 4.66.4
- typing-extensions 4.12.2
- tzdata 2024.1
- urllib3 2.2.2
- watchdog 4.0.1
- zipp 3.19.2