wpextract

Create datasets from WordPress sites for research or archiving

https://github.com/gatenlp/wpextract

Science Score: 85.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
✓
Committers with academic emails
1 of 2 committers (50.0%) from academic institutions
✓
Institutional organization owner
Organization gatenlp has institutional domain (gate.ac.uk)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.3%) to scientific vocabulary

Keywords

corpus crawler nlp text-extraction text-mining web-scraping wordpress

Last synced: 10 months ago · JSON representation ·

Repository

Create datasets from WordPress sites for research or archiving

Basic Info

Host: GitHub
Owner: GateNLP
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://wpextract.readthedocs.io
Size: 1.94 MB

Statistics

Stars: 4
Watchers: 3
Forks: 0
Open Issues: 4
Releases: 8

Topics

corpus crawler nlp text-extraction text-mining web-scraping wordpress

Created over 3 years ago · Last pushed over 1 year ago

Metadata Files

Readme Contributing License Citation

README.md

WPextract - WordPress Site Extractor

WPextract is a tool to create datasets from WordPress sites.

Archives posts, pages, tags, categories, media (including files), comments, and users
Uses the WordPress API to guarantee 100% accurate and complete content
Resolves internal links and media to IDs
Automatically parses multilingual sites to create parallel datasets

Quickstart

See the complete documentation for more detailed usage.

Install with pipx shell-session $ pipx install wpextract
Download site data shell-session $ wpextract download "https://example.org" out_dl
Process into a dataset shell-session $ wpextract extract out_dl out_data

About WPextract

WPextract was built by Freddy Heppell of the GATE Project at the School of Computer Science, University of Sheffield, originally created to scrape mis/disinformation websites for research.

License

Available under the Apache 2.0 license. See LICENSE for more information.

Citing

[!NOTE] This software was developed for our EMNLP 2023 paper Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study. The code has been updated since the paper was written; for archival purposes, the precise version used for the study is available on Zenodo.

We'd love to hear about your use of our tool, you can email us to let us know! Feel free to create issues and/or pull requests for new features or bugs.

If you use this tool in published work, please cite our EMNLP paper:

Freddy Heppell, Kalina Bontcheva, and Carolina Scarton. 2023. Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5729–5741, Singapore. Association for Computational Linguistics.

Permanent references to each release of this software are available from Zenodo.

Owner

Name: GateNLP
Login: GateNLP
Kind: organization
Location: Sheffield, UK

Website: https://gate.ac.uk/
Twitter: gateAcUk
Repositories: 170
Profile: https://github.com/GateNLP

GATE - General Architecture for Text Engineering

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - given-names: Freddy
    family-names: Heppell
    orcid: 'https://orcid.org/0009-0003-7241-5846'
  - given-names: Kalina
    family-names: Bontcheva
    orcid: 'https://orcid.org/0000-0001-6152-9600'
  - given-names: Carolina
    family-names: Scarton
    orcid: 'https://orcid.org/0000-0002-0103-4072'
title: "WPextract"
version: 1.0.0
doi: 10.5281/zenodo.10008086
date-released: 2023-10-20
url: "https://github.com/GateNLP/wpextract"
preferred-citation:
  type: conference-paper
  authors:
  - given-names: Freddy
    family-names: Heppell
    orcid: 'https://orcid.org/0009-0003-7241-5846'
  - given-names: Kalina
    family-names: Bontcheva
    orcid: 'https://orcid.org/0000-0001-6152-9600'
  - given-names: Carolina
    family-names: Scarton
    orcid: 'https://orcid.org/0000-0002-0103-4072'
  collection-title: "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing"
  title: "Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study"
  year: 2023
  month: 12
  start: 5729
  end: 5741

GitHub Events

Total

Create event: 6
Issues event: 4
Release event: 1
Watch event: 2
Delete event: 3
Issue comment event: 2
Push event: 4
Pull request event: 8

Last Year

Create event: 6
Issues event: 4
Release event: 1
Watch event: 2
Delete event: 3
Issue comment event: 2
Push event: 4
Pull request event: 8

Committers

Last synced: about 1 year ago

All Time

Total Commits: 89
Total Committers: 2
Avg Commits per committer: 44.5
Development Distribution Score (DDS): 0.011

Past Year

Commits: 27
Committers: 2
Avg Commits per committer: 13.5
Development Distribution Score (DDS): 0.037

Top Committers

Name	Email	Commits
Freddy Heppell	f**y@f**m	88
Ian Roberts	i**s@s**k	1

Committer Domains (Top 20 + Academic)

sheffield.ac.uk: 1 freddyheppell.com: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 12
Total pull requests: 43
Average time to close issues: 3 days
Average time to close pull requests: 3 days
Total issue authors: 3
Total pull request authors: 2
Average comments per issue: 0.08
Average comments per pull request: 0.0
Merged pull requests: 42
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 8
Average time to close issues: 1 day
Average time to close pull requests: 17 days
Issue authors: 3
Pull request authors: 2
Average comments per issue: 0.33
Average comments per pull request: 0.0
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

freddyheppell (3)
ducnguyenphanhoai (1)
ducnguyen04071996 (1)

Pull Request Authors

freddyheppell (56)
ianroberts (2)

Top Labels

Issue Labels

documentation (2)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 26 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 9
Total maintainers: 1

pypi.org: wpextract

Create datasets from WordPress sites

Homepage: https://wpextract.readthedocs.io/
Documentation: https://wpextract.readthedocs.io/
License: Apache-2.0
Latest release: 1.1.1
published over 1 year ago

Versions: 9
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 26 Last month

Rankings

Dependent packages count: 10.7%

Average: 35.4%

Dependent repos count: 60.0%

Maintainers (1)

freddyheppell

Last synced: 11 months ago

Dependencies

.github/workflows/lint.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite

.github/workflows/test.yml actions

actions/checkout v3 composite
actions/setup-node v3 composite
actions/setup-python v3 composite

pyproject.toml pypi

tests/e2e/tools/docker-compose.yml docker

mariadb 10.6.4-focal
wordpress 6.6

.github/workflows/publish.yml actions

actions/checkout v4 composite
actions/download-artifact v3 composite
actions/setup-python v5 composite
actions/upload-artifact v3 composite
pypa/gh-action-pypi-publish release/v1 composite
sigstore/gh-action-sigstore-python v2.1.1 composite

poetry.lock pypi

babel 2.15.0
beautifulsoup4 4.12.3
black 24.4.2
build 0.9.0
certifi 2024.7.4
charset-normalizer 3.3.2
click 8.1.7
click-option-group 0.5.6
colorama 0.4.6
coverage 7.5.4
exceptiongroup 1.2.1
ghp-import 2.1.0
griffe 0.47.0
idna 3.7
importlib-metadata 8.0.0
iniconfig 2.0.0
jinja2 3.1.4
langcodes 3.4.0
language-data 1.2.0
lxml 5.2.2
marisa-trie 1.2.0
markdown 3.6
markupsafe 2.1.5
mergedeep 1.3.4
mkdocs 1.6.0
mkdocs-autorefs 1.0.1
mkdocs-get-deps 0.2.0
mkdocs-material 9.5.28
mkdocs-material-extensions 1.3.1
mkdocstrings 0.25.1
mkdocstrings-python 1.10.5
mypy-extensions 1.0.0
numpy 2.0.0
packaging 24.1
paginate 0.5.6
pandas 2.2.2
pathspec 0.12.1
pep517 0.13.1
platformdirs 4.2.2
pluggy 1.5.0
pygments 2.18.0
pymdown-extensions 10.8.1
pytest 8.2.2
pytest-datadir 1.5.0
pytest-mock 3.14.0
python-dateutil 2.9.0.post0
pytz 2024.1
pyyaml 6.0.1
pyyaml-env-tag 0.1
regex 2024.5.15
requests 2.32.3
responses 0.25.3
ruff 0.5.1
setuptools 70.3.0
six 1.16.0
soupsieve 2.5
tomli 2.0.1
tqdm 4.66.4
typing-extensions 4.12.2
tzdata 2024.1
urllib3 2.2.2
watchdog 4.0.1
zipp 3.19.2