wiki-entity-summarization-preprocessor

Convert Wikidata and Wikipedia raw files to filterable formats with a focus of marking Wikidata as summaries based on their Wikipedia abstracts.

https://github.com/msorkhpar/wiki-entity-summarization-preprocessor

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary

Keywords

distilbert java neo4j networkx postgresql python transformers wikes wiki-entity-summarization wikies
Last synced: 4 months ago · JSON representation ·

Repository

Convert Wikidata and Wikipedia raw files to filterable formats with a focus of marking Wikidata as summaries based on their Wikipedia abstracts.

Basic Info
  • Host: GitHub
  • Owner: msorkhpar
  • License: cc-by-4.0
  • Language: Java
  • Default Branch: main
  • Homepage:
  • Size: 1.1 MB
Statistics
  • Stars: 2
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Topics
distilbert java neo4j networkx postgresql python transformers wikes wiki-entity-summarization wikies
Created almost 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

arXivGitHub License

Wiki Entity Summarization Pre-processing

Overview

This project focuses on the pre-processing steps required for the Wiki Entity Summarization (Wiki ES) project. It involves building the necessary databases and loading data from various sources to prepare for the entity summarization tasks.

Server Specifications

For the pre-processing steps, we used an r5a.4xlarge instance on AWS with the following specifications:

  • vCpu: 16 (AMD EPYC 7571, 16 MiB cache, 2.5 GHz)
  • Memory: 128 GB (DDR4, 2667 MT/s)
  • Storage: 500 GB (EBS, 2880 Max Bandwidth)

Getting Started

To get started with the pre-processing, follow these steps:

  1. Build the wikimapper database:

shell pip install wikimapper `

If you would like to download the latest version, run the following:

shell EN_WIKI_REDIRECT_AND_PAGES_PATH={your_files_path} wikimapper download enwiki-latest --dir $EN_WIKI_REDIRECT_AND_PAGES_PATH

After having enwiki-{VERSION}-page.sql.gz, enwiki-{VERSION}-redirect.sql.gz, and enwiki-{VERSION}-page_props.sql.gz loaded under your data directory, run the following commands:

shell VERSION={VERSION} EN_WIKI_REDIRECT_AND_PAGES_PATH={your_files_path} INDEX_DB_PATH="`pwd`/data/index_enwiki-$VERSION.db" wikimapper create enwiki-$VERSION --dumpdir $EN_WIKI_REDIRECT_AND_PAGES_PATH --target $INDEX_DB_PATH

  1. Load the created database into the Postgres database: read pgloader's document for the installation

```shell ./config-files-generator.sh source .env cat < sqlite-to-page-migration.load load database from $INDEXDBPATH into postgresql://$DBUSER:$DBPASS@$DBHOST:$DBPORT/$DB_NAME with include drop, create tables, create indexes, reset sequences ; EOT

pgloader ./sqlite-to-page-migration.load ```

  1. Correct missing data: After running the experiments, some issues were encountered with the wikimapper library. To correct the missing data, run the following script:

shell python3 missing_data_correction.py

Data Sources

The pre-processing steps involve loading data from the following sources:

  • Wikidata, wikidatawiki latest version: First, download the latest version of the Wikidata dump. With the dump, you can run the following command to load the metadata of the Wikidata dataset into the Postgres database, and the relationships between the entities into the Neo4j database. This module is called Wikidata Graph Builder (wdgp). shell docker-compose up wdgp
  • Wikipedia, enwiki lastest version: The Wikipedia pages are used to extract the abstract and infobox of the corresponding Wikidata entity. The abstract and infobox are then used to annotate the summary in Wikidata. To provide such information, you need to load the latest version of the Wikipedia dump into the Postgres database. This module is called Wikipedia Page Extractor ( wppe). shell docker-compose up wppe

Summary Annotation

When both datasets are loaded into the databases, we start processing all the available pages in the Wikipedia dataset to extract the abstract and infobox of the corresponding Wikidata entity. Later, these pages are marked from the extracted data, and the edges containing the marked pages are marked as candidates. Since Wikidata is a heterogeneous graph with multiple types of edges, we need to pick the most relevant edge as a summary between two entities for the summarization task. This module is called Wiki Summary Annotator (wsa), and we use DistilBERT to filter the most relevant edge.

shell docker-compose up wsa

Conclusion

By running the above commands, you will have the necessary databases and data loaded to start the Wiki Entity Summarization project. The next steps involve providing a set of seed nodes based on your preference along with other configuration parameters to get a fully customized Entity Summarization Dataset.

Citation

If you use this project in your research, please cite the following paper:

bibtex @misc{javadi2024wiki, title = {Wiki Entity Summarization Benchmark}, author = {Saeedeh Javadi and Atefeh Moradan and Mohammad Sorkhpar and Klim Zaporojets and Davide Mottin and Ira Assent}, year = {2024}, eprint = {2406.08435}, archivePrefix = {arXiv}, primaryClass = {cs.IR} }

License

This project is licensed under the CC BY 4.0 License. See the LICENSE file for details.

Owner

  • Name: Mo Sorkhpar
  • Login: msorkhpar
  • Kind: user
  • Location: IN, USA

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Javadi"
  given-names: "Saeedeh"
- family-names: "Moradan"
  given-names: "Atefeh"
- family-names: "Sorkhpar"
  given-names: "Mohammad"
  orcid: "https://orcid.org/0009-0006-2856-9225"
- family-names: "Zaporojets"
  given-names: "Klim"
  orcid: "https://orcid.org/0000-0003-4988-978X"
- family-names: "Mottin"
  given-names: "Davide"
  orcid: "https://orcid.org/0000-0001-8256-2258"
- family-names: "Assent"
  given-names: "Ira"
  orcid: "https://orcid.org/0000-0002-1091-9948"
title: "Wiki Entity Summarization Benchmark"
version: 1.0.5
doi: 10.48550/arXiv.2406.08435
date-released: 2024-06-12
url: "https://github.com/msorkhpar/wiki-entity-summarization"

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 13
  • Total Committers: 1
  • Avg Commits per committer: 13.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Mo Sorkhpar S****r@o****m 13

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

wiki_summary/Dockerfile docker
  • base latest build
  • python 3.10 build
  • python 3.10-slim build
wikidata-graph-builder/Dockerfile docker
  • eclipse-temurin 21_35-jdk build
  • maven 3.9.4-eclipse-temurin-21-alpine build
wikipedia-page-extractor/Dockerfile docker
  • eclipse-temurin 21_35-jdk build
  • maven 3.9.4-eclipse-temurin-21-alpine build
pom.xml maven
wiki-storage/pom.xml maven
  • io.hypersistence:hypersistence-utils-hibernate-63 3.7.3
  • org.flywaydb:flyway-core
  • org.postgresql:postgresql
  • org.projectlombok:lombok
  • org.springframework.boot:spring-boot-devtools
  • org.springframework.boot:spring-boot-starter-data-jpa
  • org.springframework.boot:spring-boot-starter-data-neo4j
  • org.springframework:spring-jdbc
wikidata-graph-builder/pom.xml maven
  • com.github.msorkhpar:wiki-storage 1.0.0-SNAPSHOT
  • org.apache.commons:commons-compress ${apache.commons.compress.version}
  • org.apache.commons:commons-lang3 3.14.0
  • org.projectlombok:lombok
wikipedia-page-extractor/pom.xml maven
  • com.github.msorkhpar:wiki-storage 1.0.0-SNAPSHOT
  • org.apache.commons:commons-compress ${apache.commons.compress.version}
  • org.apache.commons:commons-lang3 3.14.0
  • org.projectlombok:lombok
poetry.lock pypi
  • certifi 2024.2.2
  • charset-normalizer 3.3.2
  • colorama 0.4.6
  • filelock 3.14.0
  • fsspec 2024.5.0
  • html2text 2024.2.26
  • huggingface-hub 0.23.2
  • idna 3.7
  • intel-openmp 2021.4.0
  • jinja2 3.1.4
  • lxml 5.2.2
  • markupsafe 2.1.5
  • mkl 2021.4.0
  • more-itertools 10.2.0
  • mpmath 1.3.0
  • mwparserfromhell 0.6.6
  • neo4j 5.20.0
  • networkx 3.3
  • numpy 1.26.4
  • packaging 24.0
  • pillow 10.3.0
  • psycopg2-binary 2.9.9
  • pycurl 7.45.3
  • pytz 2024.1
  • pyyaml 6.0.1
  • regex 2024.5.15
  • requests 2.32.3
  • safetensors 0.4.3
  • sympy 1.12.1
  • tbb 2021.12.0
  • tokenizers 0.19.1
  • torch 2.3.1+cpu
  • torchvision 0.18.1+cpu
  • tqdm 4.66.4
  • transformers 4.41.2
  • typing-extensions 4.12.0
  • urllib3 2.2.1
  • wcwidth 0.2.13
  • wikitextparser 0.55.13
  • wptools 0.4.17
pyproject.toml pypi
  • html2text ^2024.2.26
  • more-itertools ^10.2.0
  • mwparserfromhell ^0.6.6
  • neo4j ^5.20.0
  • networkx ^3.3
  • psycopg2-binary ^2.9.9
  • python ^3.10
  • torch ^2.3.1+cpu
  • torchvision ^0.18.1+cpu
  • transformers ^4.41.2
  • wikitextparser ^0.55.13
  • wptools ^0.4.17