wiki-entity-summarization

This repository hosts a comprehensive suite for graph-based entity summarization dataset generating from user-selected Wikipedia pages. Utilizing a series of interconnected modules, it leverages Wikidata and Wikipedia dumps to construct a dataset, alongside auto-generated ground truths.

https://github.com/msorkhpar/wiki-entity-summarization

Keywords

dataset dataset-generator entity-summarization neo4j networkx python wiki-entity-summarization wikies

Last synced: 6 months ago · JSON representation

Repository

This repository hosts a comprehensive suite for graph-based entity summarization dataset generating from user-selected Wikipedia pages. Utilizing a series of interconnected modules, it leverages Wikidata and Wikipedia dumps to construct a dataset, alongside auto-generated ground truths.

Basic Info

Host: GitHub
Owner: msorkhpar
License: cc-by-4.0
Language: Python
Default Branch: main
Homepage:
Size: 35.3 MB

Statistics

Stars: 21
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 4

Topics

dataset dataset-generator entity-summarization neo4j networkx python wiki-entity-summarization wikies

Created almost 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

README.md

GitHub License GitHub Release

Wiki Entity Summarization Benchmark (WikES)

This repository leverages the wiki-entity-summarization-preprocessor project to construct an Entity Summarization Graph based on a given set of nodes. The project tries to maintain the structure of the Wikidata knowledge graph by performing random walk sampling with a depth of K, starting from seed nodes after all the summary edges have been added to the result. It then checks if the expanded graph is a single weakly connected component. If not, it finds B paths to connect the components. The final result is a heterogeneous graph consisting of the seed nodes, their summary edges, (1..K)-hop neighbors of the seed nodes and their edges, and any intermediary nodes added to ensure graph connectivity. Each node and edge in the graph is enriched with metadata obtained from Wikidata and Wikipedia and predicate information, providing additional context and details about the entities and their relationships.

A single root entity with its summary edges and other expanded edges by random walk

Loading the Datasets

Load Using `wikes-toolkit`

To load the dataset, we have introduced a toolkit that can be used to download, load, work, and evaluate 48 Wiki-Entity-Summarization datasets. The toolkit is available as a Python package and can be installed using pip:

bash pip install wikes-toolkit

A simple example of how to use the toolkit is as follows:

```python from wikes_toolkit import WikESToolkit, V1, WikESGraph

toolkit = WikESToolkit(savepath="./data") # savepath is optional G = toolkit.loadgraph( WikESGraph, V1.WikiLitArt.SMALL, entityformatter=lambda e: f"Entity({e.wikidatalabel})", predicateformatter=lambda p: f"Predicate({p.label})", tripleformatter=lambda t: f"({t.subjectentity.wikidatalabel})-[{t.predicate.label}]-> ({t.objectentity.wikidata_label})" )

rootnodes = G.rootentities() nodes = G.entities()

```

Please refer to the Wiki-Entity-Summarization-Toolkit repository for more information.

Using mlcroissant

To load WikES datasets, you can use mlcorissant as well. You can find the metadata JSON files in the dataset details tabel.

Here is an example of loading our dataset using mlcorissant:

```python from mlcroissant import Dataset

def printfirstitem(recordname): for record in dataset.records(recordset=record_name): for key, val in record.items(): if isinstance(val, bytes): val = str(val, "utf-8") print(f"{key}={val}", end=", ") break print()

dataset = Dataset( jsonld="https://github.com/msorkhpar/wiki-entity-summarization/releases/download/1.0.5/WikiProFem-s.json")

print(dataset.metadata.record_sets)

printfirstitem("entities") printfirstitem("root-entities") printfirstitem("predicates") printfirstitem("triples") printfirstitem("ground-truths") """ The output of the above code: wikes-dataset [RecordSet(uuid="entities"), RecordSet(uuid="root-entities"), RecordSet(uuid="predicates"), RecordSet(uuid="triples"), RecordSet(uuid="ground-truths")] id=0, entity=Q6387338, wikidatalabel=Ken Blackwell, wikidatadescription=American politician and activist, wikipediaid=769596, wikipediatitle=Ken_Blackwell, entity=9, category=singer, id=0, predicate=P1344, predicatelabel=participant in, predicatedesc=event in which a person or organization was/is a participant; inverse of P710 or P1923, subject=1, predicate=0, object=778, root_entity=9, subject=9, predicate=8, object=31068, """ ```

Loading the Pre-processed Databases

As described in wiki-entity-summarization-preprocessor, we have imported en-wikidata items as a graph with their summaries into a Neo4j database using Wikipedia and Wikidata XML dump files. Additionally, all the other related metadata was imported into a Postgres database.

If you want to create your own dataset but do not want to run the pre-processor again, you can download and load the exported files from these two databases. Please refer to the release notes of the current version 1.0.0 ( enwiki-2023-05-1 and wikidata-wiki-2023-05-1).

Process Overview

1. Building the Summary Graph

Create a summary graph where each seed node is expanded with its summary edges.

2. Expanding the Summary Graph

Perform random walks starting from the seed nodes to mimic the structure of the Wikidata graph.
Scale the number of walks based on the degree of the seed nodes.
Add new edges to the graph from the random walk results.

3. Connecting Components

Check if the expanded graph forms a single weakly connected component.
If not, iteratively connect smaller components using the shortest paths until a single component is achieved.

4. Adding Metadata

Enhance the final graph with additional metadata for each node and edge.
Include labels, descriptions, and other relevant information from Wikidata, Wikipedia, and predicate information.

Pre-generated Datasets

We have generated datasets using A Brief History of Human Time project. These datasets contain different sets of seed nodes, categorized by various human arts and professions.

| dataset (variant, size, None/train/val/test) | #roots | #smmaries | #nodes | #edges | #labels | roots category distribution | Running Time(sec) | |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|-----------|--------|--------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------| | WikiLitArt-s
csv, graphml, croissant.json | 494 | 10416 | 85346 | 136950 | 547 | actor=150
composer=35
film=41
novelist=24
painter=59
poet=39
screenwriter=17
singer=72
writer=57 | 91.934 | | WikiLitArt-s-train
csv, graphml, croissant.json | 346 | 7234 | 61885 | 96497 | 508 | actor=105
composer=24
film=29
novelist=17
painter=42
poet=27
screenwriter=12
singer=50
writer=40 | 66.023 | | WikiLitArt-s-val
csv, graphml, croissant.json | 74 | 1572 | 14763 | 20795 | 340 | actor=23
composer=5
film=6
novelist=4
painter=9
poet=6
screenwriter=2
singer=11
writer=8 | 14.364 | | WikiLitArt-s-test
csv, graphml, croissant.json | 74 | 1626 | 15861 | 22029 | 350 | actor=22
composer=6
film=6
novelist=3
painter=8
poet=6
screenwriter=3
singer=11
writer=9 | 14.6 | | WikiLitArt-m
csv, graphml, croissant.json | 494 | 10416 | 128061 | 220263 | 604 | actor=150
composer=35
film=41
novelist=24
painter=59
poet=39
screenwriter=17
singer=72
writer=57 | 155.368 | | WikiLitArt-m-train
csv, graphml, croissant.json | 346 | 7234 | 93251 | 155667 | 566 | actor=105
composer=24
film=29
novelist=17
painter=42
poet=27
screenwriter=12
singer=50
writer=40 | 111.636 | | WikiLitArt-m-val
csv, graphml, croissant.json | 74 | 1572 | 22214 | 33547 | 375 | actor=23
composer=5
film=6
novelist=4
painter=9
poet=6
screenwriter=2
singer=11
writer=8 | 22.957 | | WikiLitArt-m-test
csv, graphml, croissant.json | 74 | 1626 | 24130 | 35980 | 394 | actor=22
composer=6
film=6
novelist=3
painter=8
poet=6
screenwriter=3
singer=11
writer=9 | 26.187 | | WikiLitArt-l
csv, graphml, croissant.json | 494 | 10416 | 239491 | 466905 | 703 | actor=150
composer=35
film=41
novelist=24
painter=59
poet=39
screenwriter=17
singer=72
writer=57 | 353.113 | | WikiLitArt-l-train
csv, graphml, croissant.json | 346 | 7234 | 176057 | 332279 | 661 | actor=105
composer=24
film=29
novelist=17
painter=42
poet=27
screenwriter=12
singer=50
writer=40 | 244.544 | | WikiLitArt-l-val
csv, graphml, croissant.json | 74 | 1572 | 42745 | 71734 | 446 | actor=23
composer=5
film=6
novelist=4
painter=9
poet=6
screenwriter=2
singer=11
writer=8 | 57.263 | | WikiLitArt-l-test
csv, graphml, croissant.json | 74 | 1626 | 46890 | 77931 | 493 | actor=22
composer=6
film=6
novelist=3
painter=8
poet=6
screenwriter=3
singer=11
writer=9 | 60.466 | | WikiCinema-s
csv, graphml, croissant.json | 493 | 11750 | 70753 | 126915 | 469 | actor=405
film=88 | 118.014 | | WikiCinema-s-train
csv, graphml, croissant.json | 345 | 8374 | 52712 | 89306 | 437 | actor=284
film=61 | 84.364 | | WikiCinema-s-val
csv, graphml, croissant.json | 73 | 1650 | 13362 | 19280 | 305 | actor=59
film=14 | 18.651 | | WikiCinema-s-test
csv, graphml, croissant.json | 75 | 1744 | 14777 | 21567 | 313 | actor=62
film=13 | 19.851 | | WikiCinema-m
csv, graphml, croissant.json | 493 | 11750 | 101529 | 196061 | 541 | actor=405
film=88 | 196.413 | | WikiCinema-m-train
csv, graphml, croissant.json | 345 | 8374 | 75900 | 138897 | 491 | actor=284
film=61 | 142.091 | | WikiCinema-m-val
csv, graphml, croissant.json | 73 | 1650 | 19674 | 30152 | 344 | actor=59
film=14 | 31.722 | | WikiCinema-m-test
csv, graphml, croissant.json | 75 | 1744 | 22102 | 34499 | 342 | actor=62
film=13 | 33.674 | | WikiCinema-l
csv, graphml, croissant.json | 493 | 11750 | 185098 | 397546 | 614 | actor=405
film=88 | 475.679 | | WikiCinema-l-train
csv, graphml, croissant.json | 345 | 8374 | 139598 | 284417 | 575 | actor=284
film=61 | 333.148 | | WikiCinema-l-val
csv, graphml, croissant.json | 73 | 1650 | 37352 | 63744 | 412 | actor=59
film=14 | 68.62 | | WikiCinema-l-test
csv, graphml, croissant.json | 75 | 1744 | 43238 | 74205 | 426 | actor=62
film=13 | 87.07 | | WikiPro-s
csv, graphml, croissant.json | 493 | 9853 | 79825 | 125912 | 616 | actor=58
football=156
journalist=14
lawyer=16
painter=23
player=25
politician=125
singer=27
sport=21
writer=28 | 126.119 | | WikiPro-s-train
csv, graphml, croissant.json | 345 | 6832 | 57529 | 87768 | 575 | actor=41
football=109
journalist=10
lawyer=11
painter=16
player=17
politician=87
singer=19
sport=15
writer=20 | 89.874 | | WikiPro-s-val
csv, graphml, croissant.json | 74 | 1548 | 15769 | 21351 | 405 | actor=9
football=23
journalist=2
lawyer=3
painter=3
player=4
politician=19
singer=4
sport=3
writer=4 | 21.021 | | WikiPro-s-test
csv, graphml, croissant.json | 74 | 1484 | 15657 | 21145 | 384 | actor=8
football=24
journalist=2
lawyer=2
painter=4
player=4
politician=19
singer=4
sport=3
writer=4 | 21.743 | | WikiPro-m
csv, graphml, croissant.json | 493 | 9853 | 119305 | 198663 | 670 | actor=58
football=156
journalist=14
lawyer=16
painter=23
player=25
politician=125
singer=27
sport=21
writer=28 | 208.157 | | WikiPro-m-train
csv, graphml, croissant.json | 345 | 6832 | 86434 | 138676 | 633 | actor=41
football=109
journalist=10
lawyer=11
painter=16
player=17
politician=87
singer=19
sport=15
writer=20 | 141.563 | | WikiPro-m-val
csv, graphml, croissant.json | 74 | 1548 | 24230 | 34636 | 463 | actor=9
football=23
journalist=2
lawyer=3
painter=3
player=4
politician=19
singer=4
sport=3
writer=4 | 36.045 | | WikiPro-m-test
csv, graphml, croissant.json | 74 | 1484 | 24117 | 34157 | 462 | actor=8
football=24
journalist=2
lawyer=2
painter=4
player=4
politician=19
singer=4
sport=3
writer=4 | 36.967 | | WikiPro-l
csv, graphml, croissant.json | 493 | 9853 | 230442 | 412766 | 769 | actor=58
football=156
journalist=14
lawyer=16
painter=23
player=25
politician=125
singer=27
sport=21
writer=28 | 489.409 | | WikiPro-l-train
csv, graphml, croissant.json | 345 | 6832 | 166685 | 290069 | 725 | actor=41
football=109
journalist=10
lawyer=11
painter=16
player=17
politician=87
singer=19
sport=15
writer=20 | 334.864 | | WikiPro-l-val
csv, graphml, croissant.json | 74 | 1548 | 48205 | 74387 | 549 | actor=9
football=23
journalist=2
lawyer=3
painter=3
player=4
politician=19
singer=4
sport=3
writer=4 | 84.089 | | WikiPro-l-test
csv, graphml, croissant.json | 74 | 1484 | 47981 | 72845 | 546 | actor=8
football=24
journalist=2
lawyer=2
painter=4
player=4
politician=19
singer=4
sport=3
writer=4 | 92.545 | | WikiProFem-s
csv, graphml, croissant.json | 468 | 8338 | 79926 | 123193 | 571 | actor=141
athletic=25
football=24
journalist=16
painter=16
player=32
politician=81
singer=69
sport=18
writer=46 | 177.63 | | WikiProFem-s-train
csv, graphml, croissant.json | 330 | 5587 | 58329 | 87492 | 521 | actor=98
athletic=18
football=17
journalist=9
painter=13
player=22
politician=57
singer=48
sport=14
writer=34 | 127.614 | | WikiProFem-s-val
csv, graphml, croissant.json | 68 | 1367 | 14148 | 19360 | 344 | actor=21
athletic=4
football=3
journalist=4
painter=1
player=5
politician=13
singer=11
sport=1
writer=5 | 29.081 | | WikiProFem-test
csv, graphml, croissant.json | 70 | 1387 | 13642 | 18567 | 360 | actor=22
athletic=3
football=4
journalist=3
painter=2
player=5
politician=11
singer=10
sport=3
writer=7 | 27.466 | | WikiProFem-m
csv, graphml, croissant.json | 468 | 8338 | 122728 | 196838 | 631 | actor=141
athletic=25
football=24
journalist=16
painter=16
player=32
politician=81
singer=69
sport=18
writer=46 | 301.718 | | WikiProFem-m-train
csv, graphml, croissant.json | 330 | 5587 | 89922 | 140505 | 600 | actor=98
athletic=18
football=17
journalist=9
painter=13
player=22
politician=57
singer=48
sport=14
writer=34 | 217.699 | | WikiProFem-m-val
csv, graphml, croissant.json | 68 | 1367 | 21978 | 31230 | 409 | actor=21
athletic=4
football=3
journalist=4
painter=1
player=5
politician=13
singer=11
sport=1
writer=5 | 46.793 | | WikiProFem-m-test
csv, graphml, croissant.json | 70 | 1387 | 21305 | 29919 | 394 | actor=22
athletic=3
football=4
journalist=3
painter=2
player=5
politician=11
singer=10
sport=3
writer=7 | 46.317 | | WikiProFem-l
csv, graphml, croissant.json | 468 | 8338 | 248012 | 413895 | 722 | actor=141
athletic=25
football=24
journalist=16
painter=16
player=32
politician=81
singer=69
sport=18
writer=46 | 768.99 | | WikiProFem-l-train
csv, graphml, croissant.json | 330 | 5587 | 183710 | 297686 | 676 | actor=98
athletic=18
football=17
journalist=9
painter=13
player=22
politician=57
singer=48
sport=14
writer=34 | 544.893 | | WikiProFem-l-val
csv, graphml, croissant.json | 68 | 1367 | 46018 | 67193 | 492 | actor=21
athletic=4
football=3
journalist=4
painter=1
player=5
politician=13
singer=11
sport=1
writer=5 | 116.758 | | WikiProFem-l-test
csv, graphml, croissant.json | 70 | 1387 | 44193 | 63563 | 472 | actor=22
athletic=3
football=4
journalist=3
painter=2
player=5
politician=11
singer=10
sport=3
writer=7 | 118.524 |

Keep in mind that by providing a new set of seed nodes, you can generate the output for your own dataset.

Dataset Parameters

| Parameter | Value | |-------------------------------|-------| | Min valid summary edges | 5 | | Random walk depth length | 3 | | Min random walk number-small | 100 | | Min random walk number-medium | 150 | | Min random walk number-large | 300 | | Max random walk number-small | 300 | | Max random walk number-medium | 600 | | Max random walk number-large | 1800 | | Bridges number | 5 |

Graph Structure

In the following you can see a sample of the graph format (we highly recommend using our toolkit to load the datasets):

CSV Format

After unzipping {variant}-{size}-{dataset_type}.zip file, you will find the following CSV files:

{variant}-{size}-{dataset_type}-entities.csv contains entities. An entity is a Wikidata item (node) in our dataset.

| Field | Description | datatype | |-----------------|--------------------------------------|----------| | id | incremental integer starting by zero | int | | entity | Wikidata qid, e.g. Q76 | string | | wikidatalabel | Wikidata label (nullable) | string | | wikidatadesc | Wikidata description (nullable) | string | | wikipediatitle | Wikipedia title (nullable) | string | | wikipediaid | Wikipedia page id (nullable) | long |

{variant}-{size}-{dataset_type}-root-entities.csv contains root entities. A root entity is a seed node described previously.

| Field | Description | datatype | |----------|----------------------------------------------------------|----------| | entity | id key in {variant}-{size}-{dataset_type}-entities.csv | int | | category | category | string |

{variant}-{size}-{dataset_type}-predicates.csv contains predicates. A predicate is a Wikidata property or a describing a connection.

| Field | Description | datatype | |-----------------|------------------------------------------|----------| | id | incremental integer starting by zero | int | | predicate | Wikidata Property id, e.g. P121 | string | | predicatelabel | Wikidata Property label (nullable) | string | | predicatedesc | Wikidata Property description (nullable) | string |

{variant}-{size}-{dataset_type}-triples.csv contains triples. A triple is an edge between two entities with a predicate.

| Field | Description | datatype | |-----------|------------------------------------------------------------|----------| | subject | id key in {variant}-{size}-{dataset_type}-entities.csv | int | | predicate | id key in {variant}-{size}-{dataset_type}-predicates.csv | int | | object | id key in {variant}-{size}-{dataset_type}-entities.csv | int |

{viariant}_{size}_{dataset_type}-ground-truths.csv contains ground truth triples. A ground truth triple is an edge that is marked as a summary for a root entity.

| Field | Description | datatype | |-------------|---------------------------------------------------------------|----------| | rootentity | entity in `{variant}-{size}-{datasettype}-root-entities.csv| int | | subject | id key in{variant}-{size}-{datasettype}-entities.csv| int | | predicate | id key in{variant}-{size}-{datasettype}-predicates.csv| int | | object | id key in{variant}-{size}-{dataset_type}-entities.csv` | int |

Note: for this file one of the columns subject or object is equal to the root_entity.

Example of CSV Files

```csv

entities.csv

id,entity,wikidatalabel,wikidatadesc,wikipediatitle,wikipediaid 0,Q43416,Keanu Reeves,Canadian actor (born 1964),KeanuReeves,16603 1,Q3820,Beirut,capital and largest city of Lebanon,Beirut,37428 2,Q639669,musician,person who composes, conducts or performs music,Musician,38284 3,Q219150,Constantine,2005 film directed by Francis Lawrence,Constantine(film),1210303 ```

```csv

root-entities.csv

entity,category 0,Q43416,actor ```

```csv

predicates.csv

id,predicate,predicatelabel,predicatedesc 0,P19,place of birth,location where the subject was born 1,P106,occupation,occupation of a person; see also "field of work" (Property:P101), "position held" (Property:P39) 2,P161,cast member,actor in the subject production [use "character role" (P453) and/or "name of the character role" (P4633) as qualifiers] [use "voice actor" (P725) for voice-only role] ```

```csv

triples.csv

subject,predicate,object 0,0,1 0,1,2 3,2,0 ```

```csv

ground-truth.csv

root_entity,subject,predicate,object 0,0,0,1 3,3,2,0 ```

GraphML Example

The same graph can be represented in GraphML format, available in the dataset details tabel

xml <?xml version="1.0" encoding="UTF-8"?> <graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"> <key id="d9" for="edge" attr.name="summary_for" attr.type="string"/> <key id="d8" for="edge" attr.name="predicate_desc" attr.type="string"/> <key id="d7" for="edge" attr.name="predicate_label" attr.type="string"/> <key id="d6" for="edge" attr.name="predicate" attr.type="string"/> <key id="d5" for="node" attr.name="category" attr.type="string"/> <key id="d4" for="node" attr.name="is_root" attr.type="boolean"/> <key id="d3" for="node" attr.name="wikidata_desc" attr.type="string"/> <key id="d2" for="node" attr.name="wikipedia_title" attr.type="string"/> <key id="d1" for="node" attr.name="wikipedia_id" attr.type="long"/> <key id="d0" for="node" attr.name="wikidata_label" attr.type="string"/> <graph edgedefault="directed"> <node id="Q43416"> <data key="d0">Keanu Reeves</data> <data key="d1">16603</data> <data key="d2">Keanu_Reeves</data> <data key="d3">Canadian actor (born 1964)</data> <data key="d4">True</data> <data key="d5">actor</data> </node> <node id="Q3820"> <data key="d0">Beirut</data> <data key="d1">37428</data> <data key="d2">Beirut</data> <data key="d3">capital and largest city of Lebanon</data> </node> <node id="Q639669"> <data key="d0">musician</data> <data key="d1">38284</data> <data key="d2">Musician</data> <data key="d3">person who composes, conducts or performs music</data> </node> <node id="Q219150"> <data key="d0">Constantine</data> <data key="d1">1210303</data> <data key="d2">Constantine_(film)</data> <data key="d3">2005 film directed by Francis Lawrence</data> </node> <edge source="Q43416" target="Q3820" id="P19"> <data key="d6">P19</data> <data key="d7">place of birth</data> <data key="d8">location where the subject was born</data> <data key="d9">Q43416</data> </edge> <edge source="Q43416" target="Q639669" id="P106"> <data key="d6">P106</data> <data key="d7">occupation</data> <data key="d8">occupation of a person; see also "field of work" (Property:P101), "position held" (Property:P39) </data> </edge> <edge source="Q219150" target="Q43416" id="P106"> <data key="d6">P161</data> <data key="d7">cast member</data> <data key="d8">actor in the subject production [use "character role" (P453) and/or "name of the character role" (P4633) as qualifiers] [use "voice actor" (P725) for voice-only role] </data> <data key="d9">Q43416</data> </edge> </graph> </graphml>

Usage

Generate a New Dataset

To get started with this project, first clone this repository and install the necessary dependencies using Poetry.

```bash git clone https://github.com/yourusername/wiki-entity-summarization.git cd wiki-entity-summarization curl -sSL https://install.python-poetry.org | python3 - poetry config virtualenvs.in-project true poetry install poetry shell

You can set the parameters via .env file instead of providing command line arguments.

cp .env_sample .env

python3 main.py [-h] [--minvalidsummaryedges MINVALIDSUMMARYEDGES] [--randomwalkdepthlen RANDOMWALKDEPTHLEN] [--bridgesnumber BRIDGESNUMBER] [--maxthreads MAXTHREADS] [--outputpath OUTPUTPATH] [--dbname DBNAME] [--dbuser DBUSER] [--dbpassword DBPASSWORD] [--dbhost DBHOST] [--dbport DBPORT] [--neo4juser NEO4JUSER] [--neo4jpassword NEO4JPASSWORD] [--neo4jhost NEO4JHOST] [--neo4jport NEO4JPORT] [dataset_name] [minrandomwalk_number] [maxrandomwalk_number] [seednodeids] [categories]

    options:
            -h, --help                Show this help message and exit
            --min_valid_summary_edges Minimum number of valid summaries for a seed ndoe
            --random_walk_depth_len   Depth length of random walks (number of nodes in each random walk)
            --bridges_number          Number of connecting path bridges between components
            --max_threads             Maximum number of threads
            --output_path             Path to save output data
            --db_name                 Database name
            --db_user                 Database user
            --db_password             Database password
            --db_host                 Database host
            --db_port                 Database port
            --neo4j_user              Neo4j user
            --neo4j_password          Neo4j password
            --neo4j_host              Neo4j host
            --neo4j_port              Neo4j port

    Positional arguments:
            dataset_name              The name of the dataset to process (required)
            min_random_walk_number    Minimum number of random walks for each seed node (required)
            max_random_walk_number    Maximum number of random walks for each seed node (required)
            seed_node_ids             Seed node ids in comma-separated format (required)
            categories                Seed node categories in comma-separated format (optional)

```

Re-generate WikES Dataset

To re-construct our pre-generated datasets, you can use the following command:

bash python3 human_history_dataset.py

This project uses our pre-processor project databases. Make sure you have loaded the data and run the databases properly.

Citation

If you use this project in your research, please cite the following paper:

bibtex @misc{javadi2024wiki, title = {Wiki Entity Summarization Benchmark}, author = {Saeedeh Javadi and Atefeh Moradan and Mohammad Sorkhpar and Klim Zaporojets and Davide Mottin and Ira Assent}, year = {2024}, eprint = {2406.08435}, archivePrefix = {arXiv}, primaryClass = {cs.IR} }

License

This project and its released datasets are licensed under the CC BY 4.0 License. See the LICENSE file for details.

In the following, you can check other licenses that we used as external services, libraries, or software. By using this project, you accept the third parties' licenses.

Wikipedia:
- https://www.gnu.org/licenses/fdl-1.3.html
- https://creativecommons.org/licenses/by-sa/3.0/
- https://foundation.wikimedia.org/wiki/Policy:TermsofUse
Wikidata:
- https://creativecommons.org/publicdomain/zero/1.0/
- https://creativecommons.org/licenses/by-sa/3.0/
Python:
- https://docs.python.org/3/license.html#psf-license
- https://docs.python.org/3/license.html#bsd0
- https://docs.python.org/3/license.html#otherlicenses
DistilBERT:
- https://github.com/RayWilliam46/FineTune-DistilBERT/blob/main/LICENSE
Networkx:
- https://github.com/networkx/nx-guides/blob/main/LICENSE
Postgres:
- https://opensource.org/license/postgresql
Neo4j:
- https://www.gnu.org/licenses/quick-guide-gplv3.html
Docker:
- https://github.com/moby/moby/blob/master/LICENSE
PyTorch:
- https://github.com/intel/torch/blob/master/LICENSE.md
Scikit-learn:
- https://github.com/scikit-learn/scikit-learn/blob/main/COPYING
Pandas:
- https://github.com/pandas-dev/pandas/blob/main/LICENSE
Numpy:
- https://numpy.org/doc/stable/license.html
Java-open:
- https://github.com/openjdk/jdk21/blob/master/LICENSE
Spring framework:
- https://github.com/spring-projects/spring-boot/blob/main/LICENSE.txt
Other libraries:
- https://github.com/tatuylonen/wikitextprocessor/blob/main/LICENSE
- https://github.com/aaronsw/html2text/blob/master/COPYING
- https://github.com/earwig/mwparserfromhell/blob/main/LICENSE
- https://github.com/more-itertools/more-itertools/blob/master/LICENSE
- https://github.com/siznax/wptools/blob/master/LICENSE
- https://github.com/tqdm/tqdm/blob/master/LICENCE

Owner

Name: Mo Sorkhpar
Login: msorkhpar
Kind: user
Location: IN, USA

Repositories: 3
Profile: https://github.com/msorkhpar

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

Committers

Last synced: over 1 year ago

All Time

Total Commits: 59
Total Committers: 5
Avg Commits per committer: 11.8
Development Distribution Score (DDS): 0.254

Past Year

Commits: 59
Committers: 5
Avg Commits per committer: 11.8
Development Distribution Score (DDS): 0.254

Top Committers

Name	Email	Commits
Mo Sorkhpar	S**r@o**m	44
Mo Sorkhpar	s**r@o**m	9
saeedehj	s**j@g**m	4
Atefeh Moradan	a**n@g**m	1
Saeedeh Javadi	3****j	1

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Total issue authors: 0
Total pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

wiki-entity-summarization

Science Score: 39.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Wiki Entity Summarization Benchmark (WikES)

Loading the Datasets

Load Using wikes-toolkit

Using mlcroissant

Loading the Pre-processed Databases

Process Overview

1. Building the Summary Graph

2. Expanding the Summary Graph

3. Connecting Components

4. Adding Metadata

Pre-generated Datasets

Dataset Parameters

Graph Structure

CSV Format

Example of CSV Files

entities.csv

root-entities.csv

predicates.csv

triples.csv

ground-truth.csv

GraphML Example

Usage

Generate a New Dataset

You can set the parameters via .env file instead of providing command line arguments.

Re-generate WikES Dataset

Citation

License

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Load Using `wikes-toolkit`