mosaico

A multilingual open-text semantically annotated interlinked corpus

https://github.com/sapienzanlp/mosaico

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.6%) to scientific vocabulary

Keywords

artificial-intelligence natural-language-processing natural-language-understanding relation-extraction semantic-parsing semantic-role-labeling word-sense-disambiguation
Last synced: 6 months ago · JSON representation ·

Repository

A multilingual open-text semantically annotated interlinked corpus

Basic Info
  • Host: GitHub
  • Owner: SapienzaNLP
  • License: other
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 332 KB
Statistics
  • Stars: 7
  • Watchers: 4
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Topics
artificial-intelligence natural-language-processing natural-language-understanding relation-extraction semantic-parsing semantic-role-labeling word-sense-disambiguation
Created over 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

# MOSAICo:
A Multilingual Open-text Semantically Annotated Interlinked Corpus [![Conference](http://img.shields.io/badge/NAACL-2024-4b44ce.svg)](https://2024.naacl.org/) [![Paper](http://img.shields.io/badge/paper-ACL--anthology-B31B1B.svg)](https://aclanthology.org/2024.naacl-long.442/) [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/) ![1718634923466](https://github.com/SapienzaNLP/mosaico/assets/26126169/613fb6e9-e7f6-4683-87c9-73f0f359a5c3)

About MOSAICo

This is the repository for the paper MOSAICo: a Multilingual Open-text Semantically Annotated Interlinked Corpus, presented at NAACL 2024 by Simone Conia, Edoardo Barba, Abelardo Carlos Martinez Lorenzo, Pere-Lluís Huguet Cabot, Riccardo Orlando, Luigi Procopio and Roberto Navigli.

Mosaico overview

Paper abstract

Several Natural Language Understanding (NLU) tasks focus on linking text to explicit knowledge, including Word Sense Disambiguation, Semantic Role Labeling, Semantic Parsing, and Relation Extraction. In addition to the importance of connecting raw text with explicit knowledge bases, the integration of such carefully curated knowledge into deep learning models has been shown to be beneficial across a diverse range of applications, including Language Modeling and Machine Translation. Nevertheless, the scarcity of semantically-annotated corpora across various tasks and languages limits the potential advantages significantly. To address this issue, we put forward MOSAICo, the first endeavor aimed at equipping the research community with the key ingredients to model explicit semantic knowledge at a large scale, providing hundreds of millions of silver yet high-quality annotations for four NLU tasks across five languages. We describe the creation process of MOSAICo, demonstrate its quality and variety, and analyze the interplay between different types of semantic information. MOSAICo, available at https://github.com/SapienzaNLP/mosaico, aims to drop the requirement of closed, licensed datasets and represents a step towards a level playing field across languages and tasks.

Cite this work

If you use any part of this work, please consider citing the paper as follows:

bibtex @inproceedings{conia-etal-2024-mosaico, title = "{MOSAICo}: a Multilingual Open-text Semantically Annotated Interlinked Corpus", author = "Conia, Simone and Barba, Edoardo and Martinez Lorenzo, Abelardo Carlos and Huguet Cabot, Pere-Llu{\'\i}s and Orlando, Riccardo and Procopio, Luigi and Navigli, Roberto", booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", month = jun, year = "2024", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.naacl-long.442", pages = "7983--7997", }

How is MOSAICo built?

MOSAICo provides high-quality silver annotations for 4 semantic tasks: * Word Sense Disambiguation: we use ESCHER, a state-of-the-art WSD system adapted for multilingual settings. * Semantic Role Labeling: we use Multi-SRL, a state-of-the-art multilingual system for dependency- and span-based SRL. * Semantic Parsing: we use SPRING, a state-of-the-art semantic parser adapted for multilingual settings. * Relation Extraction: we use mREBEL, a state-of-the-art system for multilingual RE.

Usage

Set up MongoDB

MOSAICo data are released as mongoexported JSON files that can be loaded into a local instance of MongoDB.

First, we need to start a local MongoDB instance (we suggest using Docker): bash docker run \ -e MONGO_INITDB_ROOT_USERNAME=admin \ -e MONGO_INITDB_ROOT_PASSWORD=password \ -p 27017:27017 \ --name local-mosaico-db \ --detach \ mongo:6.0.11

Then, we need to mongoimport the data corresponding to three collections. | Collection | Sample | Mosaico Core | | --- | --- | --- | | interlanguage-links | - | link | | pages | link | link | | annotations | link | link |

The Sample column refers to an English-only sample of 835 annotated documents.

Once downloaded, you can import the data into the local MongoDB instance. ```bash

import interlanguage links

docker exec -i local-mosaico-db \ mongoimport \ --authenticationDatabase admin -u admin -p password \ --db mosaico --collection interlanguage-links <

import pages

docker exec -i local-mosaico-db \ mongoimport \ --authenticationDatabase admin -u admin -p password \ --db mosaico --collection pages <

import annotations

docker exec -i local-mosaico-db \ mongoimport \ --authenticationDatabase admin -u admin -p password \ --db mosaico --collection annotations < ```

Installing the MOSAICo library

bash pip install git+https://github.com/SapienzaNLP/mosaico

Using the MOSAICo library

The library heavily uses async programming. If you cannot integrate that within your code (e.g., inside a torch.Dataset), I suggest using a separate script to download the data locally. Moreover, we built this project on top of beanie, an ODM for MongoDB. Before proceeding, we strongly recommend to check out its tutorial, as WikiPage is a beanie.Document.

```python import asyncio from mosaico.schema import init, WikiPage

async def main(): await init( mongo_uri="mongodb://admin:password@127.0.0.1:27017/", db="mosaico", )

page = await WikiPage.find_one(WikiPage.title == "Barack Obama")
print(f"# document id: {page.document_id}")
print(
    f"# wikidata id: {page.wikidata_id if page.wikidata_id is not None else '<not available>'}"
)
print(f"# language: {page.language.value}")
print(f"# text: {page.text[: 100]} [...]")

print("# available annotations:")
async for annotation in page.list_annotations():
    print(f"  * {annotation.name}")

print("# available translated pages:")
async for translated_page in page.list_translations():
    print(f"  * {translated_page.language.value} => {translated_page.document_id}")

if name == "main": asyncio.run(main()) ```

For more information, check out the examples/ folder. If interested in the fields available for each annotation, check out the pydantic models defined in src/mosaico/schema/annotations/.

Streamlit Demo

This code includes a script to run a streamlit demo that allows for easy data visualization.

bash PYTHONPATH=$(pwd) pdm run demo

Working on the library

Setup Env

This repository uses PDM as its dependency manager.

```bash

install pdm package manager

curl -sSL https://pdm-project.org/install-pdm.py | python3 -

and add binary folder to PATH

pdm install ```

Patch WikiExtractor

We use an alignment algorithm to link the Cirrus text (which does not contain metadata such as sections and links) to the standard Wikipedia source text (which does).

In this process, we compute a cleaned more-easily-alignable version of the source text by applying wikiextractor. For best results, we recommend correcting (i.e., patching) the installed version of wikiextractor by updating the following lines in wikiextractor.extract:clean: python for tag in discardElements: text = dropNested(text, r'<\s*%s\b[^>/]*>' % tag, r'<\s*/\s*%s>' % tag) to: python for tag in discardElements: text = dropNested(text, r'<\s*%s\b[^>]*[^/]*>' % tag, r'<\s*/\s*%s>' % tag)

The reason behind this change is that the original regex fails on some edge cases: Inspired by the first person ever to be cured of HIV, <a href="The%20Berlin%20Patient">The Berlin Patient</a>, StemCyte began collaborations with <a href="Cord%20blood%20bank">Cord blood bank</a>s worldwide to systematically screen <a href="Umbilical%20cord%20blood">Umbilical cord blood</a> samples for the CCR5 mutation beginning in 2011.<ref name="CCR5Δ32/Δ32 HIV-resistant cord blood"></ref> This is the cleaned text returned by the original unpatched function: the trailing hasn't been deleted because / is excluded by the regex (while it should only be excluded if second to last char).

More details on the linking process can be found in src/scripts/annotations/sourcetextlinking/link.py.

Cite this work

If you use any part of this work, please consider citing the paper as follows:

bibtex @inproceedings{conia-etal-2024-mosaico, title = "{MOSAIC}o: a Multilingual Open-text Semantically Annotated Interlinked Corpus", author = "Conia, Simone and Barba, Edoardo and Martinez Lorenzo, Abelardo Carlos and Huguet Cabot, Pere-Llu{\'\i}s and Orlando, Riccardo and Procopio, Luigi and Navigli, Roberto", editor = "Duh, Kevin and Gomez, Helena and Bethard, Steven", booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)", month = jun, year = "2024", address = "Mexico City, Mexico", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.naacl-long.442", pages = "7983--7997", abstract = "Several Natural Language Understanding (NLU) tasks focus on linking text to explicit knowledge, including Word Sense Disambiguation, Semantic Role Labeling, Semantic Parsing, and Relation Extraction. In addition to the importance of connecting raw text with explicit knowledge bases, the integration of such carefully curated knowledge into deep learning models has been shown to be beneficial across a diverse range of applications, including Language Modeling and Machine Translation. Nevertheless, the scarcity of semantically-annotated corpora across various tasks and languages limits the potential advantages significantly. To address this issue, we put forward MOSAICo, the first endeavor aimed at equipping the research community with the key ingredients to model explicit semantic knowledge at a large scale, providing hundreds of millions of silver yet high-quality annotations for four NLU tasks across five languages. We describe the creation process of MOSAICo, demonstrate its quality and variety, and analyze the interplay between different types of semantic information. MOSAICo, available at https://github.com/SapienzaNLP/mosaico, aims to drop the requirement of closed, licensed datasets and represents a step towards a level playing field across languages and tasks in NLU.", }

License

The data is licensed under Creative Commons Attribution-ShareAlike-NonCommercial 4.0.

Owner

  • Name: Sapienza NLP group
  • Login: SapienzaNLP
  • Kind: organization
  • Location: Rome, Italy

The NLP group at the Sapienza University of Rome

Citation (CITATION.bib)

@inproceedings{conia-etal-2024-mosaico,
    title = "{MOSAICo}: a Multilingual Open-text Semantically Annotated Interlinked Corpus",
    author = "Conia, Simone and Barba, Edoardo and Martinez Lorenzo, Abelardo Carlos and Huguet Cabot, Pere-Llu{\'\i}s and Orlando, Riccardo and Procopio, Luigi and Navigli, Roberto",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.442",
    pages = "7983--7997",
}

GitHub Events

Total
  • Watch event: 3
Last Year
  • Watch event: 3

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 30
  • Total Committers: 4
  • Avg Commits per committer: 7.5
  • Development Distribution Score (DDS): 0.533
Past Year
  • Commits: 5
  • Committers: 3
  • Avg Commits per committer: 1.667
  • Development Distribution Score (DDS): 0.6
Top Committers
Name Email Commits
poccio l****o@g****m 14
Simone Conia s****a@u****t 14
Pere Lluis p****3@g****m 1
Edoardo Barba e****4@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels