galmisocorpus2023

:bookmark_tabs: Galician corpus for misogyny detection

https://github.com/luciamariaalvarezcrespo/galmisocorpus2023

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.2%) to scientific vocabulary

Keywords

corpus corpus-data galician machine-learning misogyny misogyny-detection nlp nlp-machine-learning
Last synced: 6 months ago · JSON representation ·

Repository

:bookmark_tabs: Galician corpus for misogyny detection

Basic Info
Statistics
  • Stars: 17
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
corpus corpus-data galician machine-learning misogyny misogyny-detection nlp nlp-machine-learning
Created over 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme Contributing Funding License Code of conduct Citation Security

README.md

:bookmark_tabs: GalMisoCorpus 2023

GitHub issues GitHub license Python Machine Learning

GitHub forks GitHub stars GitHub watching

:octopus: O primeiro corpus galego para a deteccin de misoxinia :octopus:
:gb: The First Galician corpus for misogyny detection

Corpus :books:

:octopus: Este repositorio contn un corpus de chos e toots procedentes de Twitter e Mastodon para a deteccin de misoxinia en lingua galega. Asemade, engdense os modelos adestrados co corpus proposto e os scripts desenvolvidos tanto para a creacin do corpus como para o adestramento dos modelos. Este traballo foi aceptado no 16th International Conference on Computational Processing of Portuguese (PROPOR 2024). O artigo est dispobel aqu.

:gb: This repository contains a corpus of tweets and toots from Twitter and Mastodon for the detection of misogyny in the Galician language. Additionally, it includes the trained models with the proposed corpus and the scripts developed both for creating the corpus and training the models. This work was accepted at the 16th International Conference on Computational Processing of Portuguese (PROPOR 2024). The paper is available here.

Ctao como / Cite as :sparkling_heart:

:octopus: Se consideras ao GalMisoCorpus2023 til para o teu traballo de investigacin, podes darlle unha :star: a este repo e citar o noso traballo facendo uso do seguinte BibTeX:

:gb: If you find GalMisoCorpus2023 useful for your research, welcome to :star: this repo and cite our work using the following BibTeX:

```bib @inproceedings{alvarez-crespo-castro-2024-galician, author = {{\'A}lvarez-Crespo, Luc{\'\i}a M. and Castro, Laura M.}, editor = {Gamallo, Pablo and Claro, Daniela and Teixeira, Ant{\'o}nio and Real, Livy and Garcia, Marcos and Oliveira, Hugo Gon{\c{c}}alo and Amaro, Raquel}, booktitle = {Proceedings of the 16th International Conference on Computational Processing of Portuguese}, month = {mar}, year = {2024}, address = {Santiago de Compostela, Galicia/Spain}, publisher = {Association for Computational Lingustics}, url = {https://aclanthology.org/2024.propor-1.3}, pages = {22--31} }

@inproceedings{alvarez2023unveiling, title = {Unveiling the Dark Side of Social Media: Developing the First Galician Corpus for Misogyny Detection on Twitter and Mastodon}, author = {{\'A}lvarez-Crespo, Luc{\'\i}a M. and Castro, Laura M}, booktitle = {VI Congreso Xove TIC: impulsando el talento cient{\'\i}fico. A Coru{~n}a}, pages = {87--90}, month = {oct}, year = {2023}, organization = {Universidade da Coru{~n}a, Servizo de Publicaci{\'o}ns} } ```

Responsabilidade / Disclaimer :warning:

[!WARNING] :octopus: Este conxunto de datos pode conter discurso de odio, linguaxe ofensiva ou outro material semellante. O contido foi recollido de diversas fontes e non foi creado nin avaliado polas autoras do proxecto, as como non reflicte as sas opinins ou puntos de vista. O conxunto de datos est destinado exclusivamente a fins de investigacin, anlise ou educativos. As autoras non avalan ningn comportamento prexudicial ou discriminatorio atopado nel. Debido s polticas de privacidade, non se pode publicar o texto procedente de X/Twitter. As persoas usuarias deben actuar con precaucin e sensibilidade ao usar o conxunto de datos e cumprir coas directrices ticas e as leis aplicbeis. As responsables do proxecto non asumen ningunha responsabilidade polo contido nin polo seu uso ou interpretacin por terceiros.

[!WARNING] :gb: This dataset may contain hate speech, offensive language, or other objectionable material. The content was collected from various sources and is not created or endorsed by the project authors. It does not reflect their views or opinions. The dataset is intended solely for research, analysis, or educational purposes. The authors do not endorse any harmful or discriminatory behavior found within it. Due to privacy policies, text from X/Twitter cannot be published. Users should exercise caution and sensitivity when using the dataset and adhere to ethical guidelines and applicable laws. The project maintainers disclaim any responsibility for the content and its use or interpretation by others.

Estrutura do repositorio / Repository structure :file_folder:

:octopus: Galego

  • /corpus: aqu atpase o corpus utilizado para os adestramentos, as como o non preprocesado para interese dos grupos de investigacin.
  • /scripts: aqu atpanse os scripts usados durante a recompilacin do corpus e durante o adestramento dos modelos. Engadronse, tamn, scripts que axudaron no proceso de colleita de datos e de procesamento dos textos.
  • /models: aqu atpanse os modelos xa adestrados.

:gb: English

  • /corpus: Here you will find the corpus used for training, as well as the non-preprocessed corpus for the interest of research groups.
  • /scripts: Here are the scripts used during the creation of the corpus and during the training of the models. Scripts were also added to assist in the data collection and text processing processes.
  • /models: Here are the already trained models.

Instalacin / Installation :wrench:

:octopus: Utiliza a ferramenta requirements.txt para instalar todas as dependencias.
:gb: Use the requirements.txt tool to install all the requirements.

bash pip3 install -r requirements.txt

Contribucins / Contributing :open_hands:

:octopus: As pull requests son benvidas. Para cambios maiores, abride primeiro unha issue para debater o que queirades cambiar, por favor.

[!TIP] As como lle suxerimos que propoa un cambio neste proxecto:

  1. Fai un fork deste proxecto na ta conta.
  2. Crea unha nova pla para o cambio que pretende facer.
  3. Fai os cambios no teu fork.
  4. Enva unha pull request dende a pla do teu fork nosa pla main.

:gb: Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

[!TIP] Heres how we suggest you go about proposing a change to this project:

  1. Fork this project to your account.
  2. Create a branch for the change you intend to make.
  3. Make your changes to your fork.
  4. Send a pull request from your fork's branch to our main branch.

Licenza / Licensing :unlock:

:octopus: Este proxecto atpase baixo a licenza de Mozilla. Vxase LICENSE para o texto completo.
:gb: This project is licensed under the Mozilla License. See LICENSE for the full license text.

Manteamos o contacto! / Get in touch! :telephone_receiver:

@luciamac_

Owner

  • Name: Lucía María Álvarez-Crespo
  • Login: luciamariaalvarezcrespo
  • Kind: user
  • Location: A Coruña
  • Company: CTC Group, University of A Coruña

This is not my CV 🤗

Citation (CITATION.cff)

date-released: "2024-mar-01"
repository-code: "https://github.com/luciamariaalvarezcrespo/GalMisoCorpus2023"
message: "If you use this software, please cite it as below."
title: "A Galician Corpus for Misogyny Detection Online"
cff-version: "1.2.0"
authors:
  - family-names: "Álvarez-Crespo"
    given-names: "Lucía M."
  - family-names: "Castro"
    given-names: "Laura M."
preferred-citation:
  type: "conference-paper"
  publisher:
    name: "Association for Computational Lingustics"
  conference:
    name: "Proceedings of the 16th International Conference on Computational Processing of Portuguese"
  url: "https://aclanthology.org/2024.propor-1.3"
  date-released: "2024-mar-01"
  title: "A Galician Corpus for Misogyny Detection Online"
  booktitle: "Proceedings of the 16th International Conference on Computational Processing of Portuguese"
  editor: "Gamallo, Pablo and
      Claro, Daniela and
      Teixeira, António and
      Real, Livy and
      Garcia, Marcos and
      Oliveira, Hugo Gonçalo and
      Amaro, Raquel"
  publisher: "Association for Computational Lingustics"
  start: "22"
  end: "31"
  authors:
    - family-names: "Álvarez-Crespo"
      given-names: "Lucía M."
    - family-names: "Castro"
      given-names: "Laura M."

GitHub Events

Total
Last Year