https://github.com/cltl-students/verkijk_stella_rma_thesis_dutch_medical_language_model

https://github.com/cltl-students/verkijk_stella_rma_thesis_dutch_medical_language_model

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.8%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: cltl-students
  • License: mit
  • Language: Python
  • Default Branch: master
  • Size: 1.39 MB
Statistics
  • Stars: 10
  • Watchers: 5
  • Forks: 3
  • Open Issues: 0
  • Releases: 0
Created almost 5 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.md

Creating Dutch Medical Language Models: a thesis project

This repository contains the code for the creation and evaluation of two domain-specific Dutch Medical Language models. Most data used in this project can not be provided due to privacy constraints. Where possible, data is provided.

Overview

The src folder contains all code and data. Per subfolder, a readme is provided. The subfolders it contains are the following: └───src │ └───gather_traindata (provides the code used for gathering, filtering and preparing the data used for pre-training in train_lm) │ └───train_lm (provides the code to pre-train two medical language models: from scratch and extending RobBERT │ └───ICF_test (provides the code to fine-tune and test language models on a medical classification task) │ └───similarity_test (provides the code to create a similarity test set from hospital notes and provides the code and data to test language models on this) │ └───NER_test (provides the code to fine-tune and test language models on named entitiy recognition for general Dutch) │ └───anonymization (provides the code to anonymize the vocabulary of a language model and test the level of anonymicity of a language model)

Thesis report

For more information or access to a copy of the full thesis you can contact me at s.verkijk@vu.nl

Owner

  • Name: Computational Lexicology and & Terminology Lab
  • Login: cltl-students
  • Kind: organization
  • Email: p.t.j.m.vossen@vu.nl
  • Location: Amsterdam

Thesis and student projects @cltl

GitHub Events

Total
  • Watch event: 5
  • Push event: 1
Last Year
  • Watch event: 5
  • Push event: 1

Dependencies

requirements.txt pypi
  • cuda ==11.2.1
  • python ==3.8.2
  • tensorflow ==2.4.1
  • tokenizers ==0.10.1
  • torch ==1.8.0
  • transformers ==4.3.3