ddxplus_testing
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: vodaiphuoc
- Language: Python
- Default Branch: main
- Size: 4.24 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
DDXPlus: A New Dataset For Automatic Medical Diagnosis

Appearing in NeurIPS 2022 dataset and benchmark track
We are releasing under the CC-BY licence a new large-scale dataset for Automatic Symptom Detection (ASD) and Automatic Diagnosis (AD) systems in the medical domain.
The dataset contains patients synthesized using a proprietary medical knowledge base and a commercial rule-based ASD system. Patients in the dataset are characterized by their socio-demographic data, a pathology they are suffering from, a set of symptoms and antecedents related to this pathology, and a differential diagnosis. The symptoms and antecedents can be binary, categorical and multi-choice, with the potential of leading to more efficient and natural interactions between ASD/AD systems and patients.
To the best of our knowledge, this is the first large-scale dataset that includes the differential diagnosis, and non-binary symptoms and antecedents.
- DDXPlus: A New Dataset For Automatic Medical Diagnosis
Availability
- Our paper is available on arXiv.
- The dataset in French is hosted on figshare.
- This is the original version of DDXPlus that all results in our paper were obtained on.
- Starting from 9 May 2023, the dataset is also available in English for easier use. This version is hosted on figshare.
- The English version of DDXPlus contains the same data in the same format as the French version.
- Wherever possible, English names or non-semantic codes are used instead of French names.
- Using the English version should lead to the same performance as using the French version. <!-- #FIXME: check the date and add link-->
Dataset documentation
In what follows, we use the term evidence as a general term to refer to a symptom or an antecedent. The dataset contains the following files:
- release_evidences.json: a JSON file describing all possible evidences considered in the dataset.
- release_conditions.json: a JSON file describing all pathologies considered in the dataset.
- release_train_patients.zip: a CSV file containing the patients of the training set.
- release_validate_patients.zip: a CSV file containing the patients of the validation set.
- release_test_patients.zip: a CSV file containing the patients of the test set.
Evidence description
Each evidence in the release_evidences.json file is described using the following entries:
name: name of the evidence.- In the English version, this is replaced with a unique, non-semantic code starting with
E.
- In the English version, this is replaced with a unique, non-semantic code starting with
code_question: a code allowing to identify which evidences are related. Evidences having the samecode_questionform a group of related symptoms. The value of thecode_questionrefers to the evidence that need to be simulated/activated for the other members of the group to be eventually simulated.question_fr: the query, in French, associated to the evidence.question_en: the query, in English, associated to the evidence.is_antecedent: a flag indicating whether the evidence is an antecedent or a symptom.data_type: the type of the evidence. We use "B" for binary, "C" for categorical, and "M" for multi-choice.default_value: the default value of the evidence. If this value is used to characterize the evidence, then it is as if the evidence was not synthesized.possible-values: the possible values for the evidence. Only valid for categorical and multi-choice evidences.- In the English version, every value is replaced with a unique, non-semantic code starting with
V.
- In the English version, every value is replaced with a unique, non-semantic code starting with
value_meaning: The meaning, in French and English, of each code that is part of thepossible-valuesfield. Only valid for categorical and multi-choice evidences.
Example
English
json
{
"name": "E_130",
"code_question": "E_129",
"question_fr": "De quelle couleur sont les lésions?",
"question_en": "What color is the rash?",
"is_antecedent": false,
"default_value": "V_11",
"value_meaning": {
"V_11": {"fr": "NA", "en": "NA"},
"V_86": {"fr": "foncée", "en": "dark"},
"V_107": {"fr": "jaune", "en": "yellow"},
"V_138": {"fr": "pâle", "en": "pale"},
"V_156": {"fr": "rose", "en": "pink"},
"V_157": {"fr": "rouge", "en": "red"}
},
"possible-values": [
"V_11",
"V_86",
"V_107",
"V_138",
"V_156",
"V_157"
],
"data_type": "C"
}
French
json
{
"name": "lesions_peau_couleur",
"code_question": "lesions_peau",
"question_fr": "De quelle couleur sont les lésions?",
"question_en": "What color is the rash?",
"is_antecedent": false,
"default_value": "NA",
"value_meaning": {
"NA": {"fr": "NA", "en": "NA"},
"foncee": {"fr": "foncée", "en": "dark"},
"jaune": {"fr": "jaune", "en": "yellow"},
"pale": {"fr": "pâle", "en": "pale"},
"rose": {"fr": "rose", "en": "pink"},
"rouge": {"fr": "rouge","en": "red"}
},
"possible-values": [
"NA",
"foncee",
"jaune",
"pale",
"rose",
"rouge"
],
"data_type": "C"
}
Pathology description
The file release_conditions.json contains information about the pathologies patients in the datasets may suffer from. Each pathology has the following attributes:
- condition_name: name of the pathology.
- In the English version, the English name is used instead of the French name.
- cond-name-fr: name of the pathology in French.
- cond-name-eng: name of the pathology in English.
- icd10-id: ICD-10 code of the pathology.
- severity: the severity associated with the pathology. The lower the more severe.
- symptoms: data structure describing the set of symptoms characterizing the pathology. Each symptom is represented by its corresponding name entry in the release_evidences.json file.
- antecedents: data structure describing the set of antecedents characterizing the pathology. Each antecedent is represented by its corresponding name entry in the release_evidences.json file.
Example
English
json
{
"condition_name": "Myasthenia gravis",
"cond-name-fr": "Myasthénie grave",
"cond-name-eng": "Myasthenia gravis",
"icd10-id": "G70.0",
"symptoms": {
"E_65": {},
"E_63": {},
"E_52": {},
"E_172": {},
"E_84": {},
"E_66": {},
"E_90": {},
"E_38": {},
"E_176": {}
},
"antecedents": {
"E_28": {},
"E_204": {}
},
"severity": 3
}
French
json
{
"condition_name": "Myasthénie grave",
"cond-name-fr": "Myasthénie grave",
"cond-name-eng": "Myasthenia gravis",
"icd10-id": "G70.0",
"symptoms": {
"dysphagie": {},
"dysarthrie": {},
"diplopie": {},
"ptose": {},
"faiblesse_msmi": {},
"dyspn": {},
"fatigabilité_msk": {},
"claud_mâchoire": {},
"rds_paralys_gen": {}
},
"antecedents": {
"atcdfam_mg": {},
"trav1": {}
},
"severity": 3
}
Patient description
Each patient in each of the 3 sets has the following attributes:
- AGE: the age of the synthesized patient.
- SEX: the sex of the synthesized patient.
- PATHOLOGY: name of the ground truth pathology (cf condition_name property in the release_conditions.json file) that the synthesized patient is suffering from.
- EVIDENCES: list of evidences experienced by the patient. An evidence can either be binary, categorical or multi-choice. A categorical or multi-choice evidence is represented in the format [evidence-name]_@_[evidence-value] where [evidence-name] is the name of the evidence (name entry in the release_evidences.json file) and [evidence-value] is a value from the possible-values entry. Note that for a multi-choice evidence, it is possible to have several [evidence-name]_@_[evidence-value] items in the evidence list, with each item being associated with a different evidence value. A binary evidence is represented as [evidence-name].
- INITIAL_EVIDENCE: the evidence provided by the patient to kick-start an interaction with an ASD/AD system. This is useful during model evaluation for a fair comparison of ASD/AD systems as they will all begin an interaction with a given patient from the same starting point. The initial evidence is randomly selected from the evidence list mentioned above (i.e., EVIDENCES) and it is part of this list.
- DIFFERENTIAL_DIAGNOSIS: The ground truth differential diagnosis for the patient. It is represented as a list of pairs of the form [[patho_1, proba_1], [patho_2, proba_2], ...] where patho_i is the pathology name (condition_name entry in the release_conditions.json file) and proba_i is its related probability.
Example
English
json
{
"AGE": 18,
"DIFFERENTIAL_DIAGNOSIS": [["Bronchitis", 0.19171203430383882], ["Pneumonia", 0.17579340398940366], ["URTI", 0.1607809719801254], ["Bronchiectasis", 0.12429044460990353], ["Tuberculosis", 0.11367177304035844], ["Influenza", 0.11057936110639896], ["HIV (initial infection)", 0.07333003867293564], ["Chagas", 0.04984197229703562]],
"SEX": "M",
"PATHOLOGY": "URTI",
"EVIDENCES": ["E_48", "E_50", "E_53", "E_54_@_V_161", "E_54_@_V_183", "E_55_@_V_89", "E_55_@_V_108", "E_55_@_V_167", "E_56_@_4", "E_57_@_V_123", "E_58_@_3", "E_59_@_3", "E_77", "E_79", "E_91", "E_97", "E_201", "E_204_@_V_10", "E_222"],
"INITIAL_EVIDENCE": "E_91"
}
French
json
{
"AGE": 18,
"DIFFERENTIAL_DIAGNOSIS": [["Bronchite", 0.19171203430383882], ["Pneumonie", 0.17579340398940366],["IVRS ou virémie", 0.1607809719801254], ["Bronchiectasies", 0.12429044460990353], ["Tuberculose", 0.11367177304035844], ["Possible influenza ou syndrome virémique typique", 0.11057936110639896], ["VIH (Primo-infection)", 0.07333003867293564], ["Chagas", 0.04984197229703562]],
"SEX": "M",
"PATHOLOGY": "IVRS ou virémie",
"EVIDENCES": ["crowd", "diaph", "douleurxx", "douleurxx_carac_@_sensible", "douleurxx_carac_@_une_lourdeur_ou_serrement", "douleurxx_endroitducorps_@_front", "douleurxx_endroitducorps_@_joue_D_", "douleurxx_endroitducorps_@_tempe_G_", "douleurxx_intens_@_4", "douleurxx_irrad_@_nulle_part", "douleurxx_precis_@_3", "douleurxx_soudain_@_3", "expecto", "f17.210", "fievre", "gorge_dlr", "toux", "trav1_@_N", "z77.22"],
"INITIAL_EVIDENCE": "fievre"
}
Dataset statistics
Pathology statistics

Socio-demographic statistics

Distribution of the evidence types
| | Binary | Categorical | Multi-choice | Total | |:---------------:|:------:|:-----------:|:------------:|:-----:| | Evidences | 208 | 10 | 5 | 223 | | Symptoms | 96 | 9 | 5 | 110 | | Antecedents | 112 | 1 | 0 | 113 |
Number of evidences of the synthesized patients
| | Avg | Std dev | Min | 1st quartile | Median | 3rd quartile | Max | |:---------------:|:-----:|:-------:|:---:|:------------:|:------:|:------------:|:---:| | Evidences | 13.56 | 5.06 | 1 | 10 | 13 | 17 | 36 | | Symptoms | 10.07 | 4.69 | 1 | 8 | 10 | 12 | 25 | | Antecedents | 3.49 | 2.23 | 0 | 2 | 3 | 5 | 12 |'
Differential diagnosis statistics

Experiments
Code for reproducing results in the paper can be found in code.
In our paper, we reported results of two methods, a RL-based method AARLC and a supervised method BASD which is adapted from ASD. For instructions on how to run them, see here for AARLC and here for BASD.
Owner
- Login: vodaiphuoc
- Kind: user
- Repositories: 1
- Profile: https://github.com/vodaiphuoc
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Fansi Tchango" given-names: "Arsene" - family-names: "Goel" given-names: "Rishab" - family-names: "Wen" given-names: "Zhi" - family-names: "Julien" given-names: "Martel" - family-names: "Ghosn" given-names: "Joumana" title: "DDXPlus: A New Dataset For Automatic Medical Diagnosis" version: 1.0.0 date-released: 2022-05-19 url: "https://github.com/bruzwen/ddxplus"
GitHub Events
Total
- Push event: 2
- Create event: 2
Last Year
- Push event: 2
- Create event: 2
Dependencies
- mlflow *
- numpy *
- orion *
- torch *