sars-cov-2-sequenzdaten_aus_deutschland

Ein zentraler Bestandteil einer erfolgreichen Erregersurveillance ist das Verständnis der Verbreitung eines Erregers sowie seiner pathogenen Eigenschaften. Hierbei stellt das Wissen über das Erregergenom eine wichtige Informationsquelle dar. So erlaubt der Nachweis von Mutationen im Genom eines Erregers, Verwandtschaftsbeziehungen zu rekonstruie...

https://github.com/robert-koch-institut/sars-cov-2-sequenzdaten_aus_deutschland

Keywords

base-sequence basensequenz covid-19 deutschland genom genome germany offene-daten open-data rki sars-cov-2 viral-sequences virussequenzen

Last synced: 9 months ago · JSON representation ·

Repository

Ein zentraler Bestandteil einer erfolgreichen Erregersurveillance ist das Verständnis der Verbreitung eines Erregers sowie seiner pathogenen Eigenschaften. Hierbei stellt das Wissen über das Erregergenom eine wichtige Informationsquelle dar. So erlaubt der Nachweis von Mutationen im Genom eines Erregers, Verwandtschaftsbeziehungen zu rekonstruie...

Basic Info

Host: GitHub
Owner: robert-koch-institut
License: cc-by-4.0
Default Branch: main
Homepage: https://robert-koch-institut.github.io/SARS-CoV-2-Sequenzdaten_aus_Deutschland/
Size: 11.5 GB

Statistics

Stars: 68
Watchers: 6
Forks: 7
Open Issues: 1
Releases: 733

Topics

base-sequence basensequenz covid-19 deutschland genom genome germany offene-daten open-data rki sars-cov-2 viral-sequences virussequenzen

Created almost 5 years ago · Last pushed 10 months ago

Metadata Files

Readme License Citation

Readme.en.md

Documentation

SARS-CoV-2 Sequence Data from Germany

Robert Koch Institute

Cite
Robert Koch Institute. (2025). SARS-CoV-2 Sequence Data from Germany [Data set]. Zenodo. https://doi.org/10.5281/zenodo.16965158

Abstract
The dataset ‘SARS-CoV-2 Sequence Data from Germany’ consists of complete virus genome sequences and associated metadata from samples collected nationwide. The samples are sequenced and bioinformatically analysed in collaboration with the IMSSC2 laboratory network, the National Reference Centre for Coronaviruses at Charité and the RKI. The dataset enables robust molecular epidemiological analyses of the spread of SARS-CoV-2 in Germany and represents a central resource for research and public health surveillance.

Table of Content  - Information on the data set and context of origin - Structure and content of the dataset - Guidelines for reuse of the data

--- die deutsche Version finden Sie hier ---

Information on the data set and context of origin

A central component of successful pathogen surveillance is understanding the spread of a pathogen and its pathogenic properties. Knowledge of the pathogen genome is an important source of information here. The detection of mutations in the genome of a pathogen makes it possible to reconstruct relationships, uncover transmission routes and predict resistance. The Integrated Genomic Surveillance (IGS) of SARS-CoV-2 aims to monitor the spread of the virus and in particular of virus variants of concern in the population and to closely observe any changes in the virus that occur. The public provision of genomic data is of particular importance in order to enable scientists in Germany and worldwide to carry out their own analyses.

As part of the Coronavirus Surveillance Ordinance, SARS-CoV-2 sequence data from all over Germany were transmitted to the RKI via the German Electronic Sequence Data Hub (DESH) until 31.05.2023. With the expiration of the ordinance, samples will be provided by the IMSSC2 laboratory network in the future and sequenced, analyzed and made available here at the RKI. Despite the reduced number of samples, the careful selection of the participating laboratories ensures a representative insight into the virus population (Djin Ye Oh et al. 2022). In addition, sequences from the NRZ Coronaviruses at the Charité will be contributed to complement the IMSSC2 network.

Administrative and organizational information

The dataset "SARS-CoV-2 sequence data from Germany" is provided by the Robert Koch Institute for research work related to SARS-CoV-2 surveillance in the IGS project.

Data collection at the RKI is carried out with the expiry of the Coronavirus Surveillance Ordinance via the IMSSC2 laboratory network under the direction of FG 17 | Influenza viruses and other viruses of the respiratory tract and by the National Reference Center for Coronaviruses.

As part of the IGS project, the data produced by MF1 | Genome Competence Centre will be analyzed bioinformatically. Questions regarding the project can best be directed to IGS@rki.de.

The coordination and collection of reporting data is carried out by FG 36 | Respiratory communicable diseases.

Publication of the data, data curation and quality management of the (meta-)data are carried out by the RKI's MF 4 | Specialized and Research Data Management department. Questions about data management can be directed to the Open Data Team of the MF4 department (OpenData@rki.de).

Data collection

The IMSSC2 laboratory network consists of ~20 laboratory medical facilities in 13 federal states, which send randomly selected SARS-CoV-2-positive sample material to the RKI on a weekly basis. Here, whole genome sequencing and further phylogenetic and genome biology analyses are carried out to identify the most common SARS-CoV-2 lineages circulating in Germany. The results are published promptly on the RKI website and in scientific journals and contribute to the assessment of the current epidemiological situation of COVID-19. The IMSSC2 data is supplemented by sequences collected by the National Consiliary Laboratory for Coronaviruses. The data from both sources is made available to the public via GitHub and other public databases. Also included in the dataset are SARS-CoV-2 sequence data from all over Germany that were submitted to the RKI via the German Electronic Sequence Data Hub (DESH) by May 31, 2023.

Assignment of virus lines based on pangolin

The assignment of known virus lines to the collected sequences is carried out using Pangolin. When a new version or updated lineage definitions of Pangolin are released, the lineage information for the entire sequence collection is reassigned to the entire sequence dataset. The information about the lineage and the Pangolin version used can be found for each sequence in the metadata.

The information provided on the virus lineages corresponds to the current PANGOLIN Lineage Format. Only the "Taxon" column has been renamed SEQUENCE.ID to facilitate subsequent use. The SEQUENCE.ID, which is contained in all three data, is central for linking the developmental lines with the other data. PANGOLIN Lineage Format is authoritative in case of contradictions.

Quality management

The data collected by DESH passed the quality control (QC) of the IGS at the RKI according to published criteria (see: rki.de - DESH Qualitaetskriterien.pdf). In addition, for all sequences, including IMSSC2 samples, a bioinformatic QC of the sequence is performed with PRESIDENT: PaiRwisE Sequence IDENtiTy with an identity threshold of 70% and an N threshold of 20%. The metadata QC checks the metadata for incorrect data and entries that would influence further processing. If the QC for metadata or sequence data is not passed, this data is not made publicly available in order to ensure the high quality of the public dataset.

Structure and content of the dataset

The dataset includes genomic sequences of SARS-CoV-2 isolates from all over Germany and associated metadata. The dataset contains:

Submitted SARS-CoV-2 genome sequences
Metadata on SARS-CoV-2 genome sequences
License including the usage license of the dataset
Metadata file for import into Zenodo
Information on VOCs and VOIs
List of relevant lineages

SARS-CoV-2 sequence data

The SARS-CoV-2 sequence data is provided in the root directory under "SARS-CoV-2-Sequenzdaten_Deutschland.fasta.xz".

SARS-CoV-2-Sequenzdaten_Deutschland.fasta.xz

Structure of the sequence data

The file provided contains sequence entries that are structured according to the FASTA format. In this format, each entry begins with a short description, also known as a header or "description line". This line is identified by a ">" character at the beginning of the line. The header is followed by the sequence itself, which is a sequence of nucleic acids in IUB/IUPAC format

Each sequence ends with the start of a new sequence entry, indicated by a new header, or, in the case of the last sequence entry, with the end of the file.

In the sequence data provided, the header corresponds to the igs_id, which allows a simple link to the metadata provided.

Header: "><igsid> version=<version> id=<genomeid> <contig_index>"
Nucleic acid sequence: IUB/IUPAC standard

This results in the following exemplary structure of a .fasta file:

```fasta=

IGS-101XX-CVDP-XX version=1 id=939421ee-feab-4b79-9f19-6dc248e0ee89 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNACCAACCAACTTTCGATCTCTT...
IGS-101YY-CVDP-YY version=0 id=08f5d734-d135-4d2a-9680-bc5a795b2d34 0 NNNNNNNNNNNNNNNNNNNNNNNNNNNNACCAACTCTCGGCTGCATGCT...
```

Compression of the sequence data

The SARS-CoV-2 sequence data is provided as an xz-compressed .fasta file. This results in the file extension .fasta.xz. Linux line breaks are used.

Character set: UTF-8
Compression: .xz
Included file format: .fasta
Line breaks: Linux line breaks

The files can be unpacked on common operating systems, for example with the programs 7zip or XZ Utils. Compression is performed as the .fasta files in particular are several gigabytes (GB) in size.

Sequence metadata

The sequence metadata is provided in "SARS-CoV-2-Sequenzdaten_Deutschland.tsv.xz". This data also contains the assigned virus lines.

SARS-CoV-2-Sequenzdaten_Deutschland.tsv.xz

Variables and values

The file SARS-CoV-2-Sequenzdaten_Deutschland.tsv.xz contains the variables and their values shown in the following table. A machine-readable data schema is stored in Data Package Format in tableschemaSARS-CoV-2-SequenzdatenDeutschland.en.json:

tableschemaSARS-CoV-2-SequenzdatenDeutschland.en.json

| Variable | Type | Characteristic | Description | |:----------------------------|:--------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | igsid | string | Example: IGS-10099-CVDP-01A2C74B-54A8-4
7B1-B7E4-6562C6231234 | A unique identifier that combines sequence data and metadata. This identifier is used as part of the FASTA ID in the sequence data. | | dateofsampling | date | Format: YYYY-MM-DDTHH:MM:SS | Date of sampling in ISO 8601 format without time zone | | sequencingplatform | string | Example: ILLUMINA | The sequencing platform used based on the ontology approved by ENA | | sequencingreason | string | Values: random, requested, clinical, other | Reason for conducting the sequencing. random: The sample was taken randomly.requested: The sample was taken due to concerns/suspicions about a new variant or something similar. clinical: The sample comes from a clinical setting. other: The reason is none of the above. | | isolationsource | string | Example: Nasopharyngeal swab (specimen) | DEMIS Vocabulary | | labsequenceid | string | Example: 873a7cc28d29e3f17b0544ea6e9e84
36defe32f6d60649159ee8ac78d414
7ac9 | FASTA ID used by the laboratory in encrypted form | | dateofsubmission | date | Format: YYYY-MM-DDTHH:MM:SS | Date of receipt of the genome at the RKI in ISO 8601 format without time zone | | version | integer | Values: ≥0 | Version of the sequence starting with 0 | | diagnosticlab.demislabid | string | Example: DEMIS-10099 | Identification number of the primary diagnostic laboratory | | diagnosticlab.postalcode | string | Example: 50858 | Postal code of the primary diagnostic laboratory | | sequencinglab.demislabid | string | Example: DEMIS-10099 | Identification number of the sequencing laboratory | | sequencinglab.postalcode | string | Example: 50858 | Postal code of the sequencing lab | | genome.gtrs | string | Examples: [{"date_of_creation": "2025-05-19T11:35:46.427598"`, `"method_version": "4.3.1"`, `"database_version": "PUSHER-v1.32"`, `"genomic_typing_result": "BA.2"`, `"date_of_assignment": "2025-01-30T16:14:14.218144"`, `"genomic_method": {"name": "Pangolin Lineage"}`, `"additional_information": "{\"note\": \"Usher placements: BA.2(1/1)\"`, `\"conflict\": 0`, `\"qc_notes\": \"Ambiguous_content:0.02\"`, `\"qc_status\": \"pass\"`, `\"is_designated\": false}"`, `"date_of_modification": "2025-`<br>`05-19T11:35:46.427598"}] | genomic typing results (GTR) in JSON format |

The file SARS-CoV-2-Entwicklungslinien_berichtet.tsv contains the variables and their values shown in the following table. A machine-readable data schema is stored in Data Package Format in tableschemaSARS-CoV-2-Entwicklungslinienberichtet.en.json:

tableschemaSARS-CoV-2-Entwicklungslinienberichtet.en.json

| Variable | Type | Characteristic | Description | |:----------------------|:-------|:---------------------|:--------------------------------------------------------------------| | LINEAGE | string | Example: JN.1 | Assigned Pangolin Lineage | | WHOLABEL | string | Example: Omikron | Name of the virus variant assigned by the World Health Organization | | CONTRIBUTINGLINEAGES | string | Example: JN.1.1.10 | Pangolin lineages derived from the lineage |

The file SARS-CoV-2-EntwicklungslinienzuVarianten.tsv contains the variables and their values shown in the following table. A machine-readable data schema is stored in Data Package Format in tableschemaSARS-CoV-2-Entwicklungslinienzu_Varianten.en.json:

tableschemaSARS-CoV-2-Entwicklungslinienzu_Varianten.en.json

| Variable | Type | Characteristic | Description | |:----------------------|:-------|:---------------------|:-------------------------------------------------------------------------------------------| | LINEAGE | string | Example: BA.2 | Assigned Pangolin Lineage | | WHOLABEL | string | Example: Omikron | Name of the virus variant assigned by the World Health Organization | | CONTRIBUTINGLINEAGES | string | Example: JN.13.1 | Pangolin lineages derived from the lineage | | COLOR | any | | Legacy variable. It is no longer relevant and will be removed perspectively. | | variant_category | string | Values: VOC, VOI | WHO Classification of the variant as VOC (variant of concern) or VOI (variant of interest) |

Formatting the sequence metadata

The sequence metadata is provided as an xz-compressed, comma-separated .csv file. This results in the file extension .csv.xz. The character set used in the .csv file is UTF-8. The individual values are separated by a comma ",". Dates are formatted in the ISO 8601 standard.

Character set: UTF-8
Date format: ISO 8601
Compression: .xz
Included file format: .tsv
.csv separator: Tab "\t"

The files can be unpacked on common operating systems, for example with the programs 7zip or XZ Utils. Compression is performed as the .fasta files in particular are several gigabytes (GB) in size.

Metadata

To increase findability, the provided data are described with metadata. The Metadata are distributed to the relevant platforms via GitHub Actions. There is a specific metadata file for each platform; these are stored in the metadata folder:

Metadaten/

Versioning and DOI assignment are performed via Zenodo.org. The metadata prepared for import into Zenodo are stored in the zenodo.json. Documentation of the individual metadata variables can be found at https://developers.zenodo.org/representation.

Metadaten/zenodo.json

The zenodo.json includes the publication date and the date of the data status in the following format (example): "publication_date": "2024-06-19", "dates": [ { "start": "2023-09-11T15:00:21+02:00", "end": "2023-09-11T15:00:21+02:00", "type": "Collected", "description": "Date when the dataset was created" } ],

Additionally, we describe tabular data using the Data Package Standard.
A Data Package is a structured collection of data and associated metadata that facilitates data exchange and reuse. It consists of a datapackage.json file that contains key information such as the included resources, their formats, and schema definitions.

The Data Package Standard is provided by the Open Knowledge Foundation and is an open format that enables a simple, machine-readable description of datasets.

The list of data included in this repository can be found in the following file:

datapackage.json

For tabular data, we additionally define a Table Schema that describes the structure of the tables, including column names, data types, and validation rules. These schema files can be found in:

Metadaten/schemas/

Guidelines for reuse of the data

Open data from the RKI are available on Zenodo.org, GitHub.com, OpenCoDE, and Edoc.rki.de:

https://zenodo.org/communities/robertkochinstitut
https://github.com/robert-koch-institut
https://gitlab.opencode.de/robert-koch-institut
https://edoc.rki.de/

License

The "SARS-CoV-2 Sequence Data from Germany" dataset is licensed under the Creative Commons Attribution 4.0 International Public License | CC-BY.

The data provided in the dataset are freely available, with the condition of attributing the Robert Koch Institute as the source, for anyone to process and modify, create derivatives of the dataset and use them for commercial and non-commercial purposes.
Further information about the license can be found in the LICENSE or LIZENZ file of the dataset.

Owner

Name: Robert Koch-Institut
Login: robert-koch-institut
Kind: organization
Location: Berlin

Website: http://www.rki.de
Twitter: rki_de
Repositories: 16
Profile: https://github.com/robert-koch-institut

Das RKI ist die zentrale Einrichtung der deutschen Bundesregierung auf dem Gebiet der Krankheitsüberwachung und -prävention.

Citation (citation.cff)

cff-version: 1.2.0
type: dataset
title: SARS-CoV-2 Sequenzdaten aus Deutschland
abstract: >-
  Ein zentraler Bestandteil einer erfolgreichen Erregersurveillance ist das
  Verständnis der Verbreitung eines Erregers sowie seiner pathogenen
  Eigenschaften. Hierbei stellt das Wissen über das Erregergenom eine wichtige
  Informationsquelle dar. So erlaubt der Nachweis von Mutationen im Genom eines
  Erregers, Verwandtschaftsbeziehungen zu rekonstruieren, Übertragungswege
  aufzudecken und Resistenzen vorherzusagen. Die Integrierte Genomische
  Surveillance (IGS) von SARS-CoV-2 zielt darauf ab, die Verbreitung des Virus
  und insbesondere von besorgniserregenden Virusvarianten in der Bevölkerung zu
  überwachen sowie auftretende Veränderungen des Virus genau zu beobachten.
  Besondere Bedeutung kommt dabei der öffentlichen Bereitstellung der
  genomischen Daten zu, um Wissenschaftlern in Deutschland und weltweit die
  Möglichkeit zu eigenständigen Analysen zu eröffnen.  Im Rahmen der
  Coronavirus-Surveillanceverordnung
  (https://www.gesetze-im-internet.de/corsurv/BJNR601910021.html) wurden bis zum
  31.05.2023 SARS-CoV-2 Sequenzdaten aus ganz Deutschland über den Deutschen
  Elektronischen Sequenzdaten-Hub (DESH) an das RKI übermittelt
  (https://doi.org/10.5281/zenodo.7992536). Mit Ablauf der Verordnung werden
  künftig Proben durch das IMSSC2 Labornetzwerk bereitgestellt und am RKI
  sequenziert, analysiert und hier bereitgestellt. Trotz reduzierter 
  Probenanzahl, wird durch die sorgfältige Auswahl der beteiligten Labore ein
  repräsentativer Einblick in die Viruspopluation gesichert (Djin Ye Oh  et al.
  2022) (https://doi.org/10.1093/cid/ciac399). Zusätzlich werden Sequenzen vom
  NRZ Coronaviren an der Charité beigetragen um das IMSSC2 Netzwerk zu ergänzen.
date-released: '2025-09-03'
keywords:
  - COVID-19
  - SARS-CoV-2
  - Virussequenzen
  - Genom
  - Basensequenz
  - Deutschland
  - Viral Sequences
  - Genome
  - Base Sequence
  - Gesundheitsberichterstattung
  - Biochemische Genetik
  - Epidemiologie
  - Biochemical genetics
  - Public health surveillance
  - Epidemiology
  - Open Data
  - Offene Daten
  - Deutschland
  - Germany
  - RKI
message: Cite me!
url: >-
  https://robert-koch-institut.github.io/SARS-CoV-2-Sequenzdaten_aus_Deutschland/
license: CC-BY-4.0
doi: 10.5281/zenodo.17047123
version: '2025-09-02T08:27:00+02:00'
authors:
  - name: Robert Koch-Institut

GitHub Events

Total

Create event: 37
Issues event: 11
Release event: 29
Watch event: 2
Delete event: 3
Member event: 1
Issue comment event: 13
Push event: 94
Pull request review event: 1
Pull request event: 4

Last Year

Create event: 37
Issues event: 11
Release event: 29
Watch event: 2
Delete event: 3
Member event: 1
Issue comment event: 13
Push event: 94
Pull request review event: 1
Pull request event: 4

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 7
Total pull requests: 1
Average time to close issues: 7 days
Average time to close pull requests: 13 days
Total issue authors: 6
Total pull request authors: 1
Average comments per issue: 1.14
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 7
Pull requests: 1
Average time to close issues: 7 days
Average time to close pull requests: 13 days
Issue authors: 6
Pull request authors: 1
Average comments per issue: 1.14
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

icestorm972 (7)
corneliusroemer (5)
RKIOpenData (2)
KlausBC (1)
Rutger265 (1)
caggtaagtat (1)
fessera (1)
HannesWuensche (1)
RKIForschungsdatenmanagement (1)