As part of our effort to enable global scientific collaboration and facilitate international public health response we share open sequences with the INSDC databases (ENA, DDBJ, NCBI) and ingest public sequences from the INSDC. Read more about how your sequences are shared here and here.
This means that when you browse for an pathogen on Pathoplexus you should also see all other publicly available sequences of that pathogen (that satisfy our sequence alignment requirements).
We download sequences from the INSDC using the NCBI Datasets Virus Data Package. We then map INSDC metadata fields to Loculus metadata fields before uploading the sequences to Loculus. In order to give users access to as much data as possible we do not enforce required metadata fields on data ingested from the INSDC, however we do enforce that sequences alignment is of an acceptable quality. A quality score is based on the standard for each pathogen, defined in the Nextclade datasets that we use.
Additionally, for multi-segmented organisms the INSDC often does not offer data where segments have been grouped by isolate. To retain as much information as possible from the samples, we additionally group samples based on their isolate and other metadata fields. By default all metadata fields must be the same across segments for us to group them as one sample, with the exception of segment-specific metadata fields. These fields are either alignment-related (length, totalSnps, totalInsertedNucs, totalDeletedNucs, totalUnknownNucs, totalAmbiguousNucs, totalFrameShifts, frameShifts, completeness) or related to the INSDC-accession for that specific segment (ncbiUpdateDate, insdcAccessionBase, insdcAccessionFull, insdcVersion).
NCBI VIirus Field Name | Loculus Field name |
---|---|
Accession | insdcAccessionFull (also produces insdcAccessionBase, insdcVersion) |
BioProjects | bioprojectAccession |
BioSample accession | biosampleAccession |
Geographic Location | geoLocCountry, geoLocAdmin1 |
Geographic Region | geoLocAdmin2 |
Host Common Name | hostNameCommon |
Host Infraspecific Names Sex | hostGender |
Host Name | hostNameScientific |
Host Taxonomic ID | hostTaxonId |
Is Lab Host | isLabHost |
Isolate Collection date | sampleCollectionDate |
Isolate Lineage | specimenCollectorSampleId |
Purpose of Sampling | purposeOfSampling |
Release date | ncbiReleaseDate |
SRA Accessions | sraRunAccession |
Source database | ncbiSourceDb |
Submitter Affiliation | authorAffiliations |
Submitter Names | authors |
Update date | ncbiUpdateDate |
Virus Name | ncbiVirusName |
Virus Taxonomic ID | ncbiVirusTaxId |