Data from: Parasite infection of public databases: a data mining approach to identify apicomplexan contaminations in animal genome and transcriptome assemblies
Main Authors: | Borner, Janus, Burmester, Thorsten |
---|---|
Format: | info dataset Journal |
Terbitan: |
, 2018
|
Subjects: | |
Online Access: |
https://zenodo.org/record/5010344 |
Daftar Isi:
- Background: Contaminations from various exogenous sources are a common problem in next-generation sequencing. Another possible source of contaminating DNA are endogenous parasites. On the one hand, undiscovered contaminations of animal sequence assemblies may lead to erroneous interpretation of data; on the other hand, when identified, parasite-derived sequences may provide a valuable source of information. Results: Here we show that sequences deriving from apicomplexan parasites can be found in many animal genome and transcriptome projects, which in most cases derived from an infection of the sequenced host specimen. The apicomplexan sequences were extracted from the sequence assemblies using a newly developed bioinformatic pipeline (ContamFinder) and tentatively assigned to distinct taxa employing phylogenetic methods. We analysed 920 assemblies and found 20,907 contigs of apicomplexan origin in 51 of the datasets. The contaminating species were identified as members of the apicomplexan taxa Gregarinasina, Coccidia, Piroplasmida, and Haemosporida. For example, in the platypus genome assembly, we found a high number of contigs derived from a piroplasmid parasite (presumably Theileria ornithorhynchi). For most of the infecting parasite species, no molecular data had been available previously, and some of the datasets contain sequences representing large amounts of the parasite's gene repertoire. Conclusion: Our study suggests that parasite-derived contaminations represent a valuable source of information that can help to discover and identify new parasites, and provide information on previously unknown host-parasite interactions. We, therefore, argue that uncurated assembly data should routinely be made available in addition to the final assemblies.
- extracted contigsFasta files containing the extracted, parasite-derived contigs. Contigs from each Assembly are stored in a separate file.extracted_contigs.zippredicted amino acidsFasta files containing the predicted amino acid sequences based on the extracted contigs. Sequences from each Assembly are stored in a separate file.predicted_aa.zipdataset 1 single genesFasta files containing the single gene amino acid alignments of dataset 1 prior to processing by Gblocks.dataset_1_single_genes.zipdataset 1 single genes after gblocksFasta files containing the single gene amino acid alignments of dataset 1 after processing by Gblocks.dataset_1_single_genes_gblocks.zipdataset 2 single genesFasta files containing the single gene amino acid alignments of dataset 2 prior to processing by Gblocksdataset_2_single_genes.zipdataset 2 single genes after gblocksFasta files containing the single gene amino acid alignments of dataset 2 after processing by Gblocks.dataset_2_single_genes_gblocks.zipdataset 1 superalignment in FASTA formatConcatenated superalignment of all 1420 single gene amino acid alignments of dataset 1 after processing by Gblocks.dataset_1_superalignment.fadataset 2 superalignment in FASTA formatConcatenated superalignment of all 301 single gene amino acid alignments of dataset 2 after processing by Gblocks.dataset_2_superalignment.famitochondrial sequences from gorilla Plasmodium in FASTA formatNucleotide alignment of mitochondrial Plasmodium sequences including two sequences that were extraceted from the gorilla genome. The alignment is based on data from Liu et al. (2010) and only contains sequences from Clades G1 and C1.mito_gorilla.fa18S rRNA Piroplasmida in FASTA formatNucleotide alignment of 18s rRNA sequences from Piroplasmida including a sequences that was extraceted from the platypus genome. The alignmnet is based on data from Paparini et al. (2015) and was processed by Gblocks.18s_piroplasmida.fa