AlienRemover: data and materials

Main Author: Criscuolo Alexis
Format: info dataset Journal
Terbitan: , 2020
Subjects:
Online Access: https://zenodo.org/record/4293521
Daftar Isi:
  • AlienRemover (https://gitlab.pasteur.fr/GIPhy/AlienRemover) is a program to quickly discard alien reads (e.g. exogenous reads, host, cloning vectors) from FASTQ-formatted files. Alien bases are searched using a fast alien k-mer identification algorithm, and the removal criterion is determined by detecting (based on k-mers) a sufficient proportion of successive alien bases within high-throughtput sequencing (HTS) reads. To determine accurate default values for both k-mer length and sucessive alien base proportion, different datasets were build and stored in this repository. ► Athal_PhiX (Arabidopsis thaliana + PhiX reads) This dataset was inferred from the 2x300 Illumina MiSeq reads associated to the SRA accession SRR726611. These HTS reads correspond to the whole genome sequencing of A. thaliana, but also of Escherichia virus PhiX (used for control). Technical adapter and primer oligonucleotides were detected using Minion (see documentation here), leading to the following ones: >R1_TruSeq_Adapter_Index_22 AGATCGGAAGAGCACACGTCTGAACTCCAGTCACcgtacgTAATCTCGTATGCCGTCTTCTGCTTG >R2_TruSeq_Universal_Adapter_rc AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT >poly-A AAAAAAAAAAAAAAA >poly-C CCCCCCCCCCCCCCC Read pairs were clipped and trimmed using AlienTrimmer v2.0. Only read pairs made up by reads of length at least 220 bps were retained. To properly assign each remaining read pair to its taxon, they were aligned using minimap2 against genome assemblies of A. thaliana (GCF_000001735.4) and PhiX174 (GCA_002588795.1). Read pairs that do not align against these two genome assemblies were discarded. This procedure led to 6,378,237 read pairs: 6,343,954 (99.46%) and 34,283 (0.54%) are associated to A. thaliana and PhiX174, respectively. This read dataset corresponds to the two gzipped FASTQ files Athal_PhiX.1.fastq.gz and Athal_PhiX.2.fastq.gz. Each FASTQ block (i.e. 4-line block) associated to A. thaliana and PhiX has its first line starting with @Arabidopsis_thaliana and @PhiX, respectively. ► hCoV19_Hsap (SARS-CoV-2 + Homo sapiens reads) This dataset was inferred from the 300 bps Illumina iSeq single-end reads associated to the SRA accession SRR12782936. These HTS reads correspond to the whole genome sequencing of a SARS-CoV-2 virus isolate, but contains alien reads from its human host. These reads were processed following a similar procedure as previously described (see above). Trimming and clipping were carried out with the following oligonucleotides: >TruSeq_DNA AGATCGGAAGAGCACACGTCTGAACTCCAGTCACcctatggtATCTCGTATGCCGTCTTCTGCTTG >poly-A AAAAAAAAAAAAAAA >poly-C CCCCCCCCCCCCCCC Taxonomic assignation was performed by aligning reads against genome assemblies of H. sapiens (GCF_000001405.28) and SARS-CoV-2 (GCA_009858895.3), leading to a total set of 392,404 single-end reads, made up by 222,945 (56.81%) SARS-CoV-2 and 169,459 (43.19%) and H. sapiens ones. This read dataset corresponds to the gzipped FASTQ file hCoV19_Hsap.fastq.gz. Each FASTQ block associated to SARS-CoV-2 and H. sapiens has its first line starting with @SARS-CoV-2 and @Homo_sapiens, respectively. ► Homo.sapiens.*.kmr (Homo sapiens k-mers) AlienRemover is able to save into a file the different distinct k-mers associated to alien genomes. To quickly detect and remove alien reads within FASTQ files, these saved alien k-mers can next be directly read by AlienRemover, which is useful when dealing with large alien genomes. When analysing the dataset hCoV19_Hsap, the H. sapiens k-mers were therefore computed by AlienRemover and saved. The H. sapiens k-mer sets correspond to the 11 files Homo.sapiens.k$k.kmr, where $k is odd and varies between 11 and 31.