Improving contig binning of metagenomic data using d2S oligonucleotide frequency dissimilarity

Main Authors: Wang Ying, Wang Kun, LU Yang Young, Sun Fengzhu
Format: info software
Terbitan: , 2018
Subjects:
Online Access: https://zenodo.org/record/1217176
Daftar Isi:
  • d2SBin is easy-to-use contig-binning improving tool, which adjusted the contigs among bins based on the output of any existing binning tools. The tool is taxonomy-free only on the k-tuples for single metagenomic sample. d2SBin is based on the mechanism that relative sequence compositions are similar across different regions of the same genome, but differ between genomes. Current tools generally used the normalized frequency of k-tuple directly, which actually is the absolute instead of relative sequence composition. Therefore, we attempted to model the relative sequence composition and to measure the dissimilarity between contigs with d2S. We applied d2SBin to adjust the outputs of five widely-used contig-binning tools on six datasets. The experiments showed that d2SBin can improve the contig binning performance significantly. The input of d2SBin is the output of existing contig-binning tools. The output of current contig-binning tools has the following two formats: d2SBin_input_format1: .fasta files with contigs sequence from the same bins, such as the outputs from tools MaxBin, MetaWatt and SCIMM. Their outputs include bins-number of fasta files. Each fasta file includes the contigs ID and sequence clustered in the same bin. For example, the outputs from MaxBin are MaxBin.out.001.fasta...MaxBin.out.00X.fasta, where X is the bins number by MaxBin. The MaxBin.out.001.fasta is as follows. >contig-1.0 GACACTTTTAGTGGGCGTAAACTTCATCTAGTGGATCT >contig-1.2 CCATGTCAGAAGAAGTTGGTAATCGCCACATTAATTGTTTGTCGTTTGATCGA ... d2SBin_input_format2: .fa files only with contigs name from the same bins, such as the outputs from tool MetaCluster. Its outputs include bins-number of fasta files. Each fasta file only includes the contigs ID in the same bin, so the orginal fasta file including all the sequences of total contigs is also required. For example, the outputs from MetaCluster are MetaCluster.out.001.fa ... MetaCluster.out.00Y.fa, where Y is the bins number by MetaCluster. The MetaCluster.out.001.fa is as follows >contig-1.0 >contig-1.1 ... The original file include total contigs and theire sequences contigs.fasta is as follows: >contig-1.0 GACACTTTTAGTGGGCGTAAACTTCATCTAGTGGATCT >contig-1.1 TGGTAATCGCCACATTAAAGAAGTTGGTAA >contig-1.2 CCATGTCAGAAGAAGTTGGTAATCGCCACATTAATTGTTTGTCGTTTGATCGA ... This repository contains the input data,source code,and the detail description of running is provided here.