Spliced alignment software




















MapNext is a comprehensive and powerful tool for both spliced and unspliced alignments of short reads and automated SNP detection from population sequences. The simplicity, flexibility and efficiency of MapNext makes it a valuable tool for transcriptomic and population genomic research.

Next-generation sequencing technologies, based on sequencing by synthesis SBS , are starting to deliver a large number of DNA sequences at a relatively low cost, thus opening new areas of genomic research. To this end, sequencing throughput must be increased dramatically. This may be achieved by carrying out many parallel reactions.

Despite the fact that the read-length is short down to bp , the overall throughput is enormous, each run producing up to several million reads and billions of base-pairs of sequence data. While the promise of next-generation sequencing technologies has become a reality, they also present substantial challenges such as in the mapping of short sequence reads to the genome, polymorphism detection, characterization of allele frequencies from population samples and data management.

So far, most efforts have been spent in developing methods for unspliced mapping of short sequence reads, and several software tools have been developed. An alignment integrated tool in the Illumina-Solexa data processing package, ELAND, optimizes mapping of very short reads and allows at most only two mismatches between the read and the genomic sequence. MAQ is another program for ungapped alignment with probability models to measure alignment quality [ 1 ].

SHRiMP is a package for mapping short reads to highly polymorphic genomes with a statistical method for scoring the alignment [ 4 ]. QPalma is specifically designed to align the short sequence reads over intron boundaries [ 7 ], but it requires training sets of spliced reads with sequencing quality and known alignments information. Moreover, the existing programs usually detect SNPs in a single individual, and cannot be applied to population samples where high-quality SNPs and allele frequencies require characterization.

Here we provide a freely available software tool, MapNext, for both spliced and unspliced alignments of the short reads and automated SNP detection from population sequences. For spliced alignments, a training process is not needed.

Additionly, MapNext is capable of facilitating the conversion of text based sequences and quality output, mapping results and SNP results into a more flexible SQL database. Some Perl scripts were written to format data, and an SQL script was used to create tables and load the results into the tables in the database. The implementation of MapNext1. Many of the short sequence mapping programs use k-mer and hash index table algorithm to accelerate alignment.

To admit two mismatches, these programs split every read into four fragments and use all six combinations of the two fragments as seed.

SHRiMP starts with a rapid k-mer hashing step and executes a vectorized Smith-Waterman step to score and validate the alignment. MapNext also uses a fast k-mer scan to locate the regions of potential homology. But it uses the filtration method described by [ 9 ]. So there is at least one seed with no mismatches.

MapNext processes the set of query reads to build a hash table indexed by seed. For every seed, its location in the query reads is listed where it occurs. MapNext scans the reference sequence using a sliding window of size equals to k-mer size at steps of size 1.

If the k-mer is a key in the query hash table, the corresponding reads with full length must be compared with adjacent regions of the reference surrounding the k-mer. MapNext counts mismatches between the reads and the reference sequence and stores the corresponding read name, the strand of the read, the number of mismatches and the position of the reference sequence if the number of mismatches does not exceed the maximum number of mismatches assigned by the user.

Once the reads have been mapped onto the reference sequences, their locations can be used to precisely map reads to clusters based on the overlaps in the reference. For multiple hits of one read, the program randomly reports one of the hits. The output is given in a tab-separated format containing the reference and query sequence names, the start and end position of the alignment in the reference sequences, as well as the strand and the number of mismatches per read. QPalma is specifically designed to align the short sequence reads over intron boundaries.

Qpalma first needs a dataset of known splice site and a splice site prediction model. It extends Smith-Waterman alignment to take qulity score, splice site and intron length into account. If the P option is specified with a non-zero value, sim4cc will remove any 3'-end poly-A tails that it detects in the alignment. Occasionally, sim4cc may miss an internal exon when surrounded by very large introns, typically longer than Kb. When this is suspected, the H option can be used to reset the exons' weight to compensate for the intron gap penalty.

Ambiguity codes are by default allowed in sequence data, but sim4cc treats them non-differentially. When seqfile2 contains a collection of sequences, the first entry in the file will be used to determine the type of this and all subsequent comparisons. In the description below, the term MSP denotes a m aximal s egment p air, that is, a pair of highly similar fragments in the two sequences, obtained during the blast -like procedure by extending a spaced seed hit by matches and perhaps a few mismatches.

OPTIONS The algorithm parameters included in the first two sections below have already been tuned and do not normally require adjustment by the user.

Alignment parameters: Z Sets the spaced seed pattern used to identify approximate matches in the first stage of the algorithm. The default seed pattern was optimized for cDNA-to-genomic sequence alignment and for a large number of species comparisons, but can be reset by the user if desired. In that case, a seed of weight 12 or 11, counting 1 for each 1 in the pattern and 0. X Controls the limits for terminating word extensions in the blast -like stage of the algorithm.

The default value is If this option is not specified, the threshold is computed from the lengths of the sequences, using statistical criteria. For example, a good value for genomic sequences in the range of a few hundred Kb is To avoid spurious matches, however, a larger value may be needed for longer sequences.

C Sets the threshold for the MSP scores when aligning the as-yet-unmatched fragments, during the second stage of the algorithm. By default, the smaller of the constant 12 and a statistics-based threshold is chosen. We noted that the annotation-based TopHat2 protocol uses the annotation provided to set the XS tag for unspliced alignments that overlap annotated exons.

As this is a unique feature of TopHat2 that might confer an advantage in the evaluation of transcript reconstruction, we investigated the effect of removing the XS tag from unspliced alignments in the TopHat2 output before running Cufflinks. This modification had a negligible effect on the Cufflinks accuracy metrics presented here data not shown , demonstrating that provision of XS tags for unspliced alignments cannot explain why the annotation-based TopHat2 protocol resulted in better Cufflinks performance than other protocols.

For K data, exon precision was defined as the fraction of predicted exons matching GENCODE annotation, and exon recall as the fraction of annotated exons that were predicted. Only exons from protein-coding genes were considered when computing recall, as some noncoding RNA classes are likely to be underrepresented in the RNA-seq libraries.

Results on simulated data were benchmarked against simulated gene models, using analogous definitions of precision and recall, such that exon precision measures the proportion of predicted exons matching an exon in the simulated transcriptome, and transcript precision is the fraction of predicted spliced transcripts matching a simulated spliced transcript.

To stratify recall by expression, we divided simulated transcripts into three groups of equal size according to expression level Fig. For the simulated data, only exons of spliced transcripts were required to be placed on the correct strand, as the orientation of single-exon transcripts cannot be reliably predicted unless RNA-seq libraries are strand specific.

Spliced transcripts were considered to be correctly assembled if the strand and all exon junctions matched. Consortium members provided alignments for evaluation. National Center for Biotechnology Information , U. Nature Methods. Nat Methods. Published online Nov 3. Author information Article notes Copyright and License information Disclaimer. Received Mar 31; Accepted Sep This article has been cited by other articles in PMC.

Abstract High-throughput RNA sequencing is an increasingly accessible method for studying gene structure and activity on a genome-wide scale. Supplementary information The online version of this article doi Subject terms: Genome informatics. Main Programs for aligning transcript reads to a reference genome address the challenging task of placing spliced reads across introns and correctly determining exon-intron boundaries. Results Alignment protocols were evaluated on Illumina nucleotide nt paired-end RNA-seq data from the human leukemia cell line K 1.

Open in a separate window. Figure 1. Alignment yield. Source data. Alignment yield There were major differences among protocols in the alignment yield Figure 2.

Mismatch and truncation frequencies. Figure 3. Read placement accuracy for simulated spliced reads. Figure 4. Indel frequency and accuracy. Positioning of mismatches and gaps in reads We determined the spatial distribution of mismatches, indels and introns over read sequences Supplementary Fig.

Coverage of annotated genes We assessed how RNA-seq reads were placed in relation to annotated gene structures from the Ensembl database Supplementary Note. Spliced alignment In assessing spliced-alignment performance, we distinguish between detection of splices in individual reads and detection of unique splice junctions on the genomic sequence.

Figure 5. Spliced alignment performance. Influence of aligners on transcript reconstruction To assess the impact of alignment methodology on exon discovery and transcript reconstruction, we applied the transcript assembly program Cufflinks to the alignments.

Figure 6. Aligner influence on transcript assembly. Methods RNA-seq data. Read alignment. Evaluation of alignments. Treatment of alignment gaps. Transcript reconstruction. Program availability. Source data to Fig. Author Contributions P. Competing interests The authors declare no competing financial interests. Footnotes bertone ebi. Peter Kosarev 17 Softberry Inc. Vladimir Molodtsov 17 Softberry Inc. Igor Seledtsov 17 Softberry Inc.

References 1. Kent WJ. Genome Res. Tools for mapping high-throughput sequencing data. Methods 10 , — Article Google Scholar. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Wu, T. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26 , — Dobin, A. Bioinformatics 29 , 15—21 Burrows, M. Block-sorting lossless data compression algorithm Technical report Digital Equipment Corp.

Ferragina, P. Langmead, B. Fast gapped-read alignment with Bowtie 2. Methods 9 , — Wu, J. Nucleic Acids Res. Griebel, T. Modelling and simulating generic RNA-Seq experiments with the flux simulator.

Chen, R. Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell , — Download references. We thank G. Pertea and L. We also thank C. Trapnell for the use of his TuxSim simulation program. You can also search for this author in PubMed Google Scholar. All authors read and approved the final manuscript.

Since a pair consists of left and right reads, the type of a pair is determined by the more difficult read type. The plot on the left shows the alignment speed of the programs in terms of the number of pairs processed per second. The right plot shows alignment sensitivity. Pairs are categorized as: 1 correctly and uniquely mapped, 2 correctly mapped multi-mapped , 3 incorrectly mapped, and 4 unmapped.



0コメント

  • 1000 / 1000