Evaluation method
Evaluation method for Illumina reads
Target error format (TEF)
We use a format called target error format (TEF) to perform the analysis for Illumina reads. TEF represents the errors in a read as below:
readid n-errors [pos tb wb ind]+
In the above format, the fields are described as below :
Fields | Description |
---|---|
readid | ID of the read corrected |
n-errors | Integer. Number of errors corrected in the read. |
pos | Position for fix (0 <= pos < length of the read) |
tb | true value of the base at pos. |
wb | wrong value of the base at pos. |
wb should be current base at read | |
tb,wb is one of {0,1,2,3,4,5} | |
0 = ‘A’, 1 = ‘C’, 2 = ‘G’, 3 = ‘T’, 5 = ‘-’ | |
ind | indicates the type of error. one of {0,1,2} |
0 substitution (bad char in the read at pos) or | |
1 deletion (missing char in the read after pos) or | |
2 insertion (extra char in the read at pos) | |
Align uncorrected reads to BWA
- Before performing alignment, the reads are pre-processed (Section Pre-processing Data).
- Uncorrected reads are aligned with BWA. BWA generates alignments in SAM format. The script sam-analysis.py (Section SAM to TEF Conversion) converts alignments from SAM to TEF .
Correct reads
- Error correction program is run against the uncorrected reads.
- The output of error correction program is converted to the
target error format (TEF) using the following scripts for the
corresponding error correction program.
Program Script Coral coral-analy.pl HiTEC hitec-analy.pl Quake quake-analy.py ECHO quake-analy.py Reptile generates output in TEF. The usage of these scripts are described in the Section Conversion to TEF.
Measure Gain
- The target error format files generated in the previous two sections, are compared using Comp2PCAlign (Section Comp2PCAlign), which measures Gain and Sensitivity.
Evaluation method for 454/Ion Torrent Reads
Align uncorrected reads to Mosaik/TMAP
- Before performing alignment, the reads are pre-processed as given in Section Pre-processing Data.
- For 454 reads, alignments are performed using Mosaik. For Ion Torrent reads, TMAP is used for alignment. Alignments are converted to SAM format for further processing.
Measure Gain
- Script
compute-stats.py
(Section Corrected 454/Ion Torrent Reads Analysis ) measures gain for 454/Ion Torrent Reads using the method explained in the paper.
Tools
Requirements
Tools used for evaluation have the following dependencies :
- GCC C++ compiler v. 4.3
- Perl v. 5
- Python v. 2.7.2
- MPI
- mpi4py python package
Pre-processing Data
- Short Read Archive have the reads in ‘.sra’ format. ‘.sra’ format can be converted to fastq format using the ‘fastq-dump’ tool available with NCBI SRA tool kit.
- We use FASTA format whenever possible. To convert FASTQ to
FASTA, the script ‘fastq-converter-v2.0.pl’ is used as follows.
$ fastq-converter-v2.0.pl fastq/ fasta/ 1
Here, `fastq/’ directory consists of all the fastq files of the data-set are stored. Output is generated in the
fasta/
directory. The last argument `1’ is supplied to ignore the reads with `N’ characters. `fastq-converter-v2.0.pl’ generates unique ids for each of the read and inserts in the FASTA header. The read id is unique among all the files. These identifiers are used to compare the reads pre- and post-correction with corresponding identifiers. The order in which `fastq-converter-v2.0.pl’ does the conversion is saved. The identifiers are assigned to the reads in this order from 0 to n-1, where n is the total number of reads. For example, for SRR001665 data-set the `fastq-converter-v2.0.pl’ processes the fastq files in the following order.SRR001665_2.fastq SRR001665.fastq SRR001665_1.fastq
Here, the first read in SRR001665_2.fastq file gets the id 0, and the last read in SRR001665_1.fastq has the id n-1.
- Some of the error correction methods use only accept FASTQ. In
order to make sure the read ids are same for all the error
correction methods, we use merge-fastq.py to provide
identifiers to FASTQ files. The script merges all FASTQ in a
given input list, so that the ids in the FASTQ can be compared
pre- and post-correction. merge-fastq.py is run as follows:
$ merge-fastq.py --list=lst --outfile=all.fastq
where `lst’ is a new line delimited file containing the paths to fastq files. For example, for SRR001665 data-set the `–list’ argument is path to a file with the following contents:
SRR001665_2.fastq SRR001665.fastq SRR001665_1.fastq
For the comparison to make sense, the order should be same as the order in which FASTQ converter (`fastq-converter-v2.0.pl’) did the conversion.
- While pre-processing, we ignore the reads with invalid characters because some error correction programs can not work with invalid characters.
Conversion to TEF
Scripts for converting output to TEF are used as follows:
- coral-analy.pl converts Coral-corrected FASTA file to TEF as
below:
$ coral-analy.pl corrected.fa all.fa coral-output.er > coral_conv.log
In the above example, corrected.fa is the corrected FASTA file, all.fa is the uncorrected FASTA file and coral-output.er is the output in TEF.
- Conversion program for both Quake and ECHO is
quake-analy.py. It is run as below:
$ quake-analy.py -f all.fastq -c corrected.fastq -o echo-output.er -t echo-trim > missing.log
Here, `all.fastq’ is the input file, `corrected.fastq’ is the ECHO/Quake corrected fastq, `echo-output.er’ is the output in TEF, and `echo-trim’ is the list of reads with the trimmed area (which is ignored).
- Output from HiTEC is converted to TEF as below.
$ hitec-analy.pl corrected.fa all.fa hitec-output.er
Again, `all.fa’ is the uncorrected FASTA, `corrected.fa’ is the corrected FASTA and `hitec-output.er’ is output from HiTEC.
All these scripts exploit the identifiers given in FASTA/FASTQ headers added in pre-processing step (Section Pre-processing Data).
SAM to TEF Conversion
Alignments in SAM file are converted to TEF file using the script `sam-analysis.py’.
sam-analysis.py --file=/path/to/sam-file-input --outfile=/path/to/err-output --ambig=/path/to/ambig-output --unmapped=/path/to/unmapped-output --trim=/path/to/trim-file-output [--genomeFile=/path/to/genome-file] [--dry (for dry run no output generated)]
`–outfile’ option is a path of output file with write access. Ambiguous reads are written to the file given as the value for `–ambig’ option. Unmapped reads are written to the output file given as the value for `–unmapped’ option. Unmapped and ambigous file can be both same. trim-file-output positions trimmed (ranges allowed).
Here, genome file is optional. It is used if MD String is not available. If genome file is given, it will be loaded in memory completely. The script doesn’t handle genomes with multiple chromosomes.
Comp2PCAlign
Comp2PCAlign measures the Gain and Sensitivity from the outputs generated in the previous two sub-sections. Usage is as below:
$ comp2pcalign [correction-rslt] [pre-correct-aln-rslt] [unmapped-pre-correct-aln] [m-value] [fpfn-rslt] [optional trimmed-file]
It takes 6 arguments and they are given in the following order :
- Correction Result converted to TEF.
- Alignment SAM converted to TEF.
- File with list of unmapped reads.
- Edit distance used for alignment.
- Output file with write access to which the statistics are written to.
- [Optional] List of reads with trimmed regions.
(1) is generated from Error correction output as described in Section Conversion to TEF. (2),(3),(4) and (6) are generated from the alignment as described in SAM to TEF Conversion. (3) is a concatenation of both unmapped and ambiguous reads.
Corrected 454/Ion Torrent Reads Analysis
The procedure to analyse 454/Ion Torrent Reads is given in the paper. `compute-stats.py’ is the script implementing the procedure. It is used as below:
compute-stats.py --aln=/path/to/pre-correction-alignment-sam-file --corrected=/path/to/corrected-reads-fa-file --outfile=/path/to/stats-output (write access reqd.) --records=number of reads [--genomeFile=/path/to/genome-file] [--band=value of k used for k-band alignment (default 5)] (OR) compute-stats.py -a /path/to/pre-correction-alignment-sam-file -c /path/to/corrected-reads-fa-file -o /path/to/stats-output-file (write access reqd.) -r number of reads [-g /path/to/genome-file] [-b value of k used for k-band alignment (default 5)]
The script accepts only FASTA file. The script requires that the FASTA is pre-processed as given in Section Pre-processing Data, because it exploits the sorted identifiers to process SAM with FASTA in a single pass.
`–band’ option gives the value of band size used for k-band alignment. Here, genome file is optional. It is used if the MD String is not available. If genome file is given, it will be loaded in memory completely. The script doesn’t handle genomes with multiple chromosomes.
`compute-stats.py’ requires MPI, and mpi4py as it is uses MPI.
Download
Source code for the tools are available from here.