Evaluation method

Evaluation method for Illumina reads

Target error format (TEF)

We use a format called target error format (TEF) to perform the analysis for Illumina reads. TEF represents the errors in a read as below:

readid n-errors [pos tb wb ind]+

In the above format, the fields are described as below :

Fields	Description
readid	ID of the read corrected
n-errors	Integer. Number of errors corrected in the read.
pos	Position for fix (0 <= pos < length of the read)
tb	true value of the base at pos.
wb	wrong value of the base at pos.
	wb should be current base at read
	tb,wb is one of {0,1,2,3,4,5}
	0 = ‘A’, 1 = ‘C’, 2 = ‘G’, 3 = ‘T’, 5 = ‘-’
ind	indicates the type of error. one of {0,1,2}
	0 substitution (bad char in the read at pos) or
	1 deletion (missing char in the read after pos) or
	2 insertion (extra char in the read at pos)

Align uncorrected reads to BWA

Before performing alignment, the reads are pre-processed (Section Pre-processing Data).
Uncorrected reads are aligned with BWA. BWA generates alignments in SAM format. The script sam-analysis.py (Section SAM to TEF Conversion) converts alignments from SAM to TEF .

Correct reads

Error correction program is run against the uncorrected reads.
The output of error correction program is converted to the target error format (TEF) using the following scripts for the corresponding error correction program.

Program Script

Coral coral-analy.pl

HiTEC hitec-analy.pl

Quake quake-analy.py

ECHO quake-analy.py

Reptile generates output in TEF. The usage of these scripts are described in the Section Conversion to TEF.

Program	Script
Coral	coral-analy.pl
HiTEC	hitec-analy.pl
Quake	quake-analy.py
ECHO	quake-analy.py

Measure Gain

The target error format files generated in the previous two sections, are compared using Comp2PCAlign (Section Comp2PCAlign), which measures Gain and Sensitivity.

Evaluation method for 454/Ion Torrent Reads

Align uncorrected reads to Mosaik/TMAP

Before performing alignment, the reads are pre-processed as given in Section Pre-processing Data.
For 454 reads, alignments are performed using Mosaik. For Ion Torrent reads, TMAP is used for alignment. Alignments are converted to SAM format for further processing.

Measure Gain

Script compute-stats.py (Section Corrected 454/Ion Torrent Reads Analysis ) measures gain for 454/Ion Torrent Reads using the method explained in the paper.

Tools

Requirements

Tools used for evaluation have the following dependencies :

GCC C++ compiler v. 4.3
Perl v. 5
Python v. 2.7.2
MPI
mpi4py python package

Pre-processing Data

Short Read Archive have the reads in ‘.sra’ format. ‘.sra’ format can be converted to fastq format using the ‘fastq-dump’ tool available with NCBI SRA tool kit.
We use FASTA format whenever possible. To convert FASTQ to FASTA, the script ‘fastq-converter-v2.0.pl’ is used as follows.
```
$ fastq-converter-v2.0.pl fastq/ fasta/ 1
      
```
Here, `fastq/’ directory consists of all the fastq files of the data-set are stored. Output is generated in the fasta/ directory. The last argument `1’ is supplied to ignore the reads with `N’ characters. `fastq-converter-v2.0.pl’ generates unique ids for each of the read and inserts in the FASTA header. The read id is unique among all the files. These identifiers are used to compare the reads pre- and post-correction with corresponding identifiers. The order in which `fastq-converter-v2.0.pl’ does the conversion is saved. The identifiers are assigned to the reads in this order from 0 to n-1, where n is the total number of reads. For example, for SRR001665 data-set the `fastq-converter-v2.0.pl’ processes the fastq files in the following order.
```
SRR001665_2.fastq
SRR001665.fastq
SRR001665_1.fastq
      
```
Here, the first read in SRR001665_2.fastq file gets the id 0, and the last read in SRR001665_1.fastq has the id n-1.
Some of the error correction methods use only accept FASTQ. In order to make sure the read ids are same for all the error correction methods, we use merge-fastq.py to provide identifiers to FASTQ files. The script merges all FASTQ in a given input list, so that the ids in the FASTQ can be compared pre- and post-correction. merge-fastq.py is run as follows:
```
$ merge-fastq.py --list=lst --outfile=all.fastq
      
```
where `lst’ is a new line delimited file containing the paths to fastq files. For example, for SRR001665 data-set the `–list’ argument is path to a file with the following contents:
```
SRR001665_2.fastq
SRR001665.fastq
SRR001665_1.fastq
      
```
For the comparison to make sense, the order should be same as the order in which FASTQ converter (`fastq-converter-v2.0.pl’) did the conversion.
While pre-processing, we ignore the reads with invalid characters because some error correction programs can not work with invalid characters.

Conversion to TEF

Scripts for converting output to TEF are used as follows:

coral-analy.pl converts Coral-corrected FASTA file to TEF as below:
```
$ coral-analy.pl corrected.fa all.fa coral-output.er > coral_conv.log
      
```
In the above example, corrected.fa is the corrected FASTA file, all.fa is the uncorrected FASTA file and coral-output.er is the output in TEF.
Conversion program for both Quake and ECHO is quake-analy.py. It is run as below:
```
$ quake-analy.py -f all.fastq -c corrected.fastq -o echo-output.er -t echo-trim > missing.log
      
```
Here, `all.fastq’ is the input file, `corrected.fastq’ is the ECHO/Quake corrected fastq, `echo-output.er’ is the output in TEF, and `echo-trim’ is the list of reads with the trimmed area (which is ignored).
Output from HiTEC is converted to TEF as below.
```
$ hitec-analy.pl corrected.fa all.fa hitec-output.er
      
```
Again, `all.fa’ is the uncorrected FASTA, `corrected.fa’ is the corrected FASTA and `hitec-output.er’ is output from HiTEC.

All these scripts exploit the identifiers given in FASTA/FASTQ headers added in pre-processing step (Section Pre-processing Data).

SAM to TEF Conversion

Alignments in SAM file are converted to TEF file using the script `sam-analysis.py’.

sam-analysis.py --file=/path/to/sam-file-input
                --outfile=/path/to/err-output
                --ambig=/path/to/ambig-output
                --unmapped=/path/to/unmapped-output
                --trim=/path/to/trim-file-output
                [--genomeFile=/path/to/genome-file]
                [--dry (for dry run no output generated)]

`–outfile’ option is a path of output file with write access. Ambiguous reads are written to the file given as the value for `–ambig’ option. Unmapped reads are written to the output file given as the value for `–unmapped’ option. Unmapped and ambigous file can be both same. trim-file-output positions trimmed (ranges allowed).

Here, genome file is optional. It is used if MD String is not available. If genome file is given, it will be loaded in memory completely. The script doesn’t handle genomes with multiple chromosomes.

Comp2PCAlign

Comp2PCAlign measures the Gain and Sensitivity from the outputs generated in the previous two sub-sections. Usage is as below:

$ comp2pcalign [correction-rslt] [pre-correct-aln-rslt] [unmapped-pre-correct-aln] [m-value] [fpfn-rslt] [optional trimmed-file]

It takes 6 arguments and they are given in the following order :

Correction Result converted to TEF.
Alignment SAM converted to TEF.
File with list of unmapped reads.
Edit distance used for alignment.
Output file with write access to which the statistics are written to.
[Optional] List of reads with trimmed regions.

(1) is generated from Error correction output as described in Section Conversion to TEF. (2),(3),(4) and (6) are generated from the alignment as described in SAM to TEF Conversion. (3) is a concatenation of both unmapped and ambiguous reads.

Corrected 454/Ion Torrent Reads Analysis

The procedure to analyse 454/Ion Torrent Reads is given in the paper. `compute-stats.py’ is the script implementing the procedure. It is used as below:

compute-stats.py --aln=/path/to/pre-correction-alignment-sam-file
                --corrected=/path/to/corrected-reads-fa-file
                --outfile=/path/to/stats-output (write access reqd.)
                --records=number of reads
                [--genomeFile=/path/to/genome-file]
                [--band=value of k used for k-band alignment (default 5)]
              (OR)
compute-stats.py -a /path/to/pre-correction-alignment-sam-file
                -c /path/to/corrected-reads-fa-file
                -o /path/to/stats-output-file (write access reqd.)
                -r number of reads
                [-g /path/to/genome-file]
                [-b value of k used for k-band alignment (default 5)]

The script accepts only FASTA file. The script requires that the FASTA is pre-processed as given in Section Pre-processing Data, because it exploits the sorted identifiers to process SAM with FASTA in a single pass.

`–band’ option gives the value of band size used for k-band alignment. Here, genome file is optional. It is used if the MD String is not available. If genome file is given, it will be loaded in memory completely. The script doesn’t handle genomes with multiple chromosomes.

`compute-stats.py’ requires MPI, and mpi4py as it is uses MPI.

Download

Source code for the tools are available from here.