Evaluation method

Evaluation method for Illumina reads

Target error format (TEF)

We use a format called target error format (TEF) to perform the analysis for Illumina reads. TEF represents the errors in a read as below:

readid n-errors [pos tb wb ind]+

In the above format, the fields are described as below :

Fields Description
readid ID of the read corrected
n-errors Integer. Number of errors corrected in the read.
pos Position for fix (0 <= pos < length of the read)
tb true value of the base at pos.
wb wrong value of the base at pos.
wb should be current base at read
tb,wb is one of {0,1,2,3,4,5}
0 = ‘A’, 1 = ‘C’, 2 = ‘G’, 3 = ‘T’, 5 = ‘-’
ind indicates the type of error. one of {0,1,2}
0 substitution (bad char in the read at pos) or
1 deletion (missing char in the read after pos) or
2 insertion (extra char in the read at pos)

Align uncorrected reads to BWA

Correct reads

Measure Gain

Evaluation method for 454/Ion Torrent Reads

Align uncorrected reads to Mosaik/TMAP

Measure Gain

Tools

Requirements

Tools used for evaluation have the following dependencies :

  1. GCC C++ compiler v. 4.3
  2. Perl v. 5
  3. Python v. 2.7.2
  4. MPI
  5. mpi4py python package

Pre-processing Data

Conversion to TEF

Scripts for converting output to TEF are used as follows:

  1. coral-analy.pl converts Coral-corrected FASTA file to TEF as below:
    $ coral-analy.pl corrected.fa all.fa coral-output.er > coral_conv.log
          

    In the above example, corrected.fa is the corrected FASTA file, all.fa is the uncorrected FASTA file and coral-output.er is the output in TEF.

  2. Conversion program for both Quake and ECHO is quake-analy.py. It is run as below:
    $ quake-analy.py -f all.fastq -c corrected.fastq -o echo-output.er -t echo-trim > missing.log
          

    Here, `all.fastq’ is the input file, `corrected.fastq’ is the ECHO/Quake corrected fastq, `echo-output.er’ is the output in TEF, and `echo-trim’ is the list of reads with the trimmed area (which is ignored).

  3. Output from HiTEC is converted to TEF as below.
    $ hitec-analy.pl corrected.fa all.fa hitec-output.er
          

    Again, `all.fa’ is the uncorrected FASTA, `corrected.fa’ is the corrected FASTA and `hitec-output.er’ is output from HiTEC.

All these scripts exploit the identifiers given in FASTA/FASTQ headers added in pre-processing step (Section Pre-processing Data).

SAM to TEF Conversion

Alignments in SAM file are converted to TEF file using the script `sam-analysis.py’.

sam-analysis.py --file=/path/to/sam-file-input
                --outfile=/path/to/err-output
                --ambig=/path/to/ambig-output
                --unmapped=/path/to/unmapped-output
                --trim=/path/to/trim-file-output
                [--genomeFile=/path/to/genome-file]
                [--dry (for dry run no output generated)]

`–outfile’ option is a path of output file with write access. Ambiguous reads are written to the file given as the value for `–ambig’ option. Unmapped reads are written to the output file given as the value for `–unmapped’ option. Unmapped and ambigous file can be both same. trim-file-output positions trimmed (ranges allowed).

Here, genome file is optional. It is used if MD String is not available. If genome file is given, it will be loaded in memory completely. The script doesn’t handle genomes with multiple chromosomes.

Comp2PCAlign

Comp2PCAlign measures the Gain and Sensitivity from the outputs generated in the previous two sub-sections. Usage is as below:

$ comp2pcalign [correction-rslt] [pre-correct-aln-rslt] [unmapped-pre-correct-aln] [m-value] [fpfn-rslt] [optional trimmed-file]

It takes 6 arguments and they are given in the following order :

  1. Correction Result converted to TEF.
  2. Alignment SAM converted to TEF.
  3. File with list of unmapped reads.
  4. Edit distance used for alignment.
  5. Output file with write access to which the statistics are written to.
  6. [Optional] List of reads with trimmed regions.

(1) is generated from Error correction output as described in Section Conversion to TEF. (2),(3),(4) and (6) are generated from the alignment as described in SAM to TEF Conversion. (3) is a concatenation of both unmapped and ambiguous reads.

Corrected 454/Ion Torrent Reads Analysis

The procedure to analyse 454/Ion Torrent Reads is given in the paper. `compute-stats.py’ is the script implementing the procedure. It is used as below:

compute-stats.py --aln=/path/to/pre-correction-alignment-sam-file
                --corrected=/path/to/corrected-reads-fa-file
                --outfile=/path/to/stats-output (write access reqd.)
                --records=number of reads
                [--genomeFile=/path/to/genome-file]
                [--band=value of k used for k-band alignment (default 5)]
              (OR)
compute-stats.py -a /path/to/pre-correction-alignment-sam-file
                -c /path/to/corrected-reads-fa-file
                -o /path/to/stats-output-file (write access reqd.)
                -r number of reads
                [-g /path/to/genome-file]
                [-b value of k used for k-band alignment (default 5)]

The script accepts only FASTA file. The script requires that the FASTA is pre-processed as given in Section Pre-processing Data, because it exploits the sorted identifiers to process SAM with FASTA in a single pass.

`–band’ option gives the value of band size used for k-band alignment. Here, genome file is optional. It is used if the MD String is not available. If genome file is given, it will be loaded in memory completely. The script doesn’t handle genomes with multiple chromosomes.

`compute-stats.py’ requires MPI, and mpi4py as it is uses MPI.

Download

Source code for the tools are available from here.