GAAP:
cGOF Assisted Assembly Pipeline

GAAP is a cGOF (core-gene-defined Genome-organization-framework) Assisted Assembly Pipeline. It is aimed at scaffolding and extending scaffolds and contigs based on de novo assembly of one paired-end library and core gene cluster from multiple related references.

GAAP is composed of two separate yet sequential sections:

1) cGOF_identification, which extracts sequences and order & orientation of cGOF segments from references; one species run once.

2) Scaffolding, which uses segments of cGOF genes as anchors to order the target scaffolds and contigs, uses paired-end reads mapping for local scaffolding of ordered scaffolds/contgis to recover more contigs, and then matches the closest organized reference to construct a pseudogenome; one target run once.

The framework and algorithm of GAAP are shown in Fig.1.

Figure 1 The framework of GAAP. seg, segment of cGOF; ref, reference; sc, scaffold/contig. Head (closed circle) and tail (open circle) vertices of the syntenic seg in each reference are sequentially connected with a dashed line indicating the permutation (order and orientation) of seg. The graph in the local scaffolding of ordered sc is built by connecting seg-ordered sc and unordered sc, where the links are higher than a certain cut-off. If a conflicted connection occurs, the priorities are seg-ordered by the order and link count. The line widths indicate the link count.The line widths indicate the link count. For each pair of scaffolds and contigs, sci and scj, there exist four types of connection between them, (i) head-to-head, [sci(-),scj(+)] or [scj(-),sci(+)]; (ii)head-to-tail, [sci(-),scj(-)] or [scj(-),sci(-)]; (iii) tail-to-head, [sci(+),scj(+)] or [scj(+),sci(+)]; (iv) tail-to-tail, [sci(+),scj(-)] or [scj(+),sci(-)]; where positive and negative signs indicate the relative orientation of assemblies.


Download

Source code of GAAP can be downloaded from github for free.
To test the performance of GAAP, example datasets are provided here.

Installation

GAAP is in Python scripts (Python version of 2.7 or above is required to run the program), so it is unnecessary to compile. However, extra programs Bowtie2 and BLAT are required to run GAAP. Bowtie2 is available from http://bowtie-bio.sourceforge.net/bowtie2/index.shtml. BLAT is available from ftp://ftp.ncbi.nih.gov/blast/executables/release/.

Before start, put Bowtie2 and BLAT to your $PATH: 'export PATH=path-to-bowtie2:$PATH' and 'export PATH=path-to-blat:$PATH'.

Besides, PGAP (http://sourceforge.net/projects/pgap/) is recommended to produce gene cluster file, which is needed to run cGOF identification.


Online Manual

Here we provide two examples, S.aureus and S.suis.

1) For S.aureus, we run command lines as follows.

2) For S.ssuis, we run command lines as follows.