We downloaded all public available Brassica ESTs for four selected species
(2,482 Brassica carinata, 59,736 Brassica oleracea, 567,220
Brassica napus, and 180,611 Brassica rapa) from Genbank.
Assembly Method
We used PTA (Paracel Transcript Assembler ) to do the EST cleaning,
clustering and assembly for different species.
During the cleaning stage of PTA, we clipped off the low quality region,
vector fragments and polyA/T tail, annotated low complexity region and repeats.
The ESTs which have significant hits with RNA gene, mitochondria, chloroplast
and, ecoli and yeast genomes have been filtered out. The ESTs whose length is
less than 100 or greater than 1000 bp after the cleaning stage are also filtered
out.
After the cleaning stage, we tried to reduce the dataset without losing any
useful information. A megablast search and local developed Perl scripts were
used to remove sequences that duplicate parts of longer sequences.
After the cleaning and duplication removing process, there are 2071 (71.86%)
Brassica carinata, 47391(79.33%) Brassica oleracea, 432891(76.32%)
Brassica napus, and 123848(68.57%) Brassica rapa left to get into
the next step of the clustering and assembly.
During the clustering and assembly, we separated all the Brassica ESTs
which have significant hits with the available Brassica complete coding
sequences(1254) into different bins to assembly them separately. After this
stage, 1821 (64.07%) Brassica carinata, 41739 (69.87%) Brassica oleracea,
363710 (64.12%) Brassica napus , 108044 (59.82%) Brassica rapa are
pair-wise compared, clustered and assembled.
After assembling, 361 Brassica carinata, 8370 Brassica oleracea,
50273 Brassica napus, and 17902 Brassica rapa contigs are
generated.
MAGPIE Annotations
The produced Brassica contigs and singlet for different species are uploaded
to MAGPIE to do the annotation.
Sprocket Analysis
The produced contigs and singlet are also analyzed by using Sprockets
pipeline which consists of two main parts: pair-wise analysis (smith-waterman
comparison) and grouping of sequences with Z-score statistics, followed by
hierarchical splitting of clusters into alignable protein families.
Various computational and statistical techniques applied in Sprockets allow it
to act like a massive and selective multiple sequence alignment engine for
combining individual sequence collections and related public sequences. The end
result is a database of gene Hidden Markov Models, each related to the other by
three level of similarity: secondary structure, function and evolutionary
origin.
Sprockets Hierarchical Output Visualization
The Hierarchical outputs from sprockets have been visualize through
SprocketsView,
GBrowse, and
Bluejay
Download
The EST assembly output:
Sprockets output