| HUSAR Developments
|
| |
-
HUSAR Tasks
-
W2H - the WWW interface to HUSAR
-
W3H - the task framework
-
Webservices in Bioinformatics - Hobit @ DKFZ
|
|
| |
|
The growing complexity of both biological data and bioinformatics
tools requires knowledge of underlying biological concepts and
computing methods. Often it is difficult and time consuming to select
the correct combination of applications and databases. Therefore, we
have developed a task system that allows the integration of
applications and methods to create tailor-made analysis.
At DKFZ the W3H task framework is currently used within the HUSAR
environment (Heidelberg Unix Sequence Analysis Resources) allowing
the combination of bioinformatics tools within HUSAR into
work flows.
W3H tasks result in XML data containing all relevant information
obtained when combining the individual methods in the
environment. This XML output can be used in successive
analysis.
The HUSAR team is open to new collaborations with other groups in order to design new tasks.
Available tasks:
| 2DSweep |
Secondary Structure Prediction Tool |
| cDNA2Genome |
Tool for mapping CDNAs |
| DNASweep |
DNA Identification Tool |
| DomainSweep |
Protein Family Search Tool |
| ESTAnnotator |
EST Identification Tool |
| GeneConsensus |
Tool for combining gene prediction programs |
| GeneModel |
Tool for calculating complete gene structures |
| GOPET |
Tool for GO term prediction and validation |
| IntegrationMap |
Tool for maps human integration sequences to the genome |
| IntegrationSeq |
Tool for isolates integration sequence |
| miRpredict |
An potential miRNA Identification Tool |
| miRTaCa |
miRNA Target Catcher - miRNA Target Prediction in UTR Regions |
| PATH |
Phylogenetic Analysis Task in HUSAR |
| PrimerSweep |
Primer Search&Analysis Tool |
| ProtSweep |
Protein Identification Tool |
| PromoterSweep |
Identification of Transcription Factor Binding Sites
|
| SERpredict |
Detection of tissue- or tumor-specific isoforms created via the
exonization of an retroelement
|
|
| |
|
top
|
| |
|
W2H
is a free WWW interface to sequence analysis software
tools such as the GCG-Package
(Genetic Computer Group),
EMBOSS
(European Molecular Biology Open Software Suite)
or to services (such as
HUSAR,
Heidelberg Unix Sequence Analysis Resources).
It tries to cover as much functionality as possible while keeping
it as user friendly as possible. It gives you the opportunity to
access more than hundred programs from any computer platform
with a JavaScript enabled web browser.
The interface is freely available and under constant
maintenance.
The development of W2H started in 1996 here at
the HUSAR Bioinformatics Lab at DKFZ
(Senger et al.)
and has been
maintained since 1997 in a collaborative project between DKFZ
and EMBL-EBI (European Bioinformatics Institute, Hinxton,
UK).
All information about W2H can be found on the
W2H-Homepage.
|
| |
|
top
|
| |
|
|
W3H task framework
The task framework for W2H
The W3H task framework allows the execution of
compound jobs utilising the description of work and data flows in a
heterogeneous bioinformatics environment using meta-data information.
By means of these descriptions, the task system can schedule the
necessary execution of applications available in the environment,
depending on rules specified in the meta-data.
(Ernst et al.).
By integrating this task framework into the web interface W2H,
similarly based on meta-data, web access and data management are
immediately available for each task description. Authors of task
descriptions can base their work on the underlying classes and objects
to be able to describe dependency rules between previously independent
applications. At DKFZ the W3H task framework is currently used within
the HUSAR environment (Heidelberg Unix Sequence Analysis Resources),
which allows the combination of bioinformatics tools within HUSAR into
work flows. W3H tasks result in XML data containing all relevant information
obtained when combining the individual methods in the environment. The
resulting XML data is translated according XSLT data into web pages or
plain text to report the result of the task to the user.
|
| |
|
top
|
|
|
2DSweep
Secondary Structure Prediction Tool
2DSweep is a task for performing secondary structure
predictions on protein sequences. It reports predictions for
alpha-helix, beta-strand, coiled-coil, and helix-turn-helix
motifs. The task also predicts transmembrane regions, signal
sequences, hydrophobicity, antigenicity, protease cleavage sites as well as
possible protein localization, and provides peptide statistics (pI
value, amino-acid composition, molecular weight).
View XML schema documentation
and example output.
|
| |
|
top
|
|
|
cDNA2Genome
is an application for the automatic high-throughput mapping
and characterization of cDNAs.
(del Val et al.).
It uses already existing annotation data
and improves them when possible with the most up-to-date databases,
especially in the case of ESTs, proteins and mRNAs. cDNA2Genome is
focussed on the determination of the cDNA exon-intron structure which
is exhaustively assessed with a vast number of approaches to gene
prediction. The input cDNA sequence is masked for repetitive
elements. Then this sequence is blasted against the human genomic
database. From the blast output the best group of compatible HSPs is
selected. To be selected the HSPs must be consecutive both in the
genomic sequence and in the cDNA, they have to lie on the same genomic
strand and they have to cover the input cDNA in a maximal
way. cDNA2Genome gives information about the chromosomal and contig
location of the cDNA. It extracts the genomic sequence where the cDNA
is located and predicts the exons and introns in this region for both
strands. The gene prediction methods used are GenScan, HMMgene,
GeneID, GeneWise, and Sim4.
View data
flow and
example
output.
|
| |
|
top
|
|
|
DNASweep tries to identify a piece of eukaryotic
DNA by homology search and locates possible genes or promoter elements
in the sequence. The input DNA sequence is masked for repetitive
elements per default and a homology search is then performed against
a database of non-EST sequences, though users can choose different
databases or have additional searches against the EST or HTG sequences.
If your sequence is human it is possible to use the Human_Assembled (NCBI)
database. Due to the fact that the search for genes and transcription
factors is organism specific, the organism to which your sequence
belongs should be specified. The programs that are used are Genscan
for gene and promoters prediction, Factor for the identification of
transcription factor binding sites, and Fasta against the Eukaryotic
Promoter Database (EPD) for the location of eukaryotic promoter
elements. View example
output
|
| |
|
top
|
|
|
DomainSweep identifies the domain architecture within
a protein sequence and thus can help find correct functional
assignments for an uncharacterised protein sequence. It employs
different database search methods to scan a number of protein/domain
family databases. This is due to the fact that different analytical
approaches have been used to create family signatures. Among these
models, in increasing complexity, are: automatically generated protein
family consensus sequences (Prodom), regular-expression patterns
(Prosite), ungapped position specific scoring matrices of sequence
segments (Blocks) or sequence motifs (Prints), gapped position
specific scoring matrices (Prosite profiles), and Hidden Markov Models
(Pfam, Smart, Tigrfams). Each database covers a slightly different,
but overlapping, set of protein families/domains. Each model has its
own diagnostic strength and weakness. DomainSweep is an integrated
search tool for the most important protein family databases. In the
final result domains are classified as "Significant" or "Putative"
according to predefined rules such as database specific criterias of
cutoff values or e-value thresholds, etc. Domain hits are linked to the
corresponding protein family database entries and are grouped
together if they belong to the same InterPro family. Interpro - as an
integrated resource - provides extensive domain annotations including
direct access to the GO (gene ontology) classification system. View
data flow
,
XML schema documentation
and example
output
|
| |
|
top
|
| |
|
ESTAnnotator is a tool for automatical analysis of
EST sequences supporting the search of functional annotations of
novel transcript sequences
(Hotz-Wagenblatt et al.). In a first
quality check step repeats, vector parts and low quality sequences
are masked. Then successive steps of BLAST searching against
suitable databases and EST clustering are performed. Already known
transcripts present within mRNA and genomic DNA reference databases
are identified. Subsequently, tools for the clustering of anonymous
ESTs and for further database searches at the protein level are
executed. ESTAnnotator was successfully applied for the systematic
identification and characterisation of novel human genes involved
in cartilage/bone formation, growth, differentiation and
homeostasis
(Zabel et al.)
View data flow
and
example output
|
| |
|
top
|
|
|
GeneConsensus combines the predictions of different gene-finding programs:
GenScan, HMMGene and GeneID. From their outputs, it computes a consensus
sequence employing one of the following algorithms (selectable by the
user): The "OR"-method for high sensitivity, the "AND"-method for high
specificity, the "EUI-method", suitable for short sequences, and the
"GI"-method which is optimized for long sequences.
View example output.
|
| |
|
top
|
|
GeneModel
A Tool for calculating complete gene structures
GeneModel calculates the full-length structure of a gene from an input cDNA or
mRNA sequence. It integrates already existing information from different resources such as: NCBI,
ENSEMBL, VEGA, and UCSC. To predict the gene structure it combines CpG-Islands data, ESTs and hand
annotated and computer-predicted genes from the named resources using algorithms from the W3H Tasks
Caftan – mapping/comparison of introns exons- and Geneconsensus -detection of common groups of
compatibles exons-. Because of the differences in the anotation procedure and quality reliability of
the transcripts in the different data sources, GeneModel applies a quality scoring system depending
on the origin of the annotation to each transcript in order to improve the prediction of the structure.
The web output of GeneModel is divided in six sections: (i) General Information, (ii) cDNA location,
(iii) Complete gene structure, (iv) full-length cDNA exon table (v) cDNA genomic context and (vi)
Genomic table summary table. Sections (iii) and (iv) are graphical outputs while the rest of the
sections are tables containing the information used to generate this graphics. Section (i) provides
information about the parameters used to run GeneModel, section (ii) gives information about the
Organism, Chromosome, Begin, End and Strand of the cDNA in the genome. Section (iv) includes information
about all the exons that made the full-length cDNA indicating their begin and end in the genome and
weather they are constitutive (always present in all transcripts) or alternative. To select the exons
forming part of the full-length cDNA there is a filtering criteria for overlapping genes. In those cases
the exon with the best annotation source will be selected. The last table, section (vi), is a summary
containing for all exons founds for the subject gene the following fields: Exon number sorted by genomic
begin and end, the name of the transcript were it was found, source and quality of the annotation, type
of transcript, status of the transcript annotation, and if was found with sim4.
The user has immediate access to all complete application outputs and database entries via hyperlinks.
At the bottom of the HTML output there is a link to the explanatory legend as well as to the XML output
containing all the generated information.
View data flow and
example output.
|
| |
|
top
|
|
GOPET
A Tool for GO term prediction and validation
GoPet is a complete automated tool for assigning molecular
function or biological process terms to cDNA or protein sequences utilising
Gene Ontology for annotation terms, GO-mapped protein databases for performing
homology searches, and Support Vector Machines for the prediction and the
assignment of confidence values. GOPET provides an organism-independent prediction
since the databases cover a broad variety of different organisms and the selected
attributes are independent of the organism. It was shown previously that the
prediction quality was comparable to high-quality manual annotation and a high number
of sequences could be annotated when compared to other systems.
View example output.
|
| |
|
top
|
|
|
Integrationmap can be used to determine and profile integration sites of
viruses or viral vectors on a chromosomal and genomic level. DNA sequences adjacent to
the viral 'long terminal repeat' (LTR) can be exactly located in the human genome, as well
as the actual viral insertion site. Information about hit or next genes, hit or adjacent
repetitive elements like SINEs, LINEs, CpGs and LTRs together with their distances to the
insertion site are displayed in the output file. Input sequences must start with the first base
following 5' of the LTR.
View example output.
|
| |
|
top
|
|
|
Integrationseq can be used to prepare raw files from a genetic analyzer for
mapping to the human genome. Beginning with a quality check viral 'long terminal repeats' (LTRs),
adaptor sequences and cloning vector backbone sequences are recognized and cut off. Internal vector
sequences ('internal bands') are recognized, too.
The input should be a multiple FASTA sequence file. More detailed information is available.
View example output.
|
| |
|
top
|
|
|
miRpredict is a tool for automatical identification of known and
potential new miRNAs in DNA sequences. The sequences are split into overlapping pieces
of a miRNA like size, compared to known miRNAs of miRBase and to organism-specific
non-coding RNAs of EnsEMBL. The genomic precursor sequence is build after localization
on the genome and potential miRNAs are identified by recognizing a palindrome and
classifying the palindrome as a miRNA-like palindrome using a triplet-SVM classifier
(Xue, C. et al; BMC Bioinformatics 6, 310,2005).
View data flow and example output.
|
| |
|
top
|
|
miRTaCa
miRNA Target Catcher - miRNA Target Prediction in UTR Regions
miRTaCa can be used to find miRNA binding sites on the 3'UTR region of cDNAs.
UTR or cDNA sequences can be given as input. If a cDNA is given, the tool finds the 3' UTR and
checks with the programs MIRANDA, TARGETSCAN and RNAHYBRID for miRNA binding sites. It also
looks for conserved regions in the UTR if the homologous gene of another organism can be found.
The results can be combined by an AND, OR, or MAJORITY algorithm. The input
should be a single sequence (or a multiple FASTA sequence file). The result will be a summary
page giving the information about miRNA binding sites and the conserved sites which correspond to
miRNA binding sites of the 3'UTR.
View data flow and
example output.
|
| |
|
top
|
|
PATH
Phylogenetic Analysis Task in HUSAR
PATH is a task for the inference of phylogenies
(del Val et al.).
It executes each of the three main phylogenetic
methods: maximum likelihood (using TREE-PUZZLE), pairwise distance
combined with Neighbor-Joining and parsimony (using programs of the
PHYLIP package). According to recomendations by Jin and Nei (1990) it
automatically chooses the evolutionary model for each data-set in
order to optimize the performance of the neighbor-joining. The newly created
phylogenetic trees are then compared for consistency of the subgroups. The
output of the tasks shows the consensus trees together with full
results obtained from all executed methods as well as additional
information generated in the process. To find inconsistencies in the
input data the splittability index of the split decomposition method
is evaluated. View data flow and
example output.
|
| |
|
top
|
|
|
Primersweep finds primer pairs for PCR reactions
matching your input sequence and a target region or checks a given
primer pair according to their target and the sequence. The task
performs a quality check for primer pairs by searching for all
possible PCR products with the primers using a user defined
database. The result of PrimerSweep is a list of primer pairs with
melting temperature and GC content, and all possible PCR products
created either by two primers or by a single one binding to the
database sequences. View
example
output.
|
| |
|
top
|
|
|
Protsweep can be used for analysis and possible
identification of newly obtained protein sequences. The result lists
protein features such as molecular weight etc., and reports predicted
secretory signals, the possible subcellular
localization, and the result of homology searches against general DNA
and Protein sequence databases as well as against the protein family
databases Prosite and Blocks. View XML schema documentation
and example output.
|
| |
|
top
|
|
PromoterSweep
Identification of Transcription Factor Binding Sites - Analysing
Promoter Sequences
PromoterSweep is an automated bioinformatics pipeline to
analyse promoter sequences and predict transcription factor binding sites.
PromoterSweep uses a combination of different tools: Sequence comparison to promoter
databases, identification of transcription factor binding sites provided by the databases
Transfac and Jasper, as well as collecting orthologous sequences and applying general motif
discovery tools. The results are combined and classified for reliability.
View data flow and
example output.
|
| |
|
top
|
|
SERpredict
A tool to predict tissue or tumor-Specific Exonised
Repetitive element containing isoforms
SERpredict is an automated bioinformatics pipeline to predict tissue
or tumor-specific repetitive element (RE)-containing isoforms in human and mouse DNA.
SERpredict extracts all available exons of the input sequence found in the EnsEMBL
database and screens for REs in all of the exons. For every RE-containing exon, we are
aiming to detect tissue or tumor specific isoforms caused by the exonization of the
repetitive element. Therefore, all EST and mRNA sequences are extracted to perform a
statistical analysis to classify potential tissue or tumor specific isoforms.
View data flow and
example output.
|