Home
Conference Program
Sponsors
Tutorials
Poster Session Titles
Poster Instructions
Conference Organizers
Sponsor/Exhibitor Info
Directions
Accomodations
Miami University
About Oxford, Ohio
 
OCCBIO 2006

Ohio Collaborative Conference on Bioinformatics (OCCBIO)

Connecting Ohio's Bioinformatics and Bioscience Research Leaders
Miami University, Oxford, Ohio, July 9-11, 2007

Session III Abstracts

Bayesian hierarchical model for transcriptional module discovery by jointly modeling gene expression and ChIP-chip data

X. Liu1,2, W. J. Jessen2, S. Sivaganesan3, B. J. Aronow2, and M. Medvedovic1,2*

1 Department of Environmental Health, University of Cincinnati, Cincinnati, Ohio; 2 Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, Ohio; 3 Mathematical Sciences Department, University of Cincinnati, Cincinnati, OH

Background
Transcriptional modules (TM) consist of groups of co-regulated genes and transcription factors (TF) regulating their expression[1]. Two high-throughput (HT) experimental technologies, gene expression microarrays[2] and Chromatin Immuno-Precipitation[3] on Chip (ChIP-chip), are capable of producing data informative about expression regulatory mechanism on a genome scale. The optimal approach to joint modeling of data generated by these two complementary biological assays, with the goal of identifying and characterizing TMs, is an important open problem in computational biomedicine.

Results
We developed and validated a novel probabilistic model and related computational procedures for identifying TMs by jointly modeling gene expression and ChIP-chip binding data. We demonstrate an improved functional coherence of the TMs’ produced by the new method when compared to either analyzing expression or ChIP-chip data separately or to alternative approaches for joint analysis[4-6]. We also demonstrate the ability of the new algorithm to identify novel regulatory relationships not revealed by ChIP-chip data alone. The new computational procedure can be used in more or less the same way as one would use simple hierarchical clustering without performing any special transformation of data prior to the analysis. The R and C-source code for implementing our algorithm is incorporated within the R package gimmR which is freely available at http://eh3.uc.edu/gimm.

Conclusions
Our results indicate that, whenever available, ChIP-chip and expression data should be analyzed within the unified probabilistic modeling framework, which will likely result in improved clusters of co-regulated genes and improved ability to detect meaningful regulatory relationships. Given the good statistical properties and the ease of use, the new computational procedure offers a worthy new tool for reconstructing transcriptional regulatory networks.

1. Ihmels J, Friedlander G, Bergmann S, Sarig O, Ziv Y, Barkai N: Revealing modular organization in the yeast transcriptional network. Nat Genet 2002, 31(4):370-377.
2. Schena M, Shalon D, Davis RW, Brown PO: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995, 270(5235):467-470.
3. Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E et al: Genome-wide location and function of DNA binding proteins. Science 2000, 290(5500):2306-2309.
4. Tanay A, Sharan R, Kupiec M, Shamir R: Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci U S A 2004, 101(9):2981-2986.
5. Bar-Joseph Z, Gerber GK, Lee TI, Rinaldi NJ, Yoo JY, Robert F, Gordon DB, Fraenkel E, Jaakkola TS, Young RA et al: Computational discovery of gene modules and regulatory networks. Nat Biotechnol 2003, 21(11):1337-1342.
6. Lemmens K, Dhollander T, De Bie T, Monsieurs P, Engelen K, Smets B, Winderickx J, De Moor B, Marchal K: Inferring transcriptional modules from ChIP-chip, motif and microarray data. Genome Biol 2006, 7(5):R37.

Comparison of Statistical Techniques for the Analysis of Metabolic Toxicological Data Derived from NMR Spectroscopy

Nuclear magnetic resonance (NMR) spectroscopy is a non-invasive method of acquiring a metabolic profile from biofluids. Identifying biomarkers from these profiles may provide keys to the early detection of exposure to a toxin. Two common features of NMR data sets are small sample size and a large number of variables (i.e. high dimensionality). The high dimensionality arises from each sample spectrum being divided into a large number of regions, each of which is a dimension. Pattern recognition techniques can then be used to identify biomarkers from a data set that consists of metabolic profiles from a small number of samples. A typical first step of this analysis is to individually identify responsive spectral regions, followed by associating these regions with metabolites and biomarkers. In this paper, we evaluate several common alternatives to identify responsive regions, including the fold test, paired t-test, and logistic regression. Further, when performing these types of analyses, the issues of multiple-comparisons and false positive rates must be addressed. We compare several corrections for these issues including the Bonferroni, Holm’s, Westfall and Young, permutation, and bootstrap methods. The results of these statistical tests in combination with the multiple-comparison corrections were compared on both a simulated data set and an NMR-derived toxicology data set. Based on these results, we present a statistical protocol for determining putative biomarkers, designed to mitigate the low sample size, high dimensionality, and false positive issues associated with NMR data.

Querying with the Gene Ontology and Its Annotations

Dr. Valerie V. Cross and Yi Sun, Miami University, Oxford OH

Today, with the increasing development of computational biology, various large databases are built to describe genomic information and the used experimental data. To guarantee the consistency of the referenced biological concepts in different databases, the Gene Ontology (GO) developed by the Gene Ontology Consortium [1] describes biological concepts and their relationships in a species-independent manner. Biologists have been using the GO terms to annotate genes in various databases. These annotations create a mapping between the GO and gene products. These annotations are being used in determining the similarity between genes and gene products, an important task in post-genomics study. For example, gene similarity measurement is used in validating high-throughput protein interaction data [2], aiding the creation of new pathway modeling tools and clustering methods [3], and facilitating the detection of functionally related gene products independent of homology [4].

Numerous tools have been developed to analyze gene product data. For example, GeneInfoViz [5] is a web-based tool used to retrieve gene information and to construct and visualize gene relation networks based on the GO. The Gene Ontology Categorizer [6] uses the GO terminology to summarize or categorize an input set of genes. A goal of this research is to develop a system for QUerying with Ontological Terminologies and their Annotations (QUOTA) [7] that is applicable to all domains having an ontology of annotating terms and files or databases of annotated objects. In this presentation a brief overview of ontologies with the Gene Ontology used as a concrete example is provided. Since a central component of QUOTA is the measurement of similarity between the annotated objects, the variations on similarity with respect to fuzzy set theory, ontological semantics, and fuzzy measure theory are next described with an initial experiment comparing the similarity of 21 annotated genes products [8] using several combinations of the QUOTA similarity components. Then a synopsis of the current querying capabilities of QUOTA is presented. The presentation concludes with a discussion of current and future research plans and solicits ideas for collaborative research on how to adapt or enhance QUOTA for wide spread use in computational biology.

[1] Gene Consortium, http://www.geneontology.org/
[2] X. Guo, C. D. Shriver, H. Hu, M.N. Liebman, “Semantic similarity based validation of human protein-protein interactions,” Proc. Computational Systems Bioinformatics Conference, pp. 149-150, 2005.
[3] M. Popescu, J. Keller, J. Mitchell, and J. Bezdek, “Functional Summarization of Gene Product Clusters Using Gene Ontology Similarity Measures”, Proc. Int. Conference on Intelligent Sensors, Sensor Networks and Information Processing, Melbourne, Australia, December, 2004, pp. 553-559.
[4] F. Azuaje, H. Wang, and O. Bodenreider, “Ontology-driven similarity approaches to supporting gene functional assessment,” In Proceedings of the ISMB’2005 SIG meeting on Bio-ontologies 2005:9-10.
[5] M. Zhou and Y. Cui, “GeneInfoViz: constructing and visualizing gene relation networks.,”Silico Biol. 4(3):323-33, 2004.
[6] C. A. Joslyn, S. Mniszewski, A. Fulmer, G. Heaton, “The Gene Ontology Categorizer,” Bioinformatics. Aug 4;20 Suppl 1:I169-I177, 2004.
[7] Y. Sun, “Querying with Ontological Terminologies and their Annotations,” Masters Thesis May, 2007, Computer Science and Systems Analysis, Miami University, Oxford, OH.
[8] M. Popescu, J. Keller, and J. Mitchell, “Fuzzy Measures on the Gene Ontology for Gene Product Similarity,” IEEE/ACM Transactions on computational biology and bioinformatics, vol. 3, no. 3, pp. 263-274, July/Sept 2006.

Kolmogorov-Smirnov Based Scores for Protein Identification Using Peptide Mass Fingerprinting

Rachana Jain, Department of Biomedical Engineering, University of Cincinnati
Michael Wagner, Division of Biomedical Informatics, Cincinnati Children’s Hospital Research Foundation

Peptide Mass Fingerprinting (PMF) has increasingly gained acceptance as a primary and fast method for protein identification since the early 1990s. PMF is based on the principle that masses of the constituent peptides of a protein may provide a unique fingerprint/map which can be used to identify the protein by comparison with a database of theoretical protein digests. However, various factors, such as the presence of contaminants in the sample, limited databases and post-translational modifications of proteins complicate the task for PMF and limit its success as a protein identification method.

The crucial ingredient in PMF methods is the definition of a scoring function which can accurately distinguish between random hits and true positives. Current database search tools such as MASCOT and ProFound use the number of matches (hits) between experimentally determined peptide masses and the theoretical digest of a database protein as the primary parameter in their scoring functions. Our work focuses on systematically evaluating a number of quality measures, (some of which are novel) that measure the degree to which an experimental peak list matches a theoretical digest.

One novel quality measure we investigate here is based on the non-parametric Kolmogorov-Smirnov test. We propose finding the peptide in the theoretical digest that is closest in mass for each mass spectral peak. We then compare the resulting cumulative mass error distribution to a background distribution of false-positive proteins of similar size and compute the one-sided non-parametric Kolmogorov-Smirnov (KS) statistic as a score to indicate how different this distribution is from a random distribution.

Using publicly available curated PMF datasets from yeast, we compared the relative performance of the KS score to the simpler statistic of the number of hits given a particular mass tolerance. KS ranked 266 of 313 proteins correctly at the top 1 position, when searched against a database of 3795 non-redundant proteins, outperforming all other quality measures. By comparison, the score based on the number of peptide matches only identifies 198 proteins correctly. Furthermore, decision trees trained on the same data sets consistently identified the KS score as the feature with the maximum information gain.

These results, while still preliminary and on a limited dataset, demonstrate that the KS test outperforms traditional measures in identifying the correct protein. We propose that the KS score, especially when coupled with other features and machine-learning type algorithms (which we are currently exploring), has the potential of improving upon the current state of protein identity prediction using PMF. Furthermore, we note that the methodology is extensible to MS/MS data.

A novel approach in identifying spurious and chimeric sequences in dbEST

Alex Kloft, Yuansheng Liu, Lin Liu and Chun Liang, Department of Botany, Miami University, Oxford

dbEST is the most rapidly growing database dedicated to expressed sequence tag (EST) sequences. As of May 25, 2007, there are 43,342,964entries deposited in dbEST, covering 1,320 different species of model ornon-model organisms. While EST data are being widely used in many genome characterization approaches, including gene discovery and gene expression profiling, polymorphism detection and genomic sequence annotation, they represent the most serious challenge in data veracity. Due to imperfections in molecular biology manipulation during cDNA library construction and errors in sequencing procedures, it is estimated that about 3% base ambiguity rates and spurious sequence contaminations exist in EST sequences. For a long time, no bioinformatics program has been developed to explore systematically and comprehensively the data abnormality in ever-growing, enormous dbEST data.

Recently, we published our WebTraceMiner (http://www.conifergdb.org/software/wtm), a unique public web service for processing and mining raw EST sequencer trace files by focusing on the sequence features that characterize and annotate either 3¹ and/or 5¹ termini of cDNA inserts. Using WebTraceMiner, we have reprocessed 172,229 loblolly pine EST trace files downloaded from NCBI Trace Archive, and created the ConiferEST database (http://www.conifergdb.org/coniferEST.php), the first public EST resource that allows biologists to explore both the complexity and abnormality of ESTs in terms of terminus structures. It is clear to us that terminus determination is important to data quality control and validation of error-prone ESTs, and double-termini adapters appeared to be good indicators for EST chimeras. In this study, we extend our research to whole dbEST sequence data, based on the assumption that many cDNA libraries have adopted the same or similar construction protocol using EcoRI and XhoI as restriction enzyme sites. Among all 43,342.964 entries, we detected about 0.72% sequence reads that have either 5¹ terminus in sense strand (5TSS), 3¹ terminus in sense strand (3TSS, containing a polyA tail), 5¹ terminus in non-sense strand (5TNS, containing a polyT tail) or 3¹ terminus in non-sense strand (3TNS) in perfect matching patterns, while about 0.13% sequence reads have double-termini adapters. If one base error is allowed in pattern matching, the aforementioned numbers will be 2.16% and 0.17% respectively.

We concluded that many sequence reads in dbEST can be cleaned by determining unambiguously their terminus structures and extracting accurately their final cleaned sequences. EST termini information will definitely help identify and highlight EST chimeras, as well as other abnormalities, to reduce cascaded and deleterious impacts of spurious and chimeric sequences existing in dbEST on many downstream EST analyses
(e.g., NCBI UniGene).

Back to schedule