Home
Conference Program
Sponsors
Tutorials
Poster Session Titles
Poster Instructions
Conference Organizers
Sponsor/Exhibitor Info
Directions
Accomodations
Miami University
About Oxford, Ohio
 
OCCBIO 2006

Ohio Collaborative Conference on Bioinformatics (OCCBIO)

Connecting Ohio's Bioinformatics and Bioscience Research Leaders
Miami University, Oxford, Ohio, July 9-11, 2007

Session I Abstracts

Predicting Protein Functions Using Decision Trees

Venkata Yedida, Chien-Chung Chan, Zhong-Hui Duan, Department of Computer Science University of Akron

The human genome project and numerous other genome projects have produced a large and ever increasing amount of sequence data. One of the main research challenges in the post-genomic era is to understand the relationship between the nucleotide sequences of genes and the functions of the proteins they encode. In this study, we develop an automated protein function prediction system that is based on a set of homologous proteins and gene ontology categories. A novel measure based on a set of best local alignments is used to identify the homologues. The biological functions of the homologous proteins are characterized with gene ontology annotations. The protein function prediction is performed based on a data mining model using decision trees. The model was trained and tested using the complete proteome of model organism yeast. We show that the decision tree model is fairly easy to implement and analyze and can be used as an effective tool for protein function prediction. We present the accuracy and stability of the decision tree model for yeast protein function prediction.

Lipid Accessibility Prediction in Membrane Proteins Using Low Complexity Regression Models

Mukta Phatak1, Baoqiang Cao2, Michael Wagner3, Jarosÿaw Meller4,5

1Department of Biomedical Engineering, University of Cincinnati, 2University of Nebraska-Lincoln, 3Division of Biomedical Informatics, Cincinnati Children’s Hospital Research Foundation, 4 Department of Environmental Health, University of Cincinnati College of Medicine, 5Department of Informatics, Nicholas Copernicus University, Poland

Predicting 3D structure of a protein from its amino acid sequence is a multi step process and remains one of the main challenges in computational biology. One important intermediate step towards that bigger goal is the prediction of structural attributes such as secondary structure, relative solvent accessibility (RSA) and residue contact number. These attributes can be used to facilitate protein structure prediction and subsequent functional annotations. Previously, we developed novel methods for RSA prediction using both neural network based and support vector (SVR) based regression approaches.

Here, we propose to extend these efforts to membrane proteins. In particular, we develop novel methods to predict the relative lipid accessibility (RLA) of an amino acid residue in a membrane domain, which represents the lipid exposed surface area of that residue in relative terms. In analogy to RSA prediction in soluble proteins, the problem of predicting RLA from the amino acid sequence can be cast as a regression problem and solved using machine learning techniques. The critical difference between soluble and membrane proteins, which makes the latter significantly more challenging, is the relatively small number of high resolution structures, from which to learn. It is thus essential to carefully design and evaluate compact representations and simple models for RLA prediction. In this work, we use low-complexity Support Vector Regression (SVR) approaches that are suitable for training on the limited number of structurally resolved membrane proteins. Moreover, we develop flexible SVR-based models to represent the uncertainty of RLA assignments for residues at the membrane-water interfaces. Using cross-validation on a non-redundant set of alpha-helical membrane domains, we estimate our methods yield correlation coefficients between the observed and predicted RLAs of about 0.5. We conclude that RLA prediction methods are already showing promise towards further applications to structure prediction and identification of membrane domain interactions.

Genome-wide analysis of alternative promoters using a custom promoter tiling array platform

Gregory A. C. Singer, Jiejun Wu, Pearlly Yan, Christoph Plass, Tim H.-M. Huang, and Ramana V. Davuluri

Although examples of multiple promoter genes have been known for over a decade, the one-gene-one-promoter model still dominates. This, despite the fact that many independent lines of evidence show that alternative promoter usage is very common in the human genome. For example, projects like ECgene and Acembly that have undergone the massive task of aligning all sequenced ESTs to the human genome suggest that more than half of human genes have more than one transcription start site (TSS). Corroborating evidence was recently provided by the Riken group, who performed cap analysis of gene expression (CAGE), generating millions of ~20mer tags from the 5' ends of mRNAs. When mapped back to the genome, these tags indicated the presence of thousands of previously unknown TSSs. Even among the UCSC Known Genes (a set of high quality gene annotations), 28% of human genes have more than one TSS, and over a thousand genes have more than three annotated TSSs.

Each promoter--especially those separated by hundreds of bases--possesses its own core promoter elements, transcription factor binding sites, and epigenetic environment (including histone modifications and CpG methylation). Therefore, the promoters can act quite independently of each other, and therefore understanding which promoter is employed in which cellular condition is key to unraveling gene regulatory networks within the cell. To this end, we have annotated all putative promoters in the human genome by integrating ab initio promoter predictions using the program FirstEF, UCSC Known Gene annotations, and CAGE tag evidence. We then designed a custom genome tiling microarray platform that uses 244,000 probes to cover roughly 35,000 putative promoters from a subset of 7,000 genes in the human genome. To demonstrate the utility of this platform, we have analyzed the pattern of promoter usage in the heavily studied MCF7 breast cancer cell line in both control and estradiol-treated conditions. Many promoters were previously considered putative were found to be active, suggesting that a large number of promoters in the human genome remain undiscovered. These novel promoters were found to occur throughout the length of the gene, from more upstream than the current most 5' annotated TSS all the way to the 3'-UTR. Clearly, many of these isoforms encode truncated proteins, or non-coding RNAs. The role these strange isoforms play within the cell is still unknown, but most intriguingly we found a strong tendency for the downstream promoter in E2-sensitive multiple promoter genes to be close to the 3’-terminus of the gene sequence. We hypothesize that these 3'-located promoters may encode small interfering RNAs, or may simply act to block progression of the RNA polymerase II complex initiated from a more upstream promoter.

Exploring Structural Implications of Positional Dependencies in Protein Sequence Alignments

Hatice Gulcin Ozer, Biophysics Graduate Program, The Ohio State University William C. Ray, Children's Research Institute and The Department of Pediatrics, The Ohio State University

Predicting physical distances between amino acids in protein alignments provides invaluable information towards anticipation of their complete 3-dimensional structure. Extracting constraints using only sequence information is an indispensable direction, since the number of known protein or nucleic acid sequences grows much faster than the number of known 3-dimensional structures. Detecting interpositional dependencies within the multiple sequene alignments of protein families and understanding their physical consequences will be a big step in this direction. In studying positional dependencies, we observed that dependencies are often the result of physical proximity. Since physicochemical interactions between many identities in the biomolecule are involved in proper folding and functioning, it is expected to observe dependencies amongst some positions. Therefore, identification of statistically significant interpositional dependencies within family alignment will further assist researchers to determine constraints on family structure. In this study, we examined the critical parameters of interpositional dependencies, also called pairwise correlations, to estimate structurally important residues for family alignments.
Full abstract

Word Seeker: Discovering Genome‐wide Patterns

Lonnie Welch1,2,3, Eric Petri1, Dazhang Gu1, Klaus Ecker1

1School of Electrical Engineering and Computer Science, 2Biomedical Engineering Program, 3Molecular and Cellular Biology Program, Ohio University

The purposes of most genomic information are unknown. This limits our ability to understand and address problems that have genetic causes. Does the ‘junk’ portion of genomes have biological meaning? If so, what is the meaning? What are the biological words, phrases, grammar, etc.? The answers will lead to a more complete understanding of the purpose of the genome and the functions of undiscovered genomic elements. This knowledge will help to cure problems that are due genetic causes. We have implemented a Word Seeker tool as illustrated in two data flow diagrams below. Using suffix tree and Teiresias algorithms, the tool discovered elements in Arabidopsis (a model plant genome) that occur with unexpected frequencies in the ‘junk’ portion of the genome, and they are found to be statistically overrepresented. Such elements may form biological words, phrases, and grammar which have biological functions. As one biologist put it, there is no junk DNA.
Full abstract

Back to schedule