Today's biologists face a problem similar to the little mermaid's: their treasure trove of protein structures is exciting, but of little use if they don't know what those proteins are for.
The Protein Data Bank, where data about proteins with known structures are stored, contains tens of thousands of proteins that scientists can freely access. But in many of these cases, the biological purpose of those proteins is unknown.
"Within a human cell there are on the order of 25,000 proteins and each of those proteins has a specific function," said Robert Powers, a biochemical researcher at the University of Nebraska-Lincoln.
Powers estimates that researchers have determined the purpose of between 40 and 50% of those proteins. The Powers Research Group hopes to change that with the help of their CPASS (Comparison of Protein Active Site Structures) system and the power of grid computing.
Finding the right keyhole
Proteins interact with other molecules via ligand binding sites. You can think of these sites as keyholes or locks - only molecules that are the right shape can use them to attach to a protein, just as only a key with the right shape will open a lock.
These "keyholes" are not unique to each protein. Many proteins have similar or identical ligand binding sites; that means that proteins with similar binding sites can bind to the same molecules.
The bad news is that if the molecule in question is a drug, it may latch onto proteins that the drug designers were not targeting, causing undesirable side effects. The good news is that if two binding sites are similar, what we know about one of them can tell us a lot about the other.
The CPASS database contains all of the experimentally determined binding sites known so far - approximately 36,000. The CPASS software uses that database to learn about proteins with unknown functions.
"We compare that entire database against one new, novel, experimentally determined binding site, and try to find a match," Powers explained. When a match turns up, it means that proteins with the new binding site can serve the same function as proteins with the matching binding site.
Passing CPASS on
The concept of the CPASS system was born sometime around 2003; their first paper was published in 2006. The CPASS software must compare all possible orientations of the 3-D structure of the new ligand binding site with the entire database. To save time, they start by rotating the structure by a large angle, shrinking the mesh as areas of potential interest are identified by the program. With access to only their in-house 16 CPU cluster, each new site search took approximately 24 hours to complete.
Not only is that slow, but it was getting slower. Over the years, the CPASS database had grown by approximately 40%. The Powers team also wanted to open up the software and database to more users; despite their web portal, only a tiny percentage of CPASS users came from outside the Powers Group.
A new version with access to more computing power would be necessary if they wanted to speed up searches, attract more users, and keep pace with the size of their growing database. To accomplish their goals, they worked hand in hand with computing experts from the University of Nebraska-Lincoln's Holland Computing Center such as Adam Caprez.
"They already had a web interface built; it just submitted to their cluster," Caprez explained. "We maintained the look and feel of the web interface, but we had to change the web form to submit Condor jobs."
Today, CPASS submits jobs to the Open Science Grid via Glidein, and the jobs that once required a day to complete are finished in only about an hour. Being on the grid has also freed the Powers Research Group to start reaching out to new potential users, as their portal is free for academic users as long as they register. Their outreach efforts via OSG, and at academic conferences, have been successful.
Said Powers, "By going out to the grid now we have a couple hundred users, and we get new users just about every day."