Structural biologists who have spent months or even years trying to determine the structure of intractable proteins may be able to find solutions overnight, thanks to the power of grid computing.
Many advances in medicine and biology depend on understanding how proteins interact with each other and other factors such as drugs, RNA, or DNA. But to do that, researchers must determine how proteins are shaped. It isn't enough to know a protein's sequence; the curly strands of a protein can bend and twist in any number of directions, changing the way it will interact with its environment.
To solve the structure of an unknown protein, structural biologists begin by crystallizing the protein. Then they visit a synchrotron, where they place their crystal in front of a detector. When the extremely powerful synchrotron beam hits the crystal, the protein inside diffracts the beam, resulting in a unique pattern of x-rays hitting the detector. The detector records the intensity of the beam and the pattern it forms, but not the beam's phase; all three are necessary to solve the protein's structure.
Researchers can calculate the phase information by approximating from a similar molecule that has already been solved.
"When you try to solve a new structure, in many cases part of the structure would be in some way similar to another molecule that had been previously determined," said Piotr Sliz, the principal investigator for the Structural Biology Grid.
Researchers feed the structure of a similar known protein, and the detector data from the unknown protein, into a computer program. If the known protein is close enough in structure to the unknown, the program can solve the structure of the unknown protein using a process called molecular replacement.
Normally, researchers know that their unknown protein is similar to a protein that has been solved and saved in the Protein Data Bank. Sometimes, however, researchers know very little about their protein, or the subset of known proteins they've tried were not sufficiently similar. When that happens, they must turn to entirely different methods that would take weeks or even months of costly human time in the lab.
Invoking the grid
By using computational resources to compare unknown proteins to thousands of known proteins in only a few hours' time, Ian Stokes-Rees hoped to help structural biologists solve the structure of intractable proteins. Before they could convince biologists to use their application, however, they needed to provide proof of concept. They tried their process out by attempting to solve proteins that had been solved in 2008 using methods other than molecular replacement.
"About a year ago we stepped back and looked at the results that we'd collected from doing this process about a thousand times," Stokes-Rees said. "The evidence we had one year ago was not that compelling."
They decided to make two major changes to their method. First, they started using a different program called Phaser for the molecular replacement. Although Phaser is about ten times slower than the program they had been using, they hoped that it would produce better results. Second, they resolved to compare unknown proteins to the entire Protein Data Bank, rather than a large subset of the data bank.
These changes increased their computational needs from on the order of 100 hours per protein to about 20 000 hours, or from 4000 computations to 100000 per protein. Each run generates between 10 and 20 GB of data.
"We had to work pretty closely with the Open Science Grid people to determine how we could best manage the data, how we could manage the number of jobs, and how we could get enough computing time to complete these in less than a week," Stokes-Rees said.
They created a wrapper for Phaser, which is written in Fortran, and hooked it up to DAGman, which served as a workflow management system. With this setup, they could send jobs to OSG. But they were still running into difficulties, as jobs would take nearly as long to start up as they took to run.
"Before we implemented glideinWMS we were really struggling to schedule 1000 jobs," Stokes-Rees said. "But once we were using glideinwms we were able to get up above ... 6000 jobs running."
The new technique was a success.
"We were able to show that there were certain cases that occur sufficiently frequently when this technique can be really valuable," Stokes-Rees explained. "So far we've found that about a quarter of the cases that we try, keeping in mind that the cases we try are generally people who are stuck, who have tried the conventional methods and are unable to get a good example for their structure . . . In about a quarter of those cases, we're able to find what look like strong candidate models."
This process will not, of course, work for proteins that are unlike anything in the Protein Data Bank. But for those cases where traditional methods have failed, this new process could save months or even years of work.
The Wide Search Molecular Replacement application is now available to users of the Structural Biology Grid; note that all job requests are screened to confirm that they merit significant computating time from OSG.
For more information about WSMR, please read the paper, which ran in the 22 November issue of the Proceedings of the National Academy of Sciences under the title, "Protein structure determination by exhaustive search of the Protein Data Bank derived databases."
-Miriam Boon, iSGTW