For the last 40 years, the way that large-scale services, such as global banks, and scientific experiments, like the LHC at CERN, have been managing their data has been reminiscent of Lyman Frank Baum's The Wonderful Wizard of Oz.In the fairytale, Dorothy asks how to get to the Emerald City to see the Wizard of Oz, and is simply told that, “all you do is follow the Yellow Brick Road.” Any time she strays from the Yellow Brick Road, Dorothy and her friends encounter serious danger and eventually return to follow the safer road.
Relational databases are the Yellow Brick Road of managing large structured data globally. They are the most popular form of database scheme and have been commonly used since the 1970s. While other types of databases have been built, none have been as effective.
In relational databases, data is organized in the form of related tables: each table can have many records, and these records can have many data fields. This data can be accessed and added without having to reorganize the tables. The software interface used to build and access data structures within relational databases is Structured Query Language (SQL). It's the most widely known and respected query language used today and the closest thing to a standard in the database world.
Up until now, for organizations, this has been a happy marriage of data storage and access. But, there are growing doubts that relational databases can handle the 'data deluge' experienced by the likes of growing web companies and the transition into eScience.
In the 1939 film of The Wizard of Oz, a red brick road is intertwined with the yellow one. Similarly, a new type of database might soon offer a different path: NoSQL, or Not-Only-SQL, first coined in 2008, is promising a faster and more scalable database architecture, at least for some cases. It comes in many different implementations, such as Cassandra, MongoDB, and CouchDB to name just a few. Plus, NoSQL query languages are being developed that are easier to learn than SQL.
Big science and web giants such as Google are looking at NoSQL as the next step in the evolution of database models. Its arrival could shake up the market and replace existing technology within a year – or its arrival might be entirely for nothing. “No one knows yet if it will be a disruptive technology for us,” said Tony Cass, leader of the database services group at CERN.
Partitions, paradoxes, and particles
The theory that underpins all distributed systems is known as the CAP (Consistency, Availability and Partition Tolerance) theorem, proposed by computer scientist Eric A Brewer in 2000.
According to CAP, a distributed system should satisfy three crucial factors: all nodes on a system see the same data simultaneously (consistency); data requests receive a reply, whether they were successful or failed (availability); and that a system continues to operate in spite of random message loss (partition tolerance).
Brewer's crucial point is that it's impossible to meet all three criteria simultaneously; at most, two of the three can be met. (An analogy could be Heisenberg's famous uncertainty principle in physics, which states that it is impossible for an observer to know a particle's position and momentum simultaneously.) So, a trade-off has to be made between consistency and availability.
Is one database better than the other?
Different data management approaches suit certain goals; one decision that companies and organizations have is whether to make their data services consistent.
Consider a multinational bank with millions of customers: when a customer completes a transaction on their account, all their account information located in databases around the world should be updated instantly. This is typically part of the ACID concept (atomicity, consistency isolation, durability), which lists the properties that guarantee a database transaction is completed successfully, ensuring banking transactions run smoothly.
The other choice is to sacrifice this concept to ensure data services perform quickly for users. For example, Google's search engine produces results in a fraction of a second. Prioritizing speed of availability slows the consistency of a system if it has thousands or even millions of simultaneous search queries and data are being written to nodes on the network – especially if the nodes are spread out between cities and countries.
This concept of prioritizing availability or performance is known as BASE or 'basically available, soft state, eventually consistent' and typical of NoSQL architectures.
A new mindset
One successful implementation of a NoSQL database is in the Materials Project, a new scientific tool that uses the open source MongoDB framework to be the 'Google' of material properties, according to its creators at the Lawrence Berkeley National Laboratory in the USA. The project will provide scientists with a resource to quickly develop new, clean energy technologies, and enable researchers to analyze and query material properties, such as developing new batteries.
“Since we don't always know what properties we need ahead of time, it becomes useful to have a flexible schema that allows us to add properties to objects. In this case, objects represent materials. It also allows us easily to attach certain properties to a set of materials only where they make sense, for example only certain materials will have electrochemical properties,” said Shreyas Cholia, a computer engineer at the US National Energy Research Scientific Computing Center (NERSC) involved in the project.
“This is a much cleaner way to interface with the database, rather than dealing with complex joins [a method for combining fields from two tables by using values common to each] and relationships, you construct a query object with a list of properties or ranges that you are interested in. The MongoDB query language is extremely flexible and powerful, while being more programmer friendly,” said Cholia.
Not so fast
While new projects, such as the Materials Project can easily consider new database models, what about the existing large-scale science projects?
According to Tony Cass, relational databases have contributed to the success of CERN today. “We have used the Oracle relational database for 30 years. Most people would probably expect this for administrative applications, but Oracle was introduced at first to support LEP [Large Electron–Positron Collider] construction and operation. Today, if Oracle doesn't work, the LHC accelerator doesn't work. Oracle databases also support critical elements for the LHC experiments.
“These databases, though, have been highly optimised to deliver fast performance for applications that were designed five or more years ago, and it takes time and expertise to adapt the databases for new queries. In contrast, creation of NoSQL solutions for a novel application is often very rapid,” said Cass.
At CERN, the database group is participating in some small-scale tests of NoSQL solutions with three of the four detectors on the LHC, but larger-scale tests need to be done, Cass said.
Cass said he thinks that the popularity of NoSQL is in part based on the fast turnaround for application development and not simply the technical advantages that are essential to support large scale websites, such as Facebook.
"There is nothing wrong with this," he said, "but you have to be careful when people contrast the performance of NoSQL solutions against relational implementations. Often, people are comparing optimized, small-scale NoSQL solutions against performance of the relational approach on an Oracle platform that has been tuned to deliver high-scale performance in another area.”
“It's important to avoid picking the wrong technology based on small scale experience. Adding an index to a relational database can be painful, but reconfiguring a 150 terabyte NoSQL database is not likely to be much easier. No one as yet has done a comparison of use cases for large scale [petabytes of] data at CERN or science in general. This is true for applications that manage high volumes of data transfer. For example, we manage 3.5 trillion rows of LHC data in the database, which can be accessed via our Oracle system. Will a NoSQL solution be faster? No one knows what happens at these limits,” said Cass.
Oracle, an industry-partner of CERN, is also developing a NoSQL system for companies with large scale web applications that need to read and write large volumes of simultaneous workloads. “NoSQL offers a new mentality and has already achieved real world success stories. It powers giants like Facebook and LinkedIn,” said Charles Lamb technical consultant at Oracle.
“Our system is not relational and it extends the NoSQL model in some areas. Data is stored as a key-value pair [a pair of related objects - attribute name and value - stored with a unique collection of properties and methods], so we don't use a table-based system,” said Lamb.
“Also, NoSQL databases typically only support single record operations and do not support transactions. We support single and multiple record operations. The main challenge we face is to ensure we can scale up to hundreds of thousands of compute nodes all operating concurrently,” said Lamb.
There are also new database systems on the market that do not fit neatly into the relational or NoSQL categories. Daniel Abadi, of Yale University, is chief scientist of Hadapt, a commercial company that recently attained $9.5 million in capital from Bessemer and NorWest Venture Partners. It's developing a new scalable analytical database developed from research done at Yale University called HadoopDB.
It uses a distributed framework called Hadoop, which is used by Amazon, eBay, Facebook, and Google to create Web indexes, track user clicks, and make recommendations to customers. Hadapt combines Hadoop with parallel database techniques, such as map reduce, that can be found in modern scalable relational database systems.
“NoSQL databases are useful for large scale individual transaction models [a sequence of queries]; these are read, write, update, and delete tasks, but not for large-scale aggregations or analysis of data. Relational databases and Hadoop are still the best solutions for data analysis,” Abadi said.
“However, relational databases and Hadoop were designed for different workloads and therefore have different sets of strengths and weaknesses. Our research into the HadoopDB project has shown that it's possible to combine the scalability, job complexity, fault tolerance, and ability to process unstructured data of Hadoop with the high performance of relational database systems on structured data.” While Hadapt is researching commercial aspects, they're also looking at the scientific applications too such as sequence alignment in bioinformatics research.
Meanwhile, data management continues to increase in volume and complexity. For the increasing number of organizations and scientists that need solutions for data-intensive research, though, the path to take is not always clear. At least those working at the forefront can see an end to confusion.
“In a year's time we'll see fewer polemics. We'll see a growing realisation of what is appropriate, where — including a better understanding of the different NoSQL implementations,” said Cass.