- Digitizing biological specimen collections is a daunting challenge
- The Indiana University Herbarium created a streamlined digitization process
- More than 150,000 specimens are now accessible through an online data portal
Computers and the internet have taken much of the grunt work out of science.
Despite this, capturing digital information from a large physical collection is an organizational challenge. Performed poorly, such a process can generate data of limited usefulness.
This presented a huge problem for the Indiana University (IU) Herbarium. Established in 1885, the institution has more than 150,000 plant specimens that represent the flora from Indiana and elsewhere. The collection spans the Herbarium’s entire existence, so many items are irreplaceable.
Someone had to step up to take on this challenge, and Dr. Eric Knox decided his team was right for the job. A senior scientist and director of the IU Herbarium, Knox trained more than 70 undergraduate curatorial assistants to process the specimens.
Partnering with IU Libraries’ Imago digital repository and the Consortium of Midwest Herbaria’s shared data portal, the digital collection of the herbarium’s unique plants can be shared with scientists and the general public.
“The specimens that we have in the herbarium are all unique,” Knox says. “Each one has scientific significance in its own right.”
Digital assembly line
The first step in the digitization process starts with scientists going out into the field to collect plant specimens that are pressed and dried. While the herbarium already has a huge collection, scientists like associate curator Paul Rothrock are still adding to it.
It’s easy to label new specimens, but ensuring the validity of older ones is trickier. According to Knox, making sure older plants were labeled with their correct and current names took about three years. This process also created a digital inventory of the species in the collection, which was used in subsequent steps.
“Once we had the curated specimens, then we were able to image them all with a barcode on every single sheet,” says Knox.
“At the imaging stage, we wanted an efficient process, where students used barcode readers to rename the image file and to create a skeletal database record where the taxonomy and the geography – down to the level of county in the case of the United States – is selected from pre-populated, drop-down lists. This way, there’s no typing involved, which avoids introducing errors.”
After this, Knox’s team channels the images through a cyberinfrastructure quality control pipeline that uses IU's research computing resources to copy the images.
The files are converted from TIF format to JPEG, with 60% compression but no loss of resolution or image quality, making the files faster to download. Finally, the images are uploaded onto IU’s Imago digital repository, and then linked to the Midwest Herbaria data portal for the whole world to access.
“If we compile specimen images and information once and make them electronically available and, furthermore, use a platform that aggregates resources from all herbaria, that product is available for all people to use,” Knox says. “It fundamentally transforms how people can do science. This is environmental big data finally coming to fruition, but it needs to be digitized one specimen at a time.”
The devil is in the details
Properly handling and storing images of more than 150,000 specimens is no small task. Knox states that each TIF file is about 175 megabytes, and a discussion about the matter with Indiana University’s IT team meant “preparing for petabytes.”
Finally, the technology has caught up with the vision we had for how an information system like this should work.
“As I tell people, even though I’ve been trying to get this project underway for many, many years, I’m delighted that my early attempts were not successful,” says Knox. “What is easy now would have been difficult then."
Knox and his team also had to think about what else has changed since specimen collection began in the 19th century.
“If somebody says a plant was collected ‘10 miles north of Bloomington’, does that mean it’s more than nine miles but less than 11, or more than five miles but less than 15?” asks Knox. “Did the collector start from the center of Bloomington or from the edge of town at that time?”
Working with the developers of GEOLocate, a large set of historical US Geological Survey maps can now be overlaid on contemporary maps to help determine where plants were collected in the past.
Although there’s still a lot of work to do, this five-year project is finally drawing to a close. Much like all the preparations for a complex dinner, the hard work is about to pay off.
“It’s analogous to having Thanksgiving dinner,” says Knox. “You think about how good that food is, but what’s the first thing you do in the morning? You get up and you start chopping onions and celery to make the stuffing. Even after you pull the turkey out of the oven, you still have to make the gravy. We’re in the gravy stage right now. A year from now, the table will be set and any guest can access this information from any computer, anywhere in the world.”
Eric Knox wishes to thank Paul Rothrock, Tanner Mayfield, Daniel Layton, Maggie Vincent, Laura White, and a large army of student curatorial assistants for their excellent work during this 5-year project, which was undertaken with funding from the Indiana University Department of Biology, College of Arts and Sciences, Vice Provost for Research, Vice President for Research, IU Libraries, and University Information Technology Services.