- Amateur genealogists contribute valuable data to scientists
- Large-scale study shows that genes play only a minor role in longevity
- Crowd-sourced data plays increasing role in scientific research
Before the Internet became so central to our lives, genealogy enthusiasts investigated their roots the old-fashioned way. They asked relatives about family names, dates of birth and death, marriages, occupations, and locations. The next step was to visit libraries, historical societies, and other repositories of public records.
Today, the digitization and commoditization of these records has allowed millions of people to research their ancestry without leaving home. As amateur genealogists search online databases, they also contribute information, records, and photos not previously available. A recent paper describes how a team of researchers used this crowd-generated data to facilitate scientific inquiry.
Yaniv Erlich, a computational geneticist at Columbia University, was curious about the large amount of family tree data collected by the ancestry-themed social media website, Geni.com.
After obtaining permission from Geni and its parent-company MyHeritage, he downloaded 86 million user profiles. The data contributed by these users represents 13 million individuals spanning an average of 11 generations. That’s about five centuries of human history.
This immense family tree primarily includes peoples from North America, the British Isles, and Western Europe. Each profile describes one person and any presumed connections to other individuals in the data set. The data included descriptors such as dates of birth, birth and death locations, names of parents, and gender.
Are you who you say you are?
The researchers faced several challenges during the six months it took to process this massive amount of user-generated data. According to Joanna Kaplanis, a member of Erlich’s research team, the greatest obstacle was verifying the family tree’s accuracy.
One way they did this was comparing the average lifespan of individuals in the Geni data with lifespans of persons found in the Human Mortality Database (HMD). Populated with government-sourced demographic data, the HMD is considered to be accurate.
The researchers were reassured when the Geni data closely matched the HMD. The team also removed records that contained invalid relationships such as someone being listed as both a parent and child of the same individual.
Team members Tal Shor and Omer Weissbrod then designed an algorithm to determine the expected amount of shared DNA between individuals found in the data. Manual and automatic curation tools, including Yahoo! Placemaker, clarified geographical information.
Analysis of the Geni data led to some surprising findings about the evolution of marriage. Before the Industrial Revolution, most individuals in the data set married someone who was born within about six miles of where they themselves were born.
After 1750, that distance steadily increased to a wider range of around 60 miles by 1950. Previously, it had been thought that marriages between cousins decreased because of this geographic distance. Instead, the Geni data shows that people continued to marry close relatives for 50 years after the advent of the railway and other improvements in transportation made it easier to move away from one’s place of birth. According to Kaplanis, this suggests that cultural rather than geographic changes caused the change in mating behavior.
The study also found that genes explain only about 16% of the differences seen in human longevity. That means having the “right genes” only adds about five years to the average lifespan. That isn’t much of an advantage, considering that a habit like smoking reduces lifespan by 10 years.
Kaplanis believes the discoveries made about genetics and longevity are among the most important. She advises, “Even if your parents lived very long lives, unfortunately we still have to eat healthfully and exercise to make sure that we do too!”
A data set of the magnitude used in this study would have been difficult and expensive to obtain without harnessing the power of crowd-sourcing and social media. According to Weissbrod, “The staggering amount of data makes it possible to perform analyses that are much wider in scope than any previous study of this kind.”
All the data used in the study is available to the public, and the branches of the global family tree are sure to yield many more secrets about human life and culture.