While deriving some identity characteristics from a name may seem intuitive, for many, ethnic identification raises serious concerns about labeling and the potential for discrimination. Political scientists, however, can use such statistics to help identify and address social problems.
"Anytime we're interested in questions of discrimination, bias, or inequality as a function of race, identity, or ethnicity, we need actual measures of identity to test whether or not such discrimination exists." Harris posits. "That's where my method becomes useful - it provides those estimates to address questions of legal and moral importance: do racial or ethnic groups have equal access to the ballot box."
Inferring ethnicity from individual names (e.g., Wong, Sanchez, Smith) is error-prone. For instance, the name Lee could easily be a Caucasian, Asian, or African-American surname. So, instead of trying to categorize each name, Harris looks at the entire list of names simultaneously. The resulting demographics can then be used to pursue larger questions of cultural identity, entrenched biases, and barriers to inclusion.
"Instead of saying that person one is from ethnic group A, I use all of the information in a list of names to estimate the proportion of groups A, B, and C in that list," says Harris. "This method provides more efficient and lower-bias estimates of ethnic group proportions than you would get if you categorized each name and computed the proportions from those categorizations."
Harris's systematic method, recently published in Political Analysis, introduces a context-independent identity estimator, and can work in any geographic region or nation. He has used his method to infer racial voting behavior in North Carolina, and to explain decreases in Kenyan ethnic groups in the voter register. What's more, Harris claims, the method could be used to estimate virtually any name-related characteristic such as a person's religion, class, and ancestry.
To test the method, Harris relied heavily on the BuTinah supercomputer on the NYU Abu Dhabi campus. Named after a protected marine reserve off the coast of Abu Dhabi, BuTinah's 512 super-dense compute nodes are capable of around 70 teraFLOPS.
BuTinah helped him both manipulate the rosters and run a large of number simulations - on both real and simulated data - in order to develop and demonstrate proof of concept.
"If I had years to spare maybe I could have done the research without BuTinah. But the team at BuTinah made it do-able on a much more reasonable timetable," he says.
Harris's research was funded in part by a National Science Foundation (NSF) doctoral dissertation improvement grant.
Harris foresees using this approach to monitor how individual Twitter or Facebook networks evolve as important events approach. He wants to know if people connect more with certain ethnicities as that politically salient identity is activated.
"The approach will allow us to measure identity in a way that wasn't previously feasible. New ways of understanding inequality, social mobility, social networks - the substantive insights that new measures of identity could afford are huge."