- Hidden areas of the internet mask illegal activity
- Wrangler supercomputer speeds DARPA deep web searches
- 600 terabytes of flash memory help law enforcement spy illegality on the dark web
Much of the internet hides like an iceberg below the surface.
This so-called 'deep web' is estimated to be 500 times bigger than the 'surface web' seen through search engines like Google. For scientists and others, the deep web holds important computer code and licensing agreements.
Nestled further inside the deep web, one finds the 'dark web,' a place where images and video are used by traders in illicit drugs, weapons, and human lives.
“Behind forms and logins, there are bad things,” says Chris Mattmann, chief architect in the instrument and science data systems section of the NASA Jet Propulsion Laboratory (JPL) at the California Institute of Technology.
“Behind the dynamic portions of the web, people are doing nefarious things, and on the dark web, they're doing even more nefarious things. They traffic in guns and human organs. They're doing these activities and then they're tying them back to terrorism.”
In 2014, the Defense Advanced Research Projects Agency (DARPA) started a program called Memex to make the deep web accessible. “The goal of Memex was to provide search engines the retrieval capacity to deal with those situations and to help defense and law enforcement go after the bad guys on the deep web,” Mattmann says.
At the same time, the US National Science Foundation (NSF) invested $11.2 million in a first-of-its-kind data-intensive supercomputer – the Wrangler supercomputer, now housed at the Texas Advanced Computing Center (TACC). The NSF asked engineers and computer scientists at TACC, Indiana University, and the University of Chicago if a computer could be built to handle massive amounts of input and output.
Wrangler does just that, enabling the speedy file transfers needed to fly past big data bottlenecks that can slow down even the fastest computers. It was built to work in tandem with number crunchers such as TACC's Stampede, which in 2013 was the sixth fastest computer in the world.
“Although we have a lot of search-based queries through different search engines like Google, it's still a challenge to query the system in way that answers your questions directly,” says Karanjeet Singh.
Singh is a University of Southern California graduate student who works with Chris Mattmann on Memex and other projects.
“The objective is to get more and more domain-specific information from the internet and to associate facts from that information.”
Once the Memex user extracts the information they need, they can apply tools such as named entity recognizer, sentiment analysis, and topic summarization. This can help law enforcement agencies find links between different activities, such as illegal weapon sales and human trafficking.
The problem is that even the fastest computers like Stampede weren't designed to handle the input and output of the millions of files needed for the Memex project.
“Let's say that we have one system directly in front of us, and there is some crime going on,” Singh says. “What the JPL is trying to do is automate a lot of domain-specific query processes into a system where you can just feed in the questions and receive the answers.”
For that, he works with an open source web crawler called Apache Nutch. It retrieves and collects web page and domain information of the deep web. The MapReduce framework powers those crawls with a divide-and-conquer approach to big data that breaks it up into small pieces that run simultaneously.
Wrangler avoids data overload by virtue of its 600 terabytes of speedy flash storage. What's more, Wrangler supports the Hadoop framework, which runs using MapReduce.
Together, Wrangler and Memex constitute a powerful crime-fighting duo. NSF investment in advanced computation has placed powerful tools in the hands of public defense agencies, moving law enforcement beyond the limitations of commercial search engines.
“Wrangler is a fantastic tool that we didn't have before as a mechanism to do research,” says Mattman. “It has been an amazing resource that has allowed us to develop techniques that are helping save people, stop crime, and stop terrorism around the world.”