• Subscribe

Wrangling crime in the deep, dark web

Speed read
  • Hidden areas of the internet mask illegal activity
  • Wrangler supercomputer speeds DARPA deep web searches
  • 600 terabytes of flash memory help law enforcement spy illegality on the dark web

Much of the internet hides like an iceberg below the surface.

This so-called 'deep web' is estimated to be 500 times bigger than the 'surface web' seen through search engines like Google. For scientists and others, the deep web holds important computer code and licensing agreements.

Nestled further inside the deep web, one finds the 'dark web,' a place where images and video are used by traders in illicit drugs, weapons, and human lives.

“Behind forms and logins, there are bad things,” says Chris Mattmann, chief architect in the instrument and science data systems section of the NASA Jet Propulsion Laboratory (JPL) at the California Institute of Technology.

<strong>Chris Mattman </strong> heads a project that harnesses the Wrangler supercomputer to speed searches into the dark web. Supercomputers provide law enforcement agencies with a tremendous asset in the fight against terrorism. Courtesy Chris Mattmann.

“Behind the dynamic portions of the web, people are doing nefarious things, and on the dark web, they're doing even more nefarious things. They traffic in guns and human organs. They're doing these activities and then they're tying them back to terrorism.”

In 2014, the Defense Advanced Research Projects Agency (DARPA)  started a program called Memex to make the deep web accessible. “The goal of Memex was to provide search engines the retrieval capacity to deal with those situations and to help defense and law enforcement go after the bad guys on the deep web,” Mattmann says.

At the same time, the US National Science Foundation (NSF) invested $11.2 million in a first-of-its-kind data-intensive supercomputer – the Wrangler supercomputer, now housed at the Texas Advanced Computing Center (TACC).  The NSF asked engineers and computer scientists at TACC, Indiana University, and the University of Chicago if a computer could be built to handle massive amounts of input and output.

Wrangler does just that, enabling the speedy file transfers needed to fly past big data bottlenecks that can slow down even the fastest computers. It was built to work in tandem with number crunchers such as TACC's Stampede, which in 2013 was the sixth fastest computer in the world.

<strong>Wrangler</strong> is a data-intensive supercomputer created in partnership between the Texas Advanced Computing Center, Indiana University, and the University of Chicago. The NSF invested in the Wrangler project, seeking a means to handle large volume, rapid input/output tasks. Courtesy TACC.

“Although we have a lot of search-based queries through different search engines like Google, it's still a challenge to query the system in way that answers your questions directly,” says Karanjeet Singh.

Singh is a University of Southern California graduate student who works with Chris Mattmann on Memex and other projects.

“The objective is to get more and more domain-specific information from the internet and to associate facts from that information.”

Once the Memex user extracts the information they need, they can apply tools such as named entity recognizer, sentiment analysis, and topic summarization. This can help law enforcement agencies find links between different activities, such as illegal weapon sales and human trafficking.

The problem is that even the fastest computers like Stampede weren't designed to handle the input and output of the millions of files needed for the Memex project.

“Let's say that we have one system directly in front of us, and there is some crime going on,” Singh says. “What the JPL is trying to do is automate a lot of domain-specific query processes into a system where you can just feed in the questions and receive the answers.”

For that, he works with an open source web crawler called Apache Nutch. It retrieves and collects web page and domain information of the deep web. The MapReduce framework powers those crawls with a divide-and-conquer approach to big data that breaks it up into small pieces that run simultaneously.

<strong>Karanjeet Singh.</strong> Singh is a graduate researcher at the University of Southern California. His work with Chris Mattman looks to Wrangler's large stores of flash memory to conduct massive data extractions. Courtesy TACC.

Wrangler avoids data overload by virtue of its 600 terabytes of speedy flash storage. What's more, Wrangler supports the Hadoop framework, which runs using MapReduce.

Together, Wrangler and Memex constitute a powerful crime-fighting duo. NSF investment in advanced computation has placed powerful tools in the hands of public defense agencies, moving law enforcement beyond the limitations of commercial search engines.

“Wrangler is a fantastic tool that we didn't have before as a mechanism to do research,” says Mattman. “It has been an amazing resource that has allowed us to develop techniques that are helping save people, stop crime, and stop terrorism around the world.”

Join the conversation

Do you have story ideas or something to contribute? Let us know!

Copyright © 2022 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.


We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.