Data Mining Is a Risky Business

Is data mining safe to use in a national security context? Research and public reports reveal risks of inaccuracy and manipulation.

In the last few weeks a vigorous discussion on the legality of the NSA’s data mining efforts in the war on terror has raged. The revelations by former NSA contractor turned whistleblower Edward Snowden have concerned American citizens and governments around the world. No one had grasped the size and scope of US intelligence-gathering.

Most of the debate has revolved around the Fourth Amendment and privacy considerations. There is one component that should be more thoroughly discussed, however. Is data mining safe to use in a national security context?

Data mining has weaknesses. It becomes more and more accurate as the data filters become finer and finer, but a base level of inaccuracy exists in the raw results.

According to public reports, the NSA captures a great deal of American phone record metadata and the world’s internet communications. It then applies a variety of filters to determine whether a packet of information originated within the United States (where monitoring it is illegal) or outside (where it is not). If analysts are 51 percent sure that the information originated outside the US, they can proceed. But 51 percent is basically a coin toss which allows them to look at half the traffic.

“Foreignness” is binary. You are either inside the territorial United States or you are not. Due to the complexities of how communications traffic is routed from server to server to server, some packets of information have a very high degree of location certainty and some have a very low degree. The point is that the NSA has set bar quite low. Technology can never give us 100 percent, or even 80 percent, certainty that only people outside the US are being monitored.

And what if you want to know whether a person is a terrorist? One man’s terrorist may be another man’s nuisance. In the United Kingdom terrorist acts are “designed seriously to interfere with or seriously to disrupt an electronic system.” Although spammers are life forms somewhere between cockroaches and skunks, they are not terrorists. But under UK law, they very well could be.

Because of these complications, suspicious communications must be examined by a human analyst. That is why thousands of innocent domestic communications have been monitored by the NSA. In fact, the incidence of privacy violations has actually increased because of Big Data.

The government claims that terrorist acts have been detected with data-mining techniques. If true, this is certainly good news. But how many innocent people had their privacy violated in order to achieve this goal? The US government has not released this figure.

Here’s an analogy. If the government monitored the bedrooms of every married couple it would detect many cases of domestic violence. At the same time, it would also monitor far more cases of spouses in intimate relations. Couples who were “collateral damage” in this hypothetical campaign against domestic violence would have a right to be less than pleased.

What happens when a government makes a decision based on information gleaned from data-mining? Will we see innocent people on no-fly lists? False arrests? Inability to get jobs in sensitive areas like finance or law enforcement?

Furthermore, what if the bad guys manipulate the algorithms? In the intelligence world, there is a sub-discipline of counter-intelligence. Part of this involves the use of deception so that an adversary believes things that are not true and makes bad decisions based on bad information.

Data-mining sounds like an obscure technology. But we take advantage of highly sophisticated public data-mining technologies every time we use a search engine. Search engines work by scanning the entire internet so that when a consumer wants to find web pages related to dinosaurs, they see only web pages related to dinosaurs.

The first item in a Google search on a prominent person will almost always be Wikipedia. That isn’t due to the authority of Wikipedia per se. But because so many people link to Wikipedia, data-mining algorithms give it a higher “confidence score”.

Google has a name for search manipulation: it’s called advertising. If you search for “lawyers” it will display lawyers near where you live. Who rises to the top of the list? The firm with the best advertising agent. I should know: I run a firm that charges good money to ensure that our clients appear at the top of the first page of a search result.

An entire industry has developed around manipulating search engine — search engine optimization (SEO). Most SEO is legitimate marketing – “white hat SEO”. But its nasty little brother, “black hat SEO”, uses similar techniques to infect computers with malicious software or tempt web surfers into buying pornography.

In other words, an SEO expert can manipulate Big Data. Sure, the algorithms used by the NSA are secret but so are the ones created by Google (and Bing and Yahoo and all the others). The NSA whistleblower Edward Snowden has released enough information to give black hat SEO experts a head start in deceiving the NSA.

The NSA monitors the metadata for phone calls (the source and destination and the length of call). But it isn’t difficult to poison that well. Hackers in the 80s would routinely tap phones lines to make “free phone calls”. The equivalent nowadays is cell phone “cloning”. This is far more difficult, but it is possible. In other words, guys in black hats can log fake phone calls on your cell phone.

The US Drug Enforcement Agency has been using NSA data to track suspects. Could the black hats manipulate the phone records to prompt a SWAT raid on an innocent citizen? Yes.

If so, why couldn’t SEO experts working for terrorists or hostile governments manipulate Big Data so that intelligence analysts waste time on wild goose chases or create useless blacklists?

Data mining is a useful tool for business but it is still a technology in its infancy. There are real risks of inaccuracy, manipulation and violation of privacy. There needs to be a well-informed examination and discussion of the risks of acting on corrupted data.

John Bambenek is a computer security expert from Champaign, Illinois. He is President of Bambenek Consulting, a cybersecurity firm, and a visiting lecturer in the Department of Computer Science at the University of Illinois at Urbana-Champaign. He can be reached at jcb@bambenekconsulting.com.

Originally published by MercatorNet on 26 August 2013.