Back in 2012, The Atlantic Monthly published a behind-the-scenes article about Google Maps. This is the passage that struck me:
The best way to figure out if you can make a left turn at a particular intersection is still to have a person look at a sign — whether that’s a human driving or a human looking at an image generated by a Street View car.
There is an analogy to be made to one of Google’s other impressive projects: Google Translate. What looks like machine intelligence is actually only a recombination of human intelligence. Translate relies on massive bodies of text that have been translated into different languages by humans; it then is able to extract words and phrases that match up. The algorithms are not actually that complex, but they work because of the massive amounts of data (i.e. human intelligence) that go into the task on the front end.
Google Maps has executed a similar operation. Humans are coding every bit of the logic of the road onto a representation of the world so that computers can simply duplicate (infinitely, instantly) the judgments that a person already made…
…I came away convinced that the geographic data Google has assembled is not likely to be matched by any other company. The secret to this success isn’t, as you might expect, Google’s facility with data, but rather its willingness to commit humans to combining and cleaning data about the physical world. Google’s map offerings build in the human intelligence on the front end, and that’s what allows its computers to tell you the best route from San Francisco to Boston.
Even for Google, massive and sophisticated automation is only a first step. Human judgment is also an unavoidable part of documenting web application vulnerabilities. The reason isn’t necessarily obvious: Bayes’ theorem.
“P(A|B)” means “the probability of A, given B.”
Wikipedia explains the concept in terms of drug testing:
Suppose a drug test is 99% sensitive and 99% specific. That is, the test will produce 99% true positive results for drug users and 99% true negative results for non-drug users. Suppose that 0.5% of people are users of the drug. If a randomly selected individual tests positive, what is the probability he or she is a user?
The reason the correct answer of 33% is counter-intuitive is called base rate neglect. If you have a very accurate test for something that happens infrequently, that test will usually report false positives. That’s worth repeating: if you’re looking for a needle in a haystack, the best possible tests will usually report false positives.
Filtering out false positives is an important part of our service, over and above the scanning technology itself. Because most URLs aren’t vulnerable to most things, we see a lot of false positives. They’re the price of automated scanning.
We also see a lot of duplicates. A website might have a search box on every page that’s vulnerable to cross-site scripting, but it’s not helpful to get a security report that’s more or less a list of the pages on your website. It is helpful to be told there’s a problem with the search box. Once.
Machine learning is getting better every day, but we don’t have time to wait for computers to read and understand websites as well as humans. Here and now, we need to find vulnerabilities, and scanners can cover a site more efficiently than a human. Unfortunately, false positives are an unavoidable part of that.
Someone has to sort them out. When everything is working right, that part of our service is invisible, just like the people hand-correcting Google Maps.