In the application security space, customers and prospects tell the same story time and time again:
“We set up an automated application security testing product, we got our findings from it, we brought them to our developers, and we convinced them to prioritize fixing these vulnerabilities.
But the first finding they worked on was a false positive. Then the second was too. Now, our engineering team no longer takes our reports seriously.
So tell me, why are these false positives even there? Why can’t these be suppressed automatically?”
While thoroughly vetting vulnerabilities can slightly compromise speed, providing results fast but containing false positives is counter-productive to an efficient and successful application security program. Time wasted by engineers to triage the false positives far outweighs the speedier results provided.
Yet that’s the approach that many automated application security solutions take.
I’ve previously explored benchmarking the accuracy of automated AppSec testing, but it’s time to dig into why false positives alone do not fully measure accuracy.
False positives alone are not a full measure of accuracy
Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function. Sensitivity measures the proportion of actual positives that are correctly identified as such. Specificity measures the proportion of actual negatives that are correctly identified as such. In other words, sensitivity quantifies the avoiding of false negatives, and specificity does the same for false positives.
A perfect test would be described as 100 percent sensitive and 100 percent specific. In reality, however, any non-deterministic predictor will possess a minimum error bound known as the Bayes error rate. According to the Bayes’ Theorem, simply put, if you’re looking for a needle in a haystack, the best possible tests will still usually report false positives.
According to Wikipedia, “For any test, there is usually a trade-off between the measures – for instance, in airport security, since testing of passengers is for potential threats to safety, scanners may be set to trigger alarms on low-risk items like belt buckles and keys (low specificity) in order to increase the probability of identifying dangerous objects and minimize the risk of missing objects that do pose a threat (high sensitivity).”
Similarly, in application security testing, false positives alone don’t determine the full accuracy. False positives are just one of the four aspects that determine its accuracy – the other three being ‘true positives,’ ‘true negatives,’ and ‘false negatives.’
- False Positives (FP): Tests with fake vulnerabilities that were incorrectly reported as vulnerable
- True Positives (TP): Tests with real vulnerabilities that were correctly reported as vulnerable
- False Negatives (FN): Tests with real vulnerabilities that were not correctly reported as vulnerable
- True Negatives (TN): Tests with fake vulnerabilities that were correctly not reported as vulnerable
Therefore, true positive rate (TPR) is the rate at which real vulnerabilities were reported, correctly. False positive rate (FPR) is the rate at which fake vulnerabilities were reported as real, incorrectly.
- True Positive Rate (TPR) = FP / ( FP + TN )
- False Positive Rate (FPR) = TP / ( TP + FN )
TPR can’t be 100 percent & FPR can’t be 0 percent on automated AppSec testing
When developers and QA engineers write their unit tests and functional tests, they write these tests specifically for their applications. If they’re testing a+b=c, then all their tests are written specifically to test that. It helps them to ensure the accuracy of these tests is high, i.e., no false positives and no false negatives. Developers or QA engineers will not perform their unit & functional tests using “generic” tests. In other words, they “curve-fit” their tests to specifically match their results. If they didn’t, they’d have some false positives and some false negatives that would fail their tests, incorrectly. Similarly, if they wrote their own security tests customized specifically for their application, then these would be highly accurate as well. But when it comes to automated application security testing solutions such as WhiteHat Sentinel, all the security tests are fairly “generic” and written to be able to scan any and all applications.
With that generality come both false positives and false negatives. False positives can be easily reduced by reducing the number of tests and lowering coverage. But that will also increase the false negatives, and as a result, the coverage of these tests. Therefore, it is almost impossible to remove both false positives and false negatives for these “generic” automated security tests. In other words, TPR can never be 100 percent, and FPR can never be 0 percent. Thus, the OWASP Benchmark Score can never be 100 percent.
So we always encourage you to consider all four when analyzing the accuracy of automated application security solutions. You may use the OWASP Benchmark Project, a vendor-neutral, well-respected, and true indicator of accuracy, for comparing the accuracy of other solutions.
We will be exploring additional details on our findings from running OWASP Benchmark in our next blog post. Stay tuned!
In the meantime, it’s important to point out that both of WhiteHat’s Sentinel Source Editions offer industry-leading coverage, accuracy, and speed, while most SAST solutions only provide one or two of these critical components. Learn more here: https://www.whitehatsec.com/products/static-application-security-testing/.