That assertion is false. The model can't confidently "tell" for any given photograph. Rather, what Stanford's model can actually do 91 percent of the time is much less remarkable: It can identify which of a pair of two males are gay when it's already been established that one is and one is not.
This "pairing test" tells a seductive story, but it's a deceptive one. It translates to low performance outside the research lab, where there's no contrived scenario presenting such pairings. Employing the model in the real world would require a tough trade-off. You could tune the model to correctly identify, say, two thirds of all gay individuals, but that would come at a price: When it predicted someone to be gay, it would be wrong more than half of the time—a high false positive rate. And if you configure its settings so that it correctly identifies even more than two thirds, the model will exhibit an even higher false positive rate.
The reason for this is that one of the two categories is infrequent—in this case, gay individuals, which amount to about 7 percent of males (according to the Stanford report). When one category is in the minority, that intrinsically makes it more challenging to reliably predict.
Now, the researchers did report on a viable measure of performance, called AUC—albeit mislabeled in their report as "accuracy." AUC (Area Under the receiver operating characteristic Curve) indicates the extent of performance trade-offs available. The higher the AUC, the better the trade-off options offered by the predictive model.In the field of machine learning, accuracy means something simpler: “How often the predictive model is correct—the percent of cases it gets right.” When researchers use the word to mean anything else, they're at best adopting willful ignorance and at worst consciously laying a trap to ensnare the media.
But researchers face two publicity challenges: How can you make something as technical as AUC sexy and at the same time sell your predictive model’s performance? No problem. As it turns out, the AUC is mathematically equal to the result you get running the pairing test. And so, a 91 percent AUC can be explained with a story about distinguishing between pairs that sounds to many journalists like "high accuracy"—especially when the researchers commit the cardinal sin of just baldly—and falsely—calling it "accuracy." Voila! Both the journalists and their readers believe the model can "tell" whether you're gay.
This “accuracy fallacy” scheme is applied far and wide, with overblown claims about machine learning accurately predicting, among other things, psychosis, criminality, death, suicide, bestselling books, fraudulent dating profiles, banana crop diseases and various medical conditions. For an addendum to this article that covers 20 more examples, click here.
In some of these cases, researchers perpetrate a variation on the accuracy fallacy scheme: they report the accuracy you would get if half the cases were positive—that is, if the common and rare categories took place equally often. Mathematically, this usually inflates the reported "accuracy" a bit less than AUC, but it's a similar maneuver and overstates performance in much the same way.
In popular culture, "gaydar" refers to an unattainable form of human clairvoyance. We shouldn’t expect machine learning to attain supernatural abilities either. Many human behaviors defy reliable prediction. It’s like predicting the weather many weeks in advance. There's no achieving high certainty. There's no magic crystal ball. Readers at large must hone a certain vigilance: Be wary about claims of "high accuracy" in machine learning. If it sounds too good to be true, it probably is.