A Blog by Jonathan Low


Jul 26, 2016

Turing Tests Indicate That If We Want Digital Assistants To Really Understand Us, We Have A Long Way To Go

The challenge, as is usual with many human-device interface issues, is training machines to understand what people think of as common sense.

The problem is that the cultural parsing we take for granted thanks to years of experience often sounds ambiguous - or ridiculous - to the software we have programmed to assist us.JL

Will Knight reports in MIT Technology Review:

Programs were little better than random at choosing the correct meaning of sentences. The best were correct 48 percent of the time, compared to 45 percent if the answers are chosen at random. Common-sense reasoning will become more important as smart appliances or wearable gadgets become more common.
User: Siri, call me an ambulance.
Siri: Okay, from now on I’ll call you “an ambulance.”
Apple fixed this error shortly after its virtual assistant was first released in 2011. But a new contest shows that computers still lack the common sense required to avoid such embarrassing mix-ups.
The results of the contest were presented at an academic conference in New York, and they provide some measure of how much work needs to be done to make computers truly intelligent.
Illustration by Max Bode
The Winograd Schema Challenge asks computers to make sense of sentences that are ambiguous but usually simple for humans to parse. Disambiguating Winograd Schema sentences requires some common-sense understanding. In the sentence “The city councilmen refused the demonstrators a permit because they feared violence,” it is logically unclear who the word “they” refers to, although humans understand because of the broader context.
The programs entered into the challenge were a little better than random at choosing the correct meaning of sentences. The best two entrants were correct 48 percent of the time, compared to 45 percent if the answers are chosen at random. To be eligible to claim the grand prize of $25,000, entrants would need to achieve at least 90 percent accuracy. The joint best entries came from Quan Liu, a researcher at the University of Science and Technology of China, and Nicos Issak, a researcher from the Open University of Cyprus.
“It’s unsurprising that machines were barely better than chance,” says Gary Marcus, a research psychologist at New York University and an advisor to the contest. That’s because giving computers common-sense knowledge is notoriously difficult. Hand-coding knowledge is impossibly time-consuming, and it isn’t simple for computers to learn about the real world by performing statistical analysis of text. Most of the entrants in the Winograd Schema Challenge try to use some combination of hand-coded grammar understanding and a knowledge base of facts.
Marcus, who is also the cofounder of a new AI startup, Geometric Intelligence, says it’s notable that Google and Facebook did not take part in the event, even though researchers at these companies have suggested they are making major progress in natural language understanding. “It could’ve been that those guys waltzed into this room and got a hundred percent and said ‘hah!’” he says. “But that would’ve astounded me.”
The contest does not only serve as a measure of progress in AI. It also shows how hard it will be to build more intuitive and graceful chatbots, and to train computers to extract more information from written text.
Researchers at Google, Facebook, Amazon, and Microsoft are turning their attention to language. They are using the latest machine learning techniques, especially “deep learning” neural networks, to develop smarter, more intuitive chatbots and personal assistants (see “Teaching Machines to Understand Us”). As a matter of fact, with chatbots and voice assistants becoming more common, and with dramatic progress in areas like image and speech recognition, you might think that machines were getting pretty good at understanding language.
One of the two first-place entries did, in fact, use a cutting-edge machine learning approach. Liu’s group, which included researchers from York University in Toronto and the National Research Council of Canada, used deep learning to train a computer to recognize the relationship between different events, such as “playing basketball” and “winning” or “getting injured,” from thousands of texts.
“I was delighted to see deep learning used,” says Leora Morgenstern, a senior scientist at Leidos Corporation, a technology consulting firm, and one of the organizers of the challenge.
Liu’s team claims that after fixing a problem with the way its system parsed the contest’s questions, it is almost 60 percent accurate. Morgenstern cautions, however, that even if these claims were confirmed, the accuracy would still be far worse than a human's.
Winograd Schema sentences were first highlighted as a way to gauge machine comprehension by Hector Levesque, an artificial-intelligence researcher at the University of Toronto. They are named after Terry Winograd, a pioneer in the field and a professor at Stanford University who built one of the first conversational computer programs.
The challenge was proposed in 2014 as an improvement on the Turing Test. Alan Turing, a forefather of computing and artificial intelligence who in the 1950s pondered whether machines might one day think as humans do, suggested a simple way of testing a machine’s intelligence. His idea was for a machine to try to fool a person into thinking that he was conversing with a real person in a text conversation.
The problem with the Turing Test is that it’s often easy for a program to fool a person using simple tricks and evasions. But a program cannot parse Winograd Schema or other ambiguous sentences without some form of general knowledge.
The contest could have significant practical implications. “It’s going to come up when you start to support dialogues,” says Charlie Ortiz, a senior principal researcher at Nuance, a company that makes voice recognition and voice interface software, which sponsored the Winograd Schema Challenge. Ortiz says common-sense reasoning will be required for even simple conversations with computers. “In shopping, if I say, ‘I want to get a case for my guitar; it should be strong.’ So does ‘it’ refer to the case or the guitar?”
Marcus adds that common-sense reasoning will become more important as devices such as smart appliances or wearable gadgets become more common. “When you want to ask a query of your watch you don’t get to scroll through 50 choices,” he says. “When you start talking to your car or your watch, and you get rid of the typing modality and want to have a connected set of sentences—this conversational discourse—people just naturally refer back to things, and you need to solve these problems to make it work.”


Anonymous said...

Is part of the problem that English has "no grammar"? How does AI fare in other more complete languages like Latin or Swahili?

Anonymous said...

Turing didn't exactly "ponder", in 1950 he released the founding description of the field of AI, and was convinced we would achieve machines thinking at the level of a human by the close of the century (earlier even, based on his estimates of chip density improvements). As usual, this article contains a basic misapprehension of the Turing test, which consists of an observer determining of TWO participants which is the human and which is the machine. Both participants have an interest in tricking the observer, and the problem of assignment is far more robust than the classification of a single stream of interaction. Turing considers various pitfalls and variations (including ESP!), and decided that assessing the ability of the machine to trick the observer should be measured over multiple attempts (he believed that machines would win more than 70% of the time within 50 years). As such, no machine has currently satisfied the test as defined, despite some notable charlatans claiming otherwise, or stating that the test is outdated. In fact, it provides just as solid a framework today as it did in 1950.

Post a Comment