Artificial intelligence can write news dispatches and riff somewhat coherently on prompts, but can it learn to navigate a fantasy text-based game? That’s what scientists at Facebook AI Research, the Lorraine Research Laboratory in Computer Science and its Applications, and the University College London set out to discover in a recent study, which they describe in a paper published on the preprint server Arxiv.org (“Learning to Speak and Act in a Fantasy Text Adventure Game“) this week.
The researchers specifically investigated the impact of grounding dialogue — a collection of mutual knowledge, beliefs, and assumptions essential for communication between two people — on AI agents’ understanding of the virtual world around them. Toward that end, they built a research environment in the form of a large-scale, crowdsourced text adventure — LIGHT — within which AI systems and humans interact as player characters.
“[T]he current state of the art uses only the statistical regularities of language data, without explicit understanding of the world that the language describes,” the paper’s authors wrote. “[O]ur framework allows learning from both actions and dialogue, [and our] hope is that LIGHT can be fun for humans to interact with, enabling future engagement with our models. All utterances in LIGHT are produced by human annotators, thus inheriting properties of natural language such as ambiguity and coreference, making it a challenging platform for grounded learning of language and actions.”
Facebook AI game
Above: The chat interface for the crowdsourcing task.
Human annotators were tasked with creating backstories (“bright white stone was all the fad for funeral architecture, once upon a time”), location names (“frozen tundra,” “city in the clouds”), character categories (“gravedigger”), in addition to a list of characters (“wizards,” “knights,” “village clerk”) with descriptions, personas, and sets of belongings. The researchers then separately crowdsourced objects and accompanying descriptions, as well as a range of actions (“get,” “drop,” “put,” “give”) and emotes (“applaud,” “blush,” “crying,” “frown”).
Thanks to those efforts, LIGHT now comprises natural language descriptions of 663 locations based on a set of regions and biomes (like “countryside,” “forest,” and “graveyard”) all told, along with 3,462 objects and 1,755 characters.
With the boundaries of the game world established, the team set about compiling a dataset of “character-driven” interactions. They had two human-controlled characters in a random location — each complete with objects assigned to said location and their persons — take turns during which they could perform one action and say one thing. In total, the researchers recorded 10,777 such episodes about actions, emotes, and dialogue, which they used to train several AI models.


Facebook LIGHT game
Above: A dialogue sample from LIGHT.
Using Facebook’s PyTorch machine learning framework in ParlAI, a framework for dialogue AI research, the authors first devised an AI model that could produce separate representations for each sentence from the grounding information (setting, persona, objects) and a context embedding to score the most promising candidates. They next tapped Google’s Bidirectional Encoder Representations from Transformers (BERT), a state-of-the-art natural language processing technique that’s able to access context from both past and future directions, to build two systems: a bi-ranker, which they describe as a “fast” and “practical” model, and a cross-ranker, a slower model that allows more cross-correlation between context and response. And lastly, they used another set of AI models to encode context features (such as dialogue, persona, and setting) and generate actions.
So how’d the AI players fare? Pretty well, actually. They had a knack for leaning on past dialogue and for adjusting their predictions in light of the game world’s changing state, and dialogue grounding on local environments’ details — like descriptions, objects, and characters — enabled the AI-controlled agents to better predict behavior. None of the models bested humans in terms of performance, the researchers note, but the experiments that added more grounding information (such as past actions, persona, or descriptions of settings) improved measurably. In fact, for tasks like dialogue prediction, the AI demonstrated the ability to produce outputs appropriate for a given setting even when the dialogue and characters didn’t change, suggesting that they’d gained the ability to contextualize.
“We hope that this work can enable future research in grounded language learning and further the ability of agents to model a holistic world, complete with other agents within it,” the researchers wrote.