Amazon Alexa has tens of thousands of voice apps — skills, in Amazon’s vernacular — contributed by third-party developers. Usually, building said skills requires supplying examples of customer requests (e.g., “Order my usual”) together with the actions to which those requests should map, which are used to train the AI system that processes real requests in production. Needless to say, that’s labor-intensive, which is why scientists at Amazon are exploring techniques for pooling sample requests for similar requests from different skills at training time.
In a paper presented last week at the Association for Computational Linguistics conference in Florence, the coauthors write that the additional data improved performance by “plugging holes” in lists of example requests. Evaluated on two different public corpora and an internal corpus, they report that training an AI system simultaneously on multiple skills yielded better results than training it separately for each skill.
As the researchers note, multitask training runs the risk of causing a model to lose focus on task-specific structures. To avoid this, they forced their models to learn three different representations of all incoming data. The first was a general representation, which encoded shared information across all tasks, while the second was a group-level representation that captured commonalities among utterances in a given skill category. Meanwhile, the third and last representation was task-specific.
The AI systems the team used were of the encoder-decoder variety, meaning that they learned fixed-size representations (or encodings) of input data and used those as the basis for predictions (decoding). All comprised separate encoder modules for individual tasks and groups of tasks, along with a “switch” that controlled which encoders processed the utterance.
There were four different architectures in all, the first of which first was a parallel system. It passed input utterances through a general encoder, a group-level encoder, and a task-specific encoder simultaneously before combining the resulting representations and passing them to a task-specific decoder. The other three others were serial — in other words, the outputs of one bank of encoders passed to a second bank before moving on to the decoders.
During the training phase, the group-specific encoders learned how to best encode utterances characteristic of their groups, and the skill-specific encoders learned to encode utterances characteristic of their skills. As a result, the decoders — which always made task-specific predictions — were able to take advantage of three different representations of the input, ranging from general to specific.
All of the tasks on which the architectures were tested were joint intent classification (i.e., figuring out the actions the AI system was supposed to take) and slot-filling tasks. In this context, slots are the data items on which the intent acts. During training, the systems were rewarded when they accurately classified slots and intents but penalized when their group- and universe-level encodings made it easy to predict which skill an utterance belongs to. They were further rewarded if the task-specific encoders and the shared encoders are captured different information.
In experiments involving three different data set, two of the serial models yielded significantly better performance on mean intent accuracy and slot F1 (which factors in both false-negative and false-positive rate) over the baseline systems. On any given test, one or another of the multitask systems was consistently the best-performing, with improvements of up to 9% over baseline.