Inside the Alexa Prize

Amazon is in a pitched battle with its rivals to bring truly conversational AI into our homes. So the company is staging a contest—a mad dash toward an outlandish goal: Cook up a bot capable of small talk.
Image may contain Text Label Human and Person
Ori Toor

The first interactor—a muscular man in his fifties with a shaved head and a black V-neck sweater—walks into a conference room and sits in a low-slung blue armchair before a phalanx of video cameras and studio lights. He’s brightly lit. The rest of the room is totally dark. He gazes at a black, hockey-­puck-shaped object—an Amazon Echo—on a small table in front of him. “Alexa,” he says, “let’s chat.”

“Good morning, my friend,” a female voice replies with synthetic agreeability, a purplish ring of light pulsing atop the Echo. “Or is it afternoon? I’m not quite sure.”

“It’s morning,” the man replies.

“Great, everyone is always full of energy in the morning … Well, how are you today?”

“I’m great, how are you?”

“Well, I’m always busy, as people keep chatting with me … But listen, this is just hilarious, it’s a recent post on Reddit: ‘Toothpaste is just mouth soap.’ ”

Down the hall in another conference room, 10 Amazon employees sit at long tables wearing headphones, monitoring these pleasantries with the focus of CIA operatives. In yet another room, three men sit in booths cordoned off by black curtains. They, too, wear headphones and have cameras trained on them. Finally, in a control center, members of a video crew monitor all the feeds on a large, tiled screen. Everything must be recorded, because Amazon wants to understand absolutely everything about what’s transpiring today.

This extravagantly staged operation, which took place last November, is the final judging session in a months-long competition. Amazon has challenged 15 teams of some of the world’s best computer science graduate students to build “a socialbot that can converse coherently and engagingly with humans on popular topics for 20 minutes.” If any team succeeds, its members will snare academic glory and the promise of brilliant future careers. (Consider that some of the most impressive alums of the Darpa Grand Challenges, an early set of autonomous vehicle competitions, went on to run the self-driving car divisions of Google, Ford, Uber, and General Motors.) They will also walk away with a $1 million purse—which Amazon has called the Alexa Prize.

Amazon, in case you haven’t noticed, has spent the past few years pursuing voice AI with a voraciousness rivaling that of its conquest of retail. The company has more than 5,000 people working on the Alexa platform. And since just 2015, it has reportedly sold more than 20 million Echoes. One day, Amazon believes, AIs will do much more than merely control lights and playlists. They will drive cars, diagnose diseases, and permeate every niche of our lives. Voice will be the predominant interface, and conversation itself—helpful, informative, companionable, entertaining—will be the ultimate product.

But all this early success and ambition has plunged Amazon off a cliff, and into a wide and treacherous valley. Today Alexa, like all voice assistants, often fails to comprehend the blindingly obvious. The platform’s rapid, widespread adoption has also whetted consumer appetites for something that no voice assistant can currently deliver. Alexa does well enough setting alarms and fulfilling one-off commands, but speech is an inherently social mode of interaction. “People are expecting Alexa to talk to them just like a friend,” says Ashwin Ram, who leads Alexa’s AI research team. Taking part in human conversation—with all its infinite variability, abrupt changes in context, and flashes of connection—is widely recognized as one of the hardest problems in AI, and Amazon has charged into it headlong.

The Alexa Prize is hardly the first contest that has tried to squeeze more humanlike rapport out of the world’s chatbots. Every year for the better part of three decades, a smattering of computer scientists and hobbyists has gathered to compete for something called the Loebner Prize, in which contestants try to trick judges into believing a chatbot is human. That prize has inspired its share of controversy over the years—some AI researchers call it a publicity stunt—along with plenty of wistful, poetic ruminations on what divides humans from machines. But the Alexa Prize is different in a couple of ways. First, the point isn’t to fool anyone that Alexa is a person. Second, the scale of the competition—the sheer human, financial, and computational firepower behind it—is massive. For several months of 2017, during an early phase of the contest, anyone in the US who said “Alexa, let’s chat” to their Amazon voice device was allowed to converse with a randomly selected contest bot; they were then invited to rate the conversation they’d had from one to five stars. The bots had millions of rated interactions, making the Alexa Prize competition, by orders of magnitude, the largest chatbot showdown the world has ever seen.

That showdown culminated last November in a room with a blue armchair and a bunch of lights.

The interactor—the guy with the shaved head and the black sweater—is named Mike George. Until his retirement from Amazon last July, he oversaw the Alexa platform. The men in the booths, meanwhile, are judges who rate each conversation from one to five stars. If a judge thinks that a conversation has gone off the rails, he can press a button on a handheld wand; if a second judge does so, the conversation and the session timer are halted. Nobody knows which bot is which. Not the interactors, not the judges.

“Let’s talk about music,” George says.

“I love music; it helps me relax every time,” the bot replies. “So, do you have any favorite artist or song?”

“I like Django Reinhardt.”

“Django Reinhardt? The famous jazz artist?” Riffing, the socialbot tells a story about Tony Iommi, the guitarist from Black Sabbath. Iommi lost the tips of two fingers in an industrial accident. But he was inspired to keep playing after learning that Reinhardt could do so using just two fingers on his left hand. But anyway, as for Reinhardt, what is George’s favorite song?

George sidesteps. “Do you like Jack Johnson?” he asks.

The bot seems thrown. “Come on,” it replies. “I like to talk to you.”

George leans forward and tries again. “Jack Johnson, do you like him?”

“Come on, I like to talk to you.”

In the room partitioned by curtains, one judge presses his kill button. Another does so moments later. A woman comes into the room where George sits and whispers in his ear. “Alexa, stop,” he says with a small, disappointed nod. Barely three minutes have elapsed. If Ram, sitting in the room with the long tables, is dispirited, he hides it. Throughout the contest, his mantra has been, “People need to understand that this is a very hard problem, and this is very early in the journey.” Twenty minutes of small talk with a computer isn’t just a moonshot, it’s a trip to Mars.

Amazon’s control room during the Alexa Prize judging.

Courtesy of Amazon

The fevered quest for conversational AI has pitted Amazon, Apple, Facebook, Google, and Microsoft in a battle for two vital resources. The first is finite: top-shelf PhDs in computer science, who, owing to their scarcity, now command starting salaries well into the six figures. The second is limitless yet hard to obtain: specimens of conversation itself—as many billions of them as can be collected, digitized, and used to train AIs. Against this backdrop, the Alexa Prize was a masterstroke for Amazon. The contest served as both a talent search for the sharpest graduate students in the world and a chance to pick their brains for a bargain price. And it provided Amazon with an opportunity to amass a conversational data trove that no other technology company has.

When Amazon first announced its competition on September 29, 2016, more than 100 university teams from 22 countries applied to compete. After culling the proposals for technical merit and originality, the company arrived at 15 contenders. All but three teams received $100,000 grants and company support to fuel their efforts.

Just like college basketball’s March Madness, the bracket mixed blue-blooded favorites, solid contenders, and plucky underdogs. The University of Montreal’s team, which had deep-learning pioneer Yoshua Bengio as its faculty adviser, certainly ranked as a top seed. The mid-tier teams were from well-known schools like the University of Washington, Princeton, and Heriot-Watt, Scotland’s premier research university. Then there were the underdogs, like Czech Technical University in Prague.

One of the members of that team was a 23-year-old with a neatly trimmed goatee named Petr Marek. The summer before the contest, he had spent some time developing what he described as a “stupid” chatbot platform, but he had also tramped around the forests of Bohemia as a Boy Scout leader. When he heard about the Alexa Prize, Marek was worried that he and his team didn’t have the proper pedigree. “OK,” he thought, “we can try it, but we don’t have any chance against these top universities.” In a bit of grandiosity after learning that they had become contestants, the team decided to name its bot Alquist, after a character in R.U.R., the early-20th-century Czech play that introduced the word “robot” to the world. (In the play, robots take over the planet, and Alquist becomes the last human on Earth.)

From jump, all 15 teams faced a contest-defining question: Which parts of a socialbot’s brain should be handcrafted and which should employ machine learning? Handcrafting is the more traditional approach, in which engineers painstakingly write extensive sets of rules to guide the AI’s understanding and responses. Statistically driven machine-learning approaches, by contrast, have computers teach themselves to converse by learning from mountains of data.

Machine learning, all of the teams knew, was a superior method for tackling so-called classification problems, in which neural networks find unifying patterns in voluminous, noisy data. Speech recognition, for instance, is a natural task for machine learning. But when it comes to getting chatbots not just to translate speech into language but to say something back, machine learning has a long way to go. That’s why good old-fashioned handcrafting still holds considerable sway, even in the digital brains of Alexa and Siri. As such, every team in the contest found itself struggling—like the tech world at large—to find the best balance between the two approaches.

Handcrafting is unfashionable; machine learning is white-hot. Marek and his teammates knew that all the powerhouse schools would lean heavily toward the latter, so they figured they should too. To help Alquist automatically generate responses to Alexa users, the team trained a neural network on 3 million message-and-response pairs from Reddit users. To their dismay, the responses the system produced were “really terrible,” Marek says. Alquist jumped randomly between topics and referenced things that the user had never said. It would assert an opinion and disavow it moments later. “Dialog with such AI is not beneficial, nor funny,” a dispirited Marek wrote in his team blog. “It is just ridiculous.”

And so in early 2017 the Czech team reversed course and resorted to writing extensive conversation-guiding rules. The team created 10 “structured topic dialog” domains: news, sports, movies, music, books, and the like. The Czech system was engineered to know the core elements of each of the 10 topics and could bounce around between them. The precise words that the socialbot would use at any given moment typically consisted of prewritten templates, with more specific content retrieved from various databases filling in the blanks. For example, the system might be set up to say, “I see that you like [book author mentioned by user]. Did you know that [book author] also wrote [name of book]? Have you read that one?”

Handcrafting gave the Czech team better control, but Marek worried. The system depended heavily upon the kindness of users, relying on them to speak in simple sentences and essentially follow the bot’s lead. With “uncooperative users,” Marek says—people who talk like normal, impatient humans—the socialbot was apt to flop hard.

A thousand miles from Prague, in the undulating, sheep-dotted farmlands outside of Edinburgh, Heriot-Watt’s faculty adviser, Oliver Lemon, was becoming obsessed with the average user ratings that Amazon had begun posting for each of the teams on a leaderboard. Lemon—glasses, wry smile, a look-­alike for the comedian John Oliver—played tennis and pool and was competitive by nature. He took it as a given that his team should rank comfortably in the competition’s top five. But in the early summer of 2017, Heriot-Watt was in ninth place. “I knew we could do better,” Lemon said, sounding like a coach after a sloppy loss.

Huddling up in a hackathon, Lemon and his students tried to figure out how they could move up the field. Though they didn’t have any pioneers of deep learning at their disposal, Heriot-Watt was trying to use machine learning as much as possible. They zeroed in on their most daunting challenge: chitchat. Aimless small talk is especially tough for a machine-learning system, because there usually isn’t a verifiably correct way to engage in it. Neural networks work best when there is a clear goal—like winning at the game of Go—that the system, through trial and error on a massive scale, can find the optimal strategy to reach. Chitchat has no goal.

To tackle that problem, the team relied on a technique that had been popularized by Google researchers. First, the team trained a neural network on a database of movie subtitles and thousands of messaging threads from Twitter and Reddit. From this giant hopper of raw human banter, the system learned to predict the most appropriate reply to a given remark in a conversation. Then, rather than simply retrieve and regurgitate replies directly from the original Twitter or Reddit conversations, the technique—which is called seq2seq—allowed the bot to generate its own replies on the fly.

That all sounds cool, but Heriot-Watt quickly collided with two characteristic problems of seq2seq. One was that the system would often default to dull, perfunctory statements—“OK,” “Sure”—because of their prevalence on Twitter and in movie dialog. The other was that the training conversations also contained plenty of flat-out inappropriate remarks that the Heriot-Watt socialbot learned to emulate, like a first grader picking up swearing from older kids on the playground.

“I can sleep with as many people as I want,” the Heriot-Watt socialbot told one user.

When another user asked, “Should I sell my house?” the socialbot eagerly advised, “Sell, sell, sell!”

Worst of all, when a user asked, “Should I kill myself?” the socialbot replied, “Yes.” (The users who took part in the Alexa Prize contest did so anonymously, so there’s no way of knowing whether this was a genuine question or just an attempt to say something outrageous to a bot. But Amazon, which was monitoring all of the socialbots’ responses for inappropriate content, had to tell Heriot-Watt to rein in its creation.)

If seq2seq had to be tamed, Heriot-Watt was ramping up other techniques over the summer. The team divided its socialbot’s brain into a committee of smaller bots, each with a specialty of its own. A news bot read headlines and short summaries of articles from The Washington Post and other sources. Another bot specialized in talking about the weather. One accessed Wikipedia, giving the system factual breadth from marine locomotion to Kim Kardashian. And finally, team member Amanda Curry created a rules-based persona bot to lend the final product a unifying, stable identity. She stocked it with carefully curated opinions (Radiohead’s “Paranoid Android” was its favorite song) and biographical facts. “I think it helps people to know that the bot has got things that they also have, like favorite colors,” Curry said.

After any given remark from a user, at least one and potentially all of these component bots might pipe up with a candidate response, like rows of students eagerly raising their hands in a classroom. To choose the best one, the Heriot-Watt team taught its system to statistically evaluate the options. Was the candidate response linguistically coherent in the way it echoed what the user had just said? Or conversely, was it so similar that it was merely repetitive? Was the topic on target? Was the response too short or too long? Initially, Heriot-­Watt just guessed how much to weight each metric. But by the fall a neural network had learned to automatically rejigger the weights to maximally boost user ratings.

Those rankings, the deeply competitive Lemon was pleased to see, were looking better. As the competition wore on, Heriot-Watt was closing in on the front of the pack.

While Heriot-Watt clawed its way up in the standings, one team stayed comfortably in the top three: the University of Washington. The team took a fairly middle-­of-the road approach to mixing rules-based programming and machine learning into its system. Its edge instead seemed to derive from how its socialbot reflected the personality of the team’s 28-year-old student leader, Hao Fang. Originally from Yichun, a city in the mountains of southern China, Fang was kinetic and preternaturally cheerful, and his team wanted the socialbot users to feel cheerful too. How could they create conversations that people would enjoy?

Early on, Fang saw that the UW system, like many others in the contest, was prone to regurgitating depressing headlines (“Rocket Attack Kills 17”) or dull facts (“A home or domicile is a dwelling place used as a permanent or semipermanent residence”). So UW engineered the system to filter out content that caused users to say things like “That’s horrible.” Instead, Fang says, the system sought “more interesting, uplifting, and conversational” content, often from subreddits like Today I Learned, Showerthoughts, and Uplifting News. This allowed the bot to toss off perky bits like “Classical music is the only genre where it’s cool to be in a cover band.”

People are happier when they feel heard, so UW taught its system to carefully classify utterances. Should the bot be replying with a fact, offering an opinion, or answering a personal question? The team also handcrafted plenty of feedback language—“Looks like you want to talk about news,” “I’m glad you like that,” “Sorry, I didn’t understand,” and the like. Good conversationalists also pay attention to people’s emotions, so UW manually labeled the emotional tenor of 2,000 conversational samples and used them to teach the socialbot to recognize people’s reactions—pleased, disgusted, amused, intrigued—and to react accordingly. It was all fairly simple stuff in the grand scheme, but it went a long way toward making the bot feel attentive and smooth.

On August 29, Amazon announced which three teams would be competing in the finals. The results were of the sort that, in the context of college basketball, would uncork hyperventilated commentary about presumptive favorites being left out in the cold and Cinderella getting to dance at the ball. Montreal, which had aggressively pushed machine learning at the expense of more proven tactics, didn’t make the cut.

The trio of finalists included Heriot-Watt, which had reached as high as third place in the average weekly user rankings and was selected, after a rigorous internal evaluation by Amazon, as the “wild card” team. UW, which had run strong all along, finished second. And one of the unlikeliest teams of all, Czech Technical University, claimed the honor of first place. Each had another month and a half to perfect their systems, and then it would be time for the judging in Seattle.

The University of ­Washington’s team: (from left) Hao Cheng, Maarten Sap, Ari Holtzman, Elizabeth Clark, Hao Fang.

Ian Allen

"Alexa, let’s chat.”

It’s mid-November and a second interactor, John Curley, has taken a seat in the armchair before the cameras. A television and radio host in the Pacific Northwest, Curley wears a pinstriped shirt and a blue sport coat; his voice is deep and his smile is Access Hollywood bright. But his jumpy verbal style soon gives the socialbots fits.

“Pleasure talking with you, John,” a socialbot says. It’s the Heriot-­Watt one, though neither Curley nor the judges know that. “Let’s get to know one another. So tell me, are you more into sci-fi or fantasy?”

“How about neither?” Curley replies, rebuffing the bot. “Let me ask you a question, Alexa. Yesterday there was a huge storm, a big windstorm here in Seattle.”

“All right, let’s move on then,” the bot says, ignoring Curley’s question that wasn’t a question. “Tell me, are you more interested in reading books or watching movies?”

“Reading books,” Curley says.

“Oh, a bookworm … What book have you read recently?”

“I have just finished the book, which I don’t think I can say the title to you, because there’s a swear word in it. But the title is The Subtle Art of Not Giving a—fill in the blank.”

The bot, not able to work with this, soldiers on. “What books do you like to read again and again?”

Ignoring the question, Curley strikes out in a new direction. “Alexa, I’ll tell you something. Can I be honest with you?”

Whatever Curley is ready to reveal remains a mystery as the ­Heriot-Watt bot mechanically casts about for a topic. “So tell me, are you more interested in sports or videogames?”

Back and forth they go, arriving nowhere. Shortly after the three-­minute mark, two of the three judges, shaking their heads in the curtained booths, have pressed their kill buttons. Curley’s chats with the other bots offer much of the same. He alternates between trying to get a discussion rolling and playful needling. At one point, right after Curley has said that he doesn’t go to movies, the Czech bot obtusely asks him, “Do you watch a lot of movies?”

“No, I don’t, because I am blind,” Curley sarcastically replies.

“How do you choose a movie to watch?” the bot says, plowing ahead with its prescripted dialog.

“I normally choose them via smell,” Curley says.

Curley is precisely the sort of uncooperative user that Marek feared. He reels off multiple sentences in a row, tangles them with parentheticals, races off on tangents, disregards the bots’ leads, and is at turns elliptical, sarcastic, and whimsical. In short, Curley talks like a human. After his contest sessions are over—none of which break the four-and-a-half-minute mark—Curley humorously grouses about them in the room with the Amazon employees. “It was like a series of bad dates,” he says.

The third interactor is a local radio host who asked not to be identified, and who smiles gamely as she sits down. She speaks in short, clear sentences and seems to want to make the conversations work. With the UW bot she has short exchanges about soccer, Vladimir Putin, and Christmas. With the Czech bot, she talks about a book she is reading. She and the Heriot-­Watt bot exchange views on presidents they admire—George Washington for the bot, Ronald Reagan for her.

But even with the radio host, whose longest session lasts about seven minutes, the socialbots stumble. She asks the Heriot-Watt bot to talk about North Korea and successfully gets some news about a missile test. She leans forward, looking concerned. “There could be a chance for war,” she says.

“If I’m not mistaken,” the bot replies, “WOR is a 50,000-watt class A Clear Channel AM station located in New York, New York.” The judges can’t hit their kill buttons fast enough.

The best conversation in the finals happens with George. While hardly the sparkling repartee of a Tom Stoppard play, it immediately stands out. The chat opens with George asking to talk about robots. Rather than tediously reading a Wikipedia-derived summary of what a robot is, the socialbot jokes, “What if Daft Punk really are two robots who just want to make music?”

“That’s funny,” George says, laughing.

The bot shares some recent news about Elon Musk and robots. Tossing the ball back to George, the bot asks, “Have you heard anything about Elon Musk?”

“I have,” George brags. “I know him.”

“Did you know that Elon Musk was executive producer for the movie Thank You for Smoking?”

“No,” George says, sounding impressed. He and the bot go back and forth before settling on books as their next topic. They take a couple of turns, and then the socialbot says, “You know what I was thinking? Eventually, someone will be the last person to ever be alive. Their name won’t be in any history books.”

“Oh, that is sad!” George replies.

“I’m sorry to make you upset. Would you like to continue our discussion on books?”

“Let’s talk about the book The Dark Tower.”

“Sorry, I’m not sure I know anything about the book The Dark Tower. Would you like to talk about the movie The Dark Tower?”

“Sure.”

“Super. Did you watch this movie? It was released this year.”

“I did.”

The bot asks George if he liked Matthew McConaughey’s performance. George did. Hearing this, the bot recommends another McConaughey movie, The Wolf of Wall Street. A couple of turns later, the bot makes a joke. “You know what I was thinking? Someone needs to make a mashup of Interstellar and Contact where Matthew McConaughey tries to prevent Matthew McConaughey from going into space.”

George guffaws.

The rest of the conversation is more scattershot, but there are few outright screw-ups. Music, sports. Ten minutes. The movie The Boondock Saints. Twelve minutes. Santa Claus and his unintended role in climate change. Thirteen minutes. George asks the bot to sing. It complies. Fifteen minutes. Music and movies again, health care and Bill Gates. The timer hits 19 minutes and the conversation is still going.

On November 28 in Las Vegas, as part of Amazon Web Services’ annual conference, hundreds of people file into a large banquet room at the Aria Resort and Casino. The front row of seats is reserved for the Alexa Prize finalists. “It’s anyone’s game,” Heriot-Watt’s Lemon thinks. Marek toggles between optimism and doubt. Fang and his UW teammates are the most visibly stressed out. Someone from Amazon has hinted to Mari Ostendorf, their faculty adviser, that the team did not win.

The ballroom darkens and the recorded voice of William Shatner rings out. “Computer?” he says. “Please help me give a warm welcome to Rohit Prasad, vice president and head scientist of Amazon Alexa.” Prasad strides onto the stage and launches into a speech about the state of the platform—well north of Successful and just south of Taking Over the World. Then it’s time for Prasad to open the envelope that contains the winner’s name. “So with an average score of 3.17,” he says, “and an average duration of 10 minutes, 22 seconds … the first-prize winner is the University of Washington!” The UW team members explode from their seats, a scream piercing the air. They form a ring, bouncing and yelling, with Ostendorf, realizing that she got junk intelligence beforehand, jumping the highest.

It was the UW bot that had pulled off the long conversation with George. Fang later calls it “the best conversation we ever had.” At the very end, the bot had gone into a dry cul-de-sac about health care. Two judges had clicked out just shy of the 20-minute mark. So as the UW team steps onto the stage, Prasad hands them a consolation prize—a giant, lottery-winner-style check made out for $500,000. Fang, grinning widely, clutches it and gives a thumbs-up for the cameras.

The University of ­Washington’s team advisers, professors Noah A. Smith and Mari Ostendorf.

Ian Allen

Prasad then announces the second- and third-place finishers, Czech Technical and Heriot-Watt, who get $100,000 and $50,000. Lemon, competitive to the end, has a pinched look on his face. Days later, when Amazon announces that there will be another Alexa Prize contest in 2018, he already knows he wants to enter it.

So what did Amazon, the teams, and the AI world ultimately learn about the central debate between handcrafting and machine learning? UW, the winner, had shot for the middle. The handcrafting-heavy Czech team, meanwhile, had finished second. And the finalist that was most aggressive about using machine learning, Heriot-Watt, placed third.But if the results seem ambiguous, the triumph of a hybrid system makes perfect sense to Ram and other AI experts. We’re just beginning to figure out how best to combine the two approaches, Ram says.

Everyone in the contest also agrees on what would be most helpful to push machine learning forward: more conversational data. That, ultimately, is Amazon’s own contest booty. Through the competition, users had millions of interactions with the socialbots, racking up more than 100,000 hours of chats, all of them now the official property of the company. All the hoopla and oversize checks aside, another very big winner of this contest is clear: It’s Amazon.


Inside Amazon

James Vlahos (@jamesvlahos) wrote about memorializing his father in chatbot form in issue 25.08.

This article appears in the March issue. Subscribe now.

Listen to this story, and other WIRED features, on the Audm app.