On Evaluating and Comparing Conversational Agents

The subjectivity associated with evaluating conversations is a key element underlying the challenge of building non-goal oriented dialogue systems. This paper proposes a comprehensive evaluation strategy with multiple metrics designed to reduce subjectivity by selecting metrics which correlate well with human judgement. To our knowledge, to date it is the largest setting for evaluating agents with millions of conversations and hundreds of thousands of ratings from users. We believe that this work is a step towards an automatic evaluation process for conversational AIs.

Click Here to Download Paper