{"id":76861,"date":"2024-03-21T09:00:46","date_gmt":"2024-03-21T00:00:46","guid":{"rendered":"https:\/\/www.waseda.jp\/inst\/research\/?p=76861"},"modified":"2024-03-22T13:55:38","modified_gmt":"2024-03-22T04:55:38","slug":"a-novel-visual-cue-based-multi-modal-turn-taking-model-for-speaking-systems","status":"publish","type":"post","link":"https:\/\/www.waseda.jp\/inst\/research\/news-en\/76861","title":{"rendered":"A Novel Visual Cue-Based Multi-Modal Turn-Taking Model for Speaking Systems"},"content":{"rendered":"<h1><a href=\"https:\/\/doi.org\/10.21437\/Interspeech.2023-578\">A Novel Visual Cue-Based Multi-Modal Turn-Taking Model for Speaking Systems<\/a><\/h1>\n<p><a href=\"https:\/\/www.waseda.jp\/inst\/research\/assets\/uploads\/2024\/03\/WASEU_135_Infographic_Draft_1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-76862 size-large\" src=\"https:\/\/www.waseda.jp\/inst\/research\/assets\/uploads\/2024\/03\/WASEU_135_Infographic_Draft_1-940x529.png\" alt=\"\" width=\"940\" height=\"529\" srcset=\"https:\/\/www.waseda.jp\/inst\/research\/assets\/uploads\/2024\/03\/WASEU_135_Infographic_Draft_1-940x529.png 940w, https:\/\/www.waseda.jp\/inst\/research\/assets\/uploads\/2024\/03\/WASEU_135_Infographic_Draft_1-610x343.png 610w, https:\/\/www.waseda.jp\/inst\/research\/assets\/uploads\/2024\/03\/WASEU_135_Infographic_Draft_1-768x432.png 768w, https:\/\/www.waseda.jp\/inst\/research\/assets\/uploads\/2024\/03\/WASEU_135_Infographic_Draft_1-1536x864.png 1536w, https:\/\/www.waseda.jp\/inst\/research\/assets\/uploads\/2024\/03\/WASEU_135_Infographic_Draft_1.png 1921w\" sizes=\"auto, (max-width: 940px) 100vw, 940px\" \/><\/a><\/p>\n<p>Traditional speaking systems that incorporate turn-taking behaviour during interviews typically rely on language and sound to decide when a person&#8217;s turn ends, and the other person\u2019s turn begins. While these cues have been effective, these models still suffer from interruptions and long delays between turns. Many studies have suggested that visual cues, such as eye gaze, mouth, and head movements as well as gestures can effectively help in improving the accuracy of turn-taking models.<\/p>\n<p>To test this idea, a team of researchers led by <a href=\"https:\/\/www.yoichimatsuyama.com\/about\/\">Associate Research Professor Yoichi Matsuyama of Waseda University<\/a> has now investigated how specific visual cues could enhance turn-taking models in conversations, especially during interviews. The team then used these findings to develop a novel turn-taking model for end-of-utterance prediction during interviews.<\/p>\n<p>They first developed a turn-taking model using a two-dimensional convolutional neural network (CNN), called long-short-term memory (LSTM), incorporating gaze, mouth, and head movements and conducted an ablation study \u2014 a set of experiments in which components of a system are removed\/replaced\u00a0<em>in order to understand its impact on the<\/em>\u00a0performance of the system.<\/p>\n<p>The team used data from online interviews consisting of 10 minutes of dialogue between a Japanese English learner and an English teacher to investigate the model. The results revealed that the model with all three visual cues demonstrated the best performance. Moreover, it also revealed that gaze had the most significant impact on performance, followed by mouth movements.<\/p>\n<p>Next, to capture these visual cues more accurately, they developed a more advanced end-to-end visual extraction model, utilizing a three-dimensional CNN, called X3d. This model, owing to its higher accuracy in capturing visual information, demonstrated higher performance and accuracy compared to the LSTM model.<\/p>\n<p>Finally, they used this improved visual extraction model to develop a multi-modal turn-taking speaking model, called Intelligent Language Learning Assistant (InteLLA). InteLLA utilizes the Wave2vec model for incorporating acoustic cues, X3d for visual cues and the Bert language model for linguistic cues. Each of these models can be used independently for turn-taking cues.<\/p>\n<p>Further, the team compared the performance of InteLLA for all combinations of acoustic, linguistic, and visual features. The results revealed that using all three features resulted insignificant improvement in performance, compared to using only acoustic and linguistic cues.<\/p>\n<p>This innovative system, with its focus on visual cues, has the potential to achieve natural conservations, like human interviewers. It has applications in various fields from language proficiency courses to individual self-learning.<\/p>\n<p>This study has received a prestigious award called the \u201cISCA Award for Best Student Paper&#8221; at Interspeech, the world\u2019s largest international conference on spoken language processing in August 2023.<\/p>\n<p><strong>Link to the original journal article<\/strong>: <a href=\"https:\/\/www.isca-speech.org\/archive\/interspeech_2023\/kurata23_interspeech.html\">https:\/\/www.isca-speech.org\/archive\/interspeech_2023\/kurata23_interspeech.html<\/a><\/p>\n<h4>About the author<\/h4>\n<p><a href=\"https:\/\/www.yoichimatsuyama.com\/about\/\">Dr. Yoichi Matsuyama<\/a> is the Founder and CEO of Equmenopolis, Inc. and is currently an Associate Research Professor at the Perceptual Computing Laboratory, Waseda University, Tokyo. Prior to this, he was working as a Post Doctoral Fellow (Special Faculty) in the ArticuLab in the School of Computer Science, Carnegie Mellon University. He received a B.A. in cognitive psychology and media studies, and an M.E. and Ph.D. in computer science from Waseda University in 2005, 2008, and 2015 respectively. His research interest lies in the field of computational models of human conversations, which combine artificial intelligence, social science, and human-computer\/robot interaction.<\/p>\n<h4 style=\"text-align: center;\">LANGX Speaking, a conversation-based English proficiency assessment system.<\/h4>\n<div id=\"attachment_76863\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-76863 size-full\" src=\"https:\/\/www.waseda.jp\/inst\/research\/assets\/uploads\/2024\/03\/7f259f38d72b947ec47beefb1b586de0.png\" alt=\"\" width=\"624\" height=\"283\" srcset=\"https:\/\/www.waseda.jp\/inst\/research\/assets\/uploads\/2024\/03\/7f259f38d72b947ec47beefb1b586de0.png 624w, https:\/\/www.waseda.jp\/inst\/research\/assets\/uploads\/2024\/03\/7f259f38d72b947ec47beefb1b586de0-610x277.png 610w\" sizes=\"auto, (max-width: 624px) 100vw, 624px\" \/><p class=\"wp-caption-text\">A team of researchers led by Yoichi Matsuyama developed a conversational AI-based system to determine a learner&#8217;s English conversation and communication skills. It has been officially adopted for Tutorial English, a Waseda University regular course class, starting from the 2023 academic year.<\/p><\/div>\n<h5>Title of the paper: <a href=\"https:\/\/doi.org\/10.21437\/Interspeech.2023-578\">Multimodal Turn-Taking Model Using Visual Cues for End-of-Utterance Prediction in Spoken Dialogue Systems<\/a><br \/>\nJournal: <a href=\"https:\/\/www.isca-archive.org\/interspeech_2023\/index.html\"><span class=\"w3-text\">INTERSPEECH 2023<\/span><\/a><br \/>\nAuthors: Fuma Kurata, Mao Saeki, Shinya Fujie, and <a href=\"https:\/\/www.yoichimatsuyama.com\/about\/\">Yoichi Matsuyama<\/a><br \/>\nDOI: <a href=\"https:\/\/doi.org\/10.21437\/Interspeech.2023-578\">10.21437\/Interspeech.2023-578<\/a><\/h5>\n<p><iframe loading=\"lazy\" title=\"Innovators: Research Recap 2024, Waseda University \/\u65e9\u7a32\u7530\u5927\u5b66\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/EmWmsEdC2Qw?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A Novel Visual Cue-Based Multi-Modal Turn-Taking Model for Speaking Systems Traditional speaking systems that  [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":76864,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[95],"tags":[73,179],"class_list":["post-76861","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-news-en","tag-general-en","tag-research-en"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.waseda.jp\/inst\/research\/wp-json\/wp\/v2\/posts\/76861","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.waseda.jp\/inst\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.waseda.jp\/inst\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.waseda.jp\/inst\/research\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.waseda.jp\/inst\/research\/wp-json\/wp\/v2\/comments?post=76861"}],"version-history":[{"count":1,"href":"https:\/\/www.waseda.jp\/inst\/research\/wp-json\/wp\/v2\/posts\/76861\/revisions"}],"predecessor-version":[{"id":76939,"href":"https:\/\/www.waseda.jp\/inst\/research\/wp-json\/wp\/v2\/posts\/76861\/revisions\/76939"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.waseda.jp\/inst\/research\/wp-json\/wp\/v2\/media\/76864"}],"wp:attachment":[{"href":"https:\/\/www.waseda.jp\/inst\/research\/wp-json\/wp\/v2\/media?parent=76861"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.waseda.jp\/inst\/research\/wp-json\/wp\/v2\/categories?post=76861"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.waseda.jp\/inst\/research\/wp-json\/wp\/v2\/tags?post=76861"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}