LNCS 13815
Asbjørn Følstad · Theo Araujo · Symeon Papadopoulos · Effie L.-C. Recht · Ewa Luger · Morten Goodwin · Petter Bae Brandtzaeg (Ed.)
Chatbot Research and Design 6th International Workshop, TALKS 2022 Amsterdam, Netherlands, 22-23. November 2022 Selected articles reviewed
Computer Science Lecture Notes Founding Editor Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA. UU.
Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA. UU. Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Moti Yung Columbia University, New York, NY, USA. UU.
13815
Learn more about this series at https://link.springer.com/bookseries/558
Asbjørn Følstad · Theo Araujo · Symeon Papadopoulos · Effie L.-C. Recht · Ewa Luger · Morten Goodwin · Petter Bae Brandtzaeg (Ed.)
Chatbot Research and Design 6th International Workshop, TALKS 2022 Amsterdam, Netherlands, 22-23. November 2022 Selected articles reviewed
Herausgeber Asbjørn Følstad SINTEF Oslo, Norway
Theo Araujo University of Amsterdam Amsterdam, Netherlands
Symeon Papadopoulos CERTH-IT Thessaloniki, Greece
Effie L.-C. Durham Law University Durham, Great Britain
Ewa Luger University of Edinburgh Edinburgh, UK
Universidade Morten Goodwin de Agder Grimstad, Norway
Petter Bae Brandtzaeg Universidad de Oslo Oslo, Norway
ISSN 0302-9743 ISSN 1611-3349 (electronic) Computer Science Lecture Notes ISBN 978-3-031-25580-9 ISBN 978-3-031-25581-6 (eBook) https://doi.org/10.1007 /978 - 3 -031-25581-6 © The publishers (if applicable) and authors, under exclusive license from Springer Nature Switzerland AG 2023 This work is copyrighted. All rights, in whole or in part, are reserved to the publisher, in particular the rights of translation, reprinting, reuse of illustrations, reading, transmission, reproduction on microfilm or other physical medium, as well as the transmission or storage of information and retrieval , electronic adaptation, computer software or by similar or different methods now known or later developed. The use in this publication of common descriptive names, trade names, trademarks, service marks, etc., even without express notice, does not imply that these names are exempt from applicable protective laws and regulations and therefore are free to use. the general public. wear. The publisher, authors, and publishers believe the advice and information contained in this book to be true and accurate as of the date of publication. Neither the publisher nor the authors nor the publishers make any warranty, express or implied, with respect to the material contained in this document or for any errors or omissions. The publisher remains neutral regarding claims of jurisdiction on published maps and institutional affiliations. This Springer print is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
preface
Introduction Chatbots are a topic of constant industry research and interest. With chatbots becoming more and more relevant to a wide range of users and application areas, the need for research is considerable. As the research presented in this volume demonstrates, useful chatbot applications range from health and wellness to digital management and customer service. Research areas range from user perceptions and experiences to design and technological foundations of conversational interaction. In order to offer researchers and practitioners with a professional interest in chatbots a meeting point and strengthen chatbot research knowledge, the international workshop CONVERSATIONS has been organized annually since 2017. It is clearly interdisciplinary with contributions from computer science, the social sciences and humanities, research in management and communication, as well as research in design and human-computer interaction. The conversation workshops are intended to be open and inclusive spaces for exchange and discussion on chatbot research and design and have been held as two-day events since 2019. Due to the COVID-19 pandemic, the 2020-2021 workshops were held online, but have returned to a face-to-face format in 2022 to increase the opportunity for networking and interaction among workshop participants. CONVERSAS 2022 was the sixth workshop in the series. Despite being a face-to-face event, about a third of the sessions were held in hybrid mode to increase access. The workshop took place from 22 to 23 November 2022 and was organized by the University of Amsterdam in collaboration with SINTEF, the University of Oslo and its Center for Research in Media Innovations (CRMI), the Hellas Technology and Research Center, University of Durham, the University of Edinburgh and the University of Agder. A total of 111 participants from 25 countries registered for the workshop: 56 onsite, the rest online.
Invitation, Review, and Review Papers The workshop's call for papers outlined an interest in chatbot research, design, and applications in six key areas, building on promising directions for future chatbot research identified in previous editions of CONVERSATIONS. In particular, we solicit submissions covering chatbot users and user experience, chatbot frameworks and platforms, chatbots for collaboration and engagement, chatbot ethics and privacy, and leveraging advances in AI technology and large language models. The call for articles was communicated through the corresponding mailing lists. We also sent out the call to the network of researchers and practitioners associated with the CONVERSACIONES series of workshops. Four types of submissions were encouraged: Complete
vi
preface
Articles, position papers, project presentations and suggestions for group work. The Project Presentations category was new to this year's edition of the workshop, with the aim of allowing researchers to promote relevant and interesting projects to the CONVERSACIONES audience. We received 39 submissions, 27 full papers, four position papers, six project submissions, and two group work proposals. Full papers, position papers, and project submissions were subjected to double-blind review. Each contribution was reviewed by two to three independent program committee members, with one of the seven workshop organizers acting as the review leader and being responsible for the meta-review. Program committee members evaluated between two and four proposals. Organizers led the review process for five or six entries. The decision on acceptance was made at an organizers' meeting after all revisions had been presented by program committee members and summarized by key reviewers. Conflicting submissions were processed without stakeholder participation. The panel rejected the submission of a full paper as it was not within the scope of the workshop. The two group work proposals were evaluated by a jury made up of the organizers. A total of 12 full papers were accepted, six after minor review and six after major review. Revisions were accepted only after a compliance review by the corresponding lead reviewer and one or two additional rounds of reviews, as appropriate. In one case, the final decision also required reviewer feedback. The acceptance rate of completed works was 44%.
Workshop Program and Outcomes The two-day workshop program included a keynote speaker, a guest lecture, two group assignments, six lecture sessions, and a panel discussion. The keynote speaker was Catherine Pelachaud, Director of Research at the ISIR Laboratory at Sorbonne University. Pelachaud presented the development of a socially interactive agent, including work on multimodal interaction and adapting conversational strategies to increase user engagement. In the guest lecture, Sandro Pezzelle, from the University of Amsterdam, shared research experiences on adapting state-of-the-art conversational systems, based on previously trained language models, for different age groups. Workshop participants could choose between two group projects. Sviatlana Höhn, Bettina Migge, Doris Dippold and Britta Schneider organized a group work on Attitudes, Prejudices and Practices in Conversational AI Design. Jan de Wit and Anouck Braggaar did a group work on chatbot research and development platforms. The theme of the panel discussion was the evaluation of chatbots in research and practice; in particular, how to make useful and reliable assessments. Speakers represented academia and industry, including Michael McTear (Ulster University), Nena van As (boost.ai), Margot van der Goot (University of Amsterdam) and Elayne Ruane (UCD). The panel provided insight into the complexity of scoring criteria relevant to chatbots, the need to adapt scoring design flexibly to the specifics of a context of use, and the need to transfer and aggregate information between classifications.
preface
viii
Communication sessions included presentations of accepted full papers, position papers and project presentations organized into six thematic groups; three of them face-to-face only and three in the hybrid modality, both face-to-face and online. In this process, all accepted full articles are grouped into two general themes, each theme containing six articles. In the first topic, Chatbot Users and User Experience, the papers brought new insights and perceptions from the most diverse areas. Leuwis and He presented an investigation of a quit smoking chatbot and the factors that affect the chatbot user experience. Nordberg and Guribye addressed the user experience of consuming messages through voice user interfaces. Liebrecht, Kamoen and Aerts examined the use and preference of different implementations of conversational agents for election advice. Silva, from Cicco, Levi and Hammerschmidt, provided new insights into the impact of gamification on chatbots for brand communication. Abbas, Følstad and Bjørkli presented a study on users' perceptions of chatbots for providing digital government services. Finally, Henkel, Linn and der Goot summarized the results of a study on the intention to use chatbots for mental health among LGBTQIA+ users. The range of application domains examined provides valuable new insights into a range of specific chatbot use cases, but also serves to shed light on issues affecting these domains, for example, in relation to the factors that underpin user intent and user experience. user. The second topic, Chatbot Design and Applications, included articles that provided new insights into designing chatbot interactions and dialogs, as well as information about specific implementations. Abbo, Crovari and Garzotto presented a promising approach to application troubleshooting using a conversational agent. Van Hooijdonk, Martijn, and Liebrecht have provided a new framework for analyzing an aspect of chatbot design that is of great interest to practitioners, namely the initial self-presentation of the chatbot. Chira, Mathioudis and their colleagues came up with a multimodal chatbot to facilitate data collection from specific groups of patients. Angenius and Ghajargar provided an in-depth reflection on conversation as a design material, based on a study of a diary chatbot. Stolwijk and Kunneman presented their work on integrating chatbot responses with general knowledge and a specific task through an engaging kitchen assistant study. Finally, Niederer, Schloss, and Christensen provided insight into their work on a chatbot to help users with product settings in the dialog. Three of the accepted full papers were nominated for the CONVERSATIONS Best Paper Award. Nominated articles were those with the highest average ratings from reviewers' comments. A jury made up of the seven workshop organizers chose the winner from these three nominees. The CONVERSATIONS 2022 Best Paper Award went to Charlotte van Hooijdonk, Gabriëlla Martijn and Christine Liebrecht for their article A Framework and Content Analysis of Social Cues in the Introductions of Customer Service Chatbots. The two finalists were Lotte Leeuwis and Linwei He with the article Hello, I'm Cecil(and) the chatbot to stop smoking: the effectiveness of motivational interviewing and confrontational counseling chatbots and the moderating role of the need for autonomy and self-efficacy and Giulio Antonio Abbo, Pietro Crovari, and Franca Garzotto for their article Improving Conversational Troubleshooting with Multimodality: Design and Implementation.
viii
preface
In concluding the successful sixth edition of the CONVERSATIONS workshop, we thank all the authors, program committee members, speakers, participants and supporters, all of whom helped make the workshop a great place to exchange and discuss research on chatbots and - do project. We are pleased that the series of workshops serves to strengthen the chatbot research community and we are already looking forward to the next edition of CONVERSATIONS. November 2022
Asbjørn Følstad Theo Araujo Symeon Papadopoulos Effie L.-C. Gesetz Ewa Luger Morten Goodwin Petter Bae Brandtzaeg
Organization
Presidents General/Height Organizers Asbjørn Følstad Theo Araujo Symeon Papadopoulos Effie L.-C. Ley Ewa Luger Morten Goodwin Petter Bae Brandtzaeg
SINTEF, Norway University of Amsterdam, Netherlands Hellas Research and Technology Center, Greece University of Durham, UK University of Edinburgh, UK University of Agder, Norway University of Oslo and SINTEF, Norway
Comitê de Programa Alexander Mädche Ana Paula Chaves Caroline Ischen Charlotte van Hooijdonk Christian Löw Christine Liebrecht David Kuboˇn Despoina Chatzakou Elayne Ruane Eleni Metheniti Eren Yildiz Fabio Catania Frank Dignum Frode Guribye Guy Laban Joe Dugstad Wake Joe Herstad Klopfenstein
Karlsruhe Institute of Technology, Germany Northern Arizona University, USA University of Amsterdam, Netherlands Utrecht University, Netherlands University of Vienna, Austria Tilburg University, Netherlands Charles University, Prague, Czech Republic Research Center and Technology Hellas, Greece University College Dublin, Ireland CLLE - CNRS |IRIT-CNRS, France Umeå University, Sweden Politecnico di Milano, Italy Umeå University, Sweden University of Bergen, Norway University of Glasgow, United Kingdom NORCE & University of Bergen , Norway University of Oslo, Norway University of the Basque Country / Euskal Herriko Unibertsitatea, Spain SINTEF, Norway University of Bamberg, Germany Swansea University, United Kingdom University of Urbino “Carlo Bo”, Italy
x
Organization
Marcos Baez Margot da Boa Prostitute Marita Matthias Kraus Minha Lee Roberta De Cicco Sebastian Hobert Soomin Kim Stefan Morana Stefan Schaffer Sviatlana Höhn Ulrich Gnewuch Yi-Chieh Lee Zia Uddin
Bielefeld University of Applied Sciences, Germany Amsterdam University, Netherlands SINTEF, Norway Augsburg University, Germany Eindhoven University of Technology, Netherlands Chieti-Pescara University, Italy Göttingen University, Germany Samsung, Korea Saarland University, Germany DFKI – German Research Center for Artificial Intelligence, Germany University of Luxembourg, Luxembourg Karlsruhe Institute of Technology, Germany National University of Singapore, Singapore SINTEF, Norway
Satisfied
Chatbot Users and User Experience Hello, I'm Cecil(y), the quit smoking chatbot: the effectiveness of motivational interviewing and confrontational counseling chatbots and the moderating role of the need for autonomy and self-efficacy. . . . . . . . . Lotte Leeuwis and Linwei He interact with messages through voice user interfaces. . . . . . . . . . . . . . . . . . . . Oda Elise Nordberg and Frode Guribye speak their minds! Use and perception of young voters of an electoral counseling application (Cavaa) based on text, voice and mixed text-speech. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christine Liebrecht, Naomi Kamoen and Celine Aerts Creating value in gamified chatbot interactions and their impact on brand engagement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Susana C. Silva, Roberta De Cicco, María Levi and Maik Hammerschmidt
3
18
34
50
Chatbots as part of digital government service: a user perspective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nadia Abbas, Asbjorn Folstad and Cato A. Bjorkli
66
Understanding the intent of using mental health chatbots among LGBTQIA+ people: UTAUT testing and expansion. . . . . . . . . . . . . . . . . Tanja Henkel, Annemiek J. Linn and Margot J. van der Goot
83
Design and applications of chatbots to improve the resolution of conversational problems with multimodality: design and implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Giulio Antonio Abbo, Pietro Crovari and Franca Garzotto Marco and content analysis of social signals in the launch of customer service chatbots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Charlotte van Hooijdonk, Gabriella Martijn and Christine Liebrecht
XIII
Satisfied
A multimodal affective conversational agent for non-intrusive data collection from patients with brain diseases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Chloe Chira, Evangelos Mathioudis, Christina Michailidou, Pantelis Agathangelou, Georgia Christodoulou, Ioannis Katakis, Efstratios Kontopoulos and Konstantinos Avgerinakis AI Interactive Journal: Probing Words and Language as Interaction Design Materials. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 Max Angenius and Maliheh Ghajargar Expanded the range of clarifying responses for a kitchen help. . . . 171 Gina E. M. Stolwijk and Florian A. Kunneman Design of context-sensitive chatbots for product configuration. . . . . . . . . . . . . . . 190 Tom Niederer, Daniel Schloss, and Noemi Christensen Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Chatbot Users and User Experience
Hi, I'm Cecil(y), the stop smoking chatbot: the effectiveness of chatbots for motivational interviewing and confrontation counseling and the facilitating role of the need for autonomy and self-efficacy Lotte Leeuwis and Linwei He(B) Department of Communication and Cognition, Tilburg School of Humanities and Digital Sciences, Tilburg University, Tilburg, Netherlands[Email protected]
Abstract. This study aimed to investigate whether and how chatbots can increase smokers' intention to quit and specifically examined the effectiveness of two communication styles (i.e. motivational interviewing (IM) and confrontational counseling (CC)) and the moderating role of individual differences (i.e., need for autonomy and perceived self-efficacy) that may affect smokers' experience with the chatbot. In an online, between-subjects experiment (N = 233), smoking participants were assigned to interact with an IM chatbot (n = 121) or a CC chatbot (n = 112) during an 8-minute session. The need for autonomy and perception of self-efficacy, satisfaction with the interview and the intention to quit smoking were measured before and after the test. No significant effects of different communication styles were found on the results, nor did the need for autonomy moderate these results. However, the effect of IM on user satisfaction was stronger in smokers with higher self-efficacy and there was a positive effect of self-efficacy on user satisfaction. Furthermore, interacting with chatbots about their own smoking behavior increased participants' intention to quit, regardless of their communication style. Therefore, this study sheds light on the potential of conversational chatbots for smoking cessation interventions, as well as avenues for future research. Keywords: conversational agents · smoking cessation · motivational interviewing · need for autonomy · self-efficacy
1 Introduction Governments and organizations around the world have long attempted to increase smoking cessation rates through various interventions [1]. Many of these interventions focus on addressing the risks associated with smoking, often through fear and/or emotional appeals [2]. While these heavy-handed messages may be effective for some smokers, others may be more defensive of the messages, leading to rejection and dismissal. (Eds.): CONVERSATIONS 2022, LNCS 13815, pp. 3–17, 2023. https://doi.org/10.1007/978-3-031-25581-6_1
4
L. Leeuwis and L. Ele
the message [3]. Because of these individual differences in emotional and behavioral responses to anti-smoking messages, tobacco use remains a high-profile public health issue [4]. For this reason, the development of individualized communication vehicles for smoking cessation has been one of the recent research priorities. To this end, conversational agents such as chatbots may be suitable, as they reduce the pressure on healthcare professionals by providing accessible 24/7 access to individual counseling for a large proportion of the population wishing to quit. of smoking [5, 6]. Chatbots can be defined as “artificial intelligence programs designed to simulate human conversations” [6] and have seen a rapid increase in their ability to support healthcare advice over the years [5]. The initial effectiveness and acceptance of chatbots for health advice has been demonstrated in different contexts such as mental health and physical activity [6]. In the field of individualized counseling to quit smoking, modern chatbots can use different communication styles such as B. Confrontational Counseling (CC) and Motivational Interviewing (MI). CC focuses on confronting smokers with the consequences of their behavior to counteract self-disengagement and increase risk perception through direct counseling and health-related information [7]. In contrast, IM aims to reinforce the client's motivation and belief that behavior change is needed through self-persuasion [8]. Despite their differences, both IM and CC are client-focused, directive, and effective in facilitating behavior change [7, 9]. The potential of chatbots and their one-on-one nature allow for one-on-one conversations, and it's important to gain insight into which communication style fits each person's needs and preferences. Adapting to these preferences will likely lead to greater user satisfaction with chatbots and therefore a higher success rate in terms of targeted behavior [10]. In this context, two factors may be particularly relevant: the need for autonomy, as described in the Self-Determination Theory (SDT) [11], and perceived self-efficacy, as described in the Social Cognitive Theory (SCT) [12]. In particular, a greater need for autonomy may necessitate interventions supporting autonomy (ie MI), while a lower need for autonomy may signal the need for explicit instructions from an expert (ie CC). Therefore, to satisfy the customer's need for autonomy, the chatbot may need to use different communication styles. Likewise, smokers with higher perceived self-efficacy in their behavior modification skills may be able to handle CC, while smokers with lower self-efficacy need the motivational boost found in MI to feel empowered to change their behavior. 3, 7]. While there is a large body of research on personalization techniques for healthcare interventions, little research has directly compared IM and CC, noting the moderating role of individual differences, particularly in healthcare interventions delivered by chatbots. To fill this gap in the literature and shed light on the use of chatbots in smoking cessation interventions, this study aims to examine the optimal communication style (IM vs. CC) for a chatbot in terms of user satisfaction and role of communication. need to intend to quit smoking. autonomy and self-efficacy.
Hello, I'm Cecil(y), the stop smoking chatbot
5
2 Conceptual framework 2.1 Communication styles for smoking cessation chatbots Confrontation is a frequently mentioned practice in the addiction counseling literature [13]. Traditionally, CC aims to break through defenses such as denial and minimization of problems, directly addressing the client's resistance to change [13]. In the area of smoking cessation, CC counselors use techniques such as direct education, challenging the client's perception of the problem, and providing a treatment plan [14]. Contrary to the popular notion that CC leads to conflict between counselor and client, research has shown that this style positively predicts client engagement and that CC can even be perceived as real and authentic when conveyed empathetically. [fifteen]. However, if the client feels threatened by the counselor's behavior change, resistance may increase, resulting in lower client satisfaction and therefore less intention to change their behavior [8]. In response to the observed potential for client resistance in CC, Miller and Rollnick [8] developed IM: a behavior change counseling style, eliciting client motivation rather than imposing client motivation [8,13]. The MI counselor gently guides the client through this journey, expressing empathy, accepting resistance, encouraging self-reflection, and acknowledging and supporting the client's self-efficacy [8]. Therefore, while CC relies on the guidance and expert opinion of a professional, in IM the client is seen as the expert and the main motivator for behavior change [9]. In recent decades, research on the effects of MI and CC on smoking cessation and other health behaviors has yielded mixed and inconclusive results [16]. Some researchers argue that most smokers do not suffer from a lack of knowledge about the negative consequences of smoking and do not require expert advice based on information (e.g. CC) but rather more motivation-based intervention (e.g. ., MY) . However, there is also evidence that some people found IM too patronizing and therefore preferred a resonant CC approach [17]. Therefore, CC can be effective for smokers who expect the counselor to be the expert and not themselves [18]. There is similar hesitation in the area of IM in smoking cessation interventions. For example, a meta-analysis has shown that MI is effective for a variety of problem behaviors (eg, alcoholism and exercise) but not for smoking cessation. More recent meta-analyses found greater abstinence in smokers who received intervention for myocardial infarction, although the effects were small [9]. Overall, these results indicate that both CC and IM appear to have the potential to motivate behavior change. However, despite methodological differences in this line of research, most of the literature appears to support the notion that MS evokes less resistance and is therefore more likely to change behavior effectively, which is particularly relevant for addictions. where high resistance is common [16]. Second, meta-analysis has shown that CC may work for a small subset of the population, but IM works better for a larger group of people, making it a more effective approach [19]. Finally, in the context of interventions carried out by chatbots, CC is more likely to succeed when the sender is considered to be a legitimate and experienced professional, and chatbots may not be suitable for this style of communication, as chatbots are generally not considered doctor.
6
L. Leeuwis and L. Ele
authorities [5]. More generally, it is suggested that people prefer chatbots that provide emotional support (i.e. IM) over informational chatbots (i.e. CC) [20]. Since this is likely to lead to higher user satisfaction and higher intention to continue with the intervention, our hypothesis is: a chatbot employing IM leads to higher intention to stop (H1) and higher user satisfaction (H2) in smokers compared to a chatbot bot with CC. 2.2 Moderating role of the need for autonomy Although there is increasing evidence to support the effectiveness of MI, most studies in this area have been conducted in face-to-face clinical settings. Therefore, the effectiveness of automated instant messaging provided by chatbots remains inconclusive. In a previous study [21], we compared an IM-style chatbot with a neutral-style chatbot and found no significant IM effects. Similarly, recent reviews have also acknowledged inconclusiveness regarding the effectiveness of automated IM and therefore called for further investigation into human factors that may affect chatbot interventions [22]. Human factors play an important role in interacting with chatbots, and addressing individual differences when designing the chatbot can improve user satisfaction and loyalty [23]. In behavior change interventions in particular, the need for autonomy is a crucial feature that determines users' experience with the intervention. When this need is satisfied, the individual feels that performing the target behavior is intrinsically rewarding [24]. On the other hand, when the need for autonomy is not met, the individual feels pressured to think or act in a way dictated by external third parties, such as professionals or interventions (for example, chatbots) and may engage in defensive reactions. Although the need for autonomy is universal, the level of this need differs from person to person [24]. For example, “autonomous” are people with a higher need for autonomy and are less likely to seek external control, while “expert dependents” are people with a lower need for autonomy and are more dependent on external guidance [24]. Therefore, to maximize effectiveness and user satisfaction with the quit smoking chatbot, your communication style should be tailored to the customer's preferences, from autonomy support communication (i.e. IM) to more directive communication- confrontational (i.e. CC). . Although CC can generate resistance, recent research [25] suggests that this resistance may not be caused by communication style, but by clients' different autonomy needs. For example, people with a higher need for autonomy feel more satisfied with communication that supports autonomy (i.e. IM), which leads to better health behavior outcomes (i.e. intention to stop). Likewise, clients with a lower need for autonomy were more likely to seek expert advice [24]. It seems plausible that patients with a lower need for autonomy would benefit more from the advice and guidance provided by the CC, while patients with a higher need for autonomy would benefit more from the increased intrinsic motivation achieved by the IM. Coleman and others. [18] provided some evidence for this claim, noting that smokers who have a greater need for autonomy prefer a counselor who respects their autonomy (ie MI) over a counselor who offers confrontational counseling (ie CC). In particular, smokers with a higher need for autonomy may prefer to act independently, which is further facilitated by MS, while smokers with a lower need for autonomy prefer this.
Hello, I'm Cecil(y), the stop smoking chatbot
7
for specific tips, which CC tends to facilitate. Thus, it is hypothesized that: The need for autonomy moderates the relationship between the chatbot's communication style and the intention to quit smoking and user satisfaction of smokers, when assessing the positive effect of IM (vs. CC ) on the intention to quit quitting smoking (H3) and user satisfaction (H4) is higher in smokers with a greater need for autonomy than in smokers with a lower need for autonomy. 2.3 The moderating role of perceived self-efficacy In addition to the need for autonomy, self-efficacy is another factor that must be taken into account when designing a chatbot to stop smoking. Since quitting is often characterized by difficulties and relapses, smokers need to have at least some confidence in their ability to quit before they try. Bandura's SCT [12] establishes that the greater the perceived self-efficacy, the greater the self-confidence and commitment felt to achieve the goal. In smoking cessation research, many studies suggest that self-efficacy smokers generally have a higher intention to quit than those with lower perceived self-efficacy (for a review, see [26]). There is evidence that levels of self-efficacy can affect how patients respond to an intervention. Gaume et al. [27] found that counselors with more CC-like skills are effective in changing the behavior of clients who express high levels of confidence in their ability to change. Likewise, Colby et al. [28] found that MI did not lead to increased smoking cessation or self-efficacy to quit, possibly because participants already had relatively high levels of self-efficacy at baseline. In summary, previous research suggests the possibility of a ceiling effect, which could mean that smokers with high levels of perceived self-efficacy may not benefit from an instant messaging chatbot specifically aimed at increasing these levels. Furthermore, smokers with high levels of self-efficacy may believe that they can quit whenever they want and that they are not vulnerable to the risks of smoking, thus reducing pressure to quit and MI effectiveness [7]. To combat these self-liberated beliefs, Kotz et al. [7] found that CC is useful for increasing risk perception and making dropout more likely. Therefore, CC would be more beneficial for smokers with high levels of self-efficacy, while MI could be more beneficial for smokers with lower levels of self-efficacy. Thus, it is hypothesized that perceived self-efficacy moderates the relationship between chatbot communication style and smokers' intention to quit and being satisfied, since the positive effect of IM (vs. CC) on intention to quit (H5) and user satisfaction ( H6) is higher in smokers with lower perceived self-efficacy than in smokers with higher perceived self-efficacy. The conceptual model can be seen in Fig. 1.
8
L. Leeuwis and L. Ele
Fig. 1. Conceptual model of hypotheses
3 Methods 3.1 Participants An a priori power analysis using G*Power revealed that a sample size of 158 participants is required to reveal small to moderate effects (effect size ƒ = 0.25, power = 0.8), consistent with previous meta-analyses on the impact of MI on smoking cessation [9]. To participate in the experiment, participants had to be at least 18 years old, be fluent in English and have smoked at least one cigarette in the week prior to participating. From November 23, 2021 to December 3, 2021, a total of 270 eligible participants were recruited, agreed to the terms of the study, and completed the survey. Participants who did not complete the chatbot conversation (n = 37) were removed from the analysis, leaving a final sample of 233 participants. Of those participants, 147 identified themselves as female (63.1%), 83 as male (35.6%) and three as non-binary or chose not to say so (1.3%). Most participants reported being between 18 and 25 years old (81.5%, n=190) or 26 to 30 years old (11.2%, n=26), the rest of the participants reported being over 31 years old (7, 3%, n = 0.6). 17). 3.2 Experimental design and methods The online between-subjects experiment was conducted using Qualtrics for the research-based part of the experiment and Flow.ai, a chatbot creation platform widely used by companies and researchers, for the chatbot part of the experiment. Ethical approval was obtained from the Tilburg School of Humanities and Digital Sciences before starting data collection. In addition, a pre-test was carried out to see if the two conditions differed significantly using the Motivation Client Assessment.
Hello, I'm Cecil(y), the stop smoking chatbot
9
Questioning Scale (CEMI) [29]. This was the case for previously tested materials (MMI=5.1, SDMI=0.7, MCC=3.9, SDCC=0.8, t(22)=4.00, p<0.001) and The official experiment found that participants in the MI state perceived the chatbot as more similar to MI (M = 5.1, SD = 0.9), while participants in the CC state perceived the chatbot more as CC perceived similarly (M = 3.6, SD = 0.8, t( 231) = 3.20, p = < 0.001). Therefore, the manipulation was considered successful. At the start of the experiment, participants first entered demographic data and information about their smoking status, after which they were randomly assigned to IM (n=121) or CC (n=112) status. Participants' perceived need for autonomy and self-efficacy was measured, as well as their intention to quit smoking at baseline. Participants were taken to the Flow.ai environment, where they were invited to participate in an 8-minute conversation with Cecily, the IM chatbot, or Cecil, the CC chatbot. After the interview, the intention to stop smoking and satisfaction with the interview were measured. Finally, the participants were interviewed and thanked for their participation. 3.3 Operationalization of the MI chatbot state. IM works through a relational component and a technical component, the first being centered on acceptance, collaboration, evocation and compassion, that is, the "spirit of IM" [8]. Within chatbots, the use of natural, person-like speech and the use of emoticons can reinforce this IM spirit, which has been integrated into the chatbot's IM dialogue. The technical component was put into practice through open questions, reflections and affirmations [8]. During the conversation, the chatbot emphasized that the participants were the experts and not the chatbot, involved participants in jointly setting the agenda to develop a goal for the conversation, encouraged participants to express their thoughts and feelings about changing behavior, helped participants with identifying goals and barriers to change and creating an exit plan if the participant so desired. CC chatbot status. The main purpose of the CC chatbot was to get the participant to confront the problems that their smoking behavior may cause by providing confrontational health information and unsolicited feedback about the participant's behavior [7, 8]. During the conversation, the chatbot asked the participant to talk about his current smoking behavior, after which he was told that his behavior was quite concerning. The chatbot gave advice (without asking permission) on how to combat nicotine addiction. Evidence has been used to combat the minimization or resistance of the problem [7]. Finally, the chatbot encouraged preparedness and withdrawal by providing advice on how to manage withdrawal symptoms and encouraging the participant to seek help from a doctor. Examples of the statements of both chatbots can be found in Table 1. The interaction lasted approximately the same in both conditions. In addition to the above manipulations, the content of the dialogs (eg, the information provided) was comparable under the two conditions.
10
L. Leeuwis and L. He Table 1. Examples of chatbot statements of the two conditions.
Instant messaging chatbot expressions
Chatbot CC Expressions
"In this conversation, you are the expert on your own situation, your smoking behavior and your thoughts." (collaboration)
"You might feel good right now, but think about your future self for a second."
"Can you think of how quitting smoking might stop you from achieving that goal?" (open question)
“Before you start your attempt to quit, you need to feel 100% ready. That way, your chances of a successful attempt are greater.” (desire to be ready)
"Okay, so if I understand correctly, your goal 'This is very concerning' (critical) is [...], but smoking can prevent you from achieving this goal: '[...]'" (Reflection)" You definitely should be proud of this achievement! ☺" (confirmation)
"I advise you to distract yourself by thinking about the reasons for your withdrawal and why you do all this" (direct advice)
3.4 Measurements of Nicotine Dependence. The Fagerstrom Test for Nicotine Dependence (FTND) [30] was performed prior to experimental exposure. This measurement included six questions, such as "How many cigarettes do you smoke per day?" The total number of points participants received based on their responses ranged from 2 to 12. A higher total indicated greater nicotine dependence (M=3.6, SD=2.1, Cronbach's α=0.72). need for autonomy. The Help Seeking Scale [31] was used to measure the participants' need for autonomy. This scale consisted of fourteen items. An example statement is "Rather than facing a problem alone, I prefer to trust someone who knows more than I do." The questions were asked using a Likert-type scale ranging from 1 (completely disagree) to 7 (completely agree). A higher calculated mean indicated a greater need for autonomy (M = 4.5, SD = 0.7, Cronbach's α = 0.71). Perceived self-efficacy. The Smoking Abstinence Self-Efficacy Questionnaire (SASEQ) [32] was conducted to understand the extent to which participants felt sufficient self-efficacy to change their behavior. This measure consisted of six “bullets” as B. “Do you feel excited or tense. Are you sure you're not going to smoke?" Participants' perceived self-efficacy was then measured using a Likert scale ranging from 1 (I'm not sure) to 5 (I'm sure). A higher calculated mean indicated greater perceived self-efficacy (M =2, 7, SD=0.8, Cronbach α=0.76) intention to quit Participants' intention to quit smoking was measured using the Contemplation Scale [33], a one-item instrument that asked participants to rate their willingness to quit smoking. to stop smoking on a scale of 0 (don't think about quitting) to 10 (take action). This question was asked before the experimental exposure to establish a baseline (M = 6.4, SD = 2.9) and post -exposure (M = 7.2, SD = 2.5) to assess whether the conversation led to a greater intention to stop.
Hello, I'm Cecil(y), the stop smoking chatbot
11
User satisfaction with the conversation. Finally, user satisfaction with the interview was measured using the Customer Satisfaction Questionnaire (CSQ-8) [34]. The CSQ-8 contains eight items, such as "How well did the chatbot query meet your needs?", which were measured on a 1-4 Conversation scale (M = 2.6, SD = 0.6, Cronbach's α = 0.93). Perception of motivational interviewing. Manipulative tests were performed with an abbreviated version of the CEMI scale [29]. The scale included statements such as "The chatbot told you what to do" and "The chatbot showed that it believes in your ability to change your behavior". Scores were measured using a Likert scale ranging from 1 (not at all) to 7 (a lot). A high overall average indicated that the participant viewed the chatbot as IM; a low overall mean indicated that the participant found the chatbot similar to the CC (M = 4.3, SD = 1.1, Cronbach's α = 0.76).
3.5 Statistical analysis To test the effect of IM (versus CC) on intention to quit and user satisfaction, two unidirectional ANCOVAs were performed, taking into account the possible influence of age, gender and nicotine dependence [35]. To test the hypothetical moderating role of the need for autonomy and self-efficacy, Hayes' PROCESS model 1 with covariates was used. All analyzes were performed with SPSS 27.
4 Results 4.1 Main analysis Main effects. H1 postulated that an IM chatbot would result in a higher intention to leave compared to a CC chatbot. A one-way ANCOVA showed that the covariates age (F(1228)=1.37, p=0.243), gender (F(1228)=1.38, p=0.241) and nicotine dependence (F(1 228) = 3.52, p = 0.062) were not significantly associated with intention to quit, although nicotine dependence was only marginally insignificant. No significant effect of chatbot communication style on intention to stop was found, F(1228)=0.97, p=0.325. Therefore, H1 can be discarded. H2 postulated that an IM chatbot would result in greater user satisfaction with the conversation compared to a CC chatbot. Similar to H1, the covariates are age (F(1, 228) = 0.26, p = 0.613), gender (F(1, 228) = 0.15, p = 0.700) and nicotine dependence (F (1, 228 ) 228) ) = 0.01, p = 0.253) were not significantly associated with user satisfaction. No significant influence of chatbot communication style on user satisfaction could be determined, F(1, 228) = 1.31, p = 0.253. Therefore, H2 can also be discarded. Moderating role of the need for autonomy. H3 takes the analyzes a little further and postulates that the smokers' need for autonomy would soften the relationship between the chatbot's communication style and the intention to quit, since the positive effect of IM (vs. CC) on the intention to quit quitting is stronger in smokers with a greater need to quit. autonomy than in smokers
12
L. Leeuwis and L. Ele
with less need for autonomy. Results from Hayes PROCESS model 1 showed that the need for autonomy did not moderate the relationship between chatbot communication style and intention to leave (b = −0.15, t = −0.32, p = 0.746). The main effect of the need for autonomy on the intention to stop smoking was also not significant (b = −0.11, t = −0.16, p = 0.876). Furthermore, none of the covariates (i.e., age, sex, nicotine dependence) were significant. Therefore, H3 can be discarded. Likewise, H4 raised the moderating role of the need for autonomy in the relationship between the chatbot's communication style and user satisfaction with the conversation. The results showed that neither the interaction was significant (b = −0.03, t = −0.25, p = 0.806), nor the main effect of the need for autonomy on user satisfaction (b = 0.12, t = 0.64, p = 0.520) or one of the covariates. H4 can also be rejected. Moderating role of perceived self-efficacy. Similar procedures were performed using the PROCESS 1 model to test the proposed moderating effect of perceived self-efficacy. H5 postulated that the positive effect of MI (vs. CC) on the intention to quit smoking is stronger in people with lower self-efficacy. Moderation was not significant (b = −0.09, t = −0.23, p = 0.822), nor was the main effect of perceived self-efficacy on intention to stop (b = 0.81, t = 1.25, p = 0.822). .213). Furthermore, none of the covariates (i.e., age, sex, nicotine dependence) were significant. Therefore, H5 was rejected. Finally, H6 proposed the moderation of self-efficacy in the relationship between chatbot communication style and user satisfaction. The results showed significant moderation (b = −0.21, t = −2.09, p = 0.038). However, a simple slope analysis showed that the effect of MI is stronger in individuals with higher self-efficacy, contrary to the hypotheses. In addition, a self-efficacy main effect emerged (b=0.40, t=2.42, p=0.016), such that greater self-efficacy predicts greater user satisfaction with the chatbot. No significant covariates (ie age, sex, nicotine dependence) were found. Therefore, H6 was rejected. The results of the hypothesis test are shown in Fig. two.
Fig. 2. Test results integrated into the conceptual model. *p<0.05
Hello, I'm Cecil(y), the stop smoking chatbot
13
4.2 Secondary analysis An ANCOVA repeated measure was performed to test whether conversation with chatbots increased participants' intention to quit smoking. Covariates age (F(1, 229) = 1.45, p = 0.229), gender (F(1, 229) = 2.34, p = 0.136) and nicotine dependence (F(1, 229) = 2.98, p=0.086) were not significantly associated with participants' post-test intention to quit smoking. No significant effects of chatbot communication style on termination intention were found, F(1228) = 0.78, p = 0.379. However, a large significant effect of overall chatbot interaction was found on intention to stop, F(1232) = 59.20, p = < 0.001, ηp 2 = 0.20. Smokers' intention to quit after exposure was significantly higher (M=7.2, SE=0.2) than at baseline (M=6.4, SE=0.2, difference M=0.8 , SE = 0.1, p<0.001). Therefore, talking to a chatbot about smoking, regardless of the communication style used, seems to be effective in increasing the intention to quit smoking.
5 Discussion 5.1 Main results This study aimed to discover the potential of conversational chatbots to motivate smoking cessation. Specifically, the study examined the effectiveness of chatbot-delivered IM and CC in increasing intention to stop, which communication style resulted in greater user satisfaction with the conversation, and the extent to which these outcomes were moderated by individuals' need for autonomy and their perception of themselves. were -Efficacy. Overall, the results show that chatbot interaction to quit smoking, regardless of chatbot communication style, leads to a significant increase in intention to quit smoking. However, intention to end the conversation and user satisfaction with the conversation did not differ between participants in the MI condition and participants in the CC condition, nor were these results influenced by the participants' need for autonomy or self-efficacy. Therefore, all hypotheses were rejected. It was hypothesized that IM (vs. CC) as a communication style would lead to higher intentions to quit and greater user satisfaction among smokers. The participants' intention to stop increased significantly after interacting with the chatbot, regardless of the assigned status. An explanation for the insignificant results can be found in the conceptualization of MI and CC in this study. An analysis of the manipulative control subscales revealed that the technical skills of the IM chatbot (M=4.8, SD=1.3) and the CC chatbot (M=4.3, SD=1.4) were perceived as quite similar to MI. For example, in the CC condition, participants found that despite its confrontational approach to conversation, the chatbot helped them feel confident in their ability to quit smoking when talking about the need for behavior change. This similarity did not occur on the relational subscale, where the IM chatbot (M = 5.3, SD = 1.3) was perceived as more similar to IM than the CC chatbot (M = 2.9, SD = 1.1) . This reflects previous research showing that people value the empathetic communication of a chatbot compared to purely informational chatbots [20]. Furthermore, Lundahl et al. [16] found that the effect of MI is subject to a dose effect, which means that a longer treatment time with MI can lead to better results. As this study consisted of a single 8-minute session, it may be that the duration of exposure was too short for the expected effects of MI to occur.
14
L. Leeuwis and L. Ele
Future research should investigate whether multiple sessions and/or longer exposure to an instant messaging chatbot might have a greater impact on intervention outcomes. Furthermore, future research can explore whether a clearly CC-based approach without IM-related elements as seen in this study can be effective in chatbot counseling from the start, as chatbots may never possess the legitimate expertise of a doctor Clients more receptive to professional health advice [5]. We sought to investigate whether the main effect of IM could be explained by considering the participants' need for autonomy. No significant results were found in this regard. However, the results showed, albeit insignificantly, that participants with a higher need for autonomy were less likely to give up than participants with a lower need for autonomy, regardless of the chatbot's communication style. This finding may illustrate that smokers with a greater need for autonomy may benefit less from chatbots to quit smoking precisely because they want to make decisions without interference from other people or persuasive technologies [24]. Therefore, it is important that the chatbot is presented in a non-intrusive way so that people can interact with the chatbot at their own pace and choice. Research on persuasive chatbots has shown that perceived intrusion negatively predicts people's perception of the chatbot and persuasion [36]. In this study, given the relatively short interaction time, it is likely that the purpose of the conversation (i.e. motivation to quit) was introduced early without sufficient initial involvement, leading to a threat to autonomy. As these results make it clear that the ability to act autonomously is relatively important for most people, future (qualitative) research may shed light on how people with differently pronounced autonomy needs might benefit from health-related chatbots, although people with high need for autonomy Autonomy can really benefit from such interventions. These results may indicate how interventions should be designed to support needs and be effective in facilitating behavior change. Self-efficacy has been suggested as a moderator in the relationship between chatbot communication style and intention to stop and user satisfaction. We found that self-efficacy smokers were more satisfied with the interaction and that the effect of IM on user satisfaction was stronger in those with higher self-efficacy. Smokers with low self-efficacy may prefer to receive information on how to sustain a cessation attempt (i.e. CC), whereas IM works best for smokers who already have self-efficacy, encouraging them to reflect on their own beliefs and facilitating quitting planning. Actions. Addiction researchers have categorized interventions on a continuum ranging from more self-help approaches (eg, mobile app, booklet) to more intensive care (eg, nicotine replacement therapy) [37]. It could be that an IM chatbot falls on the self-help side and is better suited for people with higher self-efficacy. However, this study examined self-efficacy as a trait and did not test whether an instant messaging chatbot could effectively increase perceived self-efficacy during the process. A potential suggestion to consider in future research is whether an increase in perceived self-efficacy after speaking with a conversational chatbot could mediate, rather than mitigate, the relationship between chatbot communication style and intervention outcomes.
Hello, I'm Cecil(y), the stop smoking chatbot
15
5.2 Limitations and Suggestions for Future Research The present study sheds some light on the use of chatbots for smoking cessation; however, there are several limitations that must be taken into account when interpreting the results. First, this study did not collect information about participants' race or education, although previous research has found that such factors can influence communication preferences and experiences with digital interventions such as chatbots [38]. Future research looking at these characteristics may provide additional insight into the potential and effectiveness of chatbots in healthcare counseling, for example how chatbots can convey greater trustworthiness or legitimacy and thus lead to greater user satisfaction and better outcomes. of the intervention. Second, we did not measure participants' perceptions of the quality of the interaction and chatbot, although such perceptions may have influenced user satisfaction and intention to cancel. In order to have more experimental control and ensure that the chatbot delivers content in a structured and consistent way under both conditions, we created the chatbots with limited capabilities. Participant input was mostly limited to standard keywords, and the chatbots used little natural language processing and generation. While constrained chatbots are commonly used in healthcare to ensure control and prevent unwanted damage [39], this can limit interaction variability and quality and affect the user experience. It is recommended that future research explore the potential with more advanced chatbots and find the balance between controllability and user experience.
6 Conclusion This study examined the effectiveness of conversational chatbots using two different communication styles, IM and CC, on smoking cessation intention and user satisfaction with the conversation, as well as the moderating role of smokers' need for autonomy and self-perception. Effectiveness dependency. There were no significant effects of MI (vs CC) on intention to stop and user satisfaction, nor were these effects moderated by need for autonomy or perceived self-efficacy. However, the results showed that greater perceived self-efficacy leads to greater user satisfaction, as well as modest evidence that chatbot interaction is effective in increasing abandonment intentions. Practical implications of this study, therefore, include that healthcare professionals might consider developing chatbots to keep up with clients' demands for smoking cessation support and alleviate pressure on healthcare systems. However, more research is needed on what exactly smokers want when it comes to advising chatbots to increase the success of such tools. This study is intended as a springboard for further exploration of this topic, which is now more important than ever.
References 1. Zhu SH, Lee M, Zhuang YL, Gamst A, Wolfson T: Actions to increase population-level smoking cessation: How far has it come? in the last two decades? tob Control 21, 110–118 (2012)
sixteen
L. Leeuwis and L. Ele
2. Thompson, L.E., Barnett, J.R., Pearce, J.R.: Just fear? Anti-smoking campaigns appeal to fear, risk, self-efficacy and addiction. Health Risk Soc. 11, 181–196 (2009) 3. Witte K, Meyer G, Martell D Effective health risk messages: a step-by-step guide. Sage, New York (2001) 4. Organization, W.H.: WHO Report on the Global Tobacco Epidemic, 2021: Addressing New and Emerging Products. World Health Organization (2021) 5. Palanica A., Flaschner P., Thommandram A., Li M., Fossat Y.: Physicians' perceptions of chatbots in healthcare: a web-based cross-sectional study. J. Med. internet resolution 21, e12887 (2019) 6. Milne-Ives, M., et al.: The effectiveness of artificial intelligence conversational agents in healthcare: systematic review. J. Med. internet resolution 22, e20346 (2020) 7. Kotz D, Huibers MJ, West RJ, Wesseling G, van Schayck OC: What is the effect of confrontational counseling on smoking cessation in smokers with COPD? Patient education counts. 76, 16-24 (2009) 8. Miller WR, Rollnick S. Motivational Interviewing: Helping People Change. Guilford Press, New York (2012) 9. Heckman, C.J., Egleston, BL, Hofmann, MT: Effectiveness of motivational interviewing for smoking cessation: systematic review and meta-analysis. tob Control 19, 410-416 (2010) 10 Strecher VJ, Shiffman S, West R Randomized controlled trial of a computer-adapted web-based smoking cessation program as an adjunct to nicotine patch therapy. Addiction 100, 682–688 (2005) 11. Deci, EL, Ryan, RM: Self-Determination Theory: A Macrotheory of Human Motivation, Development, and Health. He can. Psychol 49, 182 (2008) 12. Bandura, A.: Regulation of cognitive processes through perceived self-efficacy. Developer Psychol 25, 729 (1989) 13. Schneider, R.J., Casey, J., Kohn, R.: Motivational versus confrontational interviewing: a comparison of substance abuse screening procedures in employee assistance programs. J. Behavior. Health Service Resolution 27, 60–74 (2000) 14. Francis, N., Rollnick, S., McCambridge, J., Butler, C., Lane, C., Hood, K.: When smokers resist change: Experimental analysis of the impact of patient resistance on physician behavior. Addiction 100, 1175–1182 (2005) 15. Moyers TB, Miller WR, Hendrickson SM: How does motivational interviewing work? Therapist interpersonal competence predicts client participation in motivational talk sessions. J. Consult. Klin Psychol 73, 590 (2005) 16. Lundahl BW, Kunz C, Brownell C, Tollefson D, Burke BL: A meta-analysis of motivational interviewing: 25 years of empirical studies. Soc. Resolution Work. Practice. 20, 137-160 (2010) 17. Ahluwalia, J.S., et al.: The effects of nicotine gum and consultation among occasional African American smokers: a 2 × 2 factorial design. Addiction 101, 883–891 (2006) 18. Coleman, T., Cheater, F., Murphy, E.: Qualitative study investigating the smoking cessation counseling process in general practice. Patient education counts. 52, 159-163 (2004) 19. Miller WR, Rollnick S Meeting in the middle: motivational interviewing and self-determination theory. international J. Behavior. Law physical nutrition. 9, 1–2 (2012) 20. Liu, B., Sundar, S.S.: Should machines express sympathy and empathy? Experiments with a health tips chatbot. cyberpsychologist. Network Behavior 21, 625–636 (2018) 21. He, L., Basar, E., Wiers, RW, Antheunis, M.L., Krahmer, E.: Can chatbots help motivate abandonment? A study of the effectiveness of motivational interviewing on commitment and therapeutic alliance. BMC Public Health 22, 1-14 (2022) 22. Shingleton RM, Palfai TP Technology-enabled accommodations of motivational interviewing for health-related behaviors: a systematic review of current research. Patient education counts. 99, 17-35 (2016)
Hello, I'm Cecil(y), the stop smoking chatbot
17
23. De Vreede, T., Raghavan, M., De Vreede, GJ.: Design Foundations for AI-Assisted Decision Making: A Self-Determination Theory Approach (2021) 24. Smit, E.S., Bol, N.: From Autonomous to Specialist Dependent: Classification Based on the Health-Related Need for Autonomy and the Need for External Control in Mobile Users. media psychologist. 23, 391–414 (2020) 25. Altendorf, MB, van Weert, JC, Hoving, C, Smit, ES: Should or Could? Test use of language that supports autonomy and provides options in computer-specific online communications about alcohol reduction. Digital Health 5, 2055207619832767 (2019) 26. Hyde J, Hankins M, Deale A, Marteau TM: Interventions to enhance self-efficacy in the context of addictive behaviors: a systematic review of the literature. J. Health Psychologist. 13, 607–623 (2008) 27. Gaume, J., Gmel, G., Faouzi, M., Daeppen, J.-B.: Counselor skills influence outcomes of short motivational interventions. J. Noun treat abuse. 37, 151-159 (2009) 28 Colby SM et al Enhanced motivational interviewing versus short counseling for rearing adolescent smokers: results of a randomized controlled trial. Addicted. Behavior 37, 817–823 (2012) 29. Madson, MB, et al.: “Measuring Client Perceptions of Motivational Interviewing: Measuring Client Perceptions of Motivational Interviewing: Behavior Rating Scale Factor Analysis. J. Noun treat abuse. 44, 330–335 (2013) 30. Heatherton TF, Kozlowski LT, Frecker RC, Fagerström KO: The Fagerström Test for Nicotine Dependence: A Review of the Fagerström Tolerance Questionnaire. Brother J. hooked. 86, 1119–1127 (1991) 31. Komissarouk, S., Harpaz, G., Nadler, A.: Differences in readiness to seek help oriented towards autonomy or dependency: concept development and scale validation. Individual personality. It differs. 108, 103–112 (2017) 32. Spek, V., Lemmens, F., Chatrou, M., van Kempen, S., Pouwer, F., Pop, V.: Developing a self-efficacy questionnaire for tobacco abstinence international J. Behavior. Med. 20, 444-449 (2013) 33. Biener, L., Abrams, DB.: The Contemplation Ladder: Validation of a measure of readiness to consider quitting. health psychologist. 10, 360 (1991) 34. Larsen DL, Attkisson CC, Hargreaves WA, Nguyen TD: Assessment of client/patient satisfaction: development of an overall scale. Evaluation of the program plan. 2, 197–207 (1979) 35. Shiffman, S., Brockwell, S.E., Pillitteri, J.L., Gitchell, J.G.: Individual differences in acceptance of smoking cessation treatment: demographic characteristics and smoking history. addicted to drugs and alcohol. 93, 121–131 (2008) 36. Van den Broeck, E., Zarouali, B., Poels, K.: Chatbot advertising effectiveness: when does the message arrive? Calculation. Buzz. Behavior 98, 150–157 (2019) 37. Secades-Villa, R., Alonso-Pérez, F., García-Rodríguez, O., Fernández-Hermida, J.R.: Effectiveness of three treatment intensities for smoking cessation in care primary. Psychol Rep 105, 747-758 (2009) 38. Kontos E, Blake KD, Chou WS-YS, Prestin A Predicters of eHealth Usage: Insights on the Digital Divide from the Health Information National Trends Survey 2012. J.Med. internet resolution 16, e3117 (2014) 39. Safi, Z., Abd-Alrazaq, A., Khalifa, M., Househ, M.: Technical aspects of chatbot development for medical applications: scope review. J. Med. internet resolution 22, e19127 (2020)
Interact with messages through voice user interfaces Oda Elise Nordberg(B)
and Frode Guribye
Universidad de Bergen, 5007 Bergen, Noruega {oda.nordberg,frode.guribye}@uib.no
Abstract. An important application area for voice user interfaces (VUIs) is the consumption of messages, and it is necessary to study how such interactions are carried out in practice. This article presents a study that examines how inexperienced users navigate messages in VUIs and which interactions indicate user expectations beyond the current capabilities of VUIs. A field trial was carried out with seven participants in five households, followed by in-depth interviews with six of the participants. The analysis provides a detailed picture of how these interactions play out in practice and offers insights into participants' perspectives on their experiences. Participants had some expectations of interacting with news about VUIs, but many of them were not met. This article identifies three main challenges: 1) lack of interactive targeting in news contexts, 2) limited navigation capabilities in news sources, and 3) insufficient support for deep news exploration, with concomitant design implications for more compelling interactions with VUI for news. purposes Keywords: Voice interactions Voice User Interface (VUI) Newspaper design
1 Introduction The way people consume news has changed in recent decades. We have gone from reading print newspapers, listening to news on the radio and watching television to reading online and interacting with news on social media. With these changes, news audience participation has become an important part of news companies' business models [1]. The incorporation of chatbots is a strategy used to create interaction with messages. Chatbots use technology that gives users access to data and services through dialogue [9, 13]. In theory, interactions with chatbots can be written or spoken dialogues, but the term is usually associated with written interactions [10]. In this article, we focus on spoken dialog chatbots and use the term Voice User Interfaces (VUIs) to emphasize this aspect. While text and voice based conversational technologies are classified as chatbots, it is important to recognize that they have different types of interactions, opportunities and challenges. Much research has been done on text-based chatbots for news purposes, in areas such as design and implementation [17, 19], chatbot personality and relationships with news audiences [8], user experiences and expectations [34], recommendations of news [ 4], and perceptions of Chatbots and the credibility of information [33]. There © The author(s), under exclusive license from Springer Nature Switzerland AG 2023 A. Følstad et al. (Eds.): CONVERSATIONS 2022, LNCS 13815, pp. 18-33, 2023. https://doi.org/10.1007/978-3-031-25581-6_2
Interact with messages through voice user interfaces
19
There have been some studies on VUI and news that have focused on news quality [6], the potential benefits of a self-revealing VUI for news consumption [26] and research on interactive news [16]. However, the field is still little explored. Recent advances in natural language processing have led to the widespread adoption of VUIs. VUIs like Amazon Alexa, Apple's Siri and Google Assistant are present in many people's lives now, either as services on their smartphones or as physical smart speakers in their homes. VUIs are increasingly being used for news consumption [22], and so-called digital news leaders argue that audio is becoming increasingly important [8] and “will change the way the public discovers media”. online” in the coming years” [23, p. 35]. Advances in natural language processing solutions offer new ways of interacting with the news, in which users can take a more active role. Current VUIs have some obvious limitations and there is a mismatch between user expectations and technological capabilities [18]. There must be a better understanding of how users interact with VUIs and what expectations they have of such encounters. With a few exceptions [eg. B. 25, 31] in laboratories and not in the natural environment of users. In this study, we want to examine how inexperienced users interact with news using VUIs and which interactions indicate their expectations that go beyond the current capabilities of VUIs. This study builds on a tradition within human-computer interaction (HCI) of studying and analyzing naturally occurring interactions with technology [3, 7, 24] that has also been used to study VUIs [25]. Drawing inspiration from interaction analysis [15], we postulate that to understand the current practice of using VUIs for news consumption and to gain insight into user expectations, it is valuable to study the interactions between VUIs and people in their daily lives. The social and environmental aspects of everyday life provide us with realistic and detailed data that we can analyze to better understand these interactions and expectations. We present the results of a seven-day field test in five households, in which participants were asked to regularly use a VUI to consume news content. These interactions were recorded and then analyzed, inspired by interaction analysis [15]. We conducted interviews with participants to assess their experiences and learn about their perspectives. Our study contributes to the exploration of chatbots by examining how people with VUIs, i.e. spoken dialogue chatbots, interact for messaging purposes. Our study aims to provide a better understanding of the limitations of VUIs for messaging purposes and to make suggestions on how these interactions can be improved. Our work provides the following contributions to the field. 1) We present a detailed analysis of six types of interaction requests specific to the VUI news domain: initiating news interactions, navigating between different news sources, navigating within a news source, accessing specific news genres, requesting explanations about news items and search for additional information and reports. 2) We describe three main challenges in interacting with news content via VUI with design implications: lack of interaction targeting in news contexts, limited navigation capabilities in news sources, and insufficient support for in-depth study of news sources.
20
OE Nordberg and F. Guribye
2 Background In this section, we present work related to VUIs and conversational interactions with messages. 2.1 Voice User Interface VUIs are technologies in which spoken words are the primary input and output. Other terms used to describe similar voice technologies include personal (smart) assistant, smart speaker, conversational agent, and conversational user interface. The latter term has been hotly debated and it is argued that such an interaction is not truly conversational [5, 25, 27, 28]. On the one hand, the use of language as the main form of interaction offers many possibilities: it is natural and intuitive and makes interaction easier and possibly more inclusive [9]. On the other hand, people with hearing or speech impairments may be excluded from VUIs. Compared to graphical user interfaces (GUIs) and text-based user interfaces, VUIs can reduce the cognitive effort of tasks [30]. Speaking is often faster than typing, which can improve task efficiency and user satisfaction. VUIs are hands-free technologies that give users the ability to multitask and can be beneficial for people with certain physical disabilities. Text-based interactions allow users to quickly navigate through content and focus on reading the relevant parts. On the other hand, VUIs can be less efficient in this regard, as users need to listen to the full response to get an overview of the content [30]. In some VUI systems, such as many smart speakers, speech is the only input and output. The lack of visual cues and offerings can make it difficult to appreciate the full scope of VUI capabilities [32]. Driven by the notion of an AI-powered future, there has been some hype around the use of VUIs, which are portrayed as naturally integrated into everyday conversations. Empirical studies of this phenomenon paint a different picture and point to natural and normal problems with such interactions [25]. Rather than being understood as full conversations, VUI interactions are often transactional in purpose, meaning they are goal-oriented and usually involve users gathering information necessary to complete a task or achieve a goal [18]. In daily use, VUIs are often incorporated into complex multi-stakeholder environments, and users frequently encounter problems [29]. Myer et al. [21] finds four categories of problems that users face when interacting with unknown VUIs: unknown intent (the person utters a command that the VUI does not support or formulates the query in a way that the VUI cannot parse), error processing natural language (the VUI "understands" what the person is saying), feedback failure (the VUI gives feedback but may be ambiguous or misinterpreted by the person), and system errors (technical errors). Users respond to these problems in different ways, such as simplifying the utterance or adding more information, making a different request, restarting the interaction, calming down or terminating [21]. Lugar and Sellen [18] find that there is a serious imbalance between users' expectations about the capabilities of conversational agents and their actual capabilities. They argue that users have difficulty imagining how the conversational agent works, and that the agent reinforces this incomprehensibility by not revealing necessary information about its capabilities and (lack of) intelligence. Sciuto et al. [31] point out that it is difficult for users to discover new features, as they usually explore the VUI functionality first.
Interact with messages through voice user interfaces
21
times they interact with him. To solve the problem of VUIs acting as black boxes, Porcheron et al. [25] suggest that designers should follow the advice of Dourish and Button [7] about observable-reportable abstractions, presenting information and signals about what the system is doing, why it is doing it, and what is likely to be done next. 2.2 Conversational interactions with news Traditional journalism can be described as a form of lecture and monologue. On the contrary, conversational journalism aims to present the news in a conversational way [20]. News organizations have explored using text-based chatbots to engage news audiences by offering a more conversational form of journalism. Operating on a face-to-face messaging platform with a talkative and informal tone, a chatbot ensured the participation of specific audiences that were not previously involved in news consumption [8]. In a study of users of popular international news chatbots, Zhang et al. [34] find advantages and challenges of today's news chatbots. On the one hand, they make it easy to find news information, and interactive and personal content can be engaging. On the other hand, current news chatbots still have limitations in terms of effectiveness, information content, efficiency and human characteristics. The most notable contribution to VUIs in the news domain comes from the study by Dambanemuya and Diakopoulos [6], who examined smart speakers to determine information quality in news queries. They found that Alexa had a high response rate of 90%, of which 70% were deemed relevant to the questions asked. The way users phrased their questions influenced the results: "Can you tell me about...?" and what happened?", leading to more relevant responses compared to "what's new?" and "what's new?"), while the latter two refer to evolving events, were known, usually Wikipedia pages, and only 1.4% of responses came from news sources.
3 Methods In this section we describe the methods used in this study. First, we describe how we conducted a field experiment to examine how participants interacted with messages via VUIs. Second, we explain how we use interviews to better understand participants' experiences and expectations. Finally, we describe our analysis process. 3.1 Field test We started this project by conducting a field test [3] in April 2021 to investigate how novice users interact with VUIs in their daily lives to stay informed about what's new. Field tests are often used in HCI to explore new technologies and use cases in a more natural setting than lab experiments. By carrying out a field study 'in the wild', we were able to obtain information about the actual use of the system, including the possibility of 'unintended use' and more detailed data on what works and what does not [3].
22
OE Nordberg and F. Guribye
Participant. In the study, we recruited first-time users who had no experience with VUI for messaging purposes, as our goal was to examine participants' initial expectations and experiences. Based on convenience sampling and our personal network, we recruited a total of 7 novices (4 females and 3 males) from 5 families. Their ages ranged from 26 to 66 years old. See Table 1 for an overview of households and participants. All participants had some knowledge about VUIs; For example, they tested a personal assistant on their smartphone. However, none of the participants had used it to follow the news. Six of the participants were Norwegian and interacted with the VUIs in Norwegian, while one participant, Ben, interacted with his chosen VUI, Amazon Alexa, in English. Table 1. Summary of participants in the field trial. The names are pseudonyms. GA = Google Assistant; Alexa = Home Amazon Alexa
Name
To alter
Gender
FUNNY
1
Amy
62
Female
Georgia
1
alf
66
Masculine
Georgia
2
ben
27
Masculine
Alexa
3
gato
26
Female
Georgia
3
Kal
26
Masculine
Georgia
4
Debutante
51
Female
Georgia
5
eva
32
Female
Georgia
Organization Participants freely explored and interacted with a physical IUV in their homes, but were instructed to use it to receive news twice a day for one week. Participants who did not have a physical VUI (all except Ben) were provided with a Google Home Mini device for the duration of the study. We helped participants set up the VUI and they chose their favorite news sources. They recorded their interactions with the messages using a voice recording app on their phones and sent those recordings to us via an encrypted broadcast page. 70 voice recordings were made, lasting between 13 s and 10 min and 14 s, with a total duration of 6.3 h. All recordings were transcribed. Prior to conducting the field test, we conducted a one-day pilot study with one participant to ensure that the instructions were clear and that we were collecting actionable data. The research project was reviewed and approved by the local research ethics committee.
3.2 Interviews After completing the field trial, all participants were asked to participate in a follow-up interview about their experiences. A transcript of a semi-structured interview that was
Interact with messages through voice user interfaces
23
reviewed in the pilot study were followed. In households with two members, both were invited to participate in the same interview. In the case of Cal and Cat, Cal was unable to attend. The interviews lasted between 21 min and 1 h 15 min, with a total duration of 3.2 h. The interviews consisted of four main topics: the participants' messaging habits before the field test (e.g. how they were typically informed about the news, different news preferences), how they interacted with the VUI to be informed about the news (e.g., a typical scenario; their goals, claims, expectations), how they experienced these interactions (likes and dislikes, bugs encountered, overall experience), and speculations about how these interactions might evolve in the future (how they might turn into improvements, possible roles). from VUI). 3.3 Analysis Our analysis was inspired by interaction analysis [15], which examines how people interact with each other and with artifacts and tools, in our case VUI. In interaction analysis, data are found in the details of everyday social interactions that naturally occur between human and technological actors and their environment. This includes conversations, non-verbal interactions, and the use of artifacts and technology [15]. Our analysis was based on audio recordings from the field test, which provided rich data from natural, everyday interactions with the VUI, and interviews provided additional descriptions and explanations. The first author categorized the field test and interview materials. From these initial comments, we inductively identified the themes presented in this article. The presentation and interactions with the VUI were evaluated to identify routine practices accompanied by challenges and consequent solutions. The interviews served to complement these findings and to better understand the participants' reports. During the analysis, we paid special attention to factors that point beyond the current capabilities of VUIs.
4 Results In this section we present the results of the field test and interviews. Subsection 4.1 provides an overview, subsections. 4.2, 4.3, 4.4, 4.5, 4.6, and 4.7 focus on each of six different VUI message domain-specific subscriber interaction requests. 4.1 Overview Participants received twice-daily news updates via the VUI as provided. They mostly did this once in the morning and once in the evening. Most participants performed this task alone, but sometimes participants who lived together shared or shared with the other person present. Sometimes participants were doing other things while using the VUI to send messages, such as eating or cleaning. Based on the analysis, the current situation of getting news updates through a VUI is quite simple and limited. The process begins when a person says the VUI wake word, usually followed by a message command or question. The VUI then finds
24
OE Nordberg and F. Guribye
the news sources that the person follows and presents them one by one. These sources are specified in the Google Home application or when the user requests the news for the first time (Alexa). Each source has a pre-recorded audio file, similar or the same as the radio, which is played from beginning to end. These audio files do not appear to facilitate interactions beyond those associated with traditional audio files such as B. Pause and play. During the interviews, all participants indicated that it was easy to ask VUI to bring them news and that they felt up to date, but none wanted to continue using VUI for informational purposes. According to the descriptions by Myers et al. interacting with the messages in different ways, which often resulted in errors. [twenty-one]. Participants expressed uncertainty about the capabilities of the VUI. They explained that it is difficult to know what features are available in the VUI and what information the VUI bases its actions on. Even when a command was successful, the VUI often provided limited information, confusing the user: "I once asked to turn on the podcast and then it just turned on a random podcast and didn't know." He probably ended up saying it was the Aftenposten podcast, but I don't know where he found it." (Eva) All participants said that the technology and interaction needed to be improved, but believed it had potential for the future. As shown in the subsections Next, participants attempted to interact with messages in six different ways specific to the VUI message domain: initiating message interactions, navigating between different message sources, navigating within a message source, accessing specific message genres, requesting explanations of elements in the news and requesting additional information and reports. As we will show, most of these interaction requests were beyond the capabilities of the two VUIs. 4.2 Initiating message interactions The VUIs used in this study were smart speakers with various functions, such as presentation Message initiation usually started shortly after VUI activation. These first iterations with the VUI for messaging were generally successful, except for a few cases with system errors [21]. What participants said when they wanted news updates varied, but it was usually expressed as a request or brief instruction, as shown in the excerpts below. The following excerpt shows a typical successful interaction where a person wants to be informed about the news.
(GA = Google Assistant) Alf: (00:07) OK Google, what's new? GA: (00:10) This is breaking news. NRK Nyheter 22:03 today.
As shown here, Alf requests updates. The VUI introduces the NRK Nyheter news source and at 00:18 the pre-recording of NRK Nyheter begins. While Alf politely asks for updates, his wife Amy uses a command:
Interact with messages through voice user interfaces
25
Amy: (00:14) OK Google, the news.
4.3 Navigation between news sources All participants followed multiple news sources on their VUI. Navigation between different sources was one of the most successful types of interaction in this study. Participants could indicate which source they wanted to hear and the VUI would locate the source mentioned. They can also navigate between fonts by telling the VUI to "jump", which will cause the VUI to display the next font in line, as shown below:
(00:11) [VG News starts] Cat: (00:51) Okay, Google, next. GA: (00:56) Finans let us know today at five. (00:59) [Start Finansavis News]
4.4 Navigation within a news source All participants, with the exception of Deb, who did not particularly interact with news, attempted to navigate within news sources. The VUIs in our study lacked features for this navigation, making this a problem of unknown intent [21]. When a user told VUI to skip a recent story, Google Assistant skipped the entire news thread from that source. Several commands such as next and forward have been tried without success.
Alexa: (00:28) From CNN. [CNN News Begin] [Activism and President News (US)] Ben: (00:59) Alexa, uh, jump. [Alexa "Dang-dang" [neutral] sound] Alexa: (01:04) This update is currently unavailable.
In the above excerpt, Ben tries to ignore a specific message, but Alexa tells him that action is not possible yet. Compared to Google Assistant, Alexa doesn't skip the entire news thread. Based on how Alexa responds, "Ignore" is recognized as an action, but it is still considered an issue of unknown intent [21] because the command is not supported. Cat tried several different commands, but Google Assistant didn't recognize any. The following snippet shows how Google Assistant ties command forwarding to streaming music services. 47 seconds pass from Cat's first attempt to skip a message before she returns to the same story.
26
OE Nordberg and F. Guribye
Cat: (00:28) OK Google, [VG Nyheter stops] can I hear the next message? GA: (00:32) Sorry, I don't understand. (00:34) [VG Nyheter continued] Cat: (00:40) OK Google, continue. Cat: (00:45) OK Google... [VG Nyheter pauses] Go ahead. GA: (00:49) Song selection is only available for YouTube Music Premium subscriptions, but you might like this station. (00:58) [Music starts playing] Cat: (01:00) [Laughs a little] Hey Google... [Music stops] Can you go back to the news? GA: (01:06) To hear the news, just say "Give me the latest news"... Cat: (01:09) Give me the latest news. GA: ...You can also request news from... Cat: (01:11) OK Google. GA: ...sources or rather... Cat: (01:13) Give me the latest news. GA: (01:15) This is the latest news. Resumes VG at 22:17...
Alf explained that he's tried a lot of different ways to navigate a font, but he just never knew what to say. This experience was consistent for most participants. In the interview, Alf explained his experience: "What I found annoying was when... let's say you were on NRK Nyheter and you didn't want to hear more about Corona. They want to hear the next message from NRK. So when you say 'next' , ended up with NRK. Then jump directly to the next news source.” 4.5 Access to specific news genres Sometimes participants wanted a specific news genre, for example local news, international or culture-related news news when asked about certain types of messages, the VUIs were not able to provide a result, leading to problems with unknown intent [21].
Alf: (05:51) OK Google, update me on NRK sports. GA: (05:57) I still can't play NRK sports news.
As shown in the snippet above, Alf's VUI command is a bit complex as he wants a specific genre of news from a specific source. VUI recognizes sport as a genre, but states that it cannot take this action. The "yet" at the end of the answer indicates that this is a feature that may be available in the future.
Interact with messages through voice user interfaces
27
4.6 Searching for Additional Information and Reports Occasionally, participants would interrupt the news flow to ask VUI for more information about the news presented or to see what other sources mentioned about the subject. VUI did not understand these issues as below:
[News on Kjærvik's murder] Alf: (02:51) OK Google, update me on Kjærvik's murder. GA: (02:56) Sorry; I did not understand. (02:58) [News continues] Alf: (03:06) OK Google, can you tell me what VG says about Kjærvik's murder? GA: (03:12) Sorry, I don't understand what you mean. (03:16) [News continues]
The above example shows that Alf stops the news stream because he wants more information about the news being presented. Google Assistant does not understand your request (intent unknown) [21] and continues. Alf interrupts again, asking specifically what other news source is reporting on the case in question. Google Assistant still doesn't get it and continues with the news thread again. 4.7 Requesting Clarification on Items in Messages Messages sometimes contained information or topics that users wanted to know more about. This often included topics or items that the participants were not very familiar with, making it difficult to follow the actual message. This required the VUI to stop the thread and "fetch" a response from the user. It was often difficult for the VUI to answer these questions, but sometimes it found a relevant answer:
[Border truck control news] Cat: (03:11) OK Google. [NRK Nyheter pauses] What are the entry requirements for Norway? GA: (03:16) Regjeringen.no states: “Only foreigners residing in Norway can now enter Norway. […] (03:38) [NRK Nyheter continues]
The snippet above shows a successful interaction where the VUI responds to the participant's question. However, there are also failed interactions where the VUI responds but misinterprets the context, as shown in the following snippet:
28
OE Nordberg and F. Guribye
[News about the new European football league "Super League"] Eva: (01:33) Ok Google... [NRK Nyheter pauses] What is the Super League? GA: (01:38) This is what Wikipedia can say: “The Super League is the highest level of Danish football. The winner becomes the Danish champion. [...] Eva: (01:49) Ok Google... GA: ...the championship has been decided..."
Eva understands that the answer is wrong and tries to be more specific:
Eva: (01:51) What is the Norwegian Super League? GA: (01:55) Sorry, I don't understand what you mean.
In Eva's example, her husband later told her what the Superleague was. A consequence of the VUI not taking context into account and providing incorrect information is that users accept it and may spread incorrect information.
5 Discussion Engagement is considered fundamental in the competitive news market [1]. There is an expectation within the industry [22, 23] and among researchers [8, 14] that conversational interfaces such as VUIs will influence future messaging practices and potentially improve interaction with messages [12]. To make these expectations realistic, the interaction with the VUIs in the message configuration must be improved. We argue that it is necessary to better understand how users interact with VUIs in messaging contexts, as well as to identify their expectations, needs and preferences. Our research is a step in that direction. The results of our study showed that participants had specific expectations about the capabilities of VUIs in messaging environments, but these were not met. Participants should be able to navigate between and within news sources, access specific genres of news, get explanations of news items, and obtain additional information and reports. In addition to their unmet expectations, participants encountered problems consistent with those reported in previous VUI research [18, 21, 25]. The conversational format of VUIs offers new ways to interact with content, and our analysis shows that people have high expectations for it. In news contexts, the VUIs in our field test used traditional radio concepts and formats, where users were passive listeners. The role of VUIs in interacting with the news was mainly limited to finding the various sources and presenting them, but the actual presentations were made by the news sources via pre-recorded audio files. When the technology worked correctly, the process was easy. When users wanted to listen to a sequence of several radio-like news segments (often with repeated news from different news sources), this worked well. However, if users wanted to interact by browsing content or seeking deeper knowledge by asking questions, they would likely run into problems. When users interact with these types
Interact with messages through voice user interfaces
29
of actions, the VUI had to pause the message thread and try to identify the target users. However, current VUIs lacked the ability to consider the context that initiated the user's interaction (for example, a topic featured in the news that prompted them to ask a question). In our field experiment, participants repeatedly tried (and mostly failed) to interact with messages through the VUI, and their VUI expectations and skills were also inconsistent in this application area [18]. The fact that the news was easy to access and that participants felt informed about the news was not enough to make them want to continue using VUIs for news after the study. All participants agreed that technology and interactions need to be improved. Text messaging chatbots play a more prominent role compared to the VUIs examined in our research. These messaging chatbots are intentionally designed to provide messaging information in a conversational way [p. 8, 17, 33], while the VUIs examined in our study are intelligent assistants with multiple functions, of which message delivery is only one. As with messaging chatbots [34], it is easy to use VUIs to find messages, but these VUIs lack the interactive and personal content that messaging chatbot users interact with. In the future, it might be worth exploring how text messaging chatbot capabilities can be ported over to VUIs. Relative to previous research on the quality of information in VUI message requests [6], our research also shows the relevance of how people formulate their requests. Research on natural interactions with VUIs has highlighted the normal and natural problems that people encounter when interacting with VUIs [21, 25]. Our study identified specific issues related to the VUI News domain. We argue that not all VUI problems can be solved only at a general level and that difficulties encountered in specific VUI domains need to be examined as B. messages are produced. Based on our results, we argue that the capabilities of VUIs in message configuration are not fully utilized. We identified three main issues with using the VUI for messaging, along with design implications for improving that experience. 5.1 Interaction Guidance in Messaging Contexts The first issue we want to highlight is user uncertainty about the capabilities of VUIs to interact with messages. These days, users are used to playing a passive role in news consumption, such as reading online newspapers or listening to news podcasts. VUI technology offers new ways to interact with messages, but users may find this experience unfamiliar and have a limited understanding of VUI capabilities. Participants in our study wanted to engage with the news by asking questions and trying to navigate the content, but they had difficulty doing so. They tried but failed and used different responses to solve these problems, which is consistent with what Myers et al. reported results. [twenty-one]. Users wanted to interact with the news but didn't know how. Even after a week, participants were unsure about the capabilities of the VUIs. This gap between user expectations and the capabilities of VUIs could be reduced if VUIs guide users to potentially relevant interactions. We believe there is a need to develop custom guides and bespoke endorsements, both at a general and domain-specific level. Both Google Assistant and Alexa had some interactive tabs for the messaging domain, but they popped up sporadically and were too general.
30
OE Nordberg and F. Guribye
In the future, research should focus on how best to convey the capabilities of relevant VUIs across different typical application environments. A VUI can explain its capabilities in a specific application area and provide users with further suggestions, for example how to navigate message content, in line with Porcheron and colleagues' suggestion [25] to follow advice on "observable-reportable abstractions ". [7]. By providing more feedback and guidance, the VUI can help users explore and understand its existing features. Additionally, users can gain more information and insights into the VUI processes of successful and failed interactions, which can provide them with a more realistic understanding of how the VUI works. 5.2 Navigation between and within news sources One of the main problems faced by participants concerns navigation through news content. With VUI you can navigate between different fonts, but not within a specific font. There is a need to create a vocabulary that facilitates navigation through the information content. Future research should examine how people view messages presented through VUIs and what types of statements are used when navigating message content. As the study by Dbanemuya and Diakopoulos [6] shows, the way people formulate their VUI requests influences the relevance of the information they receive. For example, some standard phrases can be defined, such as "Give me the next story", "Ignore this story", or "What is another source saying about this topic?", to drive user expectations and VUI capabilities . This may require existing news audio files to be tagged or tagged with metadata to allow the VUI to differentiate each story. This requires media work to prepare their content for access via VUI. Another solution could be for the VUI to use text-to-speech technology and directly read online newspapers such as Wikipedia, among others. Using the text-to-speech feature may impact your experience. It would be a compromise between a professional journalist who reads the news - using the prosody and intonation typical of this "radio chat" genre [11] - and the limitations in that sense when a story is read by a machine. 5.3 Deep Messaging Exploitation by VUIs As Ford and Hutchinson [8] discussed messaging chatbots, VUIs can deliver messages in a conversational format where users can have more advanced interactions and ask questions if this is considered in the design process of such systems . The VUI format offers the opportunity to interact with news in a more conversational way compared to more traditional media such as newspapers or radio programs. We argue that VUIs have untapped interactive potential in messaging environments and that our results encourage the development of more advanced interactions. Participants in our study often tried to interact with messages through the VUI in advanced ways, but were rarely successful. They also sought to educate themselves and explore the topics presented in the news with a deeper focus, asking related questions and trying to navigate the content. When they asked the VUI questions
Interact with messages through voice user interfaces
31
As for the news, there was rarely a good response. The VUI must know the topics of the presented news, including knowledge of the main entities (eg characters, settings and phenomena). By raising awareness, the VUI is more likely to respond with relevant information when asked questions. Similar to the design implications mentioned above, this can be resolved by adding tags and metadata to audio files.
6 Conclusion In this article, we present our results from a seven-day field test, followed by interviews in which we examined how inexperienced users interacted with VUI news and what their expectations were. Some of our findings are consistent with previous research on messaging chatbots and VUI, while others are specific to the messaging VUI domain. We show how users have specific unmet expectations for the messaging context capabilities of VUIs in terms of navigation and verbosity. Our study also has some limitations; Most notable is the limited context: it was conducted on a small sample of participants in Norway and primarily in Norwegian (one participant completed the task in English with US news sources). Only two types of VUI were used: Google Assistant and Amazon Alexa. Therefore, information about VUI functionalities may not be applicable to other contexts, VUIs or languages. As all participants were novice users, it may be beneficial to also conduct surveys with users who are experienced in using VUI for messaging purposes. As this is a field experiment focused on natural environments, participants were free to choose which news sources to interact with. Therefore, our results may be relevant for some VUI news sources, but not all. Participants recorded their own interactions with their phones, making the interactions less natural. It is also possible that not all interactions were recorded. However, we have found that the data we collect is rich enough to provide us with valuable insights into how novice users interact with VUI news and to get a clearer picture of their expectations. Our study showed how unsatisfactory interactions with VUIs were for messaging purposes and how participants had a number of unmet expectations. In the analysis, we present six different types of interaction requests specific to the VUI news domain: initiating news interactions, navigating between different news sources, navigating within a news source, accessing specific news genres, requesting explanations about news items news and search for additional information and reports. Based on these interaction attempts, we highlight three main issues with service design implications: lack of interactive guidance in new environments, limited navigation capabilities across news sources, and insufficient support for in-depth news searches. Design implications show how you can provide a better experience when interacting with messages through the VUI and use the potential of interactions. We encourage future research to focus on VUIs for messaging purposes, specifically establishing a vocabulary for messaging interactions, and exploring more advanced interaction design.
32
OE Nordberg and F. Guribye
References 1. Batsell, B.: Committed Journalism. Columbia University Press, New York (2015) 2. Brown B, McGregor M, Laurier E iPhone live. In: Proceedings of the 2013 Conference on Human Factors in Computing, CHI 2013, pp. 1031–1040. ACM Press, New York (2013). https://doi.org/10.1145/2470654.2466132 3. Brown, B., Reeves, S., Sherwood, S.: In the wild: challenges and opportunities for field testing methods. In: Proceedings of the 2011 Conference on Human Factors in Computing, CHI 2011, pp. 1–12. ACM Press, New York (2011). https://doi.org/10.1145/1978942.1979185 4. Chen, L., et al.: A pilot study to understand user attitudes toward a conversational agent for news recommendations. In: Proceedings of the 4th Congress of Conversational User Interfaces, CUI 2022, pp. 1–6. ACM Press, New York (2022). https://doi.org/10.1145/3543829.3544530 5. Clark, L. et al.: What makes a good conversation? Challenges in designing truly conversational agents. In: Proceedings of the 2019 Conference on Human Factors in Computing, CHI 2019, pp. 1 to 12. ACM Press, New York (2019). https://doi.org/10.1145/3290605.3300705 6. Dambanemuya, H.K., Diakopoulos, N.: "Alexa, what about impeachment?" Evaluation of smart speakers for message quality. En: Computation + Journalism Symposium, pages 1 to 4. The Brown Institute for Media Innovation, New York (2020) 7. Dourish, P., Button, G.: On "technology": fundamental relationships between ethnomethodology and design of the system. human-computer. Interact. 13(4), 395-432 (1998). https://doi.org/10.1207/s15327051hci1304_2 8. H Ford, J Hutchinson: Newsbots mediating relationships between journalists and audiences. digit. J. 7(8), 1013-1031 (2019). https://doi.org/10.1080/21670811.2019.1626752 9. Følstad, A., Brandtzaeg, P.B.: Chatbots and the new world of HCI. Interactions 24(4), 38–42 (2017). https://doi.org/10.1145/3085558 10. Følstad, A., Skjuve, M., Brandtzaeg, P.B.: Different chatbots for different purposes: Towards a chatbot typology to understand interaction design. In: Bodrunova, S.S., et al. (Ed.) INSCI 2018. LNCS, vol. 11551, pp. 145-156. Springer, Cham (2019). https://doi.org/10.1007/9783-030-17705-8_13 11. Goffman, E. Ways of Speaking. University of Pennsylvania Press, Philadelphia (1981) 12. Gómez-Zará, D., Diakopoulos, N.: Characterization of communication patterns between audiences and news bots. digit. Daily. 8(9), 1093-1113 (2020). https://doi.org/10.1080/21670811.2020.181648 13. Gorwa, R., Guilbeault, D.: Unpacking the social media bot: a typology to guide research and policy. Internet Policy 12(3), 1-30 (2018). https://doi.org/10.1002/poi3.184 14. Jones, B., Jones, R.: Public service chatbots: automating conversations with BBC News. digit. Daily. 7(8), 1032-1053 (2019). https://doi.org/10.1080/21670811.2019.160937 15. Jordan, B., Henderson, A.: Interaction analysis: principles and practice. J. Learning. Science. 4(1), 39-103 (1995). https://doi.org/10.1207/s15327809jls0401_ 16. Jung, H., et al.: Tell me more: Understanding User Interaction of Smart Speaker News powered by Conversational Search. In: Proceedings of the 2019 Conference on Human Factors in Computing, CHI 2019, pp. 1 to 6. ACM Press, New York (2019). https://doi.org/10.1145/329 0607.3312979 17. Laban P, Canny J, Hearst MA: What's New? A question-based messaging chatbot. In: Proceedings of the 58th Annual Meeting of the Association for Computer Linguistics: System Demonstrations, pp. 380-387 (2020). https://doi.org/10.18653/v1/2020.acl-demos.4 18. Luger, E., Sellen, A.: Like a Really Bad AP: The Gap Between User Expectations and Conversational Agent Experience . In: Proceedings of the 2016 Conference on Human Factors in Computing, CHI 2016, pp. 5286–5297. ACM Press, New York (2016). https://doi. org/10.1145/2858036.2858288
Interact with messages through voice user interfaces
33
19. Maniou, T.A., Veglis, A.: Using a chatbot to disseminate messages during a crisis: design, implementation and evaluation. Future Internet 12(12), 1-14 (2020). https://doi.org/10.3390/ FI1207010 20. Marchionni, D. Conversational Journalism in Practice: A Case Study of the Seattle Times 2010 Pulitzer Prize for Breaking Reporting. digit. J. 1(2), 252-269 (2013). https://doi.org/10. 1080/21670811.2012.748513 21. Myers, C., et al.: Patterns of How Users Overcome Obstacles in Voice User Interfaces. In: Proceedings of the 2018 Conference on Human Factors in Computing, CHI 2018, pp. 1 to 7. ACM Press, New York (2018). https://doi.org/10.1145/3173574.3173580 22 N Newman: Journalism, Media and Technology Trends and Predictions 2018. Digital News Report, Reuters Institute (2018) 23 N Newman: Journalism, Media and Technology Trends and Predictions 2019 Digital News Report , Reuters Institute (2019) 24. Pizza, S., et al.: Smartwatch in vivo. In: Proceedings of the 2016 Conference on Human Factors in Computing, CHI 2016, pp. 5456–5469. ACM Press, New York (2016). https://doi. org/10.1145/2858036.2858522 25. Porcheron, M. et al.: Language interfaces in everyday life. In: Proceedings of the 2018 Conference on Human Factors in Computing, CHI '18, pp. 1-12. ACM Press, New York (2018). https://doi.org/10.1145/3173574.3174214 26. Rao, S., et al.: Ethical Self-Disclosing Voice User Interfaces for Delivery of News. In: Proceedings of the 4th Congress of Conversational User Interfaces, CUI 2022, pp. 1–4. ACM Press, New York (2022) 27. Reeves, S.: Some conversational challenges of talking to machines. In: Supplement to the 2017 ACM Conference on Cooperative Work and Computer-Supported Social Computing, CSCW 2017, pp. 431-436. ACM Press, New York (2017) 28. Reeves, S.: Talk Considered Harmful? In: Proceedings of the 1st Conference on Conversational User Interfaces, CUI 2019, pp. 1–3. ACM Press, New York (2019). https://doi. org/10.1145/3342775.3342796 29. Reeves, S., Porcheron, M., Fischer, J.: "This is not what we wanted": Conversational design with language interfaces. Interactions 26(1), 46–51 (2018). https://doi.org/10.1145/329 6699 30. Rzepka, C., Berger, B., Hess, T.: Language Assistant vs. Chatbot - Investigation of the fit between interaction modes of conversational agents and information search tasks. Front information system. 24, 1-18 (2021). https://doi.org/10.1007/s10796-021-10226-5 31. Sciuto, A., et al.: "Hey, Alexa, how are you?" In: Proceedings of the 2018 Interactive Systems Design Conference, DIS 2018, pp. 857-868. ACM Press, New York (2018). https://doi.org/10.1145/3196709.3196772 32. Valério, F.A.M., et al.: This is what I can do: Chatbot strategies to convey its functions to users. In: Proceedings of the XVI Brazilian Symposium on Human Factors in Computational Systems, IHC' 2017, pp. 1-10. ACM Press, New York (2017) 33. Zarouali, B., et al.: Overcoming bias with chatbot news? Examine the impact of news content with opposing views on consistency and credibility. EUR. J.common. 36(1), 53-68 (2021). https://doi.org/10.1177/026732312094090 34. Zhang, Z., Zhang, X., Chen, L.: Information on designing a messaging chatbot. In: Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, VAT 2021, pp. 224-231. ACM Press, New York (2021). https://doi.org/10.1145/3472306.347835
Express your opinion! Use and perception of a text-based, voice-based, and mixed text-voice (CAVAA) conversational agent voting advice application by young voters Christine Liebrecht (B)
, Naomi Kamoen and Céline Aerts
Tilburg School of Humanities and Digital Sciences, Department of Communication and Cognition, Tilburg University, Tilburg, Niederlande {C.C.Liebrecht,N.Kamoen}@tilburguniversity.edu
Abstract. Conversational Agent Voting Advice Applications (CAVAA) are chatbot-based information retrieval systems for citizens who want to learn about political issues in times of political elections. Previous studies examining these relatively new tools have focused primarily on the impact of CAVAAs involving a text-based chatbot. To further optimize its design, current research has compared the effect of CAVAAs with a text/voice/mixed chatbot. In an experimental laboratory study among young voters (N = 60), these three modalities were compared on measures of use (the amount of information retrieved by the chatbot and misunderstandings), measures of evaluation (ease of use, usefulness and fun) and policies. (perceived and real political knowledge). The results show that the three CAVAA modalities work equally well on political action and pleasure perceptions. At the same time, CAVAA's text and combination outperformed CavaA's voice in several respects: CavaA's voice received lower scores for usability and helpfulness, respondents requested less additional information, and experienced more misunderstandings when interacting with the voice chatbot. . Analysis of usage data also indicates that users in the blended state are unlikely to use the language option and instead rely almost exclusively on text functionality such as clicking on hint buttons. This seems to indicate that the use of speech is very uncomfortable for CAVAA users; therefore, we recommend using text bots in this special context of use. Keywords: Electoral Advisory Apps · Conversation Agents · Chatbot Modality · Utility · Ease of Use · Entertainment · Political Knowledge
1 Introduction Cava is a Spanish sparkling wine, CAVAA is the abbreviation of Conversational Agent Voting Advice Application. Like the Voting Advice Apps (VAA), which are very popular during election times [10], such as the Dutch Stemwijzer, the German Wahl-OMat and the Swedish Valkompassen, CAVAA users respond to political opinions © The author(s) ), under exclusive license to Springer Nature Switzerland AG 2023 A. Følstad et al. (Eds.): CONVERSATIONS 2022, LNCS 13815, pp. 34-49, 2023. https://doi.org/10.1007/978-3-031-25581-6_3
Express your opinion!
35
on relevant policy issues (eg "housing taxes should be raised") and receive voting tips based on their responses [6, 19]. Unlike a regular VAA, however, a CAVAA has built-in chatbot functionality that users can approach if they have trouble understanding how to respond to statements of political attitude. The chatbot in a CAVAA is trained to provide relevant information to solve common comprehension problems (see also [13]); for example, it can provide information on the definition of political terms (“¿Qué es el imposto a la vivienda?”) or on the state of affairs in relation to the political issue (“¿Qué so alto es el impesto a la vivienda en this moment?"). ). Therefore, unlike other chatbots in the political sphere, which provide subjective information about the opinions of an individual candidate running for elections [17], CAVAA aims to provide objective information that voters can use to give a thoughtful response to form policy statements of attitude, which should lead to valid voting advice Research shows that CAVAAs are valuable information retrieval systems for citizens as these tools achieve better user experience scores than regular VAAs without chatbot functionality [14] Furthermore, citizens score higher on measures of political knowledge after working with a CAVAA than after working with a regular VAA [14]. Given these positive findings, now is the time to further optimize the chatbot design in a CAVAA. As young people who form the main group of (CA)VAA users increasingly experience voice technology in their daily lives [7], this study will compare the impact of different chatbot modalities on young voters (18–25) .a text-based AAVAC with a voice-based AAVAC and a AAC that combines these two modalities. The three versions of the CAVAA are compared in terms of three types of outcome measures. First, we compare the tools in terms of a variety of usage metrics (the number of questions asked of the system and the level of misunderstandings encountered) to get a sense of how modality can affect the extent to which users feel invited. to do. questions to the chatbot (cf. [14]). Second, we include subjective tool evaluation measures (ease of use, usefulness, fun), which are commonly used in the field of chatbot research to evaluate the design features of a chatbot (cf. [32]). Finally, we compare the three versions in terms of measures of political knowledge (perceived and actual political knowledge), which are common outcome measures in AAV studies (cf. [15]), whose ultimate goal is to increase the user's political knowledge. increasing increases the possibility of voting [11]. By comparing these three chatbot modalities on a set of dependent variables in a specific context of goal-oriented use and for a homogeneous group of young voters, the present study will contribute not only to research on (CA)AAV, but also for general surveys chatbot. . This is because chatbot research has required studies on specific chatbot design features in specific domains and on specific user groups [8] and also investigation of modality effects in contexts of use [27]. A final reason why current research adds value to chatbot research is that not only are a full text-based version and a voice-based version being compared, but also a third version that combines text and voice; such a version was needed in previous studies of the chatbot modality [18, 27]
36
C. Liebrecht et al.
1.1 Modality Each text-based and voice-based chatbot has its own distinct characteristics. For example, when typing a question into a text-based chatbot, users can verify that their input is correct before sending a message to the system [2, 18]. Furthermore, a chatbot's written response can be read at the user's pace [18], which can increase the user's perception of control over the interaction [30]. Finally, in situations where users cannot or do not want to use speech, a text-based tool is more accessible [24]. On the other hand, voice-based chatbots are easier to use in situations where users cannot type messages or read the chatbot's written responses, such as when using a chatbot while cooking or driving [24]. Furthermore, speech-based technology is generally more intuitive and speaking is also faster than typing [2, 26]. However, this does not mean that using a voice bot is more efficient than using a text-based bot, as in a voice context the user is forced to listen to the full output of the chatbot, whereas in a text-based chatbot they are the user can slide or even skip information efficiently [26]. A final benefit of voice-based chatbots is that voice interaction promotes human awareness, which can lead to greater enjoyment [25, 33]. Although both text and voice have their distinguishing qualities, or perhaps because both modalities have their own distinguishing qualities, neither can be considered superior. Several studies show that the user's perception of the modality of a chatbot depends both on the characteristics of the task and on the characteristics of the user who works with the application. This is shown, for example, in [5], who found that language interaction leads to more positive attitudes than text interaction, but only on utilitarian tasks (e.g. "When is Valentine's Day?") . Dad in 2017?") and non-hedonic (e.g., "Tell me a bedtime story"). Furthermore, [27] compared a text-based and a voice-based chatbot in two task contexts: a task of goal-directed information search (ie, looking for a restaurant with predefined instructions) and an experimental task. search task (ie, looking for a restaurant without predefined instructions). users experienced a greater cognitive workload and more pleasure in the goal-directed task alone, although perceived efficiency was lower. Finally, [26] found that there are correlations between certain user characteristics and preference for a text or language modality, expanding on previous studies that focus on demographic characteristics such as age [28]. Task technology fit theory (TTF; [12]) can be used to explain these differences in language use and perception. text-based and voice-based chatbots, depending on the task and user characteristics. The theory posits that the task, individual user, and technology capabilities must fit together to lead to positive performance outcomes. When people perceive a strong match between the task and the technology, they perceive the technology as more effective and efficient. If there are discrepancies between the user, the task and the technology, the system is ranked worse [12]. In terms of TTF, it is relevant to examine how the user, task and technology interact in the specific context of CAVAA. This context can be seen as both utilitarian and specific, as the user wants to gather policy information and ultimately wants to get an application-specific voting suggestion. Compared to previous studies of the chatbot modality, which were carried out mainly in (fictional) customer service contexts (eg [20, 26, 27]), the CAVAA context can therefore be considered as more cognitively demanding.
Express your opinion!
37
as users try to solve their (real) problems of understanding about political issues by asking the chatbot questions. In terms of user characteristics, it is known that CAVAA users make a minimal effort to collect the necessary information before responding to policy statements (see the results of [13, 14]). Therefore, our study can be seen as a study of modality effects in a cognitively demanding context, where users make minimal effort. It is difficult to predict how young voters will use a text-based, voice-based chatbot combined in the specific context of using CAVAA. On the one hand, based on the TTF, one would expect that a combined CAVAA would result in the most intensive usage and therefore users would ask more questions about the system. This is expected, as users are free to choose which modality to use and can even switch between modalities during interaction. On the other hand, since (CA)VAA users have been shown to exert minimal effort when working with the tool [13, 14], it could also be that switching between modalities takes too long and users continue to work in a of the modalities.them. both modalities in the combined version. For CAVAAs that contain a modality (voice vs. text), it can be argued that the voice-based chatbot will generate more requests for information, since this modality is more intuitive to use and speaking is faster than typing [2 , 26]. On the other hand, correctly interpreting speech output is more difficult than interpreting text output, which the user can process at his own pace [18, 26]. Given these different possible scenarios, it is difficult to formulate a concrete hypothesis about the effect of modality on measures of utilization. Since the assessment of tool users and also how much policy knowledge is likely to depend on the actual use of the tool, we also refrain from formulating concrete hypotheses about these dependent variables. Instead, we examine the differences between the three CAVAA modalities for exploration measures, evaluation measures, and policy measures.
2 Method 2.1 Design In a between-subject experimental study, we compared a CAVAA with a text-based chatbot, a voice-based chatbot, and a combined chatbot in terms of multiple outcome measures. In the experiment, CAVAAs were distributed to a homogeneous group of eligible young Dutch voters in a laboratory setting. Each participant worked with only one of the three versions of the CAVAA and then completed a survey measuring evaluation measures (perceived ease of use, usefulness, enjoyment) and policy measures (actual and perceived political knowledge). In addition, actual use of chatbot modalities was measured by analyzing CAVAA participants' conversation logs regarding requested information, the use of predefined buttons or free input to retrieve information, and the occurrence of problems communicating with the chatbot. . On December 16, 2021, the research project received ethics approval from the Ethics Committee of Tilburg University (TSHD_RP174).1 1 Data collection started after consulting the Ethics Committee before final approval.
38
C. Liebrecht et al.
2.2 Participants We recruited a sample of 60 young Dutch voters aged between 18 and 24 years (age M = 20.3 years; SD = 1.88) from our university's participant pool. Of these, 13 participants (21.7%) identified themselves as men, 46 as women (76.7%) and 1 participant (1.7%) identified themselves as outside the gender binary distribution. All participants had Dutch as their mother tongue and were registered in a Dutch municipality and therefore entitled to vote. Of the participants, 11 (18.3%) had never voted, 37 participants (61.7%) had voted in a previous election, and 12 participants (20%) had voted in several previous elections. We compared participants in the text (N=19), voice (N=20) and mix (N=21) conditions with respect to the above demographic characteristics and found no difference in prior voting experience (χ2(4) = 2.02 , p = 0.73), gender (χ2(4) = 3.52, p = 0.48) and age (F(2, 57) = 3.70, p = 0.36). This implies that there is no reason to assume that there were a priori differences between participants in the text condition, the voice condition, and the combined condition. 2.3 Material development process. All three versions of CAVAA were developed in collaboration with chatbot developer Genius Voice (geniusvoice.nl). This company designed the look and feel of the chatbots and trained them to improve intent recognition. To test functionality, we pretested our three versions of CAVAA among nine participants (three per version). Based on these preliminary tests, several improvements have been made to the CAVAAs. Below we describe the test materials that were used in the final test. All experimental materials can be found on the Dataverse (https://doi.org/10.34894/MNMLAT). Modality. In the text-based state, users can interact with the chatbot by clicking suggestion buttons or typing their own messages into an open chat window. So they always typed or clicked to query the chatbots and always got a typed text response in return. In voice-based state, user and chatbot communicate by voice. Similar to the text condition, in the language condition, users were given written suggestion buttons that specified the types of information they could request, but these buttons were not clickable; Instead, users had to read the hint buttons aloud to activate them. In addition, they could also ask questions through the free expression of opinions, comparable to the open chat function in the text state. When the user asked a question in voice state, the chatbot would display the answer as text on the screen and also read the answer aloud. In the mixed state, users could communicate with the chatbot through both modalities and switch between text and voice during the conversation. This means they can activate suggestion buttons by clicking on them or using the voice function to activate content. Furthermore, the chatbot responses were visible on the screen and when users activated the sound button, the chatbot responses were also read aloud. It was only possible to change gears in the combined state.
Express your opinion!
39
between the two modalities during the CAVAA. Figure 1 shows an example of the combined condition and describes how this condition appears in relation to the other two conditions.
Fig. 1. Combined CAVAA screenshot showing the first utterance, three answer choices, and four suggestion buttons. Below the hint buttons are two icons (indicated by a dotted line in the image) that turn the sound (for output) and microphone (for input) on and off. There is also an open text field for entering a written question (marked with a dashed line in the figure). In the voice state, the open text box was absent and the sound and microphone icons were always on, as in this state the voice was the only modality that could be used to control CAVAA; In the text-based state, the open text box was visible and the mute and microphone icons were not displayed, as this CAVAA could only be controlled with text.
Witness. The content and conversational flow of the CAVAAs were based on the experimental materials of [15], who developed a CAVAA for the 2021 Dutch national elections. A total of 16 policy statements from this study were also included in the current research because they were still in discussion when we were developing the materials for this research. We added two new statements for a total of 18 political attitude statements, which is the minimum number of AVA statements identified in a corpus study analyzing AVA in national elections [31]. An example (translated) of a statement is "There should be a mandatory referendum, allowing citizens to stop the implementation of laws". Users can indicate their opinion of each statement as "agree", "neutral" or "disagree". These response options were visualized with a green ("agree"), gray ("neutral") and red ("disagree") button below each statement (see Fig. 1). After responding to all attitude statements, CAVAA provided the user with a personalized voting query, in which the user's positions were compared to the positions of the eight most prominent political parties in the Netherlands (similar to [15]). types of information. Developed using the Rasa conversational framework (rasa.com), the chatbots were trained to recognize user intent based on an extensive list of possible questions users might ask per engagement statement, including synonyms (e.g., "cons ", 'cons')' , 'cons') and abbreviations (e.g. 'Partij van de Arbeid',
40
C. Liebrecht et al.
"PVD"). This training data resulted in an NLU model; Intent and entity combinations determined the chatbots output to the user. Based on user input, the chatbots provided users with four types of information for each attitude statement in the tool; This type of information was based on the types of questions that users ask when responding to statements of political affiliation [13]. First, the chatbots were trained to provide semantic information, meaning that the chatbots could explain the meaning of a difficult word in the statement (e.g. "What does a mandatory referendum mean?"). Second, the chatbots were trained to provide pragmatic information about the current state of affairs related to the political issue in the declaration (eg "What is the current state of referendums in the Netherlands? ?"). In addition to semantic and pragmatic information, the chatbots were also able to provide information about the pros and cons of opinion politics (eg "What is the advantage of holding binding referendums?") and about party positions. for example, "What is the PvdA's position on mandatory referendums?"). Below the statement, the four types of information were displayed via four hint buttons, but users could also access the information by asking questions they asked themselves. The information provided by the chatbot in response to user questions was pre-formulated by the researchers and was always based on reliable resources such as government websites, online dictionaries, news articles and existing voting aids (similar [14, fifteen]). conversation flow. The conversation between CAVAA and the user started with the chatbot greeting the user. Then the first statement was shown. The user can choose to respond directly to the statement or request additional information first. To reinforce the dialogic character of the chatbot, we added conversational phrases in three different ways. Initially, an information request was always presented with a colloquial phrase (eg "Thanks for your question", "I checked for you"). Second, after displaying the additional information, the chatbot would repeat the statement preceded by a conversational phrase (e.g. "So the statement was...", "Is there anything else you would like to know before responding to the statement ? ?"). Third, the transition between statements was marked with a colloquial phrase (eg "I've recorded your answer, let's go to the next statement"). In all three versions of the chatbot, the chatbot randomly selected conversational phrases from a list, so that all three experimental conditions contained the same variation of conversational items. Chatbots were also equipped with a set of error responses in case they didn't understand user input or couldn't find a suitable response. These responses consisted of an error message (eg "Sorry, I don't understand your question") and a repair strategy (eg "Could you rephrase?"). However, if misunderstandings arose, the chatbot's error response indicated the four types of information the user could be asked for (e.g. "Sorry, I can't answer this question. I can provide information about...") , which has been shown to be a successful recovery strategy in chatbot conversations [3, 4].
Express your opinion!
41
2.4 Usage measures For usage measures, we analyzed a sample of 60 (participants) * 18 (statements) = 1,080 respondents and item combinations. This example was coded based on the types of information requested by the participants and whether there were any communication issues between the user and the chatbot. For the combined condition, we also assessed which modality participants used to request information. A second coder coded a random subsample of 17 chatbot conversations (28%) split across the three CAVAA versions. Reliability between coders was always acceptable (semantic information κ = 0.97, pragmatic information κ = 0.96, party positions κ = 1.00, (dis)advantages κ = 0.95, lack of communication κ = 0.76 ). 2.5 Evaluation measures and policy measures In an online survey, evaluation measures were examined first and then policy measures. With the exception of factual knowledge questions, all survey questions can be answered on a seven-point scale from “strongly disagree” to “strongly agree”. Pleasure. The questionnaire began with three statements to measure participants' enjoyment of using the CAVAA. The three items were taken from a survey on a previous study conducted by [21] and adapted to the context of the current study (eg "I found using the chatbot a pleasant experience"). The three items clustered well (α = 0.88). The ease of use. Chatbot usability was measured with five items taken from the study of [1] (eg "I found this chatbot easy to use"). All five items agreed well (α = 0.81). Utility. Utility was measured with four items based on [1] and adjusted for the context of the current study (eg "I was able to respond better to statements with this chatbot than with a normal voting aid"). The four items provided a reliable measure (α = 0.73). Perceived political knowledge. Participants' perception of political knowledge after using CAVAA was measured by adapting four statements from a study conducted by [29] (eg, "By using this chatbot, I gained more knowledge about the political landscape"). The four items showed good grouping (α = 0.69). Political factual knowledge. Eight true/false statements were presented on topics related to political attitude statements in the CAVAA. Participants responded to these statements with "true", "false" or "don't know". The last answer option was included to prevent participants from guessing, which could affect reliability. For data analysis, the answers to the eight knowledge questions were recoded. Correct answers were coded 1, and incorrect answers and “don't know” answers were coded 0. This resulted in factual political knowledge between 0 and 8 for each participant.
42
C. Liebrecht et al.
2.6 Procedure The study was carried out in December 2021, about three months before local elections in the Netherlands. All participants were recruited through our university's human subject pool and participated in the laboratory experiment (taking into account corona measurements at the time). Prior to the start of the experiment in one of the soundproof booths, participants were given a short briefing on what voting advice apps are and how CAVAA can be used specifically in the present study. They then started the study and were asked to give informed consent to use their data. It was emphasized that participation was completely voluntary and that participants could stop at any time. After informed consent, participants answered several questions about demographic variables. Participants could then click on a link that would take them to one of three versions of CAVAA, which would open in a new window. After all 18 policy statements in the CAVAA had been responded to, a voting recommendation was made. Then, the participant was directed to the online survey, which included the assessment measures and the policy awareness measures. The questionnaire ended with a debriefing session in which participants were informed about the purpose of the study. The experiment lasted about 20 minutes in total, and all participants received a prorated course credit in return.
3 Results 3.1 Usage measures Mean values and standard deviations for the usage measures are presented in Table 1. For the four types of information that a respondent could ask for, there was a difference between the chatbot conditions: both in the As in condition and in the condition combined text, respondents requested at least one of the four types of information more often than in the language condition. (text vs language: χ2 = 4.52, p = 0.03, combined vs language: χ2 = 6.76, p = 0.009). There was no difference between the text and the combined condition (χ2 = 0.03, p = 0.86). If we divide this analysis by type of information, it can be seen that less information was requested about the pros and cons of a given policy in the language condition compared to the combined condition (χ2=7.28, p=0.007), and also than in Less information about partisan attitudes was sought for the language condition compared to the text condition (χ2 = 5.76, p = 0.02). All other contrasts did not reach significance. Likewise, when retrieving semantic and pragmatic information, no significant differences were found between the three CAVAA conditions (in all cases: χ2 < 3.55, p > 0.06), although the trend towards differences (p values between 0.06 and 0.1) indicate that in With a larger sample, more requests for information may be found for both the text condition and the combined condition than for the language condition.
Express your opinion!
43
Table 1. Percentage of times a type of information was requested (logit and SE in parentheses) and associated variances (in logit) (M) for each experimental condition. semantics
Pragmatic
(Disadvantages
party attitudes
No total
Text
20,8 % (-1,34; 0,33)
19,6 % (-1,41; 0,25)
29,7% (-0,86; 0,24)
19,0 %* (-1,45; 0,32)
59,2 %* (0,38; 0,30)
Voice
15,3% (-1,71; 0,33)
12,2% (-1,97; 0,26)
21,7 % (-1,29; 0,23)
7,2% (-2,55; 0,33)
43,1% (-2,76; 0,27)
define
21,4% (-1,30; 0,34)
18,8 % (-1,46; 0,19)
40,1 %* (-0,40; 0,25)
11,4% (-2,05; 0,32)
60,6 %* (0,43; 0,26)
S2 o Texto
0,12
0,46 (0,27)
0,86 (0,34)
1,67 (0,64)
0,83 (0,35)
S2
or voice
0,04
0,50 (0,32)
0,78 (0,32)
1,45 (0,68)
0,54 (0,25)
S2 Oh dor
0,29
0,07 (0,13)
1,05 (0,37)
1,67 (0,66)
0,51 (0,23)
S2
1.51 0.31 (0.14) 0 (0) 0 (0) Items * indicates significant difference (p<0.05) with voice status.
0,57 (0,22)
Table 2. Non-reporting fraction (logit and SE in parentheses) and associated variances (in logit) (M) for each experimental condition. communication failure text
1,4 %* (-4,21; 0,69)
Voice
12,8% (-1,92; 0,18)
define
0 (0)
S2
the text
5.49
S2 or win
0,20
S2 Oh dor
S2
0 items * indicates a significant difference (p<0.05) with voice status.
We also ran a multilevel loglinear model similar to [14] to compare how often a respondent experienced communication problems. No communication errors were observed in the combined condition (0%), very few misunderstandings in the text condition (1.4%), while in the language condition some type of misunderstanding occurred in approximately 12.8% of the respondent combinations and items. Differences between text and language status turned out to be significant (χ2 = 3.20, p < 0.001;
44
C. Liebrecht et al.
see Table 2). The combined condition could not be included in the analysis due to lack of variance. Given that there were no miscommunications in the combined condition and a lot of failures in the language condition, a relevant question is how often users used the language option in the combined condition. The combined condition had 21 participants and all responded to 18 policy statements, so there were 378 combinations of respondent items. Only in 12 of these cases (3.2%) did respondents use their voice to request information. This means that the voice feature was rarely used and respondents used the text option despite having the ability to control the chatbot with their voice. 3.2 Policies and Evaluation Measures In the survey, participants rated the enjoyment, ease of use, and usefulness of CAVAA. In addition, actual and perceived political knowledge was measured. Means and standard deviations of all dependent variables are shown in Table 3. Table 3. Means (M) and standard deviations (SD) in parentheses by dependent variable and by experimental condition. Pleasure
easy to use
Utility
perceived knowledge
factual knowledge
Text (N=19)
5,60 (1,09)
6,13 (0,68)*
6,12 (0,80)*
4,91 (0,85)
5,11 (1,79)
Voice (N=20)
5,40 (1,12)
5,21 (1,30)
5,34 (1,11)
5,14 (0,93)
5,25 (2,00)
Combined (N=21)
5,63 (0,49)
6,18 (0,53)*
6,01 (0,70)
5,11 (0,86)
5,57 (1,40)
* indicates significant difference (p<0.05) with voice status.
A factorial ANOVA was performed for each dependent variable to examine whether that variable was dependent on the CAVAA modality. There was a modality effect for ease of use (F(2.57)=7.38, p=0.01). A post hoc test (Bonferroni) showed that both the text and the combined conditions were easier to use than the language condition (p=0.007 and p=0.003, respectively). A modality effect on utility was also found (F(2.57)=4.55, p=0.02). A post hoc test (Bonferroni) showed that CAVAA was rated more useful in the text state than in the speech state (p = 0.02) and that there was also a tendency (p = 0.05) for CAVAA to be in the state combined condition that scores more useful than the language condition. No modality effect was observed for happiness and the two measures of political knowledge (happiness: F(2.57)=0.37, p=.70; perceived political knowledge: F(2.57)=0.39, p=.68; Objectivity of political knowledge: F (2, 57) = 0.38, p = 0.69).
Express your opinion!
45
4 Discussion The impact of the chatbot mode (text, voice or combined) was examined in the specific context of using Conversational Agent Voting Advice Applications (CAVAA). In contrast to previous studies of chatbots on the effects of modality in customer service contexts (e.g. [20, 26, 27]), the actual context of use, in which users attempted to understand statements of political attitude, can be considered more cognitively demanding and objective. . In our study, we focused on a homogeneous group of young voters (18-25 years old) who exhibit satisfactory behavior when working with a CAVAA, which means that they are willing to make the least amount of effort to obtain information [13, 14]. The results show that users perceive the ease of use and usefulness of the tools differently: both the text and the combined condition performed better than the voice condition on these measures. Based on the results of our content analysis, two possible explanations for these findings can be formulated. First, users experienced more misunderstandings when using voice input. This miscommunication sometimes occurred when the user was trying to request additional information, for example B. when the chatbot misunderstood a question like "wait, what are the benefits" when the user probably meant "what are the benefits". However, most misunderstandings occurred when the user attempted to respond "Agree", "Neutral" or "Disagree" to statements of political affiliation. For example, the voice bot sometimes thought the user was saying "Eend" (the Dutch word for "duck") or "Aids" ("AIDS") when the user probably meant "Eens" (the Dutch word for "agreement") . "). Similarly, the chatbot sometimes understood "Central" when the user probably meant "Neutraal" ("Neutral"), or "Online" when the user meant "Oneens" ("I disagree"), he said . forms of miscommunication likely made the user feel that they had less control over their voice tone [30], which may have resulted in lower usability and usefulness ratings condition, users felt less invited to ask questions to the user. chatbot, which may have diminished perceptions of usability and especially usefulness. The relevant question is why users make fewer requests for information on language state. Users feared being misunderstood. However, how misunderstandings were more common when the user responded to the policy statement rather than asking the system a question, an alternative explanation could be that users who are known to make minimal effort to request information [13, 14] felt the use of their voice it was very good. In our opinion, this explanation is very plausible, especially considering that users mainly use text (click) to request information and not voice in the matched condition. The occurrence of misunderstandings is not considered an alternative explanation for trusting the text in the combined condition, as we did not identify any misunderstandings in the combined condition. In our opinion, it is likely that a text-based mode of communication simply fits better in the specific context of using CAVAA. However, to be more specific about the cause of lower requests for language state information and therefore lower scores on usability measures, it might be worthwhile to do a replication study with an improved version of the chatbot based on language. This chatbot should be better able to recognize respondents' responses to political attitude statements. In
46
C. Liebrecht et al.
Furthermore, it would be useful to combine an experimental study with a cognitive interview, in which users are asked to explicitly state how much effort they felt it took to ask a question. Another result of the present study is that no differences were found between the three versions of the CAVAA in terms of perceived pleasure. This result is in contrast to previous studies showing that users of voice-based chatbots generally enjoy the interaction [25, 33]. It appears that users in the current study enjoyed working with all three versions of CAVAA overall, as reflected in the relatively high average satisfaction scores (around 5) and in the open comments that users made at the end of the survey, such as: ' I really enjoyed using the chatbot!', 'The chatbot helped me understand the issues in the electoral advisory app; I really enjoyed using a chatbot in an election advisory app' and 'It felt natural to talk to the chatbot. It's much nicer and more personal than simply answering questions." occurred in all three versions, as experiences of pleasure and novelty are closely related [23]. between the three modalities of knowledge, perceived and actual. As we expected policy impacts to be the result of actual use and evaluation of the tool, and because we found modality effects for these latter measures, the lack of policy differences may be unexpected. In all three conditions, relatively high scores were achieved for both perceived knowledge (average 5 on a 7-point scale) and actual knowledge (average 5 on an 8-point scale). of political attitudes on a CAVAA, regardless of modality, leads to relatively high scores on these political measures. Therefore, to better understand this result, it would be interesting for a future study to include not only a post-CAVAA measure of perceived and actual knowledge, but also a pre-CAVAA measurement. In this way, the delta of these two measures can be calculated and used as a more detailed measure of real and perceived knowledge.
5 Conclusion The aim of this study was to investigate how people use and perceive chatbots with different modalities in the cognitively demanding context of conversational agent voting advice applications. Participants' scores on perceived and actual political knowledge, as well as their perceived pleasure scores, did not differ across chatbot conditions. However, differences were noted in terms of usefulness and ease of use: speech-based CAVAA was found to be less user-friendly and less useful than the other two modalities. Content analysis of chat logs revealed that users request more information in both the text condition and the combined condition than the language condition. Furthermore, there were more misunderstandings between the tool and the user in the language condition than in the other two conditions. Finally, the results showed that in the combined condition, users almost did not use the option to control the chatbot by voice and relied on the tap and click feature in most cases. Taken together, these results suggest that
Express your opinion!
47
that the combined condition was similar in practice to the text condition, and that these two conditions exceeded the language condition in several respects. In order to achieve an optimal user-task-technology fit, as articulated in TTF theory, in the context of political CAVAAs, chatbot developers could better develop text-based chatbots, as such chatbots not only prevent misunderstandings, but also encourage users to easily request more information. This is how CAVAAs can help citizens find political information. Ultimately, this should lead to more voters voting in real elections and strengthening democracy. Acknowledgments The authors thank the Tilburg University Fund (project number ESF2021-2) for financial support in the development of CAVAAs. A summary of the results of this study was also published in the Dutch scientific journal Tekstblad [22].
References 1. Ahn, T., Ryu, S., Ahn, T., Ryu, S., Han, I.: The influence of web and game quality on user acceptance of online commerce. Info. To manage. 44(3), 263-275 (2007). https://doi.org/10.1016/j.im.2006.12.008 2. Angga PA, Fachri WE, Elevanita A, Agushinta RD: Chatbot design with 3D avatar, voice interface and facial expression. In: 2015 International Science in Information Technology Conference (ICSITech), pp. 326-330. IEEE October 2015 3. Ashktorab Z, Jain M, Liao QV, Weisz JD Resilient Chatbots: Repair Strategy Preferences for Conversational Interruptions. In: Proceedings of the 2019 CHI Conference on Human Factors in Computer Systems, pp. May 1-12, 2019 4. Bohus, D., Rudnicky, A.: Sorry, I don't understand! An investigation of misunderstanding errors and recovery strategies. In: Annals of the 6th SIGdial Workshop on Discourse and Dialogue, pp. 128-143, Sep 2005 5. Cho, E., Molina, M.D., Wang, J.: The Impact of Modality, Device, and Task Differences on People's Perceived Similarity to Voice-Activated Virtual Assistants. cyberpsychologist. network behavior 22(8), 515–520 (2019) 6. De Graaf, J.: The Irresistible Rise of Stemwijzer. In: Cedroni, L., Garzia, D. (eds.) Voting Advice Applications in Europe: The State of the Art, pp. 35–46. Scriptaweb, Napoli (2010) 7. Direct search: De Nationale Voice Monitor 2021 (2021). https://www.directresearch.nl/blogs/de-nationale-voice-monitor-2021/. Accessed February 8, 2022. 8. Følstad, A., et al.: Future Directions for Chatbot Research: An Interdisciplinary Research Agenda. Arithmetic 103(12), 2915-2942 (2021). https://doi.org/10.1007/s00607-021-01016-7 9. Fryer, LK, Ainley, M, Thompson, A, Gibson, A, Sherlock, Z: Attracting and Maintaining Interest in a Language Clear: a Experimental comparison of chatbot and human co-workers. Calculation. Buzz. Behavior 75, 461-468 (2017) 10. Garzia, D., Marschall, S.: Evaluated requirements for election advisory: State of the art. international J electron. Government. 5(3–4), 203–222 (2012) 11. Gemenis, K., Rosema, M.: Applications of electoral advice and electoral participation. To choose. Race. 36, 281-289 (2014) 12. Goodhue, DL, Thompson, RL: Task technology adjustment and individual performance. MY P. 19(2), 213 (1995). https://doi.org/10.2307/249689 13. Kamoen, N., Holleman, B.: I don't understand. Difficulty responding to statements about political attitudes in electoral advisory requests. Survival resolution methods. 11(2), 125-140 (2017). https://doi.org/10.18148/srm/2017.v11i2.6728
48
C. Liebrecht et al.
14. Kamoen, N., Liebrecht, C.: I Need a CAVAA: How Conversational Agent Vote Advice Applications (CAVAA) Influence Users' Political Knowledge and Tool Experience. Forehead. Artifact Intelligence 5, 835505 (2022). https://doi.org/10.3389/frai.2022.835505 15. Kamoen N, McCartan T, Liebrecht C Conversational Agent Voting Advice Applications: a comparison between a structured, semi-structured and unstructured chatbot: design to communicate with voters on issues policies . In: Følstad, A., et al. (ed.) Chatbot Research and Design: 5th International Workshop, TALKS 2021, Virtual Event, 23–24. November 2021, Selected Papers Revised, pp. 160-175. Springer International Publishing, Chamber (2022). https://doi.org/10.1007/978-3-030-94890-0_10 16. Kanda T, Hirano T, Eaton D, Ishiguro H Interactive robots as social partners and peer tutors for children: a field test. human-computer. Interact. 19(1–2), 61–84 (2004) 17. Kim, Y., Lee, H.: The Rise of Chatbots in Political Campaigns: The Impact of Conversational Agents on Voting Int. Int. J. Human-Computer. Interact. 1-12 (2022). https://doi.org/10. 1080/10447318.2022.2108669 18. Kocielnik, R., Avrahami, D., Marlow, J., Lu, D., Hsieh, G.: Designing for Workplace Reflection: A Voice-Based Chat Agent In : Proceedings of the 2018 Interactive Systems Design Conference, pp. 881-894, June 2018 19. Krouwel, A., Vitiello, T., Wall, M.: The Practical Aspects of Casting Voting Tips: A New Methodology for Profiling and Matching. international J electron. Government. 5(3–4), 223–243 (2012) 20. Le Bigot, L., Jamet, E., Rouet, J.-F., Amiel, V.: Mode and effects of modal shift on performance and organization of speech with a dialog system to retrieve information in natural language. Calculation. Buzz. Behavior 22, 467–500 (2006) 21. Lee MK, Cheung CM, Chen Z Acceptance of the Internet-based learning environment: The role of extrinsic and intrinsic motivation. Info. To manage. 42(8), 1095–1104 (2005) 22. Liebrecht, C., Kamoen, N.: "Hey Siri, what's hondenbelasting?": Voicebots and text bots in a political context. Tekstblad 27(1), 22–24 (2022) 23. McLean, G., Osei-Frimpong, K.: Hey, Alexa… Examine the variables that influence the use of AI voice assistants in the home. Calculation. Buzz. Behavior 99, 28-37 (2019) 24. Nass, CI, Brave, S: Wired for Speech: How Voice Activates and Advances the Human Computer Relationship. MIT Press, Cambridge (2005) 25. Pal D, Arpnikanondt C, Funilkul S, Chutimaskul W: The adoption analysis of voice-based smart IoT products. IEEE Internet Things J. 7(11), 10852-10867 (2020) 26. Riefle L, Brand A, Mietz J, Rombach L, Szekat C, Benz C: What works for Tim may not be a good match for Tom: Exploring the impact of user characteristics on user experience with conversational interaction modalities. In: Wirtschaftsinformatik conference volume 2022, volume 13 (2022) 27. Rzepka, C., Berger, B., Hess, T.: Voice Assistant vs. Chatbot: examination of the adjustment between the mode of interaction of conversational agents and information search tasks. Front information system. 24(3), 839–856 (2022) 28. Schroeder, J., Schroeder, M.: Trust in machines: how the nature of the interaction affects the willingness to share personal information with machines. In: Proceedings of the 51st Hawaii International Conference on Systems Science, Hawaii, USA (2018) 29. Shulman, HC, Sweitzer, MD: Advancing Frame Theory: Designing an Equivalence Framework to Improve Processing of Policy Information. Buzz. Municipal Resolution 44(2), 155-175 (2018). https://doi.org/10.1093/hcr/hqx006 30. Sundar, S.S.: The MAIN model: a heuristic approach to understanding the effects of technology on credibility. In: Metzger, MJ, Flanagin, AJ (eds.). Digital media, youth and credibility. John D. and Catherine T. MacArthur Foundation Series on Digital Media and Learning, pp. 73-100. The MIT Press, Cambridge (2008). https://doi.org/10.1162/dmal.978026256232 4,073
Express your opinion!
49
31. Van Camp, K., Lefevere, J., Walgrave, S.: Content and Wording of Statements in Electoral Guidance Proposals. In: Garzia, D., Marschall, S. (eds.) Correspondence of voters with parties and candidates. Applications of voting advice in comparative perspective, pp. 11-32. ECPR Press, Colchester (2014) 32. Xu, J.D., Benbasat, I., Cenfetelli, R.T.: The nature and implications of transparency in the context of referral agents. MIS Q 38, 379-406 (2014) 33. Yang H, Lee H Understanding user behavior of virtual personal assistant devices. IseB 17(1), 65-87 (2018). https://doi.org/10.1007/s10257-018-0375-1
Creating value in playful chatbot interactions and their impact on brand engagement Susana C. Silva1 , Roberta De Cicco2,3,4(B) , Maria Levi1 and Maik Hammerschmidt5 1 Porto Catholic Business School, Universidade Católica Portuguesa , Porto, Portugal 2 Department of Neuroscience, Imaging and Clinical Sciences, University of Chieti-Pescara,
chieti, italy[Email protected]3 CAST, Center for Advanced Studies and Technology, Chieti, Italy 4 Department of Economics, University of Molise, Campobasso, Italy 5 Faculty of Economics and Business, Smart Retail Group, Georg-August-Universität Göttingen, Göttingen, Germany
Abstract. Gamification is a powerful tool to motivate consumers to deepen their relationship with the brand. Despite this potential, the impact of enriching chatbot interactions with gaming experiences on brand engagement has not been studied. To fill this gap, this study tries to understand how gamification contributes to customer value creation in a gamified conversational context and how this value creation is related to brand engagement. In particular, we examine whether, and to what extent, the hedonic and utilitarian values provided in interactions with gamified chatbots influence cognitive, emotional, and behavioral brand engagement. Based on an empirical study with a fully functional gamified chatbot, we show that perceived hedonic value has a big impact on all three dimensions of brand engagement, especially on the cognitive dimension. Useful values, while unrelated to the cognitive dimension of brand engagement, significantly reinforce the emotional and behavioral dimensions of engagement. These results point to the potential for extrinsic and utilitarian motivations to increase brand engagement, even in entertainment-oriented environments such as gamified chatbot interactions. Keywords: Chatbots Gamification Use value Hedonic value Consumer brand engagement
1 Introduction Chatbots are disembodied conversational agents that communicate with humans via text-based chats or voice commands to respond to a variety of customer requests and needs [2, 22]. So far, chatbots have mostly been used as non-human versions of frontline service workers, solving simple, standardized tasks in a highly efficient manner. Consequently, previous research has focused on examining user interactions with the conversational agent, such as satisfaction with the chatbot interaction [37], intention to use the chatbot again [12] or the likelihood of recommending the chatbot to others [ 68]. What has been largely overlooked is whether conversational agents have the potential to engage customers with © The author(s), under exclusive license from Springer Nature Switzerland AG 2023 A. Følstad et al. (Eds.): CONVERSATIONS 2022, LNCS 13815, pp. 50–65, 2023. https://doi.org/10.1007/978-3-031-25581-6_4
Creating value in gamified chatbot interactions and their impact
51
Trademarks [38]. This expanded view of conversation-based agents to drive customer loyalty beyond solving individual service issues has been frequently requested in current literature [35]. In the new digital society, understanding how to engage consumers and brands and create strong connections between them has never been more important, as a new "experience economy" takes over and consumers begin to enjoy dematerialized interactions[ 42] and communication strategies. Early engagement highly valued. based on monetary rewards lose their point [15]. Since people play games for intrinsic, almost addictive reasons [23], the use of game design elements as an approach to stimulate engagement is an emerging trend in marketing (eg [16, 61]). This engagement-stimulating activity, known as “gamification” [36], where game elements are applied to non-game contexts, has been extended to the sphere of marketing, where the use of game design elements in non-game contexts is becoming increasingly popular. [66]. . Augmenting chatbots with game elements is becoming more possible as major technological advances allow computer agents to interact with users in a much wider variety of contexts [1, 40]. However, to make customers engage with the brand, chatbots must integrate elements that have a unique power to release consumers' motivational energy to interact with a brand [42]. We argue that bringing gamification to chatbot interactions can be one such approach to turning chatbots into "engagement machines". Integrating game elements into chatbots may have the potential to evoke experiences similar in strength to those conveyed by the game, and therefore can effectively motivate engagement responses [16]. Playful interaction separates people from their environment and immerses them in the experience, creating a sense of natural flow [9]. Therefore, combining the immersive and motivational power of gamification in human-chatbot interactions can be a strategy to revolutionize the way brands interact with consumers. In this study, we use a real-world chatbot interaction to empirically examine how interaction with a gamified design chatbot affects consumer brand engagement. We argue that injecting a gamified experience into a chatbot generates not only hedonic but also useful value for users, which in turn strengthens cognitive, emotional and behavioral attachment to the brand. The empirical results make good contributions to recent emerging research at the intersection of gamification and conversational technologies. We shed light on the mechanisms that allow gamified chatbots to effectively increase brand engagement, identifying two opposing psychological processes (utility versus hedonic value). We have shown that integrating gamification into a given system is effective when game elements can influence target users in terms of what they personally value [19, 62].
2 Theoretical framework and hypothesis development 2.1 Chatbots and gamification Due to the explosive expansion and technological advances of artificial agents, chatbots have become the dominant interface for many companies when it comes to communicating with consumers. Not surprisingly, companies are also looking at chatbots as a potential way to improve company and brand engagement. Enterprises are investing heavily in conversational agents to better engage their customers, and deployment of these agents is anticipated.
52
S.C. Silva et al.
Up to 241% in travel and hospitality and 187% in consumer products [39]. Gamification is an emerging technological process that makes it possible to imitate the entertainment experiences of games, using game elements such as game design, while establishing rules, objectives and feedback systems [36]. The literature on gamification applications in marketing emphasizes that brand loyalty occurs mainly due to interactive and challenging experiences and that game elements can facilitate such experiences [6, 36]. There are many types of in-game components such as avatars, points, badges, levels, gifts, levels and leaderboards [59] that are more likely to be recognized by users and better integrated into chatbots [44]. This makes integrating these gamification elements into chatbots a clear option to drive engagement. Despite this, no studies have examined how gamified chatbots improve brand engagement. Our proposal to combine chatbots and gamification principles, based on the pyramid of dynamic and mechanical components (DMC pyramid) by Werbach and Hunter [59], is focused on studies that complain that many gamification applications do not like to work. [40] , as well as professionals and researchers need to seek suitable technological systems in which game elements can be integrated to make them (more) effective. 2.2 Consumer engagement Consumer engagement is a positive motivational state that a consumer can experience when interacting with a specific object [33], influenced by context and mediated by cognitive, emotional or behavioral expressions [14]. Consumer brand engagement is a crucial construct to study as it is a key aspect of corporate equity [64] as people who are more engaged with a brand are happier and show greater loyalty [60]. With a lot of focus in the marketing space on how to make consumers more aware and engaged with the brand, marketers have recently started looking for new solutions to overcome obstacles in consumer engagement. Games research results [29, 36, 61]. As consumer and brand engagement occurs due to interactive and co-created customer experiences with a brand, brand engagement is expected to increase when a gamified service is used iteratively [30]. While some researchers see commitment as one dimension, that is, behavior [54], others believe that commitment is a complex state that depends on multiple dimensions that deserve greater understanding [33], with a unidimensional concept that does not reflect its scope. complex. [32] . Based on this, the present study contemplates the three main dimensions, i. h cognitive, emotional and behavioral dimensions to give a broader and more detailed perspective. Cognitive engagement is the psychological investment or level of interest [58] in interacting with a brand. Emotional engagement is related to the development of emotional bonds and affection for a brand [33]. Behavioral engagement represents the level of consumer engagement and positive engagement in the experience [21]. Examining these three different approaches to brand engagement is important for several reasons. Although these different emphases often coexist in a given consumption context [14], one emphasis may take precedence over the other.
Creating value in gamified chatbot interactions and their impact
53
The formation of relevant consumer relationship outcomes, depending on the context and different approaches, can play different and varying roles in shaping engagement in relation to the underlying psychological processes that can be activated. 2.3 Useful and hedonic values Högberg et al. [30] suggest that consumer and brand engagement is positively reinforced by consumers' perceived values of a gaming experience. This is consistent with [55] who believe that knowledge of consumer values is essential to support information systems, as they strongly influence Internet users' perceptions [45] and drive behavior [48]. According to Woodruff and Gardia's Means-Ends model [62], which is one of the most prominent value models in the literature, consumers act according to what could bring the desired benefit and avoid negative consequences. Therefore, the consumer's commitment to a given item must be influenced by their personal values and beliefs. In other words, consumers make a self-assessment of the correlation between what they value from different perspectives and the perceived value that the experience offers them, and if the experience supports them, engaging behavior is expected [13]. Thus, personal values act as precursors of consumer engagement [69], which means that, when focusing on engagement outcomes, managers need to understand how consumers perceive values so that they can design experiences that exceed those of other competitors [ 63]. In general, different types of values are considered and, depending on what is expected from an experience, consumers are seen as problem solvers or as individuals who seek emotional stimuli [28]. Babin et al. [3] emphasize the relevance of both utilitarian values, including economic and functional aspects, and hedonic values, including emotional and social aspects, which reflect Holbrook's extrinsic and intrinsic values [31], respectively. In previous studies, Carpenter et al. [7] demonstrated that hedonic and utilitarian elements are crucial in the retail experience, rather than one form of motivation dominating, although one may be more dominant than the other depending on the context. So there is room to believe that ideally both types of value should be considered in building customer loyalty, especially in the case of new technologies. Perceived utility value refers to the useful result that results from some kind of conscious search for an intended consequence [3], which is defined as a method of evaluating the functional and economic benefits that consumers receive when choosing a product or service [ Four. Five]. In general, utility value is considered driven by the desire to satisfy a basic need or perform a functional task [11]. In general, when considering utility, consumers assess the perceived value of an experience from a more cognitive perspective [52], which is recognized as a determinant of consumer engagement as well as behavioral intent [57], which reflects evaluations of functionality and convenience that save time. , which are related to a more rational and task-oriented form of assessment [47]. Utilitarian value has a clear impact on attitudes towards Internet users [18], while showing direct positive effects on consumer satisfaction and word of mouth [4].
54
S.C. Silva et al.
Based on the above, we propose that in a human-chatbot playful interaction: H1: Use value is positively related to cognitive engagement with the brand H3: Use value is positively related to emotional engagement with the brand H2: The value of use is positively related to behavioral engagement with the brand Overby and Lee [45] define perceived hedonic value as an assessment of the experiential benefits of choosing a particular product or service, as an activity that provides these hedonic elements should motivate users to participate in the activity. experience [55]. Since hedonic value is intrinsically motivating, this should lead to greater engagement in the interaction and equally reflect the designer's positive experience of the activity [30]. In general, when considering hedonic value, consumers evaluate the perceived value of an experience by evaluating emotional and affective factors [34]. From this perspective, if the experience provides a relevant hedonic benefit for consumers, as gamification should due to its inherent nature, it should lead to ongoing engagement behaviors (eg [13]) and intentions [57]. Like the utility value, the hedonic value also showed a positive effect on preference, satisfaction, and behavioral intention [56]. As in Chan et al. [8] increases positive emotions and the level of satisfaction conveyed by the experience, which helps to enable more interactions and behavioral intentions towards the brand. Following this reasoning, we expect hedonic value, as well as utilitarian values, to be positively associated with consumer engagement in all three dimensions when interacting with the gamified chatbot. According to Cronin et al. [10] The value judgment has a positive impact on preference, that is, on the buyer's propensity to prefer a particular retailer. According to Batra et al. [5] A brand must be highly valued for a person to experience engagement. Since for more hedonic users pleasure and fun are the main benefits, while for more pragmatic users utilitarian benefits, such as achieving a goal, should be more justified, brand engagement can be influenced by hedonic elements of extrinsic attributes in all contexts that a useful benefit, such as a discount, can be achieved. Therefore, consistent with the above logic, we propose that in a human-chatbot playful interaction: H4: hedonic value is positively related to cognitive brand engagement H5: hedonic value is positively related to behavioral brand engagement H6: hedonic value is positively related to the emotional connection with the brand. The proposed model would then be shown in the following figure.
3 Research method 3.1 Study design The aim of the study was to apply real chatbot interactions to improve the external validity of the study. This is less likely when scenarios or screenshots are used, as is the case in most chatbot studies.
Creating value in gamified chatbot interactions and their impact
55
Fig. 1. Research model
A pilot study with 60 participants was conducted to understand which gamification elements and strategies produced the most relevant results when applied to human-chatbot interaction. Werbach and Hunter [59] retrieved these game elements from the DMC pyramid and included three categories: 1) game dynamics in the form of emotions, narrative, progression, and constraints; 2) game mechanics in the form of challenge, reward and feedback system; 3) Game components in the form of points and badges. Based on the results of the GAMEX scale developed by Eppmann et al. [17] and qualitative feedback from participants, the play activity element for the main study was a challenge in the form of a questionnaire consisting of four questions about the brand and its products and a 'Can you spot the differences?' Play to get a discount. Each question had three alternative answers that participants could choose from and was associated with a different level of difficulty to give them a sense of progression and challenge. Progression to the next level of the game was rewarded with a 10% discount that can be spent on any branded product and a symbolic badge to visually recognize the achievement. Badges were used as key game elements in interaction, such as in Hamari [25]. Participants received instant feedback on whether they answered the questions correctly and on their progress in the game. If the answer was correct, participants received a recognition badge and could advance to the next level at the highest possible discount. If the answer was incorrect, the participant received the discount code associated with the achieved level. The design of the conversation was associated with a fictitious brand page created on Facebook called “Rainbow Packing” and the chatbot was created using Chatfuel (https://chatfuel.com/), which enables the development of a rules-based chatbot, with tree-shaped streams. and predefined frameworks to help users with their requests. Figure 1 below shows an excerpt of the gamified chatbot design. The products selected for the study were bags, as it was necessary to use a neutral product without cultural influences to avoid biased results (Fig. 2).
56
S.C. Silva et al.
Fig. 2. Extract from the gamified chatbot design
3.2 Main Study: Development of Measurements and Sampling The questionnaire was developed using Google Forms and shared online through social networks with participants who were invited to share the survey with their contacts. Data collection took place in December 2020. The sample included a population of young Portuguese consumers who had a Facebook Messenger account (necessary to interact with the chatbot). The first part of the survey included demographic questions such as age, gender, and occupation, while other questions related to the participants' past use of the chatbot, online shopping habits, and product preferences and needs. Subsequently, the following constructs of interest were measured: (1) hedonic value, (2) utility value, (3) cognitive engagement, (4) emotional engagement, and (5) behavioral engagement to measure both hedonic value and value. utility (see Table 1 below), we used the scales from Babin et al. [3] adapted from engagement, including the cognitive engagement dimension, the emotional engagement dimension, and the behavioral engagement dimension, were evaluated to assess consumer brand engagement. To this end, Dessarts et al. [14] Brand Commitment Scale. This scale was chosen for its breadth of engagement, making it a more consistent measure of consumer engagement and its ability to shed light on the dimensionality of engagement. It recognizes consumer engagement as a three-dimensional concept and adds more detail to its conceptualization. While the behavioral dimension relates to the level of effort and time invested [33], the cognitive aspect of engagement relates to the level of absorption a user can achieve, and the emotional dimension relates to the level of pleasure and excitement [ 58]. .
Creating value in gamified chatbot interactions and their impact
57
4 Results 4.1 Demographic information The data obtained from the demographic questions were analyzed using the SPSS Statistics software, version 26.0. Most of the participants managed to reach the final stages. Those participants (38) who did not reach the final score were excluded from the analysis, resulting in a final sample of 165 participants (74 men, 91 women). Most participants are between 18 and 23 years old (66.7%) and 24 and 39 years old (23%). 43.6% of the participants have an undergraduate degree, 21.25% have a master's degree and 32.1% have an Abitur. More than half of the participants stated that they had already experienced an interaction with a chatbot (55.8%) and 57% of the participants stated that they regularly shop online. Almost all participants liked the products presented by the chatbot (92.1%), while 72.1% of participants indicated a need for products (bags). 4.2 PLS analysis and measurement model To estimate the measurement and structural models, the Partial Least Squares Structural Equation Model (PLS-SEM) [24] from the SmartPLS 3.2.8 software [46] was used. Due to the low factor loading, three items were removed from the post-analysis: HV1, BED 5 and EED2, while two items were removed from the hedonic scale (HV3 and HV7) due to a high value of the variance inflation factor (VIF). ). to avoid collinearity problems that arise when the variance inflation factor (VIF) registers a value greater than 5.00 [24]. The final model consisted of 5 constructs and 21 items. All VIF values and factor loadings for the remaining constructs are presented in Table 1. All indicators exceeded the established limit of 0.6. Likewise, the minimum limits established by Fornell and Larcker [20] of 0.7 for composite reliability, 0.6 for Cronbach's Alpha and 0.5 for AVE were also exceeded. Reliability and convergent validity were assessed using composite reliability (CR) and mean variance extracted (AVE), respectively, and AVE square root values for each composite were compared with correlations between different composite models [20]. In all cases, the AVE values exceed the corresponding squared intercompound correlation values. Discriminant validity was also assessed using the HeteroTrait-MonoTrait correlation ratio (HTMT), as it performs better than more traditional methods [27] with values below the 0.90 threshold. Standardized loads were used to assess the reliability of the indicator. The thresholds for CR, AVE, and Lambda are 0.7, 0.5, and 0.6, respectively. The threshold value of 0.90 was used for the HTMT criterion [26]. 4.3 Structure of the model The hypotheses were tested with 5000 bootstrap resampling. As indicated in Table 2, the results show that utility is not related to the cognitive engagement dimension of consumer engagement, since the relationship between these variables is not significant (γ = -0.045, p = 0.563), the which does not support hypothesis (H1). However, the utility value is positively related to the dimensions of behavior and emotional commitment, according to H2 and H3 (each γ = 0.195, p. 40
12 (3,8%)
male gender identity
67 (22,0%)
Female
133 (43,6%)
non binary
54 (17,7%)
transgender
13 (4,3%)
intersexual
0 (0%)
Others
33 (10,8%)
Level of education No complete education
3 (1,0%)
Primary
31 (10.2%) (continued)
96
T.Henkel et al. Table 3. (continued)
Characteristics
norte (%)
Secondary school
106 (34,8%)
professional qualification
13 (4,3%)
Degree or equivalent
96 (31,5%)
Master's or equivalent
36 (11,8%)
doctorate or equivalent
6 (2,0%)
Others
9 (3,0%)
Mental health issues Yes, seek professional help
94 (30,8%)
Yes, do not seek professional help
121 (39,7%)
NO
39 (12,8%)
I'm not sure
51 (16,7%)
Appendix 4 Table 4. Frequency distribution of past experiences with chatbots (N=305) Type of chatbot
Never
Customer Service Chatbots 82 (26.9%) Health Chatbots
Almost never
sometimes often
Often
102 (33,4%) 97 (31,8%) 17 (5,6%) 7 (2,3%)
224 (73,4%) 58 (19,0%)
Social messaging chatbots 152 (49.8%) 84 (27.5%)
15 (4,9%)
6 (2,0%)
2 (0,7%)
46 (15,1%) 12 (3,9%) 11 (3,6%)
Understanding the intent of using mental health chatbots
97
Appendix 5 Table 5. Comparison of regression models to predict BI use of mental health chatbots Behavioral intention to use mental health chatbots UTAUT model
extended model
Constantly
3.36***
3,35***
performance expectation
0,75***
0,67***
expectation of effort
−0,08
social influence
0,25***
−0,14* 0,18**
will to self-revelation
0,19**
Perception of loss of privacy
0,04 – 0,01
Entrust R2 F
0,67 35,91***
0,70 20,17***
Note * p < 0.05, ** p < 0.01, *** p < 0.001
References 1. Abd-alrazaq AA, Alajlani M, Alalwan AA, Bewick BM, Gardner P, Househ M: An overview of the characteristics of chatbots in mental health: a scoping review. international J.Med. Informatik 132, 103978 (2019) 2. Ajzen, I.: The Theory of Planned Behavior. Organ. behavioral tinnitus. decision procedures. 50(2), 179–211 (1991) 3. Almahri, F.A.J., Bell, D., Merhi, M.: Understanding student acceptance and use of chatbots in UK universities: a research approach. In: 2020 6th International Conference on Information Management (MICI) (2020) 4. Barak A, Gluck-Ofri O: Degree and reciprocity of self-disclosure in online forums. cyberpsychologist. Behavior 10(3), 407–417 (2007) 5. Bellman, S., Johnson, E.J., Kobrin, S.J., Lohse, GL.: International Differences in Privacy Issues: A Global Consumer Survey. Information Society 20(5), 313–324 (2004) 6. Bergström, A. Online Privacy Concerns: A Comprehensive Approach to Understanding the Concerns of Different Groups for Different Uses. Calculation. Buzz. Behavior 53, 419–426 (2015) 7. Brandtzaeg, P.B., Følstad, A.: Why people use chatbots. In: Kompatsiaris, I., et al. (Ed.) INSCI 2017. LNCS, vol. 10673, pp. 377-392. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-70284-1_30 8. Carpenter, M.: The Human Rights of Intersex People: Managing Harmful Practices and the Rhetoric of Change. To play. Health Matters 24(47), 74–84 (2016) 9. Carpenter CS, Eppink ST, Gonzales G Transgender, gender identity, and socioeconomic outcomes in the United States. ILR Rev. 73(3), 573-599 (2020) 10. Chaix, B., et al.: When chatbots find patients: a prospective one-year study of conversations between breast cancer patients and a chatbot. Cancer JMIR 5(1), e12856 (2019)
98
T. Henkel et al.
11. Chang IC, Hwang HG, Hung WF, Li YC: Clinical acceptance of pharmacokinetic-based clinical decision support systems. Systems Expert app 33(2), 296–303 (2007) 12. Chazin, D., Klugman, S.: Clinical considerations for working with clients in the disclosure process. pragmatic. Psychotherapist case study. 10(2), 132–146 (2014) 13. Cheung, A.S., et al.: Binary and non-binary gender identity among Australian gender and trans individuals. Bow. Sex. Behavior 49(7), 2673–2681 (2020) 14. R Chocarro, M Cortiñas, G Marcos-Matás: Teachers' attitudes against chatbots in education: a technology acceptance model approach considering the effect of social language, proactivity of the bot and user properties. raising educators. 1–19 (2021) 15. Croes, E.A.J., Antheunis, M.L.: 36 questions about loving a chatbot: Are people willing to reveal themselves to a chatbot? In: Chatbot Research and Design, pp. 81-95 (2021) 16. D'Alfonso, S.: AI in mental health. Act. Opinion. Psychol 36, 112–117 (2020) 17. Davis, F.D.: Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology. MIS Q. 13(3), 319-339 (1989) 18. Fish, J.N., et al.: Q Chat Space: Assessing the Feasibility and Acceptability of an Internet-Based Support Program for LGBTQ Youth. prevention science. 23, 130–141 (2021) 19. Fishbein M, Ajzen I Belief, Attitude, Intention, and Behavior: An Introduction to Theory and Research. Addison-Wesley, Reading (1975) 20. Fitzpatrick KK, Darcy A, Vierhile M: Application of cognitive behavioral therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): a court hearing randomized controlled . JMIR Mental Health 4(2), e19 (2017) 21. Fulmer R, Joerin A, Gentile B, Lakerink L, Rauws M: Using psychological artificial intelligence (Tess) for symptom relief Depression and anxiety: randomized controlled trial. JMIR Mental Health 5(4), e64 (2018) 22. GIF|Definition, meaning and facts. Encyclopedia Britannica (nd). https://www.britannica. com/technology/GIF 23. Goklani, B.: Chatbots in healthcare: key benefits, risks, and challenges you should be aware of. Mindinventory, September 15, 2021. https://www.mindinventory.com/blog/chatbots-in-healthcare/ 24. Guo X, Zhang X, Sun Y: The paradox of privacy and personalization in mHealth services Accepting different age groups . Electron. Containing Resolution App 16, 55–65 (2016) 25. Ho A, Hancock J, Miner AS: Psychological, relational, and emotional effects of self-disclosure following conversations with a chatbot. J.common. 68(4), 712–733 (2018) 26. Hoofnagle CJ, King J, Li S, Turow J: How do young adults differ from older adults when it comes to privacy attitudes and policies? (SSRN Scientific Paper # ID 1589864). Social Science Research Network, Rochester, NY (2010) 27. Today, MG, Milne, G: Gender Differences in Privacy Measures for Young Adult Facebook Users. J. Interact. announcement. 10(2), 28–45 (2010) 28. Inkster B, Sarda S, Subramanian V: An empathy-driven dialogic artificial intelligence (Wysa) agent for digital mental well-being: study of data analysis methods from the mixed real world. JMIR mHealth and uHealth 6(11), e12106 (2018) 29. Isaias P, Reis F, Coutinho C, Lencastre JA: Empathetic technologies for distance/mobile learning. Interact. smart education technology. 14(2), 159–180 (2017) 30. Jackson, SD Connection Is the Antidote: Psychological Distress, Emotional Processing, and Virtual Community Building Among LGBTQ Students in the Aftermath of the Orlando Shooting. psychological sex. Orientation Diver Gen. 4(2), 160–168 (2017) 31. Kretzschmar K, Tyrol H, Pavarini G, Manzini A, Singh I: Can your phone be your therapist? Youth ethical perspectives on the use of fully automated conversational agents (chatbots) in promoting mental health. Biomedical Informatics Perspectives 11, 1-9 (2019)
Understanding the intent of using mental health chatbots
99
32. Lee, Y.C., Yamashita, N., Huang, Y.: Developing a chatbot as an enabler to promote deep self-disclosure for a real psychologist. Proc. Human-Computer ACM. Interact. 4 (CSCW1), 1–27 (2020) 33. Lee, Y.C., Yamashita, N., Huang, Y., Fu, W.: “I hear you, I feel you”: promoting deep self-disclosure through from a chatbot. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (2020) 34. Lee, J.D., Veja, KA: Trust in Automation: Design for Reasonable Trust. Buzz. Factors 46(1), 50–80 (2004) 35 Lipschitz J et al Adoption of mobile apps for depression and anxiety: cross-sectional study on patient interest and barriers to participation. JMIR Mental Health 6(1), e11334 (2019) 36. Liu K, Tao D: The roles of trust, personalization, loss of privacy, and anthropomorphism in public acceptance of smart health services. Calculation. Buzz. Behavior 127, 107026 (2022) 37. Lucas, GM, Gratch, J., King, A., Morency, LP: It's Just a Computer: Virtual Humans Increase Disclosure. Calculation. Buzz. Behavior 37, 94-100 (2014) 38. Ly KH, Ly AM, Andersson G A fully automated conversational agent to promote mental well-being: a pilot RCT using mixed methods. internet interview 10, 39–46 (2017) 39. Magsamen-Conrad, K., Upadhyaya, S., Joa, CY, Dowd, J.: Bridging the divide: using UTAUT to predict multigenerational tablet adoption practices. Calculation. Buzz. Behavior 50, 186-196 (2015) 40. Mandal, D., McQueen, R.J.: Extending UTAUT to the statement Adoption of social networks by microenterprises. International J. Inf Management Technol. (IJMIT) 4(4), 1-11 (2012) 41. McInroy, L.B., Craig, SL, Leung, V.W.Y.: Platforms and Patterns for Practice: LGBTQ+ Youths' Use of Information and Communication Technologies. Adolphe child. Sociedad Work J. 36(5), 507–520 (2018) 42. Melián-González S, Gutiérrez-Taño D, Bulchand-Gidumal J Predicting intentions to use chatbots for travel and tourism. Act. exit route. 24(2), 192-210 (2019) 43. Mostafa, R.B., Kasamani, T.: History and results of initial trust in the chatbot. EUR. J Mark. 56, 1748–1771 (2021) 44. Nadarzynski T, Miles O, Cowie A, Ridge D Acceptability of artificial intelligence (AI) driven chatbot services in healthcare: a mixed methods study. Digital Health 5, 2055207619871808 (2019) 45. Neufeld, DJ, Dong, L, Higgins, C: Charismatic leadership and user acceptance of information technology. EUR. J.Inf. System 16(4), 494-510 (2007) 46. Pennebaker, J.W.: Emotion, disclosure, and health: an overview. In: Emotion, Disclosure and Health, pp. 3–10 (1995) 47. Powell, J.: Trust Me, I'm a Chatbot: How Artificial Intelligence in Healthcare Fails the Turing Test. J. Med. internet resolution 21(10), e16222 (2019) 48. Prakash AV, Das S: Conversational intelligent agents in mental health services: a thematic analysis of user perceptions. Asia Pacific J. Assoc. information system 12(2), 1–34 (2020) 49. Rogers, E. Diffusion of Innovation. Free Press, New York (1995) 50. Russell, ST., Fish, J.N.: Mental health in lesbian, gay, bisexual, and transgender (LGBT) youth. Rev. Annual Clinic Psychol 12(1), 465–487 (2016) 51. Schroeder, J., Schroeder, M.: Trust in machines: how the nature of the interaction affects willingness to share personal information with machines. In: Proceedings of the 51st Hawaii International Conference on Systems Science, Hawaii (2018) 52. Schueller, SM, Neary, M, O'Loughlin, K, Adkins, E.C.: discovery and interest in health applications among people with intellectual disabilities Health Needs: Research and focus group study. J. Med. internet resolution 20(6), e10141 (2018) 53. Sheehan, KB: Towards a typology of internet users and online privacy issues. Information Society 18(1), 21-32 (2002)
100
T. Henkel et al.
54. Steele, L.S., et al.: LGBT identity, untreated depression, and unmet mental health service needs of sexual minority women and transgender people. J. Women's Health 26(2), 116–127 (2017) 55. Taddicken, M.: The "Privacy Paradox" on the Social Web: The Impact of Privacy Concerns, Individual Characteristics, and Perceived Social Relevance in Different Forms of Identity - reports J. Comput.-Mediat. comuna 19(2), 248–273 (2013) 56. Tarhini, A., El-Masri, M., Ali, M., Serrano, A.: Extension of the UTAUT model to include the acceptance and use of understood internet banking by customers in Lebanon. Information Technology People 29(4), 830–849 (2016) 57. Toch, E., Wang, Y., Cranor, LF: Personalization and Privacy: An Examination of Privacy Risks and Solutions in Data-Driven Systems. user model. User adaptation. Bury. 22(1–2), 203–220 (2012) 58. van Wezel, M.M.C., Croes, E.A.J., Antheunis, M.L.: "I'm here for you": can social chatbots really help their users? A literary critique. In: Følstad, A. et al. (Eds.) Chatbot Research and Design: Fourth International Workshop, CONVERSATIONS 2020, pp. 96–113 (2021) 59. Venkatesh, V., Morris, MG, Ackerman, PL.: A longitudinal field investigation of gender differences in individual technology adoption decision-making processes. Organ. behavioral tinnitus. decision procedures. 83(1), 33-60 (2000) 60. Venkatesh, V., Morris, MG, Davis, GB, Davis, FD: User Acceptance of Information Technology: Toward a Unified Vision. MIS Q. 27(3), 425-478 (2003) 61. Venkatesh, V., Sykes, TA, Zhang, X.: "Just what the Doctor Ordered": A revised UTAUT for the introduction and use of EMR systems by PhDs In: 2011 44th Hawaii International Conference on System Sciences (2011) 62. Venkatesh V, Thong JYL, Xu X: Consumer Acceptance and Use of Information Technology: Expanding the Unified Theory of Information Technology Acceptance and Use. MY Q. 36(1), 157-178 (2012) 63. Woebot Health. Mental Health Relationship Agent, January 12, 2022. https://woebothealth. with/. Accessed January 31, 2022 64. Support mental health for all. Wyssa (2021). https://wysa.io/. Accessed January 27, 2022. 65. Yarns, B.C., Abrams, J.M., Meeks, T.W., Sewell, D.D.: The Mental Health of LGBT Older Adults. Act. Psychiatry Rep. 18(6), 1–11 (2016) 66. Yi, M.Y., Jackson, J.D., Park, J.S., Probst, J.C.: Understanding the acceptance of information technology by individual professionals: Toward an integrative perspective. Info. To manage. 43(3), 350-363 (2006)
Chatbot design and applications
Improving conversational problem solving with multimodality: design and implementation Giulio Antonio Abbo(B), Pietro Crovari and Franca Garzotto Institute of Electronics, Information and Biotechnology, Politecnico di Milano, Milan, Italy[Email protected]Abstract. Conversational troubleshooting is an increasingly popular technology that involves using dialog agents to help users of a system through a conversational question-and-answer process, usually via a chatbot. Despite their widespread use, current troubleshooting technologies lack full integration with overlay applications, relying only on text to help users. In this whitepaper, we introduce TINI, a multimodal and multilingual open source conversational troubleshooting tool based on simple configuration files and ready to use in web applications. Users can ask the chat agent questions and explain the problem they encountered; The system analyzes them together with the context of the interaction to identify the root of the problem. Finally, it proposes a solution that involves the user in a multimodal way: with text in the chat and tips in the graphical interface. A table-based configuration improves system maintainability and allows dialog designers and field experts to work on the conversation without the need for programming knowledge.
Keywords: Multimodal
1
· Chat agent · Troubleshooting
introduction
Supporting users of software systems when they encounter a problem has always been of paramount importance [30]. The diagnostic process of identifying the likely causes of a problem and finding an action that might resolve it is commonly known as troubleshooting [18]. In the user support area, problem solving is usually carried out in the form of help desks [8]. In a physical or virtual answering service, customers of a service can request information and advice. Thanks to this method, users don't need to read manuals or lists of frequently asked questions; Instead, they enlist the help of a systems specialist [3]. In recent years, conversational agents have been deployed to automate troubleshooting services [5,22] thanks to their availability and low cost compared to their human counterparts. These properties make c The author(s), under exclusive license from Springer Nature Switzerland AG 2023 A. Følstad et al. (Eds.): CONVERSATIONS 2022, LNCS 13815, pp. 103-117, 2023. https://doi.org/10.1007/978-3-031-25581-6_7
104
G.A. Abbo et al.
Ideal chatbot-based support for websites, especially e-commerce applications [20,31], where 24/7 support is considered a key factor for a competitive offering [32]. A wide range of chatbot tools are available today to automate problem-solving procedures [14], but they share a fundamental weakness: conversation definition and solvable problems are closely related and only allow for chat-based interaction. [5]. Although intuitive in the configuration phase, this choice brings with it three major undesirable drawbacks. First, current technologies are not context sensitive: they are not designed to complement the information extracted from the conversation with contextual data obtained from the interaction in the graphical interface [21,25]. Consequently, to receive good support, users need to describe the details of the problem and the context in which the problem occurs with more interaction effort. Second, chat-based conversation troubleshooting can help users through text responses alone, with the added effort of translating incoming suggestions into UI actions [10,39]. A multimodal response would allow the problem solving system to provide users with actionable solutions, for example by highlighting the relevant element in the user interface, further improving response quality [9,11,27,35]. As the chat data is finally merged with the chat agent configuration, it is impossible to edit them separately. A conversational designer or domain expert cannot change the user experience without minimal coding knowledge. Also, developers who want to modify the functionality of the debug application must have good knowledge of the application domain, as it will inevitably interfere with the structure of the conversation. To solve these problems, we propose TINI, a conversational troubleshooting kit that is natively context sensitive and multimodal. Its table-based configuration allows conversation designers and developers to work independently on configuring the experience. TINI is an open source plugin ready to be integrated into any new or existing web application. When TINI is integrated into a website, users can describe their issue in a special chat panel within the application's interface. TINI analyzes the user's messages together with the context of the interaction to get to the bottom of the problem and multimodally proposes a coherent solution, describing it in the chat and highlighting the relevant elements in the graphical interface. In this article, we improve and extend the work on MCTK published in a poster [1]. After seeing the solution's potential and thanks to the many feedbacks collected, we completely redesigned the system's architecture, making it a single plug-and-play component (lighter, faster and easier to configure) and introduced multilingual support. Furthermore, we formalize the problem with a mathematical formulation and describe a complete case study to clarify its use.
Improved conversation troubleshooting with multimodality
2 2.1
105
Related Work Dialog Management in Conversation Agents
Conversation agents are typically structured around a dialog manager [33] that manages the user's intentions and the dialog context to direct the agent's subsequent actions, modeling the dialog state, goals and policies [29]. Dialogue management can be performed using two main types of techniques [17]: artisanal and probabilistic. In handcrafted dialog handlers, developers must define the rules to respond to each user's intent. These approaches are done mainly through slot filling techniques [15], where the system collects information from user messages and requests the missing data, or through the use of finite state machines [28,41], where it represents the status of the dialog. as states in a finite state machine [43]. Handcrafted dialog managers give developers full control over the interaction without having to manually code every possible interaction. Probabilistic strategies for managing dialogs consist of training an algorithm from large datasets of random samples [42]. These dialog managers require a large base of conversational examples to train, but they can respond to unexpected user prompts based on the provided examples, which were not considered at design time. Some solutions take a mixed approach. For example, learning from a corpus with an external knowledge base or ontology [2,26]. In general, many of these solutions are highly dependent on the conversation that the user will have [4] and make it difficult to separate the conversation from the context of the application. This is evident in learning a corpus, where samples inevitably mix up the content and flow of the conversation, but it is also true for state machine approaches, since the conversation steps are modeled in a graph and are inseparable. of the process in which the conversation unfolds. Consequently, software behavior is forcibly tied to dialog, making it difficult to support additional features such as multiple interaction modes and multiple languages. These approaches are largely unimodal: the user can only interact based on the conversation, whether written or spoken, and needs other external components to achieve multimodal interaction [24]. However, multimodality has been shown to play an important role in interactions with conversational agents [27]; as reported by Oviatt et al. [35] people tend to interact in a more multimodal way with increasingly complex tasks. Multimodal solutions are being developed that can be applied in different fields, from the theater [6] to the automotive system [23,36], in most cases with a customized solution. However, the multimodality in chatbots to help users solve problems remains largely unexplored. 2.2
Troubleshooting conversation
Guided troubleshooting is the process of helping users in a specific area and helping them solve specific problems; Typically, the user asks a question, which is validated to find the root cause and fix the problem [18].
106
G.A. Abbo et al.
Creating programs that autonomously guide users through the problem-solving process requires extracting the intent and specific information from the question, finding an appropriate answer, and taking the necessary action. There are many approaches to solve this problem [38], which, as in the case of dialog managers, can be divided into two groups: those that require manual configuration of rules and those that learn from data. Rule-based approaches, such as case-based taxonomic reasoning [16] or complex slot-filling techniques [44], allow for greater customization of behavior and patterns than machine learning, for example, through the integration of external services to improve performance. However, large sets of rules are difficult to manage and the scope of these troubleshooting frameworks is often very limited: they are often custom solutions for a specific application in a single domain [37]. Subramaniam et al. [37] tries to overcome this limitation with a multi-bot system to cover several domains of applications in detriment of a more complex configuration. The automated approach avoids the manual construction of a complex knowledge base using machine learning techniques [40]. The problem, however, is that domain-specific training datasets are difficult to create and the quality and consistency of agent responses can vary significantly [13].
3
preliminary exam
TINI is a framework that provides users with multimodal conversation troubleshooting in web applications. Embedding it on a new or existing website allows users to get suggestions for solving problems they encounter. 3.1
design principles
Before proceeding with the description of the system, we present the principles that we applied to the design, namely multimodality, separation of concerns and extensibility, and illustrate the justification of their considerations and how they are pursued in TINI. multimodality. A system is multimodal if it supports more than one means of interaction: text, graphics, gestures and more [34]. Several research studies show that the response to multimodal stimuli is better than to unimodal ones [27,35]. Furthermore, as the main basis of interaction of a web application is the graphical interface, we want to intervene and guide the user in this environment as much as possible, instead of asking him to switch to a completely separate conversational environment [7]. Tight integration across multiple modalities also supports the interpretation of user requests. Just as a human field expert requests details about system status to answer a question, an automated solution must use the context of the request to gain a better understanding.
Improved conversation troubleshooting with multimodality
107
It. Additional information such as past user actions in the GUI can help provide a customized solution for each specific instance. Unlike most chatbot frameworks in use, TINI natively supports multimodality. In fact, the configuration is based on a configuration table that combines the context, here defined as what is displayed on the screen, with the problem-solution pairs, allowing the interface elements relevant to the solution to be matched. with the answer. In this way, different views can generate different tips adapted to what is happening on the screen. separation of interests. In software engineering, "separation of concerns" is the practice of separating the main algorithm from the parts that concern specific tasks [19]. We can extend this concept beyond software modules and apply it to professionals involved in different aspects of a system. We want to allow Conversation Designer and Domain Expert to customize TINI's behavior independently of the developer. To achieve this, we put the domain-specific data in one place where these experts can work directly and modify the conversation as they see fit. Furthermore, the structure that contains this data must be accessible to people with little programming knowledge. By working on the configuration table, using any CSV reader, domain experts can enter new problem types, change the wording of ineffective statements, or change the solution to an existing problem. Scalability An extensible solution lets you add new behaviors and modify existing behaviors with minimal disruption to your internal structure. Indeed, when field-specific data is integrated into the application, as the knowledge base grows, introducing changes requires interventions in various places, a reassessment of all existing rules to understand whether each of them is seen. affected by changes and feature expansion quickly becomes impractical. Table-based configuration meets this requirement: all field-specific knowledge and associated data is separated from the code that implements the logic. 3.2
usage scenario
To explain how the proposed system works, we will illustrate its use with a concrete example. Andy is a user of a web application developed by Barb. Barb's app is a simple online image editor with lots of tools and controls organized in panels. On her system, Barb uses TINI to add a debugging agent to help her users. If Andy needs help, he can click a button added by TINI to chat with the chat agent. Andy visits the web app and loads the image he wants to edit, then uses the available tools to crop and rotate the image. At this point Andy wants to dim the image but doesn't know how. it keeps clicking
108
G.A. Abbo et al.
Fig. 1. A photo editing app enhanced with TINI tool. On the right side, the chat panel shows the conversation with the chatbot; On the left, available tools are displayed in a panel, with tools relevant to the solution highlighted in red. The developer can customize every aspect of the user interface.
the TINI button and write in the chat: "How do I darken the image?"; The system processes the message with the available information about the active modules and responds: "You can use the highlight tool to control the brightness of the image." At the same time, in the user interface, the system highlights the corresponding tool and the panel where it is located, as shown in Fig. 1. Andy reads the answer, finds the appropriate tool and can continue editing the image. 3.3
formalization of the problem
From the point of view of high-level abstraction, the proposed system should receive as input the question provided by the user and the context information, consult a knowledge base and provide an answer indicating the relevant elements that need to be communicated to the user. Definition of modules and parameters. To present the context, we can take into account the options available to the user and group them into modules. Modules may or may not be active depending on whether they are available to the user. For example, in a photo editing application, the modules could be different tools like brush tool, crop tool or stamp tool. In addition, each module can have one or more parameters that represent its module's configuration. In the image editor, these can be the brush tool's size and transparency levels.
Improved conversation troubleshooting with multimodality
109
and the proportions of the area in the cropping tool. Then the system gets the information about which modules are active as the inactive modules are not available to the user and cannot be used to solve the problem the user is facing. Mathematical formulation. The problem can be formalized mathematically as follows. Let M be the set of all system modules and P be the set of parameters; we assume that each parameter is associated with a single module. Let q be the user's question. Given A, a set of active modules, A ⊆ M , we describe the system as the relation σ(A, q) = M , P , r , which, given the question and the active modules as input, the highlighted set of modules returns M ⊆ A and the parameters P ⊆ P , and the response r, which explains the actions to be performed. A description of the implementation is provided in Section 4.3.
4
system summary
The central component of TINI is the configuration table, a data structure that represents the connections between problems and solutions in a specific application area, the aspect of which is discussed in Section 4.1. We can describe the action of TINI in four steps, as shown in Fig.2. The system receives an interface problem, which is described by the user in the chat, for example: "How to darken the image?" with the active modules on the screen; in the example, these are the active tools. The phrase is used to identify the problem through intent extraction. This is used in the configuration table to get the list of possible solutions for the active modules. These solutions are communicated to users through a chat phrase that guides them through the solution and some visual cues, such as the tools to use and their parameters to edit, that suggest users where to work in the interface.
Prayer
Question Solution Problem Identification Identification Active Modules
visual cues
Fig. 2. Sequence of operations performed by the system.
4.1
Construction
Before presenting the structure of the system in detail, we propose a general description of the data models and configuration files. TINI can be configured with three files: the training, the configuration table and the pronunciation files.
110
G.A. Abbo et al.
training.json. This file contains sets of sample users for each issue type; It is used to train an NLU engine to extract intent, which represents the type of problem, from user queries. The format follows the NLP.js1 specification as shown below. Multiple languages can be supported by providing multiple files named after the locale: eg training-de, training-it. { "name": "training", "locale": "en-US", "data": [ { "intent": "alignment", "statements": [ "Image is distorted", "Image is not aligned" , ... ] }, ... ] }
Configuration table.csv. This table contains the connections between Issue types, modules, parameters, and responses as indicated in Table 1. This structure balances maintainability, which is relevant when reading and updating the table, with extensibility, which is important when adding new types. of problems and solutions at the table. expressions.json. A file that maps response IDs to their actual text. It's a simple JSON file with the following structure. As before, multiple locale name files can be added to support multiple languages: for example, en-utterances, it-utterances. { "align.rotate": "To correct the image alignment...", "focus.blur": "If you want a smoother look...", "focus.sharp": "To sharpen the image . ..",... } The proposed system has the necessary extensibility requirements. Adding a module or parameter is as easy as adding a row to the table by adding a new 1
www.github.com/axa-group/nlp.js.
Improved conversation troubleshooting with multimodality
111
Table 1. Configuration table structure. Each line corresponds to a module parameter and each column is a topic that TINI wants to address. The presence of an expression tag in a cell means that the corresponding parameter is a possible solution to the column problem. The conversation designer can edit this table directly to add problems and expressions and create new connections between problems and parameters. Cut composition style module parameters
Tool
Filter
Shades of gray
style.generic
Filter
Sepia
style.generic
Filter
ancestral
style.generic
Filter
Blur
Filter
Sharp
Filter
pixelar
Focus
comp.harvest
focus.blur focus.sharp style.generic focus.pixel
The type of problem can be done by adding a column. Adding or changing a connection between problem types and parameters is done by modifying a statement identifier in the corresponding cell and using identifiers instead of entering the statements themselves in the cell allows the same statement to be reused multiple times without duplication and support for multiple languages; These operations do not require any code intervention, only the data structure and configuration of the NLU, thus satisfying the separation of concerns. These three files are provided during frontend component initialization. In this phase, the developer can also customize the appearance of the interface and make active modules available, as described in Section 4.4. 4.2
Architecture
As shown in Fig. 3, the system is divided into five modules. It consists of a standalone Vue.js2 library available as open source and can be used as such. interface component. This component is available to the developer and can be integrated into the web application. A circular overlay button will appear in the lower-right corner of the screen. The button lets you toggle the visibility of a chat panel that handles communication with the chat agent. When the user sends a message, the interface forwards it to the main component, and when it receives a response, it displays it in the chat. In addition, it updates information about which modules and parameters should be highlighted in the application. The component is implemented in Vue.js; When used, it accepts various configuration parameters, including configuration file URLs and the list of active modules, and generates an event when highlighted items change. two
www.vuejs.org.
112
G.A. Abbo et al.
interface component
Kern
NLU Adapter
Integrated NLU engine
To use
configuration table
To use
configuration controller
To use
statements
Training
Fig. 3. System components and configuration files.
Center. The kernel implements system logic. It is initialized with information about which modules are currently active. Upon receiving the user's question, it uses the NLU adapter and configuration handler to get the answer, which consists of an answer and a list of items to be highlighted in the interface. Algorithm details are found in Section 4.3. The response is returned to the interface component. This part is implemented in JavaScript and is interface agnostic: more interfaces may be added in the future, extending the library to other frontend frameworks. NLU adapter and built-in NLU engine. The NLU adapter is responsible for extracting from the user's question the intent that represents the problem encountered by the user. To achieve this, it uses a built-in NLU engine, which is an instance of NLP.js, a JavaScript library that implements machine learning models for natural language understanding. We chose NLP.js because it runs in the browser and natively supports 41 languages, 104 of them with the BERT extension [12]. When the web application is loaded in the client's browser, it is trained on the examples from the training file and is ready to recognize the intent of the user's sentences. Also, the NLU adapter recognizes the language of the user's query. Alternatively, the adapter can be configured with a pre-trained external NLU service: in this second case, the training configuration file can be omitted. configuration handler Finally, the configuration handler retrieves and parses the expressions and documents in the configuration table and provides their contents to the parent component. Documents can be provided via a URL or directly as strings. 4.3
runtime behavior
Principal component behavior can be represented using the mathematical notation presented in the previous section. Given the user's query q, the NLU adapter extracts the intent representing problem type t. For example, from the sentence "How do I make the image darker?" the NLU adapter would extract
Improved conversation troubleshooting with multimodality
113
the intent representing the image.dark problem type. The system can extract only one intent from the given question q. For this reason, if conversation designers want to accept compound questions (for example, "I want to crop and enlarge the image"), they must create specific intents. As before, let A be the set of active modules. The algorithm searches the configuration table, where the rows contain the pi parameters (with the corresponding mi modulus), the columns contain the problem types tj, and the corresponding cell is labeled uij, which represents an expression identifier and finds all mi, pi , uij el tj = t and mi ∈ A; that is, it finds all the parameters of the active modules related to the problem and the corresponding declarations. The language of the input (eg English) is detected in the user's sentence and is used to retrieve the actual expressions that make up the response from the expression document. They are then concatenated to produce a single final answer r.
Webapp-Framework
interface component
Kern
NLU Adapter
visit a lot
initialize(NluInfo, Configuration URL) initialize(NLUInfo)
Load configuration from configuration controller
click on the showChat() button
write a question
resolve(query, active modules) getIntent(query) intent
show answer
Answer, highlighted modules
findSolution(data, active modules, intent)
Featured «Event» update
Fig. 4. The behavior of the system; configuration controller and speech recognition are not shown for simplicity.
In practice, the sequence of user interactions with the system can be modeled according to Figure 4, which represents its behavior from the moment the application is started. In the initial phase, the built-in NLU engine, if used, is trained quickly. The user can interact with the system and ask a question. The system interprets the query, tracks the active modules,
114
G.A. Abbo et al.
and generates a response. When the operation completes, an event is generated to inform the application that the highlighted items need to be updated. 4.4
Mission
To use TINI, simply import the publicly available package into a new or existing Vue application. When adding the provided component to the application, the developer can specify the URLs of the configuration files, provide a list of active modules, which are kept updated as they run, and use the provided plug-ins to customize the behavior of the system. Special CSS classes can be used to refine the graphical appearance. To use the multimodal response, the developer listens to a specific event that provides the list of modules relevant to the response: this can be used to update some CSS classes in other components to make them stand out.
5
Diploma
We propose TINI, a new open source plugin to troubleshoot chatbots in conversations. The system is multimodal: it proposes the interface elements relevant to the solution of the problem, along with a textual response. In addition, it is context sensitive: when responding, it considers the context of the application along with the user's question. TINI builds on previous research results and improves its usability as it is now a standalone interface plug-in rather than requiring multiple external services, while maintaining simplicity of configuration and extensibility. Separation of concerns ensures that a person with no programming experience can change the configuration autonomously. Finally, the tool now supports multiple languages. In the future, we want to expand TINI's capabilities in two directions: to include multi-level interactions to guide the user when the problem is not immediately clear, and in calculating the solution, taking into account not only which modules are active, but also the available parameter values. . Additionally, the plugin can be ported to other frontend frameworks to make it more widely available. Finally, we plan to test several alternatives to the configuration table, including graph-based approaches and a custom interface for editing the configuration.
References 1. Abbo, G.A., Crovari, P., Pid`o, S., Pinoli, P., Garzotto, F.: MCTK: A Multimodal Dialog Troubleshooting Kit for Help Desk Users in Web Applications. In: Proceedings of the 2022 International Conference on Advanced Visual Interfaces, AVI 2022, S. 1–3. Association for Computing Machinery, New York (2022) 2. Agarwal, S., Dusek, O., Konstas, I., Rieser, V.: A conversational agent based on knowledge-based multimodal search. arXiv:1810.11954 (2018)
Improved conversation troubleshooting with multimodality
115
3. Al-Hawari, F., Barham, H.: A machine learning-based help desk system for managing IT services. King Saud University J. Calculus. Information Sciences. 33(6), 702–718 (2021) 4. Baez M, Daniel F, Casati F Conversational Web Interaction: Proposing a Dialogue-Based Natural Language Interaction Paradigm for the Web. In: Følstad, A., et al. (Ed.) TALKS 2019. LNCS, vol. 11970, pp. 94-110. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-39540-7 7 5. Bavaresco, R., et al.: Conversational agents in business: a systematic review of the literature and future lines of research. Calculation. Science. Rev. 36, 100239 (2020) 6. Bhushan, R., et al.: ODO: Multimodal Chatbot Design for an Experimental Media System. Interact Multimodal Technology. 4(4), 68 (2020) 7. Boshrabadi, AM, Biria, R: The effectiveness of multimodal versus print texts for teaching reading comprehension skills to Iranian third-grade high school students. international long J. Learn. linguist app. World 5, 17 (2014) 8. Corea, C., Delfmann, P., Nagel, S.: Towards Intelligent Chatbots for Customer Service - Practice-Based Requirements for a Research Agenda (2020) 9. Crovari, P ., Pid´o, S., Garzotto, F., Ceri, S.: Show, don't tell. Design considerations for multimodal conversational interfaces. In: Følstad, A., et al. (Ed.) TALKS 2020. LNCS, vol. 12604, pp. 64-77. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68288-0 5 10. Crovari, P., et al.: GeCoAgent: A Conversational Agent to Enhance Genomic Data Extraction and Analysis. MCA Trans. Calculation. healthc. 3(1), 1–29 (2021) 11. By Cicco R, Iacobucci S, Aquino A, Romana Alparone F, Palumbo R Understanding user acceptance of chatbots: an extended TAM approach. In: Følstad, A., et al. (Ed.) Chatbot Research and Design. LNCS, pp. 3-22. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-94890-0 1 12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Deep Bidirectional Transformer Pre-Training for Language Comprehension (2019) 13. Følstad, A., Nordheim, C.B., Bjørkli, CA: What makes users trust a chatbot for customer service? An exploratory interview study. In: Bodrunova, S. S. (Ed.) INSCI 2018. LNCS, vol. 11193, pp. 194-208. Springer, Cham (2018). https://doi.org/10. 1007/978-3-030-01437-7 16 14. Følstad, A., Taylor, C.: Conversational repair in chatbots for customer service: the expression effect and the suggestion of alternatives. In: Følstad, A., et al. (Ed.) TALKS 2019. LNCS, vol. 11970, pp. 201-214. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-39540-7 14 15. Goddeau, D., Meng, H., Polifroni, J., Seneff, S., Busayapongchai, S.: The form- Voice application based dialog manager. In: Proceedings of the Fourth International Conference on Spoken Language Processing, ICSLP 1996, vol. 2, pg. 701-704 (1996) 16. Gupta, KM: Conversational Case-Based Taxonomic Reasoning. In: Aha, D.W., Watson, I. (eds.) ICCBR 2001. LNCS (LNAI), vol. 2080, pp. 219-233. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44593-5 16 17. Harms JG, Kucherbaev P, Bozzon A, Houben GJ Approaches for managing dialogue in conversational agents. Computing on the Internet IEEE. 23(2), 13–22 (2019) 18. Heckerman, D., Breese, J.S., Rommelse, K.: Theoretical resolution of decision problems. comuna ACM 38(3), 49–57 (1995) 19. H¨ursch, W.L., Lopes, C.V.: Separation of Concerns. Technical Report (1995) 20. Janssen, A., Rodríguez Cardona, D., Breitner, M.H.: More than FAQ! Chatbot taxonomy for business-to-business customer service. In: Følstad, A., et al. (Ed.) TALKS 2020. LNCS, vol. 12604, pp. 175-189. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68288-0 12
116
G.A. Abbo et al.
21. John, R.J.L., Potti, N., Patel, J.M.: Ava: From data to conversational insights. In: CIDR, page 10 (2017) 22. Kvale, K., Freddi, E., Hodnebrog, S., Sell, O.A., Følstad, A.: Understanding the user experience of customer service chatbots: what can we do learn from customer satisfaction surveys? In: Følstad, A., et al. (Ed.) TALKS 2020. LNCS, vol. 12604, pp. 205-218. Springer, Cham (2021). https://doi.org/10.1007/978-3030-68288-0 14 23. Lemon, O., Georgila, K., Henderson, J., Stuttle, M.: An ISU dialogue system, reinforcement learning with Sample Dialog Guidelines: Generic Slot Population in Vehicle TALK System. Demonstrations 4 (2006) 24. Liao, L., Ma, Y., He, X., Hong, R., Chua, T.S.: Knowledge-Aware Multimodal Dialogue Systems. In: Proceedings of the 26th ACM International Conference on Multimedia, MM 2018, pp. 801–809. Association for Computing Machinery, New York (2018) 25. Liu W, Li X, Huang D: A Survey on Context Awareness. In: International Conference on Computer Science and Service Systems (CSSS) 2011, pp. 144–147 (2011) 26. Madotto, A., Wu, CS, Fung, P.: Mem2Seq: Efficient Integration The End of Task-Oriented Dialog Systems. arXiv:1804.08217 (2018) 27. Massaro, D.W.: A framework for assessing human multimodal integration and the role of embedded conversational agents. In: Proceedings of the 6th International Conference on Multimodal Interfaces – ICMI 2004, State College, PA, USA, p. 24. ACM Press (2004) 28. Mathur, V., Singh, A.: The rapidly changing landscape of conversational agents. arXiv:1803.08419 (2018) 29. McTear M, Callejas Z, Griol D The conversation interface. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32967-3 30. Morana, S., Pfeiffer, J., Adam, M.T.P.: User Assistance for Intelligent Systems. Bus. close information system. 62(3), 189–192 (2020) 31. Moriuchi E, Landers VM, Colton D, Hair N: Engagement with chatbots versus interactive augmented reality technology in e-commerce. J. Strateg. Marking. 29(5), 375–389 (2021) 32. Nursetyo, A., Setiadi, D.R.I.M., Subhiyakto, E.R.: AIML-based intelligent chatbot system for e-commerce support. In: International Seminar on Research in Information Technologies and Intelligent Systems (ISRITI) 2018, pp. 641-645 (2018) 33. O'Shea, J., Bandar, Z., Crockett, K.: Systems Engineering and Conversational Agents. In: Kacprzyk, J., Tolk, A., Jain, LC, Jain, LC. (Ed.) Intelligence-Based Systems Engineering, vol. 10, pp. 201-232. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-17931-0 8 34. Oviatt, S.: Multimodal interfaces. In: Handbook of Human-Computer Interaction 2nd ed CRC Press, Boca Raton (2007) 35. Oviatt, S., Coulston, R., Lunsford, R.: When do we interact multimodally? Cognitive load and multimodal communication patterns. In: Proceedings of the 6th International Conference on Multimodal Interfaces, p. 8 (2004) 36. Pieraccini, R., et al.: Multimodal Conversational Systems for Automotives. commune ACM 47(1), 47-49 (2004) 37. Subramaniam S, Aggarwal P, Dasgupta GB, Paradkar A COBOTS: a multibot cognitive conversation framework for the help desk. In: Proceedings of the 17th International Conference on Autonomous Agents and Multi-Agent Systems, AAMAS 2018, Richland, SC, pp. 597–604. International Foundation for Autonomous Agents and Multiagent Systems (2018)
Improved conversation troubleshooting with multimodality
117
38. Thorne, C.: Chatbots for Problem Solving: A Survey. Far away. Linguist. Compass 11(10), e12253 (2017) 39. Turk M.: Multimodal interaction: a review. Pattern Recognition. Latvian. 36, 189–195 (2014) 40 Vinyals O, Le Q: A Neural Conversational Model (2015) 41 Wilks Y, Catizone R, Worgan S, Turunen M Some background on dialog management and dialog language for dialog systems . Calculation. talk long 25(2), 128–139 (2011) 42. Williams, J.D., Asadi, K., Zweig, G.: Hybrid Codenets: Practical and Efficient Comprehensive Dialogue Control with Supervised and Reinforcement Learning (2017) 43 Zeigler, B ., Bazor, B.: Dialog design for an interactive language automation system. In: Proceedings of the 2nd IEEE Workshop on Interactive Voice Technology for Telecommunication Applications, pgs. 113-116 (1994) 44. Zhao, G., et al.: MOLI: Intelligent Chat Agent for Mobile Customer Service. Information 10(2), 63 (2019)
Structure and content analysis of social signals in the introduction of customer service chatbots Charlotte van Hooijdonk1(B)
, Gabriela Martín1
e Christine Liebrecht2
1 Utrecht University, Trans 10, 3512 JK Utrecht, Holland
{C.M.J.vanHooijdonk,G.N.Martijn}@uu.nl
2 Tilburg University, PO Box 90153, 5000 LE Tilburg, Holanda
[Email protected]
Abstract. Organizations are increasingly using chatbots to answer customer questions, yet customers still have unsatisfying encounters with them. To successfully use customer service chatbots, it is important for organizations and designers to understand how to present them to customers. Arguably, the way a chatbot presents itself and its services can influence customer perceptions of the chatbot. Therefore, a framework was developed to annotate social signals in chatbot presentations. To validate our framework, we performed a content analysis of customer service chatbot launches (n=88). The results showed that the framework proved to be a reliable identification tool. Furthermore, the most common social cue in chatbot launches was a humanoid avatar, while communication cues indicative of chatbot functionality were rare. The article ends with implications for the design of chatbot launches and opportunities for further research. Keywords: Self-presentation Customer service chatbots Social signals Anthropomorphism Content analysis
1 Introduction Organizations are increasingly using online conversational agents such as chatbots for customer service purposes to increase productivity and reduce costs [16]. Chatbots are systems designed to communicate with customers in natural language, i.e. text or voice [16]. Today, chatbots respond to around 80% of customer queries [20]. Chatbot market revenue is currently 106.6 million and is expected to increase to 454.8 million by 2027 [39]. While the number of customer service chatbots is on the rise, customers are having unsatisfying encounters with them. For example, customers view chatbots as lacking in empathy and chatbot conversations are impersonal [8, 31]. Furthermore, chatbots cannot always provide adequate answers to customer queries due to low intent recognition [12, 14, 23]. These poor encounters indicate a gap between customer expectations and chatbot performance [30], leading to resistance to chatbots. © Author(s), under exclusive license from Springer Nature Switzerland AG 2023 A. Følstad et al. (Eds.): CONVERSATIONS 2022, LNCS 13815, pp. 118-133, 2023. https://doi.org/10.1007/978-3-031-25581-6_8
A structure and content analysis of social signals
119
To successfully use customer service chatbots, it is important for organizations and designers to understand how to present them to customers [1]. Both the appearance and initial messages of the chatbot can influence customers' perception of the chatbot and the company. Kull et al. [22] show that brand engagement increases when initial messages emphasize the usefulness of the chatbot. Previous research has identified various signals in chatbot adoption and how they affect user perceptions, but lacks a well-defined classification of possible signals for chatbot adoption that would allow for more valid cross-study comparisons. Therefore, it is valuable to get an overview of pointers for chatbot launches. This framework is useful for chatbot designers who develop customer service chatbots, and for researchers who want to systematically investigate how different signals in chatbot launches affect customers' perception of the chatbot and the connected organization. Furthermore, little is known about how organizations adapt these presentations to selectively disclose chatbot information. Our study expands the role of chatbot presentations and proposes a social signaling framework based on previous research. Using content analysis, we examine what social signals occur when customer service chatbots are introduced. In summary, the following research questions were examined: RQ 1: To what extent can social signals be reliably identified when introducing customer service chatbots? RQ 2: To what extent are the social signals identified in our framework produced when customer service chatbots are introduced?
2 Theoretical framework 2.1 Managing customer expectations of customer service chatbots with chatbot presentations Organizations are increasingly using chatbots to process customer service requests. Chatbots are deployed on websites, applications and social media channels, offering a new form of human-computer interaction [11]: customers can request the desired information through a dialog screen and receive information in natural language [41]. Customers use customer service chatbots primarily for efficiency reasons, i.e. they want to get requested information quickly and conveniently [3, 13, 42]. However, chatbots often do not meet customer expectations [4, 19, 30]. For example, they understand that customer requests are sometimes incorrect, only partially address customer requests in their responses, and communicate in a rather impersonal manner [8, 13, 23]. One way to manage customer expectations is at the first stage of the customer's communication journey with the chatbot, i.e. at the introduction of the chatbot. Figure 1 shows an example of a chatbot introduction: the first screen of a food delivery chatbot. This introduction consists of three parts: 1) a header with an avatar representing a chef's hat, 2) three welcome messages in which the chatbot reveals its artificial identity, and 3) customer response options consisting of buttons and a text field.
120
C. van Hooijdonk et al.
Fig. 1. Introduction of a chatbot for an (anonymous) food package delivery service.
When customers engage in a conversation with a customer service chatbot, they have three conversation goals [30, 37]. The first objective refers to the task that customers want to accomplish with the chatbot. For example, in the introduction of the chatbot in Fig. 1, answer buttons indicate that customers can view upcoming recipes. The second objective refers to how customers can have a smooth conversation with the chatbot. For example, the introduction shows that customers can communicate with the chatbot by clicking the reply buttons or typing their question in the text field (see Fig. 1). The third objective is related to the relationship between the customer and the chatbot. Customers prefer personal interaction and the "human touch" to service encounters, even when talking to a chatbot [24, 36]. This can be achieved through the use of anthropomorphic cues such as an avatar and a conversational communication style [1, 17, 26]. The chatbot in Fig. 1 introduces himself as “Support-Sous-Chef” and uses personal pronouns (eg “I”, “my”, “me”). Therefore, customers' expectations of achieving their goals with the chatbot can be controlled by the signals contained in the introduction. 2.2 Social cues in chatbot introductions The computers are social actors (CASA) paradigm shows that people unthinkingly apply the social scripts of human-human interaction when using computers [33, 34]. Recently, extensions of the paradigm have been proposed to explain technological advances and changes in the way people interact with technology [15, 28]. Lombard and Xu [28] propose the Media as Social Actors (MASA) paradigm, which considers the effects of social cues and psychological mechanisms (ie unconsciousness and anthropomorphism) of social responses. Social cues can be defined as “biologically and physically based traits that stand out to observers because of their potential as channels of useful information” [10:2]. Examples of social cues include a humanoid avatar or
A structure and content analysis of social signals
121
Informal use of language. If a medium presents social signals by itself, it is likely that users will perceive it as a social entity and not as a medium [27]. Therefore, if a chatbot includes social cues, such as a human-like avatar, users will perceive and react to its social entity similar to person-to-person interaction [27, 28]. The effects of social cues on chatbots have been investigated in several experimental studies [1, 6, 17]. In these experiments, manipulated social cues occur at various stages of the users' communication journey with the chatbot. For example, Go and Sundar [17:308] created two versions of a chatbot that differed, among other things, in the disclosure of the introduction (e.g., human-like: "Hello! I'm Alex, a sales representative" vs. like a machine: 'Hello! I'm Alex, an automated chatbot'). They also manipulated the interactivity of the messages: the chatbot was less responsive (i.e., simple back-and-forth exchanges) or more (i.e., conditional message exchanges) to participants' messages. An interaction effect was observed between message presentation and interactivity: participants rated the chatbot more positively when it presented itself as a human being and engaged in a highly interactive conversation than when it engaged in a less interactive conversation. . While experimental studies provide insight into the impact of a specific social signal on chatbot launches, there are several social signals present in existing customer service chatbot launches, as shown in Figure 1. Social signals do not occur in isolation and must be considered in together, as the combination of signals can increase or decrease its impact [9]. For example, the chatbot introduction in Figure 1 includes visual and verbal identity cues (i.e., the avatar is a chef's hat, and the words "automated sous chef" and "I'm not human" are used) that mark the articulation . of the chatbot's identity. Therefore, it is important to identify social cues in chatbot introductions from existing literature. Based on a previous classification of social cues for chatbots [9] and research on social cues in chatbot presentations, we developed a framework (see Table 1) in which the identified cues were grouped into four main categories: identity cues, competence, conversation . communication keys and notes. Identity notes. Is the interlocutor that appears on the customer's screen an attendant or a chatbot? There are several clues that can mark or mask the artificial identity of the chatbot. The first form is a disclosure (eg "I'm a chatbot", "I'm a virtual assistant"). DeCicco et al. [7] examined the impact of disclosure on chatbot launches. They found that the presence of disclosure in a chatbot presentation resulted in less social presence, trust and brand attitude than no disclosure. Similar results were reported by Luo et al. [31] who conducted a field study in which the chatbot revealed its artificial identity or not. Results showed that early conversation disclosure reduced purchase rates and customers perceived the chatbot as less informed and emphatic. Other cues can also mask or mark the chatbot's artificial identity. One aspect of customer service chatbot design that has received much attention in research and practice is anthropomorphism [2, 5, 13, 16]. This refers to humanizing chatbots by adding social cues like a name and avatar. Several scientists have conducted experimental studies, often comparing so-called human-like chatbots with machine-like chatbots. The operation of human-like chatbots differed across studies. For example, Araujo [1] used a communication style and name to distinguish between human and mechanical chatbot. Participants interacted with a human chatbot called
122
C. van Hooijdonk et al.
Emma using informal language or a machine-like chatbot called ChatBotX using formal language. In contrast, Go and Sundar [17:308] manipulated disclosure and avatar in chatbot introduction. They created four versions of a chatbot presentation ranging from machine-like (i.e. reveal: "Hello! I'm Alex, an automated chatbot", avatar: speech bubble character) to human-like (i.e. say , reveal: "Hi Alex , a sales representative", avatar: a profile picture of a real person). An expectation violation effect was found: attitude and behavior scores were lower when the chatbot with a human-like avatar and a machine-like reveal was introduced compared to the introduction. with a machine-like avatar and a machine-like reveal [17]. Competition indications. One way to increase customer confidence in customer service chatbots is to highlight the chatbot competition. [35] For example, Mozafari et al. [32] studied the effects of the transmission of experience in combination with a disclosure in the introduction of the chatbot on the confidence of the participants. The results showed only disclosure negatively affected trust, while disclosure with a statement about the chatbot experience (e.g. "Because of my high efficiency, I can find the best offer for you") or weaknesses (e.g. "Note that 'I've only been in the field a year and I'm still learning') had a positive impact on trust [32:2919]. Another study examined the effects of conveying the chatbot experience through metaphors [21] A study by Wizard of Oz was conducted in which participants were given a prompt explaining the chatbot experience using a metaphor (eg "The bot you are going to interact with is a tour guide inspired by it") [21: 9]. agent. Results showed that metaphors that communicated low chatbot proficiency (e.g., "young student") scored higher and ranked higher than metaphors that communicated high chatbot competition (e.g., "trained professional"). Kull et al. [22:844] compared chatbot welcome messages in terms of competence (e.g. "Years of experience in the travel industry allow me to answer any travel question") or friendliness (e.g. "I'll take care of you and I will try to respond to each trip' was emphasized. Questions you may have.') Attendees' brand engagement increased when the chatbot started the conversation with a warm welcome message. Conversation notes How you can represent the chatbot habits conversational cues? Three verbal cues have been identified that affect chatbot conversational etiquette. Because chatbots mimic human-to-human communication [29], customers expect chatbots to utter a word of greeting such as "hello" [5, 16, 19 ] time is self-introduction (for example, "My name is..."). In addition, chatbots must adhere to turn-taking protocols [5] After the chatbot introduces itself, it must give the turn to the customer (for example, "How can I help you?"). Communication notes. How can the chatbot communicate, what features and what types of messages can it process? Several cues have been identified that can improve the chatbot's communication capability [5, 40]. Several academics argue that the purpose of the chatbot needs to be clarified in the introductory messages (i.e. what is the chatbot supposed to do?) You should also explain how to communicate effectively with the chatbot, for example by typing in keywords or clicking the answer [19].
A structure and content analysis of social signals
123
Table 1. Structure of social signals in chatbot introductions Identity signals
competition notes
conversation notes
communication notes
disclosure
Statement emphasizing professional competence
greetings
explanatory purpose
avatar type
Statement emphasizing weakness
Introduce
Explanation Interaction
type of name
taking turns
To validate our framework, we performed a content analysis of customer support chatbot launches. This analysis allowed us to investigate whether our framework is reliable for analyzing social signals in chatbot introductions. Additionally, we gained insight into which social signals organizations are using in chatbot launches and whether they align with the signals in our framework.
3 Method 3.1 Sample chatbot launches Customer service chatbot launches from different organizations in the Netherlands were manually searched in 2021. The search was limited to Dutch organizations to ensure a valid cross-industry comparison without confounding factors such as differences in language and culture. The sampling strategy consisted of three stages. First, various industries and organizations belonging to those industries were identified. Second, for each brand, we look at an organization and its competitors. For example, in the financial sector, we list several banks. Third, organizations' websites were visited and the presence of a customer service chatbot was searched for, usually located on the homepage, contact page, or domain-specific pages. Once a customer service chatbot was found, a screenshot of the chatbot's first screen was taken. In ten cases, the chatbot immediately started with a call to action for the user so that the chatbot could be activated by clicking or typing. In all other cases, a screenshot of the chatbot's first screen was taken without any interaction with the bot. In seven cases, the introduction did not contain any hint (ie, reveal and/or avatar) of the agent's artificial identity. We interacted with the agent to determine if it was a chatbot. Data saturation was achieved for 88 chatbot launches belonging to 78 Dutch organizations: government or non-profit organizations (n=8), employment agencies (n=1), education (n=1), electronics (n=5 ), finance (n=6), furniture (n=4), insurance (n=14), logistics and mail (n=3), retail (n=15), telecommunications (n=6), travel (n= 2), public services (n=7), others (n=6). You can find the example on OSF (https://osf.io/8wut9/). 3.2 Codebook All screenshots were analyzed using a codebook to identify social cues in chatbot presentations. The codebook was built using all three parts of a chatbot.
124
C. van Hooijdonk et al.
Introducing: 1) the header, 2) the welcome messages, and 3) the client response options. Social cues were assigned to (each of) these three parts. Regarding the title, five subcategories were coded. Was the presence of an avatar commented on (yes/no), and if so, what it looks like (i.e., brand logo, robot, human, object, or other). The chatbot name (yes/no) and if so, the type of name (e.g. brand, bot name, human name, other) were also coded. It was also noted if the chatbot was identified with the label “Chatbot” (yes/no). Finally, the codebook contained an open category for other items in the header (eg teaching experience). In short, header subcategories included identity clues. Nine subcategories were distinguished for chatbot welcome messages. First, the number of chatbot messages was encoded. Next, the presence of an avatar (yes/no) was coded, and if so, its appearance (i.e., brand logo, robot, human, object, or other). The presence of a greeting (eg, "Hello"; yes/no) and a self-introduction (eg, "My name is..."; yes/no) was also noted. In addition, it was coded whether the competence and/or purpose of the chatbot was communicated and whether instructions were given to ensure smooth interaction (all subcategories: yes/no). Finally, it was coded whether the chatbot gave the customer the turn at the end of the presentation (for example, “How can I help you?”). Thus, in the welcome messages, two identity keys, three conversation keys, a competency key and two communication keys were identified. For answer options, coders commented on the type of answer options (buttons, open text box, or both). If the chatbot allowed customers to communicate through buttons, the number of buttons was noted. If there was a text field open, it noted whether the chatbot provided the number of characters customers can use to compose a typed message (yes/no). Finally, the codebook included an open field where the coder could describe other observations related to the chatbot launch, such as whether the avatar in the first chatbot messages was different from the avatar in the header. The codebook can be found on OSF (https://osf.io/8wut9/). 3.3 Coding procedure Before coding the example, training was performed showing chatbot introductions, discussing the codebook and illustrating with examples. During the discussion, two new clues were identified and added to the codebook, with one clue tweaked slightly. These alerts included: • Proactive alerts. The chatbot presentations discussed often included information that the chatbot proactively provided. Two proactive indicators were identified: 1) the existence of a privacy statement (yes/no) and 2) the existence of information on promotions, offers, corona measures or other topics (yes/no). • Communication Notes. Discussed chatbot introductions often contained messages explaining the possibility of a handoff to a human agent (i.e., a handover [23]). This tip was added as a subcategory of communication tips because it explains what happens when a chatbot is unable to process a message. • Contest Notice. The chatbot presentations discussed did not contain statements about the chatbot experience, but rather statements about the weaknesses of the chatbot.
A structure and content analysis of social signals
125
to happen. Therefore, we decided to comment if chatbot incompetence was reported. Next, a training set (n=10) of chatbot introductions was created. The training set was individually coded by six annotators. The encodings were then compared and discussed, resulting in a final agreement and minor revisions to the codebook. The sample was then divided among the six trained coders. To calculate intercoder reliability, a seventh independent coder twice coded a subset of 25 chatbot introductions. 3.4 Data analysis To investigate the reliability of our framework, inter-coder reliability scores (Kripendorff α) were calculated using SPSS v. 27 calculated using the KALPHA macro [18]. Descriptive statistics (frequency and percentage) were then calculated to determine the extent to which signals identified in our framework appear in chatbot introductions. Finally, the sample was analyzed qualitatively to describe how organizations adapt chatbot adoption based on various social signals.
4 Results 4.1 Reliability of social signals in chatbot presentations Table 2 shows the reliability scores of double-coded social signals in chatbot presentations. Header signal ratings were fair to satisfactory. In two cases, the developers disagreed on the type of chatbot name. Name types relate to trademarks (eg "Tracy" for a postal service) but may also belong to other subcategories (eg humanoid names). The track scores in the welcome messages were also satisfactory. However, the greeting subcategory showed insufficient reliability, although the percentage of agreement was high, indicating the prevalence of the category [38]. Finally, the reliability of the response options was acceptable (Kripendorff's α: 0.65; percentage of agreement: 80.0). One explanation for this relatively low reliability rating is that in some cases the buttons were less noticeable because they were embedded in the chatbot message, while in other cases the text box was less noticeable compared to the buttons. The reliability assessment of the other two response stimuli was satisfactory to perfect (number of buttons: Krippendorff α: 0.92; percentage of agreement: 90.91; number of characters in the text field: Krippendorff α: 1.00, percentage concordance: 100). 4.2 Use of social signals in chatbot launches Table 2 shows the social signals identified in customer service chatbot launches in the Netherlands. Our content analysis revealed that half of the introductory headlines contained an avatar, most commonly humanoid (16:32.0%). Robot-shaped avatars and brand-related avatars were equally present (14:28.0%), as were avatars representing an object or something else (3:6.0%). Names appeared in 44.3% of the intro titles and were mostly humanoid (26: 66.7%). Robot-like names (8: 20.5%) or other names (5: 12.8%) were produced.
126
C. van Hooijdonk et al.
Table 2. Reliability scores (Kripendorff α and percent agreement) Use of social cues (frequencies and percentages) Intercoder reliability scores
Use of social cues (n=88)
α de Krippendorff
%
frequencies
%
avatar
1,00
100,00
50
56,8
Avatar Type Humanoid Robotic Marked
0,91
93,33
16 14 14
32,0 28,0 28,0
Object
3
3.4
Others
3
3.4
category header
Name
0,92
96,00
39
44.3
Name Type Human Robotic Other
0,65
80,00
26 8 5
66,7 20,5 12,8
Disclosure "Chatbot"
0,87
96,00
26
29,5
number of messages
1,00
100,00
avatar
0,92
96,00
57
64,8
avatar type
1,00
100,00
greetings
0,47
92,00
73
83,0
Introduce
0,89
96,00
56
63,6
welcome messages
customer shift
0,92
96,00
61
69.3
declaration of incompetence
1,00
100,00
6
6.8
explanatory purpose
0,87
96,00
11
12.5
Explanation Interaction
0,81
92,00
29
33,0
declaration of surrender
1,00
100,00
sixteen
18.2
proactive information
0,71
92,00
8
9.1
Disclaimer
1,00
100,00
5
5.7
less common. Nearly three out of ten headlines contained the word "chatbot". In addition, other advertising phrases such as “virtual assistant”, “digital assistant” and “assistance bot” were also found. The mean number of welcome messages was 1.77 (SD 0.85). An avatar accompanied these messages in nearly two-thirds of cases. The most common were human-shaped avatars (21: 36.8%), followed by a brand logo (18: 31.6%) and robot-shaped avatars (14: 36.8%).
A structure and content analysis of social signals
127
24.6%). The least common were avatars representing an object (1:1.8%) or something else (3:5.3%). Welcome messages often contained conversation tracks. The most frequent were greetings (73: 83.0%), followed by passes of passage (61: 69.3%) and self-introductions (56: 63.6%). Statements of incompetence such as “I'm new, but I'll do my best to help you” or “Chatbot in training” rarely occurred (6: 6.8%). Also, welcome messages often didn't include communication keys. Interaction explanations (e.g., “Keep your question short and sweet”) were the most common (29:33.0%), followed by the possibility of being referred to a human agent (16:18.2%) and explanations of the interaction. about the purpose of the chatbot (11:12.5%, for example, "Can I help you with a business question"). Finally, welcome messages barely contained proactive information (8: 9.1%) and privacy notices (5: 5.7%). Most answer choices consisted of text fields only (38: 43.2%). Buttons and the combination of text field and buttons were also present (25: 28.4%). The number of characters customers can use to compose a message was reported in nearly five out of ten text fields (12: 19.0%). The average number of buttons in chatbot introductions was 3.88 (SD 2.50) and ranged from 1 to 12 buttons. 4.3 Qualitative analysis of social signals in chatbot launches The current section covers five examples from our sample. These examples were chosen because they represent atypical and prototypical chatbot launches.
Fig. 2. Anonymous launches of atypical chatbots by two telecom providers (left and center) and an insurance company (right).
Chatbot launches from two telecom providers are shown in Fig. 2. The screenshot on the left shows an atypical chatbot introduction with some quirks. The header contains a humanoid avatar and a humanoid name ("Sam"). There is also a green dot that mimics a convention used in human-computer mediated communication, namely that a person is online and available. The chatbot's first message includes a greeting ('Hello') and a self-introduction ('I'm Sam, e-customer specialist for [brand]'). Interestingly, the chatbot's second message contains mixed signals, describing the chatbot's features on the one hand, but also providing instructions on how to ask questions.
128
C. van Hooijdonk et al.
instead: “I already know enough. Please describe your question or problem as briefly as possible and everything will be fine :-)'. The latter is also highlighted by the number of characters in the text field (i.e. 255 characters). In particular, the words "Talk to Sam" are entered into the text box to invite the customer into the conversation. The screenshot in the middle also shows an atypical introduction as it contains mixed signals about the chatbot's identity. The header only contains the brand logo. Additionally, the customer is informally greeted in the first message (“Hello!” and an emoji) presented in an irregular font and font size. The chatbot's identity will be disclosed in the second message from the chatbot (i.e. "[brand] chatbot"). Finally, in the third message, the chatbot communicates the possibility of transferring, but with the information that human employees will be available from 8 am. m. the next day. The words "Virtual Assistant" appear below the messages. The final example of an atypical chatbot launch comes from an insurance company (Fig. 2, right). The header contains an avatar (smile in a square) that resembles the brand logo. The words "Chat with [brand]" don't make it clear that the customer will be talking to a chatbot, nor does the welcome message or reply buttons. It's also interesting that the chatbot asks the user an open-ended question (“What product would you like to talk about?”), but the answer buttons show the topics the chatbot is trained on.
Fig. 3. Anonymous launches of chatbot prototypes from a community (left) and a theme park (right)
An example of a chatbot prototype implementation comes from a Dutch municipality, shown on the left in Fig. 3. The header includes a robot-like avatar, a human-like name ("Guus"), an outreach chatbot ("Chatbot"), and a description of the chatbot's incompetence ("In training"). In particular, next to chatbot messages, the avatar differs from the avatar in the header. Also, after the greeting ("Hello") and the self-introduction ("I'm Guus"), the chatbot refers to itself with the unusual words "Digital Q&A". Even though the chatbot's incompetence was communicated in the header, the appearance of eleven answer buttons and an open text field indicate that the chatbot can answer questions on a wide variety of topics. Finally, the screenshot on the right shows the launch of a theme park chatbot. The header contains an avatar and the name of an object.
A structure and content analysis of social signals
129
("pity feather"). The first welcome message includes a "digital assistant" disclosure. Interestingly, the second message contains proactive information about the extended opening hours until September 6th with a call to book by clicking on the hyperlink. Unlike other intros, the chatbot's name and avatar are above messages, not beside them. Reply buttons, a text field, and FAQs give users plenty of options for how to respond.
5 Conclusion and Discussion The purpose of this study was to determine (1) the extent to which social cues in the introduction of customer service chatbots can be reliably analyzed and (2) the extent to which the social cues identified in our framework occur in the introduction of customer service chatbots. We identified five main categories of social signals that can appear during the chatbot launch, see Table 3. The first main category refers to the identity signals that mask or mark the artificial identity of the chatbot. Three subcategories are identified: a chatbot disclosure, avatar type and name type. The second main category consists of competition signals that communicate the weaknesses of the chatbot. In addition, conversational signals belonging to the chatbot's conversational etiquette are identified, such as: B. the presence of a greeting, a self-introduction and explicitly turning the customer over after the chatbot's welcome messages. The fourth main category includes communication signals through which the chatbot displays its functionality. Three subcategories are identified: an explanation of the chatbot's purpose, an explanation of how to interact with it, and the possibility of being redirected to a human agent. The last main category includes proactive information such as the existence of a privacy notice and information about offers, corona measures or other topics. Table 3. Updated structure of social signals in chatbot launches Identity signals
competition notes
Conversation Tips Communication Tips
proactive advice
disclosure
Statement emphasizing weakness
greetings
explanatory purpose
Disclaimer
avatar type
Introduce
Explanation Interaction
proactive information
type of name
taking turns
Deliver
To validate our framework, we performed a content analysis of 88 Dutch customer service chatbot launches. For each part of the chatbot introduction (i.e. headers, welcome messages and reply options), the presence and/or type of social signals was commented out. A subset of the sample was coded twice. The results of the cross-coder reliability analysis showed that matching the identity cues in the chatbot header (i.e. disclosure, avatar, and name) was sufficient. Also most social media scores
130
C. van Hooijdonk et al.
The instructions in the welcome messages were satisfactory. Only the reliability of the 'greeting' subcategory was low due to the category's prevalence [38]. It was difficult to get a high enough reliability rating as a greeting was present in most chatbot presentations. Finally, the agreement between response signal coders was acceptable to satisfactory. One explanation for the relatively low level of agreement with answer choices is that the buttons were sometimes less visible because they were embedded in the chatbot message, while in other cases the text field was less visible due to the large number of buttons. Our framework has proven to be a reliable identification tool for social signals in chatbot launches. However, we believe that improvements can be made. For example, we noticed that self-introduction often included a name as well as a disclosure. Regarding the dissemination of the chatbot, only the “chatbot” tag was commented on, but other tags such as “virtual assistant”, “intelligent assistant” and “digital agent” were also used. These labels should also be included in the framework, as clarity of disclosure affects customer expectations [7]. Furthermore, the framework can be extended by looking at whether there is a competency signal and, if so, whether this signal emphasizes the chatbot's competence or incompetence, as competition signals can increase users' perceptions of trust [32]. Finally, the use of informal language could be added to the core category identity cues, as elements such as emoji and interjections mimic human-human interactions [25] that can obscure the chatbot's identity. Regarding the use of social cues in chatbot intros, it was found that the most common social cue in the header of chatbot intros is a humanoid avatar. Also, a humanoid name was often present in the header. Compared to these anthropomorphic identity cues, the chatbot's splash in the headlines was less common. Therefore, organizations seem to combine social signals that mask and mark the artificial identity of the chatbot. Regarding social cues in greeting messages, our results show that a humanoid avatar and the three conversation cues are commonly used. This may indicate that organizations are trying to give customers the impression that they are communicating with a human interlocutor rather than an artificial entity, and that they are making an effort to give them a warm welcome. In contrast, while cues of competence have received a lot of attention in previous research [21, 22, 32], we found that cues of incompetence emerged during chatbot presentations. Taking the results of Khadpe et al. [21] and Kull et al. [22] With this in mind, it is beneficial for organizations to focus on signals that suggest friendliness rather than competition. Considering that chatbots do not always provide adequate responses to customer queries due to low intent awareness [13, 23], it is worth noting that organizations often offer customers the ability to ask their questions in a text field with no limit of text. characters. This response option invites customers to express their messages in their own words, increasing the risk of misrecognition of intent and subsequent miscommunication. In only a few cases did the open text field include a character limit to force customers to send short messages with only keywords. Likewise, it should be noted that communication cues rarely appear in chatbot presentations, although they can drive customer expectations and behavior towards the chatbot [5, 16, 19]. The current study framework is tested on a relatively small sample of chatbots launched by Dutch organizations. To validate its reliability and
A structure and content analysis of social signals
131
To improve the generalizability of our results, a content analysis should be performed on a larger sample that also includes customer support chatbots in other countries and languages. Additionally, a follow-up study would allow us to consider factors that could affect the presence of social signals in chatbot launches, such as: B. the type of chatbot platform (d vs. non-profit). Additionally, it is valuable to examine how and why conversational designers intentionally incorporate social cues into their chatbot designs. Their considerations can be compared to customer perceptions and expectations of social signals in chatbot launches. Finally, the framework allows for experimental research where different chatbot launches are manipulated and compared more systematically. As there are several social cues present in chatbot presentations, future research should examine whether certain combinations of social cues have a stronger impact on customer expectations than other combinations. For example, the chatbot's perceived competition can be determined by not only making a statement about your competition, but also explaining how customers should interact and what response options they can use. In this way, chatbot launches that manage user expectations upfront can bridge the gap between companies looking to successfully use chatbots in their customer service and customers who are hesitant to use chatbots. Confirmations The authors would like to thank Louise Braat, Boet Bruijniks, Myrthe Jagers, Marco Krijthe and Sammie Smaak for sample collection and coding. This research is part of the NWO-funded project "Smooth Operators: Development and Effects of Personalized Conversational AI", grant number: KIVI.2019.009.
References 1. Araujo, T.: Living up to the Chatbot hype: the impact of anthropomorphic design cues and communicative agency structures on corporate and conversational agent perceptions. Calculation. Buzz. Behavior 85, 183-189 (2018) 2. Brackeen B How to Humanize Artificial Intelligence with Emotion (2017). https://medium.com/@BrianBrackeen/how-to-humanize-artificial-intelligence-with-emotion-19f981b1314a. Accessed September 21, 2022 3. Brandtzaeg, P.B., Følstad, A.: Why people use chatbots. In: Kompatsiaris I., Cave J., Satsiou A., Carle G., Passani A., Kontopoulos E., Diplaris S., McMillan D. (Eds.) INSCI 2017. LNCS, volume 10673, pp. 377-392. Springer, Cham (2017). https://doi.org/10.1007/978-3-31970284-1_30 4. Brandtzaeg, P.B., Følstad, A.: Chatbots: changing user needs and motivations. Interactions 25(5), 38–43 (2018). https://doi.org/10.1145/3236669 5. Chaves, A.P., Gerosa, M.A.: How should my chatbot interact? A survey of social characteristics in human-chatbot interaction design. international J. Hum.- Computer. Interact. 37(8), 729–758 (2021) 6. Crolic, C., Thomaz, F., Hadi, R., Stephen, A.T.: Blaming the bot: anthropomorphism and anger in customer-chatbot interactions. J Mark. 86(1), 132–148 (2022) 7. De Cicco, R., da Costa e Silva, S.C.L., Palumbo, R.: Should a chatbot reveal itself? Implications for a conversational online retailer. In: A. Følstad, T. Araujo, S. Papadopoulos, E. L.-C. Law, E. Luger, M. Goodwin, PB Brandtzaeg. (Ed.) TALKS 2020. LNCS, vol. 12604, pp. 3-15. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68288-0_1
132
C. van Hooijdonk et al.
8. Drift: State of Chatbots 2018 Report (2018). https://www.drift.com/wp-content/uploads/2018/01/2018-state-of-chatbots-report.pdf. Accessed September 20, 2022 9. Feine J, Gnewuch U, Morana S, Maedche A: A Taxonomy of Social Cues for Conversational Agents. international J. Hum.-Comput. Race. 132, 138–161 (2019) 10. Fiore SM, Wiltshire TJ, Lobato EJC, Jentsch FG, Huang WH, Axelrod B Towards understanding cues and social cues in human-robot interaction: effects of robot gaze and proxemic behavior. Forehead. Psychol 4, 859 (2013) 11. Følstad, A., Brandtzæg, P.B.: Chatbots and the new world of HCI. Interactions 24(4), 38–42 (2017) 12. Følstad, A., Nordheim, CB, Bjørkli, CA: What makes users trust a chatbot for customer service? An exploratory interview study. In: Bodrunova, S. S. (Ed.) INSCI 2018. LNCS, vol. 11193, pp. 194-208. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01437-7_16 13. Følstad, A., Skjuve, M.: Chatbots for customer service: user experience and motivation. In: Proceedings of the 1st International Conference on Conversational User Interfaces, pp. 1-9 (2019) 14. Følstad, A., Taylor, C.: Customer service chatbot interaction user experience research: a framework for qualitative analysis of chatbot dialogs Ext. user lime. 6(1), 1–17 (2021) 15. Gambino, A., Fox, J., Ratan, R.A.: Building a Stronger CASA: Expanding the Paradigm of Computers as Social Actors. Hum.-Mach. comuna 1, 71–85 (2020) 16. Gnewuch U, Morana S, Maedche A Towards the design of conversational social and cooperative agents for customer service. At IARC. (2017) 17. Go, E., Sundar, S.S.: Humanizing chatbots: the impact of visual, identity and conversation cues on human perceptions. Calculation. Buzz. Behavior 97, 304–316 (2019) 18. Hayes AF, Krippendorff K: Answering the call for a standard reliability measure for data coding. measurement methods communa 1(1), 77-89 (2007) 19. Jain M, Kumar P, Kota R, Patel SN Evaluation and information on the design of chatbots. In: Proceedings of the 2018 Interactive Systems Design Conference, pp. 895-906. ACM (2018) 20. D. Jovic: The future is now: 37 fascinating chatbot statistics (2022). https://www.smallbizgenius.net/by-the-numbers/chatbot-statistics. Accessed September 20, 2022. 21. Khadpe P, Krishna R, Fei-Fei L, Hancock JT, Bernstein MS: Conceptual Metaphors Impact Perceptions of Human-AI Collaboration. In: ACM Procedures on Human-Computer Interaction, vol. 4, no. CSCW2pp. 1-26 (2020) 22. Kull AJ, Romero M, Monahan L: How can I help you? Increase brand engagement with the warmth of a chatbot first message. J. Bus. Resolution 135, 840–850 (2021) 23. Kvale K, Sell OA, Hodnebrog S, Følstad A Improving Conversations: Lessons Learned from Manually Parsing Chatbot Dialogs. In: Følstad A., Araujo T., Papadopoulos S., Law EL-C., Granmo O.-C., Luger E., Brandtzaeg PB. (Ed.) TALKS 2019. LNCS, vol. 11970, pp. 187-200. Springer, Cham (2020). https://doi.org/10.1007/978-3-03039540-7_13 24. Laban, G., Araujo, T.: Collaboration with conversational agents: the relationship between perceived collaboration and service performance evaluations. In: Følstad A., Araujo T., Papadopoulos S., Law EL-C., Granmo O.-C., Luger E., Brandtzaeg PB. (Ed.) TALKS 2019. LNCS, vol. 11970, pp. 215-228. Springer, Cham (2020). https://doi. org/10.1007/978-3-030-39540-7_15 25. Liebrecht C, Tsaousi C, van Hooijdonk C: Linguistic elements of conversational human voice in online brand communication: manipulations and perceptions. J. Bus. Resolution 132, 124–135 (2021)
A structure and content analysis of social signals
133
26. Liebrecht, C., van der Weegen, E.: Menselijke chatbots: een zegen voor online klantcontact?: Het effect van Conversational Human Voice Door Chatbots Op Social Presence en Merkattitude. Tijd. with. 47(3) (2019) 27. Lombard, M., Ditton, T.: In Focus: The Concept of Presence. J. Comput.-Mediat. commune 3(2), JCMC321 (1997) 28. Lombard, M., Xu, K.: Social responses to media technologies in the 21st century: Media are paragons of social actors. Hum.-Mach. commune 2, 29-55 (2021) 29. Luff, P., Gilbert, NG, Frohlich, D. (eds.): Computers and Conversation. Academic Press, Cambridge (1990) 30. Luger, E., Sellen, A.: "Like a really bad PA": The gap between user expectations and conversational agent experience. In: CHI Processes 2016, pp. 5286–5297. ACM, New York (2016) 31. Luo X, Tong S, Fang Z, Qu Z: Frontiers: Machines vs. Humans: the impact of the dissemination of the artificial intelligence chatbot on customer purchases. Marking. Science. 38(6), 937–947 (2019) 32. Mozafari N, Weiger WH, Hammerschmidt M Solving the chatbot disclosure dilemma: leveraging selective self-presentation to mitigate the negative effect of chatbot disclosure. In: Proceedings of the 54th Hawaii International Conference on Systems Sciences, p. 2916 (2021) 33. Nass, C., Moon, Y.: Machines and Mindless: Social Responses to Computers. J.Soc. Issues 56(1), 81–103 (2000) 34. Nass, C., Steuer, J., Tauber, ER: Computers are social actors. In: Proceedings of the SIGCHI Conference on Human Factors in Computer Systems, pp. 72-78 (1994) 35. Nordheim, CB, Følstad, A, Bjørkli, CA: An early model of trust in chatbots for customer service: results from a questionnaire study. Interact. Calculation. 31(3), 317–335 (2019) 36 Paluch S. Perception of remote service technology and its impact on customer-provider relationships: an empirical exploratory study in a B-to-B environment. Springer, Cham (2011). https://doi.org/10.1007/978-3-8349-6936-1 37. Shechtman, N., Horowitz, LM: Media Inequality in Conversation: How People Behave Differently When Interacting with Computers and Humans . In: Proceedings of the SIGCHI Conference on Human Factors in Computer Systems, pp. 281-288 (2003) 38. Spooren, W., Degand, L.: Coding consistency relations: reliability and validity. Corp.Ling. ling. Theory 6(2), 241–266 (2010) 39. Thormundsson, B.: Chatbot Market Revenues Worldwide from 2018 to 2027 (2022). https://www.statista.com/statistics/1007392/worldwide-chatbot-market-size/. Accessed September 20, 2022. 40. Valério FA, Guimarães TG, Prates RO, Candello H: This is what I can do: chatbot policies to promote your features to users. In: Proceedings of the XVI Brazilian Symposium on Human Factors in Computing Systems, pp. 1-10 (2017) 41. van der Goot, M.J., Hafkamp, L., Dankfort, Z.: Customer service chatbots: a qualitative interview study in the customer communication journey. In: A. Følstad, T. Araujo, S. Papadopoulos, E. L.-C. Law, E. Luger, M. Goodwin, PB Brandtzaeg. (Ed.) TALKS 2020. LNCS, vol. 12604, pp. 190-204. Springer, Cham (2021). https://doi.org/10. 1007/978-3-030-68288-0_13 42. van der Goot, M.J., Pilgrim, T.: Investigation of age differences in motivations and acceptance of chatbot communication in the context of customer service. In: Følstad A., Araujo T., Papadopoulos S., Law EL-C., Granmo O.-C., Luger E., Brandtzaeg PB. (Ed.) TALKS 2019. LNCS, vol. 11970, pp. 173-186. Springer, Cham (2020). https://doi.org/10.1007/9783-030-39540-7_12
A multimodal conversational affective agent for a non-intrusive collection of brain disease data Chloe Chira1 , Evangelos Mathioudis1(B) , Christina Michailidou2 , Pantelis Agathangelou1 , Georgia Christodoulou2 , Ioannis Katakis1 , Efstratios Kontopoulos2 and Konstantinos Avgerinakis2 1
Department of Computer Science, School of Science and Engineering, University of Nicosia, 2417 Nicosia, Cyprus {chira.c,katakis.i}@unic.ac.cy, {mathioudis.e,agathangelou.p}@live.unic. ac .cy 2 CataLink Limited, Nicosia, Cyprus {cmichailidou,georgian,e.kontopoulos,coafgeri}@catalink.eu
Abstract. This article introduces Zenon, a multimodal affective chatbot specifically designed to treat brain diseases such as multiple sclerosis and stroke. Zenon discreetly collects patient information and records user feedback using two different modalities: text and video. A user-friendly interface was designed to meet the user's needs and achieve an efficient conversation flow. What makes Zenon unique is its support for multiple languages, the combination of two data sources to track sentiment, and the use of a semantic knowledge graph that ensures machine-interpretable data exchange. Keywords: Conversation Agents Chatbot Brain Disease e-Health Sentiment Analysis Knowledge Graph
1
·
introduction
Parkinson's, multiple sclerosis and stroke are three of the most common chronic neurological diseases that lead to significant cognitive and motor impairments. Neurodegenerative diseases such as Parkinson's show a growing trend along with the aging of the world's population. Furthermore, the number of stroke patients in the European Union is expected to increase by 27% between 2017 and 2047 [29], mainly due to an aging population and better survival rates. A conversational agent (chatbot) is designed to simulate verbal conversations. Conversational agents have become popular in healthcare because of their potential to improve patient care by interacting with them through instant messaging applications. Despite their popularity, many conversational agents c Author(s), licensed exclusively by Springer Nature Switzerland AG 2023 A. Følstad et al. (Eds.): CONVERSATIONS 2022, LNCS 13815, pp. 134-149, 2023. https://doi.org/10.1007/978-3-031-25581-6_9
A multimodal affective conversational agent
135
they cannot meet user expectations and promoting a positive user experience is a difficult task [13,30]. Zenon is part of the EU funded project called ALAMEDA1 and was first introduced in [25]. ALAMEDA's goal is to provide patients with Parkinson's disease, multiple sclerosis and stroke with personalized rehabilitation treatment assessments through the use of artificial intelligence and health support systems. In addition to the conversational agent, ALAMEDA uses a lot of technology, such as smart mattresses, smart bands, smart insoles and smart belts. In this context, Zenon is responsible for collecting information that complements the other sensors. Performs various tasks such as B. Opinion analysis on open "chat" user input, completing medical questionnaires on a schedule, and collecting data for certain lifestyle attributes suggested by medical partners. The main motivations for implementing a Conversational Agent in a project like ALAMEDA are: a) Using a familiar dialog-like interface to collect information that would normally need to be collected in the hospital or doctor's office, or to monitor patient mood Free text conversations, b ) The ability to use multiple channels of information (text messages and facial expressions) to extract knowledge about the patient's mood, c) The fact that you can quickly ask questions about your health status and last recorded measurements. The contributions of this document are as follows: - To our knowledge, Zenon is the first compound specifically designed to help patients suffering from Parkinson's, MS and stroke. Zenon's Conversation Manager takes care of modeling complex planned health questionnaires as conversations, analyzes the patient's mood through free text conversations in relation to the psychological or social state and provides quick reference options for the patient's health state (see Sections 3.2, 3.5). – A linguistic model developed for text-based sentiment analysis that supports four languages (English, Greek, Italian, Romanian) (see Section 3.2) that combines local feature extraction and global feature usage into a single entity. – A component to analyze the mood of facial expressions (see Section 3.3). The service monitors the patient's facial expressions via the phone's camera and provides estimates of their mood. We are expanding the capabilities of CNN's next-generation facial emotion detection models designed to provide real-time inference on edge devices. After making the appropriate architectural design changes and training the model on our dataset, the model achieved acceptable real-time performance accuracy (75%). - An opinion aggregation component that collects information from two modalities (text, video) and points out where they coincide or not. It is beneficial for the system to merge this information, taking into account the labels of both components and the trust of both. In addition, future research will benefit from the generated multimodal sentiment analysis dataset, as such datasets are currently not available (see Section 3.4). 1
https://alamedaproject.eu/.
136
C. Chira and ai.
– A data integration infrastructure based on W3C compliant semantic technologies that allow the agent to retrieve and send information to and from the ALAMEDA platform (e.g. questionnaire responses, sentiment analysis results) in a unified, efficient and scalable way ( see Section 3.5).
2
work related
We divide the related work into two parts: a) conversational agents used in healthcare applications and b) sentiment analysis in machine learning based systems assuming the availability of relevant training data. 2.1
Health speaking agents
Several directions were identified regarding the role of health interlocutors: treatment and follow-up, support for health services, health education, addressing the various health risk factors that can be modified by lifestyle and diagnosis and data collection from Many types. Asthma can use the Puffbot [28] chatbot to support its treatment. Puffbot integrates a conversational ontology, Natural Language Processing (NLP) to engage in conversations with patients. Chatbot EVA [4] also relies on NLP methods to extract entities from user input and classify user intent. EVA supports people in managing their diabetes by providing education and advice about their health issues. HOLMeS [3] is complemented by NLP and text mining techniques to act like a human doctor and help patients choose their path to disease prevention by holding conversations and talking autonomously with patients. In [26], the authors developed a counseling chatbot enhanced with multimodal emotional analysis to provide conversational services for mental health. MoSCHA [18], a mobile health assistant, also improves doctor-patient communication and helps patients overcome their chronic illnesses. The procedure integrates multiple wearable and other sensors to collect data from each patient. [24] reports that the use of agent-based dialog management in healthcare is rare. In general, there are significant efforts to use chatbots for healthcare. However, work on brain diseases focused on gathering information and using sentiment analysis to influence the flow of conversation is very limited. We would like to fill that void with this work. 2.2
Sentiment analysis of text and video.
Analysis of user-generated text in medical texts and online social networks can provide valuable information on public opinion trends and specific aspects of health information spaces. In [23] a framework for automatically extracting COVID-19 related topics from social media and applying short-term memory for sentiment classification
A multimodal affective conversational agent
137
Comments related to COVID-19 are described. The team applies NLP methods to detect COVID-19-related issues in public opinion. Furthermore, they apply LDA-augmented recurrent LSTM neural networks to analyze COVID-19 comments based on their sentiment. The presented approach performed better than other machine learning algorithms for sentiment classification. Another article [2], designed to help patients research and search for specific information, extracted anonymous posts from online medical forums related to Lyme disease to extract non-content, content-specific, and meta-level resources. These features are fed as input to a multiclass neural network. The approach outperformed other ranking algorithms in ranking the sentiment of posts about Lyme disease in several relevant forums. In [5], the authors introduce multi-input RIMs, a new extension of autonomous recursive engines that use a modular system to access various readily available sources of knowledge, such as part-of-speech information and gazetteer lists. Experiments with three sentiment analysis tasks and two health-related tweet classification tasks showed improvements without increasing the number of parameters. Facial expressions and certain muscle movements can prove to be strong indicators of how a person is feeling and when combined with other techniques such as The latest and most widely used technology to develop facial expression recognition is convolutional neural network (CNN) training ). ). For example, [15] presents a real-time computer vision system for predicting emotions based on a previously trained CNN, while [7] trains a CNN (ExpNet) to estimate facial expression coefficients in 3D and then used to predict facial emotions for recognition. In [6,14] a combination of convolutional neural network and recurrent neural network (RNN) is used for video-based emotion recognition, while in [20] a 2D-CNN and a 3D-CNN are used for emotion recognition. and visual emotions. and the features of the models are merged to arrive at a final forecast. Developed in healthcare [19] a 3D CNN model to capture and analyze patient home video footage and inform clinicians about the detected emotional state of the patient. In our work, we not only redesign, integrate, train and refine state-of-the-art models for chatbot-specific purposes, but also provide the ability to add sentiment information from both modalities (text and video). .
3
Zenon's conversational agent
requirements collection. Zenon was developed after careful consideration of patients' needs and extensive discussions with medical experts. For this reason, the research team developed local community groups to list and prioritize requirements. Local community groups were formed by representatives of the categories most affected by each use case, as well as those involved in its care, such as: B. Separate health professionals
138
C. Chira and ai.
by their medical specialists (eg physiotherapists, psychologists), relatives and informal caregivers of patients. Depending on each brain disease studied in this project, different questionnaires were developed to collect these contributions from the previous groups. The challenge in ALAMEDA is the need to adapt to three different brain diseases, in three use cases in three different countries with patients speaking three different languages: a) Stroke in Romania, b) Multiple Sclerosis in Italy and c) Parkinson's in Greece. One of Zenon's main functions is to help the user to send information to the medical questionnaires that are usually filled in at the doctor's office or hospital. Stroke case questionnaires should be modeled as a conversation to capture the complexity of the question-and-answer follow-up questions. Additionally, patients with chronic illnesses such as stroke or Parkinson's used a helpful chatbot interface when inquiring about their health status. For the multiple sclerosis case, the local community group identified a suitable chatbot interface to allow multiple sclerosis patients to report symptoms that indicate an increased likelihood of disease recurrence. Finally, medical partners suggested the possibility of free text interviews on the topic of patients' psychological situations or social status, with the presence of such interviews serving as an additional input modality to assess weekly emotional status. Functional requirements can be summarized in the following points: 1. Collect patient-reported outcomes on health status, lifestyle and well-being, and psychological factors. 2. Enable multimodal sentiment estimation by analyzing free text conversations and facial expressions. 3. Provide access to patient health information through easy user queries. General description. Following these requirements, the system was designed according to the architecture shown in fig.
Fig. 1. Zenon Overview
The user starts the chatbot and enters a message. Sentiment analysis on facial expressions runs in the background while you interact with the app
A multimodal affective conversational agent
139
and makes predictions in real time. The result of this and the original text are sent to the ChatAgent hosted on the dedicated server for analysis. Conversation Manager sends each user input directly to the sentiment analysis model to extract the sentiments that influence the flow of the conversation. After the agent processes the input, the dialog and results are stored in the semantic knowledge graph (SemKG). 3.1
conversation manager
Fig. 2. Main conversation flow
Carefully crafted chat flows are the foundation of any efficient chat agent. The Zenon flow was built based on design guidelines for task-oriented chatbots for healthcare and other fields [9,17,22]. There are two ways to start the conversation. In the first case, Zenon initiates the discussion when the user clicks on a notification. Zeno's initial message is:
140
C. Chira and ai.
depending on the context of the notification. The second way is when the user initiates the discussion. In both cases, the chatbot will eventually ask the user how they feel and whether they want to report something. Based on user requirements, the agent should exhibit empathetic qualities in such discussions. This will create a more natural relationship between the user and the chatbot, increasing engagement and ensuring better quality responses. To get an empathetic personality, we integrated a sentiment analysis model that infers the mood of users' text input in real time (see Section 3.2). Paraphrases have been incorporated for agents' questions and answers. The benefits of such variability are twofold. By making conversations less repetitive and more natural, you increase user engagement. Also, the lexical variations in the questions allow us to get different information from the user, even though the questions semantically ask the same thing. The central flow of the conversation can be seen in Figure 2. Given Zenon's medical needs, filling out questionnaires is the most used function. Respondent fatigue occurs when the respondent becomes tired during the task, resulting in a decrease in the quality of their responses. In order not to overload users, we offer the possibility to interrupt a survey and continue at a more convenient time without losing progress. Due to the medical nature of the questions and their frequency within the study, each questionnaire must be completed within the time frame specified by the medical partners. 3.2
Sentiment analysis on user-generated text
Sentiment analysis is the process of using natural language processing to infer the polarity of emotions expressed in a sentence of text. The polarity can be positive, negative or neutral. Zenon is developed using the Rasa2 framework. To enhance Zenon with empathetic features, we added a custom component to the Rasa NLU pipeline to perform real-time sentiment analysis of user input. The analysis results are used to determine the flow of the conversation (see Fig. 2). We used the Holistic Cumulative Sentiment Classification (HolC) model proposed in [1]. HolC [1] introduces a tunable hyperparameter called "Balance Factor" that strikes a balance between holistic and cumulative sentiment ranking. That is, the model takes into account both the overall sentiment of the entire passage and the average sentiment of each sentence in the text to make its final prediction. The general architecture of HolC is shown in Fig. 3. The model was trained using feedback from the Amazon3 repository from different domains such as health and personal care, books, etc. As one of the main requirements of this project was accessibility, we extended the original HolC model to be language independent and support multiple languages. 23
https://rasa.com/docs/rasa/2.x. https://jmcauley.ucsd.edu/data/amazon/.
A multimodal affective conversational agent
141
Fig. 3. HolC architecture, taken from [1]
The training datasets were mass translated into Romanian, Italian and Greek using Google Translate. The model was trained using a corpus containing tokens for all these languages. Combining multiple languages into a single model offers more benefits than one model per language. First, predictions are stronger. Furthermore, the model is language independent; h is capable of handling and supporting combinations of these languages, or all together in a single sentence, without losing content or context. 3.3
Analysis of feelings about facial expressions.
The purpose of this module is to monitor the patient's face during interaction with the chatbot. The analysis is performed frame by frame and was facilitated by the design, implementation and training of a machine learning model that receives video frames as input and generates an assessment of the user's emotional state. The overall application workflow is shown in Fig. 4. As with textual humor analysis, the results of facial expression analysis can also influence the course of the conversation (see Fig. 2). The first step is to place the user's face within the video frame. This is done using ML Kit4, which can locate a user's face and draw a bounding box around it. As a result, the region of interest is clipped and fed to the computer vision model of emotion recognition, which generates the prediction of the user's emotional state. Each frame is classified into one of the following classes: "negative", "positive", or "neutral". Our module creates and refines state-of-the-art CNN models specifically designed for mobile and embedded imaging applications. for the 4
https://developers.google.com/ml-kit.
142
C. Chira and ai.
Fig. 4. MEAA Service Workflow
When designing our deep learning model for sentiment detection, we experimented with different machine vision models such as MobileNetV1 [21], MobileNetV2 [27] and miniXCEPTION [15]. Several different architectures and combinations of hyperparameters were tested and evaluated in terms of prediction accuracy and latency for real-time inference. After comparison, we concluded that miniXCEPTION had the best prediction performance (about 75% prediction accuracy under real-world conditions) and the lowest latency. Large amounts of image data are needed to train the emotion recognition model. However, as there are only a limited number of freely available facial images, the model was trained using transfer learning. According to this technique, a previously trained model can be reused for a new problem. For our training, we used as a starting point the miniXCEPTION network, previously trained on the ImageNet [12] dataset. Then the fully connected layers are removed and new ones added, while all convolutional layers remain frozen (i.e. set to untrainable). The model is trained again, but this time on the emotion detection dataset. For training our model, our data were carefully selected to represent the population well. Specifically, we collected facial images of people with different facial expressions, different ages, with and without accessories (such as glasses), and faces under different lighting conditions. In this way, we managed to improve the generalization of our model. The emotion detection dataset is a combination of data collected from various sources such as Kaggle (FER 2013 dataset5, Jafar Hussain Human Emotions dataset6) and other open source databases such as Unsplash7, Pexels8 and Pixabay9. After training the model, it was converted to a TFLite10 version suitable to run on high end devices considering limited resources. User privacy and transparency concerns were also considered and addressed. For data protection reasons, the analysis is performed on the device only. 5 6 7 8 9 10
https://www.kaggle.com/datasets/msambare/fer2013. https://www.kaggle.com/jafarhussain786/datasets. https://unsplash.com/. https://www.pexels.com/search/face/. https://pixabay.com/vectors/. https://www.tensorflow.org/lite.
A multimodal affective conversational agent
143
The extracted video frames are not stored on an external server, but are discarded after the emotion detection model recognizes the emotion. As for transparency, the user is aware of the background and video recording service, as the app shows an “open camera” notification. While the application is open, a notification will appear at the top of the screen alerting the user that it is recording. Also, the first time the application is opened, the user will be asked for permission to access the camera. 3.4
Sentiment-Aggregation
The two models described above use different modalities to infer a user's mood at a given moment: one uses data entered by the user and the other uses the user's facial expressions. The data frequency of these models varies greatly. Text is only collected when the user sends a message while video frames are rendered during the user's interaction with the apps. For the most part, facial expressions are not static over the course of experiencing an emotion, they are dynamic events that unfold over time. Leveraging both components is very important to improve the ability to capture the real feeling of the user. The integration of information from the two models is as follows. For each text message, we take the results of the text sentiment analysis. The timestamp of the message was used to obtain the corresponding results of the frames recorded by emotion detection. Given the dynamic nature of facial expressions, instead of just considering the single frame that matches this timestamp, we consider the interval with all frames recorded 3 s before sending the message and 3 s after. During this period, the average mood is calculated from the frames (see Fig. 5). At the end, the output of both components plus yours (mismatch) is recorded. 3.5
The semantic knowledge graph (SemKG)
The results of the sentiment analysis of the two modalities, together with the accompanying data (dialog, timestamp), complete an underlying semantic model (ie the ALAMEDA ontology) in SemKG. SemKG is an "umbrella" framework that includes the populated semantic model, the repository to hold the populated semantic model, and a RESTful API to interact with the information residing in the repository. The motivation behind SemKG is based on the fact that we wanted to build a scalable semantic infrastructure to handle heterogeneous inputs and interactions from different components and semantic technologies are perfectly suited for this [8]. Furthermore, the structure adopted follows the universal open standards recommended by the W3C; specifically RDF [11] and OWL [16] for knowledge representation and SPARQL [10] for querying stored knowledge. The corresponding workflow can be seen in Fig. 6: An HTTP request (GET or POST) is sent to the appropriate endpoint on the SemKG REST server, which forwards it to the internal query engine, which converts the requests to SPARQL .
144
C. Chira and ai.
Fig. 5. Aggregation of feelings
Question. Queries are forwarded to RDF triple storage; h the repository where the semantic model resides. Once a triplestore response is received, the query engine takes care of converting the SPARQL result set into a JSON response, which in turn is routed back to the requesting entity through the REST server. SemKG's RESTful API serves as a single point of interaction for sending and retrieving ontology data. In the particular context of this work, SemKG provides valuable holistic information about patients' mood swings, coming from both input sources (text and video), analyzed and presented in a consistent manner. Examples include getting average sentiment for a specific time period, evaluating high and low sentiment scores for a specific day/week/month, and more. 3.6
User Interface: Android App
As part of ALAMEDA, we designed and developed an Android application that serves as an interface to the server-side chat module described in
Fig. 6. SemKG interaction workflow with other components
A multimodal affective conversational agent
145
previous sections. The chatbot app contains a minimal user interface similar to traditional messaging apps and offers a familiar user experience for patients from different backgrounds and age groups. The patient takes less time to learn and adapt to this interface, shifting the focus to the conversation itself and increasing the app's usability. The chatbot app can handle a variety of input methods. Users can interact with the application by typing their answers directly into the dedicated input field at the bottom of the screen, speaking or clicking buttons where appropriate. Examples of the above input methods are shown in Fig. 7.
Fig. 7. The Android app: a) on the left: a multiple choice question from the Dizziness and Balance quiz, b) in the middle: questions with buttons and a conditional question that accepts free text answers, c) on the right: a Demo app for facial analysis (note that this actually runs in the background).
One of the things to keep in mind when analyzing facial expressions is that if patients felt monitored, they would not be expressing their real feelings. Therefore, this module is implemented as a background service and does not provide a visible user interface. It's worth noting that while the service is not visible, the user is fully aware that the service is running via a dedicated "toast" notification on the screen.
4
Assessment, feedback and lessons learned
As it is a multicomponent system, we present the preliminary evaluation of these components, as well as the first feedback we received from our end users.
146
C. Chira et al. Table 1. Embeddings sentiment analysis model model parameters (#dimension)
Vocabulary (#Terms)
The current. length (#token)
languages (#n)
(300)
(310211)
(150)
(4)
Table 2. Performance scores of the opinion analysis model Precision Recall Score F1 Negative 69.49
72,68
71.05
Neutral
54,57
56,77
55,65
Positive
79,69
73,45
76,44
A broader evaluation involving users is underway in collaboration with our local community groups and medical partners. sentiment analysis in the text. We evaluated the performance of the HolC model in the sentiment analysis task for eight connected datasets and configured the model's hyperparameters as shown in Table 1). One can observe the good general performance of the model, mainly in the positive and negative classes. In the neutral class, many sentiment analysis models[1] suffer due to the limited number of idiosyncratic words in the examples of this class. Analysis of humor in facial expressions. Regarding the analysis of facial expression, validation was initially carried out on a group of ten subjects, men and women, aged between 22 and 60 years and with different characteristics (for example, glasses, short/long beard, etc.) . The model achieved a performance accuracy of 75%. A second validation is expected through the pilots of the ALAMEDA project, where real patients will interact with the application. Based on the feedback we receive, improvements and tweaks are made to the template. sentiment aggregation Regarding the performance adjustment of the two models, an initial validation was performed on a data set created after a user's interaction with the system. The mood of each text message was compared to the overall mood of the facial analysis (as described in Section 3.4). Table 3 shows the percentage of agreement/disagreement between the predictions of each model. During the evaluation, we realized that the two components partially complement each other. In cases where mood analysis has difficulty deciding between two polarities, such as between neutral and positive, emotion detection can be used to infer the patient's mood. On the other hand, if emotion detection is having problems, textual sentiment analysis may provide a more reliable result of overall sentiment. Given the lack of evidence in this review and the challenging nature of the task
A multimodal affective conversational agent
147
Table 3. Summary of agreement/disagreement for the two sentiment analysis models. Video Output Class Positive Neutral Negative Positive 23.33% 13.33% 0% Text Neutral 13.33% 13.33% 10% Output Class Negative 6.66% 10% 10%
(i.e. patients recovering from brain disorders may have difficulty expressing their emotions through facial expressions), further evaluation is required. As pilots collect more data, an end-to-end analysis of the combined capabilities of the two components will be performed for each disease and any necessary adjustments will be made. user feedback. A preliminary version of the Conversation Agent was made available to medical partners and local community groups for initial evaluation. Both groups confirmed the accessibility and usability of the mobile application's graphical interface. They also agreed that the chatbot shows great promise as a tool to collect the medical questionnaire data needed for the project. From initial assessments, medical partners identified additional features that the agent should include to improve its usability and therefore the quality of data collected. These features include the ability to enter responses in specific ways, such as dates and numbers, in a variety of formats. To realize these entity recognition capabilities, semantic mapping has been built into the agent. Furthermore, extending the agent to support health state updates based on data collected in SemKG was discussed. Medical partners agreed on what information to present to patients about the drug without stopping the study. Based on the agreed information, we add this functionality to the agent.
5
Conclusions and next steps
This article provides an overview of a conversational remedy used to treat brain disorders. The elements that make Zenon unique are the variety of conversation streams, two different modalities for sentiment analysis (text, video), multilingual support, and an underlying semantic knowledge graph framework. We present an overview of the main intelligent components, as well as a preliminary assessment of each component and the chatbot as a whole. Obviously, a more thorough user assessment of the embedded system, as well as the individual components, is high on our research agenda. Although the system proposed in this work is specifically designed for brain diseases, many of the components (and in some cases their integration) can be used in other domains or in other use cases. For example, the multimodal sentiment analysis functionality could be useful in other areas of healthcare or in applications where tracking user emotion is important. On the top
148
C. Chira and ai.
That said, we are currently working on an extension of Rasa that will result in a development framework that will greatly simplify the implementation of quizzes in a conversational agent in terms of configuration, deployment and maintenance. Of course, this benefits many applications. Knowledge. This research received funding from the European Union's Horizon 2020 research and innovation program under grant agreement nº GA 101017558 (ALAMEDA).
References 1. Agathangelou, P., Katakis, I.: Balance between holistic and cumulative mood classification. online membership network Media 29, 100199 (2022) 2. Alnashwan, R., Sorensen, H., O'Riordan, A., Hoare, C.: Multiclass sentiment classification of online health forums using independent and domain specific resources . In: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications, and Technologies, Austin, Texas, USA, pp. 75–83. ACM (2017) 3. Amato F, Marrone S, Moscato V, Piantadosi G, Picariello A, Sansone C Chatbots and eHealth: automating healthcare. In:[Email protected]*IA (2017) 4. Anastasiadou M., Alexiadis A., Polychronidou E., Votis K., Tzovaras D.: A prototype educational virtual assistant for diabetes management. In: 20th IEEE International Conference on Bioinformatics and Bioengineering (BIBE) 2020, Cincinnati, OH, USA, S. 999–1004. IEEE (2020) 5. Bagherzadeh, P., Bergler, S.: Rekurrente unabhängige Multi-Input-Mechanismen zur Nutzung von Wissensquellen: Fallstudien zu Sentiment-Analyse und Health Text Mining. In: Inside-Out Deep Learning Procedures (DeeLIO): The Second Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, S. 108–118. Association for Computational Linguistics (2021) 6. Cai, Y., Zheng, W., Zhang, T., Li, Q., Cui, Z., Ye, J.: Videobasierte Emotionserkennung mit CNN und BRNN. In: Tan, T., Li, X., Chen, X., Zhou, J., Yang, J., Cheng, H. (Hrsg.) CCPR 2016. CCIS, vol. 663, S. 679–691. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-3005-5 56 7. Chang, F.J., Tran, A.T., Hassner, T., Masi, I., Nevatia, R., Medioni, G.: Expnet: Deep 3D Facial Expressions Without Landmarks (2018) 8. Cheatham, M., Pesquita, C.: Semantic Data Integration. In: Zomaya, A.Y., Sakr, S. (Hrsg.) Handbook of Big Data Technologies, S. 263–305. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49340-4 8 9. Clark, L., et al.: Was macht ein gutes Gespräch aus? Herausforderungen beim Entwerfen wirklich dialogorientierter Agenten. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, S. 1–12 (2019) 10. World Wide Web Consortium: SPARQL 1.1 Overview (2013) 11. World Wide Web Consortium: RDF 1.1 Concepts and Abstract Syntax ( 2014) 12. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a hierarchical database of large-scale images, S 248 –255 ( 2009) 13. Diederich, S., Brendel, A., Morana, S., Kolbe, L.: Zur Gestaltung von und Interaktion mit Konversationsagenten: eine organisierende und bewertende Überprüfung der Mensch-Computer-Interaktionsforschung. J. Associate. system information 23(1), 96–138 (2022)
A multimodal affective conversational agent
149
14. Fan, Y., Lu, X., Li, D., Liu, Y.: Video-based emotion recognition using CNNRNN and C3D hybrid networks. In: Proceedings of the 18th ACM International Conference on Multimodal Interaction, pp. 445–450 (2016) 15. Gogate, U., Parate, A., Sah, S., Narayanan, S.: Real-time emotion detection and gender classification, p 138–143 (2020) 16. Strength- OWL Task: OWL 2 Web Ontology Language Document Overview: W3C Recommendation Oct 27, 2009 (2009) Z Sun L Design for Trust: A set of design principles for increasing trust in chatbots. CCF Trans. pervasive computing. Interact. 1–8 (2022) 18. Hommersom, A., et al.: MoSHCA: My Smart Mobile Health Assistant. In: 2013 IEEE 15th International Conference on e-Health Networking, Applications and Services (Healthcom 2013), Lisbon, Portugal, pp. 188–192. IEEE (2013) 19. Hossain, M.S., Muhammad, G.: An audiovisual emotion recognition system using Deep Learning Fusion for a wireless cognitive framework. IEEE cable. comuna 26(3), 62–68 (2019) 20. Hossain MS, Muhammad G Emotion recognition using a deep learning approach from audiovisual emotional big data. Inf. Fusion 49, 69-78 (2019) 21. Howard, AG, et al.: Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv (2017) 22. Hu, Y., Qu, Y., Maus, A., Mutlu, B.: Polite or direct? Conversational smart screen design for seniors based on courtesy theory. In: CHI Conference on Human Factors in Computer Systems, pp. 1-15 (2022) 23. Jelodar H, Wang Y, Orji R, Huang S Deep feeling classification and topic discovery about novel coronavirus or COVID-19 online debates: NLP using LSTM approach with recurrent neural networks. IEEE J Biomed. health information. 24(10), 2733–2742 (2020) 24. Laranjo, L., et al.: Conversational Agents in Healthcare: a Systematic Review. Jelly. Inform Med. asociado 25(9), 1248–1258 (2018) 25. Maga-Nteve, C., et al.: A Semantic Technologies Toolkit to Bridge Early Diagnosis and Treatment of Brain Diseases: Project Report ALAMEDA research project, funded by the EU. In: Garoufallou, E., Ovalle-Perandones, M.-A., Vlachidis, A. (Eds.) MTSR 2021. CCIS, vol. 1537, pp. 349-354. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-98876-0 30 26. Oh KJ, Lee D, Ko B, Choi HJ: a chatbot for psychiatric counseling in service-based mental health care in analysis and emotional dialogue training sentence. In: 18th IEEE International Conference on Mobile Data Management (MDM) 2017, Daejeon, South Korea, pp. 371–375. IEEE (2017) 27. Sandler M, Howard AG, Zhu M, Zhmoginov A, Chen L Mobilenetv 2: Inverted residuals and linear bottlenecks. In: CVPR, pp. 4510-4520. Computer Vision Foundation/IEEE Computer Society (2018) 28. Teixeira MS, Maran V, Dragoni M: The interplay of a Conversational Ontology and AI Planning for Health Dialog Management. In: Proceedings of the 36th Annual ACM Symposium on Applied Computing, Virtual Event Republic of Korea, pp. 611-619. ACM (2021) 29. Wafa H, Wolfe C, Emmett E, Roth G, Johnson C, Wang Y: The burden of stroke in Europe: projections of stroke incidence, prevalence, deaths, and 30 disability-adjusted life years. Stroke 51, 2418–2427 (2020) 30. Walker, T., et al.: Development of an intelligent virtual agent to stratify people with cognitive complaints: a comparison of human-patient and intelligent virtual agent-patient interaction. Dementia 19, 1173–1188 (2018)
AI Interactive Journal: Exploring Words and Language as Interaction Design Materials Max Angenius1(B) and Maliheh Ghajargar1,2 1 School of Arts and Communication, Universität Malmö, Malmö, Schweden
[Email protected],[Email protected]2 Research Centre for the Internet of Things and People, Malmö University, Malmö, Suécia
Abstract. Conversational Agents (CAs) make human-computer interaction more collaborative and conversational through the use of natural language. The HCI and interaction design communities have experimented and explored the domain of conversational interaction design. In addition, interaction designers may need to acquire new skills in designing, prototyping and evaluating artifacts that incorporate AI technologies in general and AC in particular. This white paper builds on an earlier study on the principles of creating interactive journal experiences with CA and examines the practice of designing such experiences using words, language, and conversation as design materials. We present a prototype for an interactive reflective journal interaction with CA and the result of a Wizard of Oz experiment. Our results suggest that AC interaction design challenges designers to use materials with inherently different properties and qualities. Despite this challenge, words seem to have unique properties that help designers externalize and iterate on ideas, for example, B. Tone and intent. Therefore, we propose to consider words, language, and conversation as primary design materials and AI predictability, adaptability, and agency as secondary materials when designing human interactions with conversational agents. Keywords: Reflective diary Conversation agents Human-AI interaction Design methods Design materials User experience Interaction modality
1 Introduction The daily practice of recording and relating personal experiences, thoughts, and emotions is a reflective practice that helps individuals and groups discover meaning, gain other people's perspective, and develop critical thinking and affective skills, among other things.[1] Computer-assisted reflective learning is an area of research within HCI in which journaling is a tool for learning and reflection [2]. Previous research confirms that interacting with others, talking, and giving and receiving feedback have positive effects on reflective activities such as studying or journaling [3, 4]. Conversational interfaces, such as conversational agents (CA), as a form of artificial intelligence (AI), use language as the primary interaction mode [5] to interact © The author(s), under exclusive license from Springer Nature Switzerland AG 2023 A. Følstad et al. (Eds.): CONVERSATIONS 2022, LNCS 13815, pp. 150-170, 2023. https://doi.org/10.1007/978-3-031-25581-6_10
Interactive diary with AI
151
Users in a conversation or collaboration [6]. AC is an emerging AI technology and interaction with AC has been the subject of interaction design studies [2, 6, 8-10]. Furthermore, there is debate in the HCI and interaction design communities about what can be considered design materials and how interaction designers can shape these materials into their practices, such as B.AI or data as design materials [7, 11, 12]. However, when designing interactions with CA similar to GPT-3 [13], there is another challenge we face and that is how to design interactions where human language is the main modality of interaction [5] and how to design realistic services using AI in a way that can be prototyped and evaluated during the development phase [14]. This article attempts to explore this area by designing and prototyping a reflective and interactive journaling experience using CA. Its aim is to answer the research question: how can we design an interactive journal experience using words, language and conversation as the main design material? In this article, we will present and discuss the process of prototyping and testing a conversational agent and provide information on using words as material for interaction design.
2 Background 2.1 Journaling and reflection Reflection is a meaning-making process that supports the discovery and understanding of relationships and connections between experiences and actions to develop new perspectives [2, 15, 16]. Reflection is influenced by internal and individual activities as well as external and collective components, e.g. B. Relationships with artifacts, activities, places and people [15]. Reflective journaling is an activity in which a person documents their personal experiences, writing down thoughts and emotions and connecting them with the goal of reflection and personal growth [1, 16, 17]. Journaling as a method of reflection allows people to organize, clarify and connect knowledge in a concrete way and thus process old and new knowledge [4]. Furthermore, the diary author is actively involved in a learning process [1] as he creates meaning and context from his experiences [18]. Some examples of different recording methods are the Double Entry Journal [4], the Dialogue Journal and the Personal Journal [3]. The double-entry journaling and dialogue journaling methods are collaborative, in which two or more people engage with one person's written reflections, either with peers or a mentor [3, 4]. The personal diary, on the other hand, is a solitary method in which only the individual diary writer participates in the writing and reflection processes [3]. Although the individual diary has value as a tool for reflection, it also carries the risk of repeating personal reflections and beliefs as they are not questioned or questioned [3]. A more collaborative approach facilitates personal growth by providing opportunities for feedback from others, which reinforces critical self-assessment [3], and learning occurs when knowledge moves from "an inquiry-based state to a socially constructed state". The transformation of the journal into a multi-person dialogue encourages ethical considerations, particularly regarding the reader's perception of trust in the journal, the clarity of expectations, and the quantity and quality of feedback [3]. These ethical considerations must be heeded for the conversation about journal content [3]. Another more craft journal style is the bullet journal, which appeals to magazine writers who value an open and direct journal.
152
M Genius from M Ghajargar
organic way to use the materials and have freedom to express yourself and express your information in a personalized way [19]. In HCI and Interaction Design, journaling has been studied in different areas and applications. For example, MEMEory is a mobile diary app that uses memes as a means of reflection [17] and Eat4Thought is a food diary app to identify eating behaviors [20]. In another study, blogs were used as a computer-based journaling tool in a classroom to stimulate discussion and reflection [18]. Finally, Robota is a CA to support self-learning in the workplace through written and verbal interaction [8]. 2.2 Human-AI Interaction AI as a design material supports interaction designers in the design of experiences and functionalities of computational artifacts, such as prediction, adaptability and agency [11, 14]. Among other things, designing with adaptability is about considering how the system will adapt and change and how it will perform over time according to the specific and different needs of the users [14, 21]. An adaptive system like AI needs to store information and remember previous user interactions to help it make connections between collected data over time [14]. The human-AI interaction can be designed in different ways, so that different levels of action can be attributed to the user and the system [7]. Intermittent human-AI interaction is an interaction toggle paradigm in which the user retains all agency and initiates the interaction, whereas continuous human-AI interaction is a paradigm in which agency is split between human and system and system. listens to the uninterrupted flow of user input and response. responses throughout the interaction [22]. Proactive human interaction with AI is where the system has the most agency and can initiate and complete tasks on its own [22]. Each human-AI interaction paradigm has different levels of complexity and use cases that designers can explore in their design process [22]. Consequently, HCI and the design community have developed various design principles for human-AI interaction for different domains, as B. Amershi et al. [7]'s Guidelines for Human-AI Interaction, Kulesza et al. [23] Explanatory Debugging Principles for Customizing Interactive Machine Learning, Cronholm & Göbel [24] Design Principles for Human-Centered AI and Design Principles for Interactive Reflective Diary with Conversational Agent [25]. 2.3 Conversation as an interaction modality Conversation can be defined as a cascade of behaviors and signals that develops between two receptive speakers [5] and is believed to be one of the oldest interfaces [26]. Conversation is basically cooperative [5] and goal-oriented [26]. Likewise, a dialog is a conversation between two entities and includes features such as toggle, overlap, interrupt, hint and repair [5]. In a design situation like human-AC interaction design, where speech and conversation are the main modes of human-computer interaction, words can be considered as one of the design materials as they are building blocks for the experience. of the user and the user. a key part of which are interaction design as a whole [26].
Interactive diary with AI
153
Human history has had different conversational cultures that have evolved over time: oral culture, literary culture, and secondary orality [26]. Oral culture was time- and context-specific, and knowledge and meaning existed as long as they were attractive to the individuals who remembered and repeated them; literary culture, thanks to writing, was portable, reproducible and stimulated reflection and the creation of knowledge; technology-enabled secondary orality, making writing immediate, group-oriented, collaborative, conversational, and intertextual [26]. In conversational design, design begins with an interaction or dialogue [26], and words and language are the main modalities for giving information and receiving responses; the interaction is round trip, imitating how a human would interact with another human [5], that is, the system is endowed with behavior and characteristics similar to humans [9]. A conversational system, such as a conversational agent (CA), uses language models and collaborates with user and other artifacts to carry out conversations [6, 27]. Context creates meaning for conversations and people interact with context based on what they hear and see and based on previous experiences during the conversation [5]. Longitudinal context is when a computer system collects information and creates documents from various conversations and meetings over time [5]. Longitudinal context allows the CA to store data from previous conversations with the user and retrieve them when needed, giving the CA personality and helping it to form an emotional connection with the user [5]. The personality of the artifact in a conversational interface encompasses character traits, behaviors, word choice and intonation, and is a central part of the user experience [5]. Personality is also linked to user expectations, as the user decides how to interact with the artifact based on how the artifact's personality is described and perceived during interactions [26][5,10,14]. In a conversational interface, information is often not presented through a conventional GUI [28]. Therefore, users should familiarize themselves with the CA interaction form [5], emphasizing the importance of first impressions and onboarding [5, 9, 26]. Integration takes place over several sessions and can be designed as a learning process for the system [5]. The initial interaction between the user and the system is where the stakes are highest [5, 26], and the goal of the initial conversation is not to explain each potential interaction, but rather to encourage the user to participate in the interaction and more to be learned gradually. over time [5], also known as progressive disclosure [26].
3 Methodology In this article, we use an exploratory and qualitative approach to research by design that employs a variety of methods and tools, such as: 30]. Building on the previous study, we defined and revised a set of design principles for designing reflective and interactive journaling experiences [25, 30]. The results provided an opportunity to explore CAs specifically for a reflective and interactive registration experience. Furthermore, previous research suggests that interaction designers need methodological knowledge and material skills.
154
M Genius from M Ghajargar
Design with conversations, words, language [5, 31] and AI [7, 14] and that there is a need for more research when it comes specifically to the use of CA technology [6, 7]. In this work, we were interested in prototyping a reflective and interactive magazine experience based on our previously developed design principles (Table 1) [25]. We started the process with a creative exploration of this space. We reviewed existing research on the subject, experimented with different recording methods, and interviewed experienced journalists. Participants recruited for this study were PhD students from the Interaction Design program at Malmö University, aged between 20 and 30, experienced journalists and interested in technology. Five of the six participants identified themselves as women and one as a man. Furthermore, the participants had different nationalities and different approaches to journaling. Five 30-minute semi-structured interviews were conducted and recorded with permission [30]. The results of this investigation were coded using affinity diagrams to find patterns and information in the data. The design principles were used to frame the design requirements and needs of the studio's design space users and generate ideas. The design process explored various methods used to design using words and language as design material, including personality design, agent behavior and mood, and dialogue, to name a few. The ideation process resulted in a final prototype of an interactive diary experience with a conversational agent, which was tested using the Wizard of Oz method with three participants, all of whom had previously participated in the study (Fig. 1) [30] .
Fig. 1. Summary of methodological steps and design process. Table 1. Revised version of design principles for interactive and reflective journals with conversational agents Design principles
user needs
01. The system works as a confidant
The journalist must have the characteristics of the artifact and the ability to generate synergy with interactions. The system must be intentionally designed to be nonjudgmental and how The journalist attributes a patient listening personality to artifacts and The personality of an interactive construct A close relationship with artifacts needs to be carefully cultivated. The relationship sets expectations and affects the synergies created to sustain that relationship without humanizing and creating a deceptive character.
02. The system expresses a personality
artifact requirements
(Continuation)
Interactive diary with AI
155
Table 1. (continued) Design Principles
user needs
artifact requirements
03. The system supports personal expression and aesthetics.
The journalist must be able to express his personal feelings and thoughts and have the freedom to use different materials and tools.
The user must have the opportunity to express himself personally, e.g. B. writing and talking to CA. In addition, system resources must support users with different journaling objectives, depth of reflection, and modes of interaction.
04. The system encourages deeper thinking
The journalist needs encouragement and guidance to think more deeply. B. through prompts or questions
The system must support reflection, e.g. B. providing prompts, questions, or thoughtful dialogue, part of the intermittent human-AI interaction paradigm [22].
05. The system expands human memory
The journalist needs to remember the previous events, thoughts and feelings that he wrote in his diary. This helps the user to recognize patterns, reflections and personal growth.
06. The system is updated and adapted
Journalist needs, goals and expectations change over time and in different contexts. The journalist needs experience and flexible journalistic practice
The system must collect data to support memory expansion, pattern recognition and continuous adaptive use over time, collecting ethical and explicit data: z-AI interaction [22] The system must be able to adapt to needs and adapt to the user context. The system must be designed for the longitudinal context and be able to collect data from multiple sessions over a long period of time.
07. The system encourages social interactions to think
The journalist needs social interaction to externalize thoughts and gain second opinions and perspectives. The user needs to share their experiences and knowledge with others to gain alternative perspectives.
The system must provide a platform for collaborative writing and sharing of journal entries. You should provide features such as Q&A, space for comments, or drawings for individual journal entries (continued)
156
M. Angenius and M. Ghajargar Table 1. (continued)
design principles
user needs
artifact requirements
08. The system is doing enough
Journalists consider writing time sacred and a meditative state that should not be disturbed.
The system can act proactively and collect data, but explicit interactions with the user must be intermittent, waiting for their turn (when the user has finished typing or needs the system to interact).
09. The system is explained and transparent
The journalist must know the functionalities of the systems, the data collection methods and their background activities.
10. The system is slowly integrating
11. The system allows the user to manage and control data and use
The system has to explain what it can and cannot do. This is especially important for building a better user relationship. The system needs to know the user little by little. During the slow integration process and the initial process, the impression of a system mentioned above is relevant to highlight important principles to create a more sustainable relationship. The journalist must be able to leave the system at any time. It must be able to control and manage data usage
The system shall provide the user with the ability to stop data collection at any time and delete journal entries if necessary.
4 Designing Conversational Interactions In keeping with our exploratory, qualitative approach to research through design [29], we begin with a process of brainstorming and drafting. The draft [32] was done in two rounds; The first round was a regular brainstorming activity using pencil and paper to brainstorm ideas and resulted in 24 different sketches around the theme of the interactive CA journal. The topics that emerged were diverse, from the use of specific technologies or tools to the nature of the conversation, the data to be collected, etc. (Figure 2).
Fig. 2. A selection of sketches.
For the second round of brainstorming, the first author of this article designed a brainstorming activity to challenge the comfort zone of divergent thinking [29]. The activity was named
Interactive diary with AI
157
Fortune's interaction design wheels consisted of seven wheels based on different frameworks and concepts from our literature review [30] (Appendix B). These were: (1) interaction attributes [33], (2) interaction perspectives [34], (3) form of interaction [35], (4) interaction perspectives [36], (5) conversational cultures [ 26 ], (6 ) levels of reflection [37] and (7) human-AI interaction paradigms [22]. These different concepts were put into a random decision tool called Wheel Decide [38] and the aim was to combine unexpected and new elements from the different wheels to invite designers to brainstorm unexpected and creative combinations. Each wheel was spun once, resulting in a list of seven items for the sketch or idea. For example, sketch number 25 was used: fast, tool, continuity, control, secondary orality, dialogic reflection, and continuous human-AI interaction. The method is reminiscent of cut-up, a Dada method of creativity in which artists cut and rearrange material to form new and unexpected creations [39]. This second round resulted in 15 sketches, each with its own unique set of elements (Appendix C), allowing participants to think about and describe the different forms that interaction could take and how users could experience it. Some combinations worked better than others (eg combination 34, Fig. 3). The Close Interaction attribute is described as deeper analysis, allowing for variation, mastery of new ideas, and exploration—essential for journaling. For example, the dialogue partner interaction perspective and dialogic reflection are appropriate because we designed an artifact to facilitate a conversation between the user and the AI. The speed attribute of the interaction form refers to the speed of the user's action and the system's reaction. Finally, the intermittent interaction between the human and the AI is closely related to the nature of the conversation as it is based on interaction.
Fig. 3. Combinations 34 and 37.
Some other combinations didn't work very well (eg combination 37, Fig. 3). The random interaction attribute is about having a low level of challenge with no room for experiential competition or refinement, which are clearly experiences we don't want to design for a reflective journal. Examining different sketches and combinations of elements allowed studying which interaction elements were most appropriate based on the design principles (Table 1). This creative exploration supported the prototyping of the CA's personality, conversation prompts, functionality and how the interaction would work.
158
M Genius from M Ghajargar
4.1 Personality Prototyping and CA Conversations Designing conversational interfaces often requires consideration of personality design, rapid design, intent definition, and path documentation [5]. We apply Deibel and Evanhoe's [5] framework to design the AC personality, which includes six elements: interaction goals, personification level, power dynamics, character traits, tone, and key behaviors. To design a personality for an interactive newspaper artifact, we used design principles (Table 1) [25, 30]. The design of these six elements of the agent's personality was the foundation of the conversation and interactive experience. For example, key behaviors included defining specific behaviors and prompts for certain situations: in a situation where the CA makes an error when responding to a user question, requests clarification on the requested task and an example prompt in that situation, your "I" I'm sorry I didn't understand that. Do you have anything to explain or repeat against him?" (Appendix A). Using the agent's personality, sample scripts and audio mockups were designed (Table 2) to prototype the conversation. The sample script was written and repeated several times. times in a text document, while the audio models were created using AI-enabled text-to-speech Software Descript. Listening to the scripts in conversation between two AI-generated voices allowed improvements and iterations of the prototype conversation, e.g. 30] This iterative conversation process allowed for a more authentic experience during the Wizard of Oz experiment 4.2 Testing journaling experiments with CA Talkus AI-relius [30] is the research prototype that represents an example of a possible design configuration for a system of interactive and reflective journaling with CA, following design principles (Table 1) and an exploratory research design approach [ 29]. The research prototype ("look and feel" [40] or the "experience prototype" [41] ] ) consists of two connected objects: (1) a diary app on a tablet with a smart pen and (2) a Bluetooth speaker that simulates an AC speaker through audio mockups (Fig. 4). We used 18 pre-recorded prompts to test the prototype based on audio patterns and key behaviors (Appendix A). Additionally, some additional prompts were added based on the assumption that the person testing the prototype would improvise, challenge the agent's skills, or request something unexpected, which was encouraged. The research prototype embodies our design principles, such as B. The Reflection Support, p. B. the request for reflection and support for personal and aesthetic expression through various input modalities - written and oral [30]. We tested the prototype using a Wizard of OZ method called "pay no attention to the man behind the curtain" in which participants are aware of the presence of the person controlling interactions [14]. The Wizard of Oz is a suitable method for testing AI-infused conversational interfaces [5, 14, 26] as it allows designers to simulate the behavior of the interactive artifact, allowing the designer to perform system tasks [26]. Furthermore, the restrictions of
Interactive diary with AI
159
Table 2. Selection of sample scripts. Speaker Dialog Agent
Hi, my name is Ethan and I'm your journal partner. What's his name?
from the user
maximum
Agent
Nice to meet you Max. Today is May 9, 2022. Would you like to write today?
from the user
Yes, I want to write a diary today.
Agent
OK great. You can keep a diary in two different ways. The first way is to write in the diary with the pen and the second way is to speak to me with your voice. Are you ready to start our first journaling session together?
from the user
Y
Agent
In order. One last thing. When you're done, put the pen back in the journal and I'll know it's time to end our journaling session for today.
*Recording period, undisturbed* Agent
I noticed that you stopped keeping a diary. you're done for today
from the user
I finished
Agent
Good job today, I will save your journal entry in your diary. Hope to see you tomorrow for our next journaling session.
technological limitations [5, 26] and allows testing the core interaction of the prototype [26] and the intended behavior and suggestions [5], making it suitable for the purposes of this study. While the Wizard of Oz method is appropriate for this study, it also has limitations. First, human performance versus machine performance [14, 42]: the designer who simulates artifact behavior is limited to the number of triggers and responses he can control [42]. Second, the human simulating the artifact's behavior must be consistent with interaction protocols and avoid improvised responses or interactions that are not part of the artifact's experience [14]. Three test sessions were conducted with participants who are avid magazine writers. Each test session lasted 30 minutes and consisted of three main tasks and four optional tasks. The main tasks were: (1) writing a diary with pen and voice; (2) delve into personal experiences and customize the diary entry; and (3) request a CA reflection request. The optional (and recommended) tasks were: (1) asking something unrelated or irrelevant; (2) ask a difficult question that the AI supposedly cannot answer; (3) ask a personal question and (4) ask the CA what it thinks. The assistant controlled the prototype, took notes during testing, and observed interactions. The tests were followed by short semi-structured interviews in which we asked participants about their experiences.
160
M Genius from M Ghajargar
Figure 4. Components for the Wizard of Oz test for the look-and-feel (experience) prototype.
5 Results In this section, we report the results of designing and testing the interactive newspaper experiment with CA. We present the results in two sections reflecting on the different qualities of journaling experiences with AC that emerged from our study and those confirmed by previous studies, (1) interactive and reflective journaling with AI and (2) conversational and collaboration journaling with AI. AI. 5.1 Interactive and reflective journaling with AI There is an intriguing tension in supporting journaling with an interactive artifact. On the one hand, conversational agents (CAs) are considered nonjudgmental, while people are still learning to trust these nonjudgmental technologies [5]; On the other hand, journalists fear being judged when it comes to someone else reading their own magazine, as also confirmed by Hubbs and Brand [3]. During the test, we observed that two participants (P2 and P3) expressed their view of the AC as a confidant without judgment and with whom they would like to be open about their feelings and experiences. Furthermore, the prototype created a sense of security and freedom from the feeling of being judged, and participants felt directly empowered to control their actions and expressions during the interactive diary. Furthermore, P2 stated that he would never have a conversation using the diary, but using Talkus AI-relius, suggesting how the interactivity of the CA would create a sense of presence and vitality. One of the challenges with
Interactive diary with AI
161
reflective thinking is acting on reflection [43] and we note that when P2 asked Talkus AI-relius for advice on how to respond to his personal reflections, Talkus AI-relius was seen as a confidant in this way. The user and the system began to form a relationship that participants valued. P2 mentioned during the test that he felt that interacting with the CA was more engaging and joyful when it came to processing negative emotions compared to a diary. P2 also pointed out that this could be due to the novelty of the experience stemming from the prototype's first attempt. More needs to be explored on how interactive technology can support reflection [2, 44–46], and one way we found in our prototype was through reflective prompts. All participants appreciated the reflective prompts in Talkus AI-relius and requested several additional reflective prompts to support their journals. The value of these thought prompts was that their content was unexpected or not necessarily a topic that the participants themselves had thought about. This unexpected information enriched their reflective experiences, opening them up to thoughts and ideas they otherwise wouldn't have. One participant (P3) stated that it is easier to predict what a known person would say in a given situation than what Talkus AI-relius would do. While helpful, the reflection prompts needed to be repeated to allow participants to process and break them down for reflection. The use of reflective prompts to create a conversation between the user and the CAs encourages dialogic reflection in which the user can build knowledge and increase the meaning of the content [4]. In addition, the collaborative newspaper offers the journalist the opportunity to receive feedback and practice critical thinking [3]. Power Dynamics is another layer added to the non-judgmental relationship between the user and the CAs, a key element provided by the Deibel and Evanhoes Framework [5] that we examined. The most common approaches to dialog design are prescriptive and persuasive: dialogs that tell users what to do, rather than telling them to explore and reflect on themselves [2]. Our prototype did not lead participants to a specific level of reflection; Rather, as expressed by P3, participants felt free to pursue their personal expression and visual aesthetics, and they felt free to ask for the agent's suggestion to deepen their reflective thinking. The interaction with a CA is generally seen as a support to mediate other actions [9]. Our prototype was designed for intermittent human-AI interaction, where the user is in control and the artifact participates only when addressed [22]. During testing, we found that participants appreciated being in control of the conversation, but also wished they had given Talkus AI-relius more of an edge. For example, there could have been more action on helping with thoughtful suggestions or even starting the journaling activity, pushing the user to support journaling consistency. With Talkus AI-relius, interaction is mostly turn-based (intermittent) as the main interaction is through user registration activity and then support and answering questions by Talkus AI-relius. Some results suggest that even when writing time is sacred and should not be disturbed, more continuous human-AI interaction benefits the journaling experience. In that case, Talkus AI-relius can suggest or remind the participant of previous diary entries on similar topics, e.g. B. while writing. These results suggest that the user values being in control in the context of the reflective diary with an AC, but is open to surprises and more prescriptive interactions, as suggested by Kocielnik et al.
162
M Genius from M Ghajargar
[2] or continuous and proactive human-AI interaction, as suggested by van Berkel et al. [22]. Furthermore, different types of human-AI interactions influenced user expectations about the conversation. P1 expressed uncertainty about what was expected of them when talking to the CA and was unsure if it would be impolite to just write down their thoughts and not talk to Talkus AI-relius; they almost felt a moral obligation to talk. and talk about everything they wrote. Interaction design is increasingly about forms of interaction beyond visual and tactile perception [44], and designers must learn to design interactions where human language is the main form of information exchange [5]. An example of this that we observed during testing was the intuitive communication of written and spoken input modalities. We've found that the combination of written and spoken input is beneficial for achieving design intent, but needs to be explored further with users. Different reflective questions may be appropriated differently for different modalities [8], which we found to be true when we used reflective cues in our prototype. In particular, we observed a design challenge related to the input modality when a participant struggled to remember and understand the meaning of the reflective message just by hearing it. The use and combination of written and oral modalities must be continually adapted and iterated. In addition, user preference for interactive registration may affect the combination or use of interaction modalities. One examinee (P2) mentioned that he first wanted to write down his thoughts to say something to Talkus AI-relius. The observation opens another level of discussion about how the different paradigms of human-AI interaction proposed by van Berkel [22] should be approached and designed. Finally, another participant (P1) argued that the combination of written and oral input would work if he could transcribe the conversation with the agent. The design encouraged participants to engage in thoughtful social interactions, such as through Talkus AIrelius feedback, comments, and additional perspectives. In addition, we observed that participants asked Talkus AI-relius for thoughtful tips to help them when they got stuck: it was comforting to have the agent as a confidant asking for help every time I got stuck (P3). The design can slowly integrate the user, which helped to build the relationship with Talkus AI-relius. We found during testing that while the prototype took time to learn the user, participants wanted to explore the agent's capabilities and limitations further. Furthermore, one participant (P1) expressed that he wanted to discover all the options and choose from those available. 5.2 Conversational and Collaborative Diary with AI Two challenges arose when performing the Wizard of Oz method with the Conversational prototype. The first related to interaction speed simulation: the human mage could not match the performance of a working conversational agent. Simulating the behavior of an AC with the Wizard of Oz is a challenge [14, 42]. The challenge was apparent, for example, when unexpected user requests were made to the agent that were not part of the pre-recorded prompts, causing the assistant to immediately consider which pre-recorded prompt was most appropriate.
Interactive diary with AI
163
the situation. The second challenge concerns consistency with standard interaction protocols and prototypical functionality during testing [14]. Throughout the test, participants asked questions or expressed requests that the prototype was not designed for. Therefore, the default response from the assistant should be the agent's pre-recorded request "sorry" for being unable to perform the action. Having to use the default apology message multiple times during the test felt weird and fun, but it was still critical to the success of the test. Avoiding off-the-cuff reactions on the spot is essential during a Wizard of Oz test [14]. The Wizard of Oz is a viable method for testing prototypes that implement AI [14] and conversational design [5, 26] and, despite some awkward and funny moments, our Wizard of Oz review turned out to be entertaining and insightful.
6 Limitations Our results offer contributions to support design researchers in using conversational technologies to design interactive journaling experiences. However, it comes with its own limitations. This study included a small group of participants for initial interviews and prototype testing and was conducted over a period of ten weeks. Due to the relatively small sample size, the result may not be representative of a larger society. Furthermore, interaction design was mostly limited to verbal interactions. We believe that this study can be expanded to include other interaction modalities, such as tactile, body and auditory interactions. To strengthen the empirical results, we encourage further user research with a broader and more diverse group of participants.
7 Final Discussions We present part of a larger project on designing interactive and reflective registration experiences with AI and, in particular, with conversational agents. We present the design process, a prototype (Talkus AI-relius) and the results of test sessions. Our results provide insight into four distinct qualities of CA journaling experiences: the interactive, reflective, conversational, and collaborative qualities. While some of these can be seen as universal qualities of a successful human-CA interaction, the combination of all these qualities ensures a compelling registration experience. This information can help shape logging experiences in a variety of contexts, including education, mindfulness, mental health, and food logging, among others. This idea also sparked our interest in the quality of design materials in the context of human-CA interactions. The ability to use words and verbal language as primary design materials when designing for the above qualities (interactive, reflective, conversational, and collaborative) is the focus of this discussion section.
164
M Genius from M Ghajargar
First, we reflect on our roles and approaches as interaction designers and researchers in designing interactions and experiences between humans and CAs. The main challenge here is that the UI is mostly made up of words rather than traditional UI visuals and elements. Therefore, rather than designing visual user interfaces, a conversational interface requires ideating, prototyping, and testing user interfaces with words, language, and conversational design materials. Therefore, returning to Schön's famous concept, interaction designers would need to engage with words and dialogues and engage in reflective conversations with them as situation materials [45, 46]. This reflective practice supports the construction of appropriate and meaningful interactions and experiences. Second, interaction designers often look at the graphical elements and visual communication of the user interface; Likewise, CA interaction designers must examine and iterate in word choice, intent, statements, and nuances to create dialogues and conversations. The practice of choosing the right words, dialogue and conversation requires the designer to write and listen to word choice and language. This practice helps interaction designers simulate the conversation and imagine how it would play out in the real world. We anticipate that this creative, thoughtful, and iterative process of engaging with words as design material can help design better, more transparent, and fairer CA systems by catching language bias early and in new ways. Third, we gain another perspective from this study on how UI offerings relate to CA interaction. In general, it is recommended to design graphical user interfaces (GUI) in such a way that there is only one possible way of operating for a specific purpose, eg B. the use of consistency as a design principle [10, 48–50]. However, designing how to interact with CA requires understanding that people can use words and conversations in many different ways, and that there are many ways to say something and express intent. There are, in a sense, infinite ways for a user to approach interacting with the CA. While it's not really possible to design infinite ways to interact with CA, interaction designers need to know and collect the most common and appropriate words and language that are spoken or used in conversations in a given context, e.g. B. in diary interactive. In terms of design methods, we found that outlining the conversational agent's personality and statements with words was an appropriate approach to visualizing how the interaction would play out and how people would engage with the CA. The use of words when delineating conversational interfaces was beneficial because the aim was to describe the conversation and qualities of interactions between humans and AC. As a result, sketches were less about visual images and more about written descriptions and verbal meanings (Figs. 1 and 2). Still, outlining the interaction and relationship between humans and CA in words made it possible to externalize, visualize, iterate, and improve the idea. Beyond words, it is beneficial for interaction designers to learn to design in a conversational interface using the material qualities of AI in general, such as predictability, adaptability, and agency. These qualities affect the interaction and relationship between the user and the agent, e.g. B. The agency directly influences the power dynamics of the humanAC relationship, as it determines the personality, the level of participation and the initiative. So the real design challenge here is designing conversational agent interactions.
Interactive diary with AI
165
through reflective conversations [47] with a variety of situational materialities: words, language, and dialogue as primary materials and predictability, adaptability, and agency as secondary materials. thank you We would like to express our gratitude to the people who participated in the user studies.
Appendix Appendix A. Examples of Key CA Behaviors and Indications The Situation
AC behavior
CA request
Meet someone for the first time Greet the user, introduce yourself and briefly explain what to expect
“Hello, my name is Talkus AI-relius and I'm your daily companion. What is your name?"
Talk to someone you trust
"Welcome back [name], nice to see you again. Today is May 10, 2022. Would you like to keep a journal?"
Greet them by name and express your happiness to see them.
Asked for help Proceed to explain to the user what he can do if he needs it
"Okay, great. You can keep a journal two different ways. The first way is to write in the journal with the pen and the second way is to speak to me with your voice. Are you ready to start our first journal session? together?"
Ask for help with something. Apologize and say he can't and suggest an alternative.
"My apologies, this is not an action you can take. Would you like me to suggest an alternative course of action?"
interrupted
Stop what you are doing, wait and listen to what the user says and act according to the proposed new task
"...should I run [task name recognized by user interaction]?"
error
Apologize and ask for clarification or repetition.
"Sorry, I don't understand. Do you have anything to explain or repeat against him?" (continuation)
166
M Genius from M Ghajargar
(continued) The situation
AC behavior
CA request
correct someone
Ask the user if they meant "x". "You mean [correction]?"
He asked a question I couldn't answer.
Apologize and tell the user, "Sorry, I don't know, that's a question they can't answer."
I asked a personal question
Explain that you don't have an opinion of your own, but can provide an outside perspective if the user so desires.
"I don't have a personal opinion. I'm here to help you process your opinions, thoughts and desires. But if you'd like, I can bring in an outside perspective."
I asked an inappropriate question
Explain that this is not something to discuss, but rather that the client is encouraged to explore their thoughts, feelings and points of view.
"I don't discuss these types of topics, but I encourage you to explore your thoughts, feelings and perceptions in your journal."
Appendix B. The Ferris Interaction Design Wheel(s)
Interactive diary with AI
167
Appendix C: Combinations of Interacting Elements Sketch No.
combination of elements
24
Fast, Tool, Continuity, Control, Secondary Orality, R2 Dialogic Reflection, Continuous Human Interaction-AI
25
Fluid, media, movement, tooling, oral culture, transformative R3 reflection, continuous human-AI interaction
26
Directed, dialogue partners, state, embodiment, literate culture, R0 descriptive reflection, proactive human-AI interaction
27
Unified aesthetic experience, rhythm, experience, secondary orality, R1 reflective description, proactive human-AI interaction
28
Constant, tool, connectivity, embedding, oral culture, R0 description, proactive human-AI interaction
29
Roughly, dialogue partners, movement, control, secondary orality, R4 critical reflection, intermittent human-AI interaction
30
Powerful aesthetic experience, rhythm, control, literary culture, R2 careful description (forgot to include human-AI interaction)
31
Slowness, aesthetic experience, proximity, transfer, literary culture, R0 description, proactive human-AI interaction
32
Immediate Aesthetic Experience, Continuity, Optimal Behavior, Oral Culture, R1. Careful description, proactive human-AI interaction
33
Roughly, dialogue partners, speed, control, secondary orality, R2 dialogic reflection, intermittent human-AI interaction
34
Volatile, system, movement, ideal behavior, education culture, R0 description, intermittent human-AI interaction
35
Precise, System, Connectivity, Ideal Behavior, Oral Culture, R2. Dialogic reflection, proactive human-AI interaction
36
Random, system, state, control, oral culture, R2 dialogic reflection, intermittent human-AI interaction
37
Soft, Media, Depth of Time, Embodiment, Cultural Literacy, R4 Critical Reflection, Proactive AI-Human Interaction
38
Late, media, direct, ideal behavior, literary culture, R0 description, continuous human-AI interaction
39
Fast, Tool, Continuity, Control, Secondary Orality, R2 Dialogic Reflection, Continuous Human Interaction-AI
168
M Genius from M Ghajargar
References 1. Blake, TK: Journaling; an active learning technique. international nurses J. school educator. 2 (2005). https://doi.org/10.2202/1548-923X.1116 2. Kocielnik R, Xiao L, Avrahami D, Hsieh G Reflection Companion: a conversational system to engage users in reflecting on physical activity. Proc. MCA Interaction. Team. Wearable ubiquitous technology. 2, 70:1-70:26 (2018). https://doi.org/10.1145/3214273 3. Hubbs, DL, Brand, CF: The paper mirror: understanding the reflective journal. J Exp. Educator 28, 60-71 (2005). https://doi.org/10.1177/105382590502800107 4. Hughes, H.W., Kooy, M., Kanevsky, L.: Dialogical reflection and diary. Clearing house: J. Educ. Strat. Issues Ideas 70, 187-190 (1997). https://doi.org/10.1080/00098655.1997.105 44193 5. Deibel, D., Evanhoe, R.: Conversations with things: UX design for chat and voice. Rosenfeld Media (2021) 6. Yang, X., Aurisicchio, M., Baxter, W.: Understanding affective experiences with conversational agents. In: Proceedings of the 2019 CHI Conference on Human Factors in Computer Systems, pp. 1-12. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3290605.3300772 7. Amershi, S., et al.: Guidelines for Human-AI Interaction. In: Proceedings of the 2019 CHI Conference on Human Factors in Computer Systems, pp. 1-13. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3290605.3300233 8. Kocielnik R, Avrahami D, Marlow J, Lu D, Hsieh G Design for Workplace Reflection: A Conversational Chat-and-Conversation Agent. In: Proceedings of the 2018 Interactive Systems Design Conference, pp. 881–894 (2018) 9. Luger, E., Sellen, A.: “Like a really bad AP” the gap between user expectations and the experience of conversational agents. In: Proceedings of the 2016 CHI Conference on Human Factors in Computer Systems, pp. 5286–5297 (2016) 10. Norman, DA: How might humans interact with agents? commune ACM 37, 68-71 (1994) 11. Rozendaal, M.C., Ghajargar, M., Pasman, G., Wiberg, M.: Smart Shape-Objects: Exploration Intelligence as Interaction Design Material. In: Filimowicz, M., Tzankova, V. (eds.) New Directions in Third Wave Human-Computer Interaction: Volume 1 - Technologies. YOURS, pp. 25-42. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73356-2_3 12. Wiberg, M.: The Materiality of Interaction: Notes on the Materials of Interaction Design. The MIT Press, Cambridge (2018) 13. Ghajargar, M., Bardzell, J., Lagerkvist, L.: A redhead walks into a bar: experiences writing novels with artificial intelligence. In: 25th Mindtrek International Academic Conference, pp. 230–241. Association for Computing Machinery, New York (2022). https://doi.org/10. 1145/3569219.3569418 14. Wärnestål, P.: Design ai-drivna tjänster. Studentliteratur AB (2021) 15. Ghajargar M, Wiberg M, Stolterman E: Designing IoT systems that support reflective thinking: A relational approach. international J. Des. 12, 21–35 (2018) 16. Pirzadeh, A., He, L., Stolterman, E.: Personal computing and reflection: a critical examination of the nature of reflection. In: CHI'13 Extended Abstracts on Human Factors in Computing Systems, pp. 1979–1988 (2013) 17. Terzimehi´c, N., Schött, SY, Bemmann, F., Buschek, D.: MEMEories: internet memes as a means of journaling. In: Designing Interactive Systems Conference 2021, pp. 538–548. Association for Computing Machinery, New York (2021). https://doi.org/10.1145/346 1778.3462080 18. Ong, L.T.R., Pemberton, R.: Improving classroom learning through computer-assisted reflective writing and peer feedback. J. Mod. Far away. 19, 99-120 (2009)
Interactive diary with AI
169
19. Tholander, J., Normark, M.: Personal Information Creation: Resistance, Imperfection, and Self-Creation in Bullet Journaling. In: Proceedings of the 2020 CHI Conference on Human Factors in Computer Systems, pp. 1-13. Association for Computing Machinery, New York (2020) 20. Zhang Y, Parker AG Eat4Thought: A Food Journal Layout. In: Extended Abstracts of the 2020 CHI Conference on Human Factors in Computer Systems, pp. 1–8. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3334480.3383044 21. Ghajargar, M., et al.: From “Explainable AI” to “Tangible AI”. In: Proceedings of the Fifteenth International Conference on Tangible, Embedded and Incorporation Interaction, Salzburg, Austria, pp. 1–4. Association Calculator (2021). https://doi.org/10.1145/343 0524.3442704 22. van Berkel, N., Skov, MB, Kjeldskov, J.: Human-AI interaction: intermittent, continuous and proactive. Interactions 28, 67–71 (2021). https://doi.org/10.1145/3486941 23. Kulesza, T., Burnett, M., Wong, W.-K., Stumpf, S.: Explanatory Debugging Principles for Customizing Interactive Machine Learning. In: Proceedings of the 20th International Conference on Intelligent User Interfaces, pp. 126–137 (2015) 24. Cronholm, S., Göbel, H.: Design principles for human-centered AI. In: ECIS Research Papers 2022 (2022) 25. Angenius, M., Ghajargar, M.: Design Principles for an Interactive and Reflective Journal with AI. In: Computing Conference 2023. Springer, London (2023) 26. Hall, E.: Conversational Design. A Book Apart (2018) 27. Allen JF, Byron DK, Dzikovska M, Ferguson G, Galescu L, Stent A: Toward human-computer conversational interaction. AI Mag. 22, 27 (2001). https://doi.org/10.1609/aimag. v22i4.1590 28. Janlert, L.-E., Stolterman, E.: Faceless interaction: a conceptual study of the notion of interface: past, present and future. hum.-calc. Interact. 30, 507-539 (2015). https://doi.org/10.1080/07370024.2014.944313 29. Gaver, W.: What can we expect from research by design? In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Austin, Texas, USA, pp. 937-946. Computing Machinery Association (2012). https://doi.org/10.1145/2207676.220 8538 30. Angenius, M.: Talkus AI-relius: an interactive diary artifact that supports reflection through conversation (2022) 31. Lee, M., Ackermans, S ., van As, N., Chang, H., Lucas, E., IJsselsteijn, W.: Caring for Vincent: a chatbot for self-pity. In: Proceedings of the 2019 CHI Conference on Human Factors in Computer Systems, pp. 1-13. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3290605.3300932 32. Sanders E, Stappers P Convivial Toolbox: generative investigation for the front-end of design. BIS, Amsterdam (2012) 33. Lenz, E., Diefenbach, S., Hassenzahl, M.: Exploring the relationships between interaction attributes and experience. In: Proceedings of the 6th International Conference on Product Design and Pleasant Interfaces, pp. 126-135. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2513506.2513520 34. Petersen, MG, Iversen, OS, Krogh, PG, Ludvigsen, M.: Aesthetic interaction: a pragmatic aesthetic of interactive systems. In: Proceedings of the 5th Conference on Interactive Systems Design: Processes, Practices, Methods and Techniques, pp. 269-276 (2004) 35. Lim, Y., Stolterman, E., Jung, H., Donaldson, J.: Interaction design and aesthetic interaction design. In: Proceedings of the 2007 Conference on Pleasant Product and Interface Design, pp. 239-254. Association for Computing Machinery, New York (2007). https://doi.org/10.1145/1314161.1314183
170
M Genius from M Ghajargar
36. Hornbæk, K., Oulasvirta, A.: What is interaction? In: Proceedings of the 2017 CHI Conference on Human Factors in Computer Systems, pp. 5040-5052. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3025453.3025765 37. Fleck, R., Fitzpatrick, G.: Reflection on Reflection: Shaping a Design Landscape. In: Proceedings of the 22nd Australian Human-Computer Interaction Special Interest Group Conference on Human-Computer Interaction, pp. 216-223. Association for Computing Machinery, New York (2010). https://doi.org/10.1145/1952222.1952269 38. Rueda decides. https://wheeldecide.com/. Accessed 15 August 2022 39. E Navas, O Gallagher, X Burrough: Keywords in Remix Studies. Routledge, Milton Park (2017) 40. Houde, S., Hill, C.: What makes prototypes prototypes? In: Helander, M.G., Landauer, T.K. and Prabhu, P.V. (Ed.) Handbook of Human-Computer Interaction, 2nd ed., pp. 367–381. North Holland, Amsterdam (1997). https://doi.org/10.1016/B978-044481862-1.50082-0 41. Buchenau, M., Suri, J.F.: Experience Prototyping. In: Proceedings of the 3rd Conference on Interactive Systems Design: Processes, Practices, Methods and Techniques, pp. 424-433. Association for Computing Machinery, New York (2000). https://doi.org/10.1145/347642. 347802 42. van Boeijen, A., Daalhuizen, J., Zijlstra, J.: Delft Design Guide: Perspectives, Models, Approaches, Methods. BIS Publishers, Amsterdam (2020) 43. Nakamura K, Feng H, Priss S, Mei H: Design for night reflection: how to support night reflection through non-digital media. In: 39th ACM International Conference on Communication Design, pp. 386-388 (2021) 44. Ghajargar, M., Wiberg, M.: Thinking with Interactive Artifacts: Reflection as a Concept in Design Results. From Numbers 34, 48–63 (2018). https://doi.org/10.1162/DESI_a_00485 45. Ghajargar, M., De Marco, A., Montagna, F.: "Wise things": When smart objects make you think. In: Proceedings of the 11th International Conference on Human-Computer Interfaces and Interaction, pp. 233-238 (2017) 46. Baumer, E.P.S., Khovanskaya, V., Matthews, M., Reynolds, L., Schwanda Sosik, V., Gay, G.: Checking reflection: On the use of reflection in the design of interactive systems. In: Proceedings of the 2014 Conference on Designing Interactive Systems, pp. 93-102. Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2598510.2598598 47. Schön, D.A.: The Reflective Practitioner: How Professionals Think in Action. Routledge, Milton Park (1983). https://doi.org/10.4324/9781315237473 48. World Leaders in Research-Based User Experience: 10 Usability Heuristics for User Interface Design, https://www.nngroup.com/articles/ten-usability- heuristics/. Accessed September 9, 2022. 49. Hutchins, EL, Hollan, JD, Norman, DA: Direct Manipulation Interfaces. hum.-calc. Interact. 1, 311-338 (1985). https://doi.org/10.1207/s15327051hci0104_2 50. Norman D. The Design of Everyday Things: Revised and Expanded. Basic Books, New York (2013)
Expanding the scope of clarification responses for a kitchen assistant Gina E. M. Stolwijk(B) and Florian A. Kunneman Vrije Universiteit Amsterdam, De Boelelaan 1105, 1081HV Amsterdam, The Netherlands[Email protected],[Email protected]Abstract. In conversational genres like Instructions, the clarifying questions a user asks may be related to the task at hand or common sense about the task area, whereas most conversational agents focus only on one of these types. To learn more about the best approach and the feasibility of integrating the two types of questions, we experimented with different approaches to modeling and distinguishing between common sense and task-specific questions in the context of a kitchen helper. We then integrated the best ones into a conversational agent, which we tested in a study of six users cooking a recipe. Although all three items performed well on their own and all participants completed the recipe, the accuracy of the questions and answers was relatively low (66%). We conclude with a discussion of aspects that need improvement to meet the diverse information needs of task-based conversational agents. Keywords: Clarification processing User study
1
Kitchen help Natural language
introduction
Today's task-based conversational agents are primarily designed to generate responses to direct knowledge questions, with context represented by relevant knowledge underlying those questions, and leveraged to determine user intent, fill time windows, and respond to follow-up questions. [25,27]. Typical examples of these areas of responsibility are restaurant reservations and service agents. In contrast, requests for information in conversational genres, such as giving instructions, are usually task-related, and both task-related concepts and status are important for a correct interpretation of the request. For example, an agent giving instructions in a virtual environment would need to consider the user's current location and view, as well as the types and characteristics of objects along with their spatial relationship [13]. In this article, we examine the challenges of modeling information-seeking dialogues as part of cooking lessons, a genre in which both common sense and task-related knowledge are required on the part of the agent to best support the agent's efforts. . To answer the many questions that the author(s), under exclusive license from Springer Nature Switzerland AG 2023 A. Følstad et al. (Eds.): CONVERSATIONS 2022, LNCS 13815, pp. 171-189, 2023. https://doi.org/10.1007/978-3-031-25581-6_11
172
GEM Stolwijk and FA Kunneman
When asked during the cooking instructions dialog, the agent would need to incorporate a lot of knowledge about recipes and cooking, know which of the two to turn to when faced with a request for information and whether to provide a response. there is a question with an adequate answer. Therefore, such an application carries a greater risk of confusing one query with another and giving a wrong answer to a question, which is why most task-based conversation agents limit the scope of queries that can be directed to task-related questions ( eg [12,23]) or general domain questions only (e.g. [25]). The studies, which used conversational agents with specific task and domain knowledge, did not provide information about the quality of responses to clarifying questions when this system converses with a user. [29] created a dataset consisting of annotated cooking recipes and cooking instruction dialogs based on these documents and limited the evaluation to performance on this dataset. Participants in the recent Alexa Taskbot Challenge [1], which integrates a variety of knowledge sources such as Howdy Y'all [2], Grillbot [14] and Miutsu [20], focused their assessment solely on user satisfaction. On the contrary, in our study we propose detailed heuristics to refer to a specific source of information during the conversation and we carry out a user study in which we can evaluate the conversations to determine the accuracy of the answers and identify the types of errors committed. We address the following research questions: RQ1) How can a task-based conversational agent distinguish task-specific questions from domain-wide questions? 1. We developed and evaluated a set of approaches for modeling domain-general knowledge and task-specific knowledge in the culinary domain, divided into: (a) cooking-related questions and answers based on question-and-answer pairs on a community questions and answers (b) extract knowledge from cooking recipes based on heuristics and segment the recipe procedure into conversational steps (c) Sort given questions into related or common sense tasks 2. We integrate the approaches into a conversational agent and we formulate a set of heuristics to allow this agent to point to the correct one during the Conversational Knowledge module RQ2) The extent to which a task-based conversational agent can differentiate between domain-general questions and task-specific questions in a world environment real? To answer this question, we evaluated the Conversational Agent through a user study, where users actually cooked when talking to the agent. All conversations were analyzed specifically to determine how well agents were responding to user queries.
2 2.1
Related job modeling skills for conversational agents
Revealing the right knowledge at the right time is a key aspect of the success of many task-based conversational agents where the nature of the task is defined.
Increased coverage of clarification responses for a kitchen helper
173
the requirement of their knowledge-based skills. For tasks that aim to fulfill a request (for example, bank context transfers or restaurant reservations), the most common approach is to train a model with annotated dialogs for a predefined set of spaces and values [25]. Another common task is conversational research, where the correct answer to a user's query matches the conversation with the user. As these agents must consider a wide range of questions, large datasets are typically used [33] and retrieved based on question-to-question similarity (i.e. based on examples) [32]. For tasks that require the user or agent to perform a series of steps, such as B. navigation [13], document retrieval [12] or cooking [23,29], the agent must have a thorough understanding of the important concepts and their relationships between separate documents. Activating this necessary knowledge can be approached as a reading comprehension task [6] or by converting documents into a dedicated representation of meaning for the task at hand [7,21]. In our approach, we use an example-based approach to model general domain knowledge and analyze recipes for a specific set of information elements to incorporate task-based knowledge. 2.2
Distinguish between different sources of knowledge
The biggest challenge for conversational systems that must consider a wide range of questions is the large search space, which increases the probability of a question being confused with a similar one. A common approach to solving this is to use a module to first classify a question by its domain [2] or type [31]. Taking a different approach [28,29], which restricted the common sense knowledge in their kitchen agent to a set of predefined topics (e.g., sharing ingredients, using kitchen utensils, etc.), creating databases background information for each topic and used a set of custom rules and actions to choose the right knowledge source to answer a question. We apply question type classification and a set of heuristics to query the correct knowledge module (task specific or common sense) in our system. In contrast to [29], the Commonsense database we used covers a broader spectrum of culinary-related questions. In addition, we carried out a user study where we focused on the quality of the answers to the questions, which did not happen in previous studies.
3
Increase the reach of clarifying responses
We distinguish two general types of clarifying questions that can be asked during cooking classes: common sense questions and specific homework questions. For both types, we wanted to draw on data to ensure broad coverage, as well as investigate how well the two could be distinguished to reduce the likelihood of confusion during conversation. Below, for each of these three subtasks, we describe experimentation to examine the best performing method. We focus our study on the Dutch language area, but most of the methods we use are applicable to other languages as well.
174
GEM Stolwijk and FA Kunneman
3.1
Answer the questions using common sense
The common sense Q&A task is formulated to find the best answer to a common user question in the kitchen field from a large database of QA peers. We experimented with two approaches to modeling question similarity. Approach 1) Word2vec. The first algorithm aimed to find the phrase(s) from a database with the greatest similarity to the user's search query using a Word2vec model [22]. For that, SpaCy [16],1 implemented the Dutch channel nl core news lg. Separate token embeddings were calculated for each sentence in the database using word2vec, which was averaged to obtain a sentence insertion. A query received from the user was represented in the same way and compared to all queries in the database using cosine similarity, choosing the set with the highest similarity as the best match. Approach 2) Establish BERT. The second algorithm was to find the database sentence(s) closest to the user query using the BERT context sentence trained on Dutch data.2 BERT (Bidirectional Transformer Encoder Representations) has been shown to provide a state-of-the-art performance. in a wide range of tasks [10]. Sentence-BERT calculates the embeddings of each sentence separately and then compares them using a similarity metric [26]. Each sentence was mapped onto a 768-dimensional vector space using average clustering of contextual token embeddings with an attention mask. The embedding of the user query sentence was compared to the embedding of each query in the database using cosine distance, choosing the sentence with the smallest distance as the best match. dataset and pre-processing. We chose to use a general community Online Question Answering (CQA) platform for the purpose of sharing knowledge about healthy cooking. We downloaded 10,000 questions from the Dutch community question answering platform goeievraag.nl3, categorized under the tag "food and drink". Each QA pair consisted of a user query and the most popular or first answer given. We pre-process the database queries by applying a CNN-based part-of-speech tagger using SpaCy. Art. -Related questions. Stop words5 were also removed, after which each query was vectorized using one of two approaches. Test execution. We conducted a controlled experiment by manually selecting seventy queries from the database and testing how well the two 1 2 3 4 5
https://spacey.io/model/nl#nl Top news lg. Yegorkitskerkin/bert-base-dutch-cased-snli. https://www.startpagina.nl/v/eten-drinken/. https://space.io/api/tagger. Use the following stopword list: https://github.com/explosion/spaCy/blob/master/spacy/lang/nl/stopwords.py.
Increased coverage of clarification responses for a kitchen helper
175
Table 1. Percentage of correct answers by algorithm and ranking when retrieving the correct common sense cooking question and answer pair. Word2vec sorting algorithm
BERT prayer
1
0,47
0,66
2
0.09 (of 53% = 0.05) 0.24 (of 34% = 0.08)
3
0.06 (of 48% = 0.03) 0.20 (of 26% = 0.05)
total 0,55
0,79
The approaches work to retrieve any of these queries when presented with a worded version. We set the database size to 10,000. Manual selection was performed by one of the authors. To ensure that a variety of questions could be assessed, the selected questions were evenly distributed across seven question categories (see Appendix A for details). To generate initial queries for each of the 70 selected queries, we used a combination of two techniques: back translation [11,18] with deep translator [3] and paraphrasing [4] with the Parrot library [9]. The augmenter intended to produce a maximum of 10 paraphrases by collecting the input utterance and annotations (intents, spaces, types of spaces) and expanding them [9]. To ensure that the generated phrases had the same meaning as the original search query but with different words, one of the authors manually checked each generated phrase and removed the offending paraphrases. In the end, an average of almost four sentences remained per original query (M = 3.97, SD = 1.91), for a total of 278. Grammatically incorrect sentences were not removed, as end users could also perform poorly worded queries . Seed queries were pre-processed in the same way as database queries. For both approaches (Word2vec and Sentence-BERT) we tested how well they can retrieve the correct database query on different database sizes. Approaches were presented with each of the 278 reworked query variants. The average number of correct answers (ie, belonging to the exact original query [not paraphrased]) was calculated as the score. To measure the closeness of an approximation, this number was calculated for the correct answer rated 1, 2, or 3. Significant effects were measured by performing an ANOVA with a dichotomous (true/false) dependent variable. Main effects and the interaction between the independent variables "database size" and "algorithm" were assessed. The three highest rankings were combined: if the best match was incorrect, the second best match was also taken into account; if the second best match is incorrect, the third best match is also taken into account. Results. Results by approach, database size and ranking are shown in Table 1. There was no significant interaction between database size and algorithm in terms of response correction (F(2) = 1.93, p = 0.15). A second ANOVA was performed that considered only the main effects and showed that the algorithm with the best performance was Set-BERT F(1) = 177.56, p<0.001, and that changing the database size did not significantly affect performance affected . (F(2)=0.62, p=0.54).
176
GEM Stolwijk and FA Kunneman
The Sentence BERT approach, considering the three tracks, manages to get a correct question for about 80% of queries. If the first question-answer pair found is incorrect, the probability of finding a correct answer in second place is significantly reduced to 0.24. In a conversational setting, these results are not reliable enough to present the user with an alternative response (e.g., the second-level QC pair) when an originally obtained result is unsatisfactory. The BERT phrase showed the best results, with scores greater than 0.28 associated with most incorrect responses. This approach is implemented in the kitchen helper to answer reasonable questions in the kitchen area, using a threshold of 0.28, below which a result is returned. 3.2
Answering task-specific questions
A kitchen helper must have a sufficient number of recipes to guide the user, which are abundantly available on web-based cooking platforms. The approach to modeling recipe-specific knowledge is strongly related to the specific recipe to be modeled and, in particular, how the recipe is formatted. We identify the most important elements of a recipe: the name of the recipe, the number of people it is intended for, the estimated cooking time, the ingredients and quantities, the kitchen utensils and the recipe. Part of what we present below contains typical heuristics from the website where we extracted recipes and their ingredients (Smulweb6). The extraction of another type of information can be generalized to any culinary recipe website, namely the separation of quantity, unit and ingredient, and the division of the recipe process into steps suitable for a conversational interface. heuristic. Some of the heuristics were based on numbers and grammatical information for which we used the SpaCy POS tagger and the stemmer trained on the Dutch channel nl core news lg.7. An example of the recipe layout and the type of information extracted is shown in Fig. 1.
Fig. 1. Break down recipe ingredients using part-of-speech and heuristic tags.
6 7
https://smulweb.nl/. https://spacey.io/models/nl#nl Top news lg.
Increased coverage of clarification responses for a kitchen helper
177
Extract steps from the recipe. The task of extracting recipe steps can be formulated as a segmentation task, where each segment must be appropriated as a single lesson in a conversational environment. Based on [17], we define single-statement steps as statements in which a single action is performed. One difficulty is that a single sentence can contain multiple actions. To identify the cooking actions and information accompanying a recipe instruction, it was first divided into sentences.8 We then generated a dependency tree for each sentence using the NLTK library.9 In parallel, the complete sentence was marked with the SpaCy POS tagger. . It was assumed that each step (subtree) should contain at least one verb (ie action of cooking) as root, the context (ingredients, quantities, utensils, etc.) that the children represented down to the leaves. . Each time a token was indicated as the root of a subtree, it was evaluated whether that token was considered a verb, as indicated by the marked sentence. If both conditions are met (that is, token is the root of a subtree and token is a verb), the subtree is treated as a step. The root of the main tree and the remaining tiles served as an independent scenario. Since POS labeling models are generally not trained with imperative clauses [7], which are common in cooking instructions, we retrained the labeler with a random forest classifier. The classifier uses different characteristics to find the label that best corresponds to a token: the token itself, its precedent, its ending token, its prefix (first 3 letters), its suffix (last 3 letters), whether it is the beginning or the end . of a sentence, if it is a number and if it contains capital letters. For training purposes, a dataset containing 284 sentences (3851 tokens) was extracted from a variety of Dutch cooking guides published on Smulweb. For this dataset, SpaCy tagger results were used as a reference, after which one of the authors manually corrected the tags if necessary. The focus of this fix was on tokens that were incorrectly (not) marked as verbs, as verbs serve to stop the procedure. Test execution. To assess the quality of recipe segmentation with and without the newly trained post tagger, we extracted ten cooking recipes from Smulweb. One of the authors manually scored the ten selected recipes to establish a baseline for scoring. These were then incorporated into the various algorithms.10 In addition to revenue segmentation, the revenue information units were also noted. We found that, using heuristics, this information can be identified with almost perfect accuracy for all ten recipes, which can be explained by their suitability for the platform. We compared the two segmentation approaches using two baselines: a baseline that consisted of selecting each sentence in a single step, and a semi-random baseline that segmented a string at each position with probability 1/k (where k is the average length of the segment is the bottom line). We represent the outcome of the four approaches by marking each token with either a 0 or a 1, the latter indicating the end of a step. 8 9 10
https://spacey.io/api/sentencezer. https://www.nltk.org/modules/nltk/tree.html. See (Appendix C) for scoring guidelines.
178
GEM Stolwijk and FA Kunneman
Results. The best performing algorithm was dependency tree-based segmentation with POS tagging, closely followed by dependency-tree-based segmentation with the retrained tagger (Table 2). The Sentencizer errors were almost all false negatives (i.e., they were predicted not to be the end of a step when they actually were), while the tree-based segmentations also gave some false positives (i.e., they were predicted to be the end of a step). the end of the stage).) . , although in reality it was not). Table 2. Confusion matrices between each approach and actual segmentation. Basic truth random baseline 0
3.3
1
Judgment Baseline
Tree based with regular POS
Tree-based with POS tweaked
1
1
1
1.058 159
1.216 1
1.200 17
1.195 22
1
152
63
25
27
18
107
145
143
Combine question and answer methods.
To help a conversational agent distinguish between reasonable and task-specific questions in the culinary domain, we set out to train a machine learning classifier to distinguish between the two. Classification of the type of question. We approach the classification of question types as a supervised machine learning task with the aim of recognizing the characteristics that distinguish the two types of questions, which served as the basis for the classification of new instances. Two functionalities were created: the number of tokens and the number of characters in a query. Next, the queries were pre-processed by stemming, removing stopwords and lowercase letters. Sentence embeddings were again calculated using SentenceBERT [26], returning a total of 768 features. The five embedding positions with the highest scores in the training data were chosen using the ANOVA F value between each feature and label (ie, indexes 2, 112, 284, 320, 420). This resulted in seven features when added to the two length-based features. A random forest classifier of 100 estimators was used to classify the sentences as belonging to either the general question type or the recipe-specific question type. A total of 359 queries (70% training and 30% evaluation) were selected and written to train and validate the classifier. These consisted of 223 general cooking questions selected from the Dutch food and drink question bank described above, and 136 specific handwritten recipe questions based on the types of questions the kitchen assistant could answer. Ten splits were created so that the average classifier performance could be calculated across these different train/sort sets. Classifier performance was evaluated using accuracy, recall and F1 score. We compare the random forest classifier with a majority baseline. Results. The random forest classifier outperformed the majority baseline in the classification set (Table 3). Classifier accuracy was approximately 85% (M=0.85, SD=0.01, for ten classification/test divisions).
Increased coverage of clarification responses for a kitchen helper
179
Table 3. Classifier performance based on 108 test queries General Specific Macro F1 Precision Recall F1 Score Precision Recall F1 Score Baseline
0,60
1,00
0,75
0,00
0,00
0,00
0,37
Random Forest 0.87
0,91
0,89
0,85
0,79
0,82
0,86
4
Analysis of the quality of clarification responses in real conversations with kitchen helpers
4.1
Conversational agent architecture
To test our Q&A models in action, we developed a conversational agent with a dialog management component based on the state of information update paradigm [30] and connected to the Google Dialogflow dialog design interface to improve understanding. of natural language and managing the user interface (Fig. 2). We developed Chefbot11 to schedule and manage the cooking instruction dialog and leverage the agent's knowledge of recipes and the kitchen area. A Django12 application was developed to connect Chefbot with Dialogflow.13 In Chefbot, agent movements are specified and linked to preconditions and effects, such as user pre-intent and position in the recipe. Information status is updated based on the impact of a move.
Fig. 2. Google Dialogflow interface.
The general flow of the conversation was modeled as follows: After an initial greeting, the agent asks the user which recipe he would like to prepare. If a decision 11 12 13
https://github.com/fkunneman/Chefbot NCF.git. https://www.djangoproject.com/. https://github.com/fkunneman/smoothbot.
180
GEM Stolwijk and FA Kunneman
When finished, the agent will show the ingredients and utensils on the screen so that the user can confirm that all preparations have been made. The instruction of the recipe then begins where, at each guided step, the user has the opportunity to request details or clarifications, as well as ask delicate questions about cooking. With the prescription ready, the agent does a final check on the matter and closes the conversation. 4.2
Select answers for clarification
While answers to reasonable questions could be pulled directly from the database, recipe-specific information was stored as part of the recipe in a json file that Chefbot could extract. Based on Section 3.2, the recipes were analyzed and converted into a json file with the following items: recipe title, number of people, cooking time, ingredients (including unit and quantity), kitchen utensils and recipe steps (composed by steps descriptive text, ingredients used in the step, amounts of ingredients used in the step, image and more detailed descriptive texts of the step). When a user request was classified as task-specific, it was compared to one of the predefined intents (Appendix B). Each of the intents required specific information extracted from the json file and/or the context of the conversation (eg previous steps, ingredients already used...).
Fig. 3. Decision tree to choose an answer to the user's query based on the context of the conversation, the question type classification and the confidence of the answer.
We define a decision tree that the agent uses to decide which source of knowledge (task-specific or common sense) to use to answer a query (Fig. 3). To ask-
Increased coverage of clarification responses for a kitchen helper
181
Questions asked prior to selecting a prescription were answered using the agent's knowledge of available prescriptions and the agent's skills. After choosing a recipe, the user query was evaluated by the question type classifier (see Section 3.3). If the query is classified as general and the distance value is below 0.28 (which was considered the optimal threshold in Section 3.1), the answer will be the best-fit question (based on the best-fit BERT). distance approach). ) for the general QA database was returned to the user. If the user's query was classified as specific to a recipe and the confidence of the most suitable dialog flow intent was greater than 0.7 (empirically determined), the most suitable user response was returned. If the confidence was too low or the distance value too high, the opposite was tried. If the threshold was not reached, the other response type was checked and now a threshold of 0.3 was checked. This threshold lower than the 0.7 described above was used because this was the value defined for the first version of the kitchen aid implemented and allowed to reduce the number of alternate attempts. If the confidence was too low or the distance score too high in this case, an alternative trial was triggered. Otherwise, the best matching answer is returned. 4.3
Experimental design and data analysis.
We conducted a user study to assess the quality of responses to questions during the cooking class. Participant. Six participants prepared a recipe with the kitchen help. Their mean self-reported cooking skill score was 63.5 (M = 63.50, SD = 17.40, [19]) out of a maximum score of 98. Participants had some prior knowledge of the purpose of the study: to assess performance of the smart kitchen assistant. As the number of participants was relatively small due to limited resources, they were encouraged to ask as many questions as possible. However, this allowed the small number of participants to provide researchers with relevant quantitative and qualitative insights when assessing the types of questions asked when cooking a recipe (i.e. in a natural setting) in terms of task-specific knowledge and what might arise. . as well as domain knowledge. Procedure. Before using the kitchen aid, participants completed the cooking skills questionnaire, were informed about the main purpose of the study, and were given information about the ingredients and cooking time required to complete the dish. All participants cooked a spicy Mexican casserole that was chosen for its linearity (i.e., no temporally overlapping steps) and limited complexity. Each participant cooked the dish in their own kitchen. The experiment was carried out by video call, in which the participants were instructed to ask as many questions as possible while preparing the offered dish. The experimenter shared his screen and entered all user requests, so participants did not have to download any required software. The kitchen assistant's response was sent to the participants. Right after completing the recipe, the participant was asked to complete the System Usability Scale [5].
182
GEM Stolwijk and FA Kunneman
Assessment. The quality of clarification responses during the six interviews was measured by manually checking their correctness, from which accuracy, recall and F1 scores were calculated. Table 4. Classifier performance on the 293 queries made during recipe instruction.
4.4
label
Accuracy recall F1 score
Generally
0,53
0,70
0,61
Specific
0,92
0,85
0,88
Weighted average 0.85
0,82
0,83
Results
There were 297 consultations (M = 49.50, SD = 16.93). Of these, 6 were manually marked as "Before Sorting" (ie before selecting a recipe), 57 as "General" and 234 as "Specific". This is a relatively small dataset. However, it serves as a first indication of the types of questions users are asking in a natural environment, in addition to the more artificial environment discussed above. The results of the random forest classifier that distinguishes between general and task-specific questions are shown in Table 4. The weighted mean value of F1 was 0.83 for the random forest classifier. General domain issues were more difficult to identify than task-specific issues. Of the misclassified queries (18%), the heuristic (ie, maximum distance score or minimum confidence score) corrected 29%. This means that a total of 87% of requests were classified correctly. The impact of the different decision tree components on returning the correct answer is shown in Table 5. 66% of all questions were answered correctly, while 87% were correctly paired. This could be partly because the incorrect answer was routed to an answer component of the question (21% of questions were misrecognized or failed the threshold-based heuristic) and partly because of an answer retrieved incorrectly by the algorithm or the lack of a response. in the database The latter largely applies to the general QA database, where only 56% of the questions would have an answer available in the database (if they had successfully traversed the decision tree to be retrieved by the sentence BERT approach) . The threshold-based heuristic considered 13 queries that incorrectly did not exceed the threshold and 5 queries that were directed to the correct question-and-answer engine after incorrect classification. Of the incorrect answers, only 6% were related to another part of the recipe and later used by the participant. In 19% of cases where the agent was unable to provide a correct answer, an alternative intent was triggered (eg, "Can you rephrase your question?"). The remaining 74% of wrong answers caused confusion among participants as an unexpected answer was returned. Only 25% of general cooking questions were answered correctly. Of the remaining 75%, most errors occurred because the query being made was not in the database. A significantly greater proportion of 76% of recipe-specific questions were answered correctly.
Increased coverage of clarification responses for a kitchen helper
183
Table 5. Influence of the different components of the conversation agent on the return of an incorrect answer. General Specific Revenue Total N
PAG
norte
PAG
234 1.0
norte
PAG
# Questions asked
57 1,0
291 1.0
# Questions available in the database
32 0,56 205 0,88
237 0,81
Questions remaining after sorting by question type
40 0,70 199 0,85
239 0,82
Open questions about decision tree heuristics
37 0,65 194 0,83
231 0,79
Correct answers returned by the QA component
14 0,25 178 0,76
192 0,66
Ease of use of the system. The average score on the system usability scale from 0 to 100 was 82.08 (M=82.08, SD=4.85). Three contestants rated it above 80, which shows that they liked it and would likely recommend it to others. The other participants gave a score between 68 and 79, showing that they felt that the agent performed well, but still needed to improve.
5
discussion
The main finding of our study is that the approaches we take to answer general domain questions and task-specific questions perform well in a controlled experiment, while significantly slowing down when faced with questions posed by a participating user that becomes the actual task. Our heuristics for handling user requests combined with our detailed analysis of conversations allow us to identify the two main causes of this performance gap: overall QA performance and confusion between reasonable and task-specific questions. In the following, we provide explanations of these causes and discuss their implications for improving conversational systems that cover a knowledge space of similar breadth. Regarding the first cause, a significant difference was found in performance on general cooking questions, where the F1 score dropped from 0.89 in the controlled experiment to 0.61 in the user study. This is mainly due to a mismatch between the source used for general domain questions, a CQA platform where at least some of the queries are asked out of curiosity, and the intended use of users engaged in the act of cooking when asking their questions. , assigned questions. . An improvement can be achieved by filtering out the types of database queries that are not normally executed during cook statements and iteratively adding queries that are not in the database but were performed by users. Alternatively, the use of a CQA could be completely discarded and replaced by a dialog-based dataset, as was done [29], which limits the number of questions to be answered.
184
GEM Stolwijk and FA Kunneman
The second cause, confusion between sensible and task-specific questions during cooking conversations, resulted in a significant proportion (21%) of incorrect responses. Part of this can be attributed to the question type classifier, although it did consistently in both the controlled evaluation (macro F1=0.86) and the user study (weighted F1=0.83). The heuristic (cosine distance of 0.7) was another factor that resulted in mismatches in some cases. These failed question type categorizations were due in part to incomplete phrasing by the user, which can be remedied using space padding. Furthermore, recipe-specific questions about the amount of an ingredient were often confused with general questions, which could be improved by omitting ingredient names as a resource for question type sorting or by training specialized classifiers to specifically distinguish different queries. recipe ingredient. Finally, the conversation interface itself can be used to prompt the user for confirmation when there is confusion between certain types of questions, or to trust the user to perform conversation repair when they give an incorrect answer. [8] show that incorrect responses from a maid can help progress in a conversation. The system itself, of course, must be adequately equipped to interpret a user's solution following an incorrect response. A final talking point is the consideration of not providing an answer if a question is out of scope. Currently, this option is only activated in the heuristic when the value falls below a threshold, but it is not built in as a core feature. A system that claims to cover a wider knowledge space in an area probably also knows what it has no answer for. One way to identify questions that cannot or should not be answered by the agent is to train a classifier on a set of unanswered questions, as was done in [24]. An important limitation of our study is that we tested the system with only six participants, who were encouraged to ask the agent many questions. This was adequate as an initial investigation, but future work must aim to make this number larger in order to draw significant conclusions, and participants should not be assigned anything more than creating the recipe to obtain more information about the performance of the system and the common questions. That said, a strength of the current setup is that the participants were actually in their kitchens when they were talking to the assistant, and the in-depth analysis of the conversations provided a clear indication of the system requirements and challenges for expanding the knowledge of the assistant.agent. One question the results raise is whether a task-based conversation system should cover a long series of questions that a user might ask a day, at the expense of confusing more questions with others. The current study shows that these costs are too high, but the gains may be worth it if the wrong-answer rate can be reduced. Our study reveals a number of directions that can be explored further to this end.
6
Diploma
Our goal is to increase the reach of clarifying answers for a kitchen assistant, relying on a community platform to clarify doubts.
Increased coverage of clarification responses for a kitchen helper
185
Common sense questions related to cooking and a recipe platform to model knowledge of specific recipes. Approaches to answering cooking-related questions based on this data were tested in a controlled environment and in a user study as part of a kitchen assistant. User study results demonstrated that the need to distinguish between reasonable and task-specific questions can result in a significant proportion of questions not being answered by the correct module. In addition, many improvements can be made by increasing coverage of common sense issues. Testing the quality of clarifying responses in a user study was critical to gaining an empirical view of the information-seeking challenges of an instructive conversation across a broad domain of knowledge.
A
General Question-Answers: Types of Questions
Questions were selected into seven categories: 1. Ingredient Substitution: What can be substituted for a specific ingredient (for example, what can I substitute powdered brown sugar for?). 2. Importance of ingredients: The question may arise about what a certain ingredient is (for example, what are goji berries?). 3. Difference between ingredients: What is the difference between two (similar) cooking ingredients (eg what is baking powder?). 4. Ingredient sustainability: How long does an ingredient last before it goes bad (eg how long is open organic coconut oil sustainable?). 5. Ingredient cooking time: How long does it take to cook/cook a certain ingredient (eg how long does it take to bake a sweet potato in the oven?) 6. Health: Is an ingredient or its use healthy? (eg isn't it healthy to reuse a teabag?) 7. Perform Cooking Techniques: How to perform a specific cooking technique (eg how do I quickly boil water in a pot?).
Table 6. Predefined kitchen units of measure and English equivalents. NL language units
Gram, gr., gr., g., g, kilogram, kilo, kg., kg, liter, l., l, milliliter, ml., ml, centiliter, cl., cl, deciliter, dl, dl, spoon soup spoon, tablespoon, tablespoon, tablespoon, teaspoon, teaspoon, tl, tl, cup, stack, bag, piece, slice, stick, toss, hand, pinch, pound, jaguar, paint, branch, tip
EM
Gram, gr., gr., g., g, kilogram, kilo, kg., kg, liter, l., l, milliliter, ml., ml, centiliter, cl., cl, deciliter, dl., dl, soup spoon, soup spoon, soup spoon, teaspoon, teaspoon, teaspoon, cup, pack, bag, morsel, slice, paste, dash, hand, snuff, pound, ounce, pint, twig, clove
186
B
GEM Stolwijk and FA Kunneman
Answering Recipe Specific Questions: Types of Intent
what recipes Ask what recipes are available for the user to cook. Baking recipe. Choose which recipe you want to prepare from the recipes available in Chefbot's database. confirm prescription. Confirm that you want to cook the selected recipe. Number of people for whom the recipe is intended. The user can ask the agent how many people the recipe is for. Users can use this information to adjust ingredient amounts based on the number of people they plan to cook for. Estimated preparation time. Added the estimated time to prepare the recipe to the recipe context. This allows the user to ask the agent how long it should take to prepare a specific recipe. recipe name Allows the user to request the name of the current recipe. Elapsed/remaining time or steps. The user can ask how long he has had the recipe or how much time and/or steps are left. Such a question would lead to the answer "The preparation should take between 30 and 60 minutes, depending on the recipe. You have already cooked for 27 minutes. You have completed 14/20 steps". Ingredients not yet used. The user can request agent recipes that were not used in the previous steps. That way they can check if they are still on track or if there are ingredients that should have been used but haven't been used yet. Also, this feature can be useful when a recipe step says "add all remaining ingredients". The user can then ask what the remaining ingredients are. Continuation. Continue with the next step after completing the current one. Repeat. Repeat the current step. Last step. Go back to the previous step. To update. Inform the agent that the current step has been completed. Ask for clarification if a step is not clear. The agent can only react to this if a step has been written in more detail and more fundamentally. Accept repair: show gratitude. Thanks to the agent for the clarification.
Increased coverage of clarification responses for a kitchen helper
187
Accept repair: Understood. Inform the agent that the clarification was understood. How much does it cost. Ask how much of the ingredient is needed in the current step. as a. Ask them to explain how a particular cooking technique should be performed. For this, the technique must be explicitly explained in the recipe. Motivate. Ask why a certain step is necessary. Close recipe. Complete the recipe.
C
Recipe Note Guidelines: Basic Truth
recipe name The recipe name was the title of the recipe at the top of the page. Number of people. The number of people for whom the recipe is intended was indicated at the beginning of the list of ingredients as follows: "Ingredients for N people". Category. It can be found under the heading of a single recipe on smulweb.nl [15]. Time to cook. It can be found in the single recipe categories on smulweb.nl [15]. Ingredient list. The ingredients were divided into three parts (if applicable): the amount, the unit of measure and the name of the ingredient. Tools. For cookware, the list has been broken down by format. The most common format was a comma-separated list of implement names, but other options like B. Bulleted lists. Procedure. To structure the procedure, it was first divided into sentences. If a set contained more than one cooking action (e.g. boiling water and adding pasta), this would again be split into two separate steps. When there are several ingredients that require the same action (eg adding onion and garlic), they are kept together in one step.
References 1. Agichtein, E., Maarek, Y., Rokhlenko, O.: Alexa Prize Taskbot Challenge (2022) 2. Alfifi, M., et al.: Howdy Y'all: An Alexa Taskbot (2022) 3. Baccouri , N.: Translator of titles. https://github.com/deep-translator (2020)
188
GEM Stolwijk and FA Kunneman
4. Beddiar, D.R., Jahan, M.S., Oussalah, M.: Augmentation of data using back-translation and paraphrase for detection of hate speech. online membership network Media 24, 100153 (2021). https://doi.org/10.1016/j.osnem.2021.100153. https://www.sciencedirect. com/science/article/pii/S2468696421000355 5. Brooke, J.: SUS: A Quick and Dirty Usability Scale. Usability Assessment Ind. 189 (1995) 6. Burges, C.J.C.: Towards Mechanistic Understanding of Text: An Essay (2013) 7. Chang, M., Guillai, L.V., Jung, H., Hare, V., Kim, J. , Agrawala, M.: Recipescape: An Interactive Tool for Analyzing Scaled Cooking Instructions. In: Proceedings of the 2018 CHI Conference on Human Factors in Computer Systems. ACM (2018) 8. Cho, J., Rader, E.: The role of the conversational base in supporting symbiosis between humans and digital assistants. Proc. ACM Human Calculator Interact. 4(CSCW1), 1-28 (2020) 9. Damodaran, P.: Parrot: Paraphrase Generation for NLU (2021) 10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert : Pre-training of Deep Bidirectional Transformers for Speech Comprehension. arXiv-Preprint arXiv:1810.04805 (2018) 11. Edunov S, Ott M, Auli M, Grangier D Understanding back-translation at scale. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 489-500 (2018). https://doi.org/10.18653/v1/D18-1045 12. Feng S, Wan H, Gunasekara C, Patel S, Joshi S, Lastras L: doc2dial: a goal-oriented document-based dialog dataset. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8118–8128. Society for Computational Linguistics (2020). https://doi.org/10. 18653/v1/2020.emnlp-main.652. https://aclanthology.org/2020.emnlp-main.652 13. Gargett, A., Garoufi, K., Koller, A., Striegnitz, K.: The Give-2 Corpus of Giving Instructions in Virtual Environments. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010) (2010) 14. Gemmell C, Fischer S, Mackie I, Owoicho P, Rossetto F, Dalton J.: GRILLBot: a flexible conversational agent to solve complexes real-world tasks. In: 1st Proceedings of the Alexa Prize Taskbot (2022) 15. Groep, J.: Gelegenheid Recipes (2021). https://smulweb.nl/recepten/gelegenheid 16. Honnibal M, Montani I, Van Landeghem S, Boyd A spaCy: Industrialstrength Natural Language Processing in Python (2020). https://doi.org/10.5281/zenodo.1212303 17. Jian, Y., Zaporojets, K., Deleu, J., Demeester, T., Develder, C.: Structured data extraction from recipes with conditional random fields . In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th Joint International Conference on Natural Language Processing, pp. 821–826 (2020) 18. Lample, G., Conneau, A.: Pretraining the multilingual language model. In: 33rd Conference on Neural Information Processing Systems (2019) 19. Lavelle, F., et al.: The development and validation of measures to assess cooking and eating skills. international J. Behavior. Ley physical nutrition. 14(1), 118 (2017) 20. Lin, Y.T., et al.: Miutsu: NTU's homework robot for alexa award. arXiv-Preprint arXiv:2205.07446 (2022) 21. Maeta, H., Mori, S., Sasada, T.: A framework for interpreting recipe texts. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th Joint International Conference on Natural Language Processing, pp. 553–558 (2014) 22. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv form arXiv:1301.3781 (2013)
Increased coverage of clarification responses for a kitchen helper
189
23. Neumann, N., Wachsmuth, S.: Recipe Enrichment: Knowledge Needs for a Kitchen Helper (2021). https://doi.org/10.5220/0010250908220829 24. Rajpurkar P, Jia R, Liang P: Knowing what you don't know: unanswered questions for the graph. In: Proceedings of the 56th Annual Meeting of the Association for Computer Linguistics (Volume 2: Short Papers), pp. 784-789 (2018) 25. Rastogi, A., Zang, X., Sunkara, S., Gupta, R., Khaitan, P.: Towards scalable multidomain conversational agents: the schema-oriented dialog dataset. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, nº 5, pg. 8689-8696 (2020) 26. Reimers, N., Gurevych, I.: Sentence-BERT: Set Embeddings Using Siamese BERT Networks. EMNLP (2019) 27. Stoyanchev S, Keizer S, Doddipatla R Action State Update Approach to Dialog Management. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7398-7402 (2021) 28. Strathearn, C., Gkatzia, D.: Chefbot: A New Framework for Generating Improved Commonsense Responses for Task-Based Dialog Systems. J ACM 46-47 (2021) 29 C Strathearn, D Gkatzia Task2Dial: A New Task and Dataset for Enhanced Task-Based Dialog Based on Common Sense Documents (2022) 30 DR Traum, S R Larsson.: The Approach to information status for dialog management. In: van Kuppevelt, J., Smith, RW. (eds.) Current and New Directions in Discourse and Dialogue, pp. 325-353. Springer, Dordrecht (2003). https://doi.org/10.1007/978-94-010-0019-2 15 31. Van-Tu, N., Anh-Cuong, L.: Improving question classification through feature extraction and selection. Indian J.Sci. technology 9, 1–8 (2016) 32. Xue, X., Jeon, J., Croft, W.: Retrieval Models for Question-Reswer Files, pp. 475-482 (2008). https://doi.org/10.1145/1390334.1390416 33. Zaib M, Zhang WE, Sheng QZ, Mahmood A, Zhang Y: Answering conversation questions: a survey. Knowledge. information system 1-45 (2022)
Context-aware chatbot design for product configuration Tom Niederer1
, Daniel Schloss1(B)
e Naomi Christensen2
1 Karlsruhe Institute of Technology, Kaiserstrasse 12, 76131 Karlsruhe, Alemanha
[Email protected]2 CAS Software AG, CAS-Weg 1 - 5, 76131 Karlsruhe, Germany
Abstract. Product Configurators provide an interface for customizing complex products. However, large form-based configurators overwhelm many end users and are often considered specialized tools. Therefore, this article addresses the problem of complexity in today's product configurators. Because chatbots can flexibly respond to requests and offer a natural language interface, they have the potential to simplify the setup process. In this post, we present a product configuration chatbot that we developed using the Design Science Research approach and in collaboration with an industry partner. We derive design principles for chatbots from user interview setup, which relate in particular to the flexibility of the chatbot compared to a static process. These design principles were implemented in our chatbot artifact, which was tested in an online experiment (N=12) and compared to a benchmark chatbot with a strong setup process. Our results indicate that the proposed design increased the reliability and performance of the setup and had an overall positive impact on participant engagement. Therefore, this study brings a prescriptive vision for the design of contextual chatbots for product configuration and a new artifact in the form of a contextual configurator chatbot prototype. Keywords: product configuration chatbots contextual awareness
1 Introduction Product configurators emerged in the course of digitization, when the demand for individualized products grew and "mass customization" became a phenomenon [1]. Product configurators offer companies a good opportunity to organize the growing demand for bespoke products. The industrial partner of the research project presented in this article, CAS Software AG, offers product configurators for this purpose. While CAS has been successfully building Configure, Price, Quote (CPQ) solutions for medium to large customers for 35 years, product configurators pose a major design challenge due to the large number of configuration items and variants. Software-based product configurators are generally considered to be specialized tools and not easily usable by customers and sales reps. The term "mass confusion" [2] describes a major problem that occurs with complex tuning configurations. Customers can be impressed with © The author(s), under exclusive license from Springer Nature Switzerland AG 2023 A. Følstad et al. (Eds.): CONVERSATIONS 2022, LNCS 13815, pp. 190-210, 2023. https://doi.org/10.1007/978-3-031-25581-6_12
Context sensitive chatbot design for product setup
191
Number of options available and the complexity of the product structure. Information overload can prevent users from making decisions. Even small configuration issues often have complex knowledge bases, for example when a product consists of components that may have sub-components or the like. Interdependencies, constraints and component rules lead to additional complexity [3]. In these environments, users are subject to a multi-tier setup process that involves many steps, many options to choose from, and product area limitations that are difficult to understand. Researchers in the field of product configuration and CAS Software AG are constantly looking for ways to reduce complexity for end users. Studies in different areas of information systems have shown that interaction in natural language can reduce complexity for end users (eg [4, 5]). Natural language interfaces give even novice users the ability to ask questions in areas they have no knowledge of. You can reduce the need for experts to translate queries into domain-specific technical terms. For these reasons, natural language interfaces are also interesting for special product configuration applications, where mastering complexity is the central challenge. Chatbots, due to their ability to move in natural language, can be a suitable tool to facilitate complex and tedious processes such as product configuration for the user. However, while a setup process usually maps to a strict, linear scenario, a conversation does not. It can consist of contextual questions, twists and turns, interlocutors and even uncooperative behavior. In a configuration dialog, several scenarios can occur where the bot needs context to respond properly. A customer might ask "What are my options?" or "Why not green?" instead of specifying the desired color. Here, the proposed chatbot must be able to connect this follow-up question with the bounded context. End users can also check the big picture or change their mind at any time during setup: "I actually prefer red." Finally, modern chatbots have the ability to extract multiple pieces of information in a single request [6]. Thus, the process could be accelerated if users knew their preferences in advance: "I want a blue sports car with a high-quality hi-fi interior." Due to the described properties of product configuration and natural language communication, a chatbot as a product configurator must have a high degree of flexibility. Context-aware chatbots are intelligent agents that can take into account contextual knowledge to facilitate human-machine interaction. This article views the concept of contextual awareness as a tool to provide the necessary adaptability and fit between task and information presentation. A context-aware chatbot has the potential to lead to greater user task performance [7, 8]. Therefore, this thesis investigates the design of a context-sensitive chatbot for product configuration. Our research project, carried out as a Design Science Research (DSR) project, aims to answer the following research questions: • RQ1: How to design a context-aware product configuration chatbot to facilitate the configuration process for the user? • RQ2: How does a context-aware chatbot compare to a simple Q&A chatbot in terms of utility, ease of use, and cognitive load?
192
T. Niederer et al.
2 Theoretical background and related works 2.1 Product configurators As product configurators offer valuable opportunities for companies, but also represent challenges, product configuration research has been dedicated to the problem of configurators and has addressed design proposals. To solve the "mass confusion" problem [2], product configurator research suggests educating the customer about product attributes and how they map to design parameters [6]. They also emphasize the importance of identifying customer needs. Customers may not be interested in exhausting all possible alternatives [6]. Furthermore, customers differ in their knowledge of the product to be configured. For these reasons, the configuration process can lead to cognitive information overload. Since a very high cognitive load can negatively affect task performance, in this case in the configuration task [7], an adequate presentation format is crucial for cognitive adjustment and performance in a configuration process [2]. To date, few technologies have been tested to support a configuration process. Most software-based configuration interfaces are available to clients as web forms [3, 5, 8]. However, they come with issues such as limited flexibility and intuitiveness, as well as complexity, which will be discussed in more detail in the next section. For this reason, we are testing a chatbot as a new technology for product configuration, which reflects specific chatbot design principles, but also configuration, as described in the following sections. 2.2 Conversational Interfaces to Reduce Complexity Chatbots are a popular but new technology option for mapping a configuration process [9]. Due to the intuitiveness and possible efficiency of natural language, chatbots have the potential to solve and facilitate complex tasks such as product configuration. Therefore, the technology has already been proven in correlated areas of high complexity, such as e-commerce and recommendations, with a focus on assisting users in making purchase decisions. Several studies use chatbots to query product databases based on reported user preferences [10-12]. Natural language interfaces have also been applied to complex data scenarios. Conversational interfaces have been used in recommender systems to avoid information overload [5, 13]. As it is very common to ask other people for recommendations (e.g. restaurants, cinemas), they are well versed in formulating recommendation requests in natural language [13]. This also applies to configuration, because the most intuitive process would be a product configuration in the form of consulting a product specialist or sales representative. 2.3 Context-awareness in chatbots A simple question and answer form would not be enough to facilitate setup, as this approach simply shifts the cognitive effort from clicking to typing. Providing an intuitive configuration interface is achieved by equipping the chatbot with context-sensitive features. Context is needed when a statement refers not just to a single question, but to the entire dialogue. Context awareness allows the conversation to flow more naturally,
Context sensitive chatbot design for product setup
193
which supports the demand to integrate more social functions into conversational agents [14]. Product configuration is contextual per se, as a single step is meaningless without reference to the others. In chatbot research, context was defined as information extracted from the conversation between the user and the chatbot [15]. Other references are also known, such as when someone fails to provide the requested information and context is needed to respond appropriately: corrections, broad context references, and limited context references [16]. For a configuration bot to provide contextual awareness, it must be connected to a knowledge base and storage. This is the case with our chatbot, which connects to the CAS configuration interface to map the hierarchical configuration steps and the settings involved. Additionally, modeled configuration constraints can be used to inform the user of available options. To answer follow-up questions, additional modeled information about certain product properties may be requested. Also, the bot can provide the ability to override certain options and apply default settings. Finally, product experts can formulate multiple desired configuration properties in a single request, while less experienced users are guided step-by-step through the process.
3 research focus
Designzyklus 1
design cycle 2
problem awareness
excitement
developer
Assessment
Diploma
Literature review / interviews
design principles
Prototype
Individual user testing
User test analysis
Refined design principles
refined prototype
last experiment
evaluation analysis
Fig. 1. Design the scientific inquiry approach for this project.
We find the DSR approach particularly suitable for gaining insight into the design of context-aware chatbots for product configuration, as it involves iterative development and testing phases to ensure relevance and validity [17]. Our DSR project follows the established five phases of awareness, proposal, development, evaluation and closure [18]. The following sections provide an overview of the main steps of the implemented method, while Section 4 describes each step in detail. Our research project consisted of two cycles that included an intermediate assessment with initial qualitative feedback and a more comprehensive final assessment. The steps involved in the research project are shown in figure 1. As the mid-term evaluation showed that the initial problem was sufficiently understood, a small adjustment of the
194
T. Niederer et al.
Design Principles (PD), the second iteration started with the proposal phase after the midterm evaluation. Problem Awareness: To increase understanding of the issues, company employees with access to customer feedback (e.g. sales reps, product managers) were surveyed to identify the actual issues end users face when using the product configuration interfaces. For this, seven semi-structured interviews were carried out. A total of seven employees (2 women, 5 men; mean age = 34.14 years, SD = 6.87; professional experience = 15.57 years, SD = 7.69; experience in product configuration = 5.71 years , SD = 2.77) respondents. A semi-structured interview catalog was developed for the interviews, consisting of questions about demographics and work experience, as well as twelve pre-asked questions about product configuration. The interview ranged from questions about the status quo of the product configuration (for example, "How often does the product configurator use which user groups?") to questions about product configuration issues (for example, 'Why do users are overriding the settings?') to customer requirements (eg 'What are the frequently expressed customer requirements?'). The interviews were transcribed using Microsoft Teams Live Transcription and had an average duration of 23.39 min (SD = 2.77). We wrote sentences from the interviews and grouped them by topic. In the discussions, agreement was reached on the final groups and the classification of the labeled information. As we will see in the next Section 4.2, the results of the expert interviews were consistent with the relevant literature on product configuration, which focused on the requirements from the user's perspective Proposition: Based on interviews and evaluations, the main objectives and requirements were identified end-user issues related to product configuration interfaces. In addition, theoretical best practices and descriptive design knowledge from related areas and chatbot research were reflected. The results were attributed to prescription EPs. The Design Cycle 2 proposal resulted in improvements in the scope and implementation of SPs. Development: The identified design principles were implemented in a prototype. The second phase of development involved implementing measures to improve the chatbot responses in terms of content and format. Additionally, product-related information has been added. Evaluation: The mid-term evaluation took the form of an unmoderated remote usability test [19]. The final evaluation in Design Cycle 2 examined whether the chatbot could make the setup process easier for end users compared to a benchmark chatbot that only offered an inflexible step-by-step setup process. The core endpoints for the final assessment were utility, ease of use, and cognitive load. These were measured by looking at a user's performance on the configuration task and doing a post-experiment survey.
4 Designing context-sensitive chatbots for product configuration 4.1 Problem awareness The transcribed documents were analyzed to extract problem information from existing product configurators, as well as end-user objectives and requirements.
Context sensitive chatbot design for product setup
195
The analysis of the interviews and the literature led to the identification of five main groups of problems, described in the following sections. Limited flexibility: Several authors [3, 20] address consumer heterogeneity using product configuration interfaces. However, product configurators often provide a single default form for the personalization experience. Users with extensive product knowledge go through the same process steps as inexperienced customers and are exposed to the same level of detail and information [6]. In addition, the standard forms emphasize a strict configuration order, which can differ significantly from the user's real preferences: "I am relatively committed to the request, although it does not necessarily correspond to the things that are important to me" (interviewee 7 ). Each product property selection may limit other selection options. Therefore, following the strictly linear order of the setup process can easily lead to suboptimal results. Insufficient information: Insufficient information manifests itself at various stages of a configuration process. First, many customers may not have detailed technical knowledge of the product domain [3]. Therefore, they cannot select specific features to suit their needs [6]. This could be due to the very domain specific terminology. For this reason, the biggest challenge when designing product configuration interfaces is often conflict resolution [21]. Configuration conflicts occur when components selected by the customer do not match. Interviewee 7 reported frustrated customers who “[...] couldn't click on what they wanted and also didn't understand why”. Complexity/Confusion – Variant products are inherently complex, and the main challenge of a product configuration interface is to reduce complexity. For example, in many car manufacturers, only a few product specialists often have a complete overview of the product line [21]. Configuration interfaces that offer too many options at once overwhelm customers [2, 3]. Interviewee 5 also reports that “[...] you just don't understand how to do something [...]”. Limited intuition/leadership: Several interviewees report limited intuition because the configuration often "[...] is not self-explanatory" (Interviewee 4). Even if there is a user guide in the form of explanatory texts, this is still perceived as 'problematic' (Interviewer 2). Users also “[...] do not click on the information to find out for themselves” (interviewee 7) due to the limited intuition of viewing the information. Duration: The process of configuring a custom product is a "cognitively challenging task" [6]. Customers are often “not [..] interested in fully exploring the potential of individualization” [6]. These users have a few preferences in mind, and once these are met, the additional options aren't particularly relevant. However, it often still requires "[...] a lot of clicks to get a result" (interview with partner 5).
196
T. Niederer et al.
4.2 Proposal for Deriving Design Requirements – After gaining a deeper understanding of issues related to product configuration interfaces, the next step is to derive requirements that address these issues. First and foremost, the user interface should help novice and experienced users alike. Therefore, the chatbot must have a flexibility that ensures the right level of efficiency and support for experts and beginners alike. Customers should be able to decide for themselves what level of detail they need and in what order to configure their product. Furthermore, economic and psychological studies also show that human preferences change depending on available alternatives [22]. Therefore, a product configuration interface should always support easy and flexible changes during the customization process. The resulting design requirement (DR) was stated as follows: A product configurator must provide flexibility (DR1) to support customers who vary widely in their goals, skill sets and configuration procedures. Second, the interface must fill in the missing information gap. You must support customers using the configurator as a tool to investigate what is possible with the provided product. Therefore, the proposed project should allow knowing more about the product, as well as its characteristics and properties. Considering this case brings with it a high level of transparency: What happens to my configuration if I choose a specific option? How does this affect my final product in terms of my preferences? Can some options be combined? In short, a product configurator must allow scanning (DR2). Third, a product configurator must address the problem of complexity and confusion. Therefore, one of the main challenges is to provide a sufficient amount of information in the right place, at the right time and not all at once. Questioners suggest reducing the amount of thinking required through intuition and restraint: the product configurator should be clear, easy to navigate, and as simple as possible to understand. Consequently, the proposed solution should be self-explanatory and as easy to learn as possible. Therefore, it should reduce cognitive effort (DR3). As a general rule, customers don't want to spend hours configuring the final product, but they want to configure their desired products "as soon as possible" (Partner Interview 3). Therefore, a proper configuration tool must implement mechanisms for quick configuration (DR4). Translate design requirements into design principles – To define guidelines on how to fulfill DR in a chatbot interface, they were translated into DP. The mapping is explained in the following sections and shown in Fig. 2. To increase flexibility (DR1) and exploitation (DR2), the chatbot design must provide conversational flexibility (DP1). The design should reflect that natural language inputs are much richer than GUI inputs [23, 24]. The intended goal is to understand users' needs and how they are best served [24]. It should offer the user the opportunity to formulate their preferences directly in natural language and initiate a configuration process by consulting the process itself for certain product properties. Furthermore, the chatbot must be flexible at all times during configuration: it must allow for (contextual) queries, corrections, undo previous steps and deviations from a standard configuration procedure.
Context sensitive chatbot design for product setup
197
To increase the possibilities of exploitation (DR2) and reduce the required cognitive effort (DR3), the chatbot needs to provide relevant information before, during and after the configuration process (DP2). Relevant information is general (static) information about the product, the process, the possibilities and limitations. During configuration, providing dynamic (context-dependent) information is particularly relevant. Also, during setup, the chatbot should take advantage of NLU capabilities to help users with vague or unclear request phrases. Context-dependent information is needed for configuration: The chatbot must be able to name conflicting resources and provide solutions on how a conflicting resource can be selected and what effects the selection would have. Ideally, the chatbot can also explain why different options cannot be combined. Relevant information in the form of intermediate states and transparency about configuration changes must also be available during configuration. To reduce the required cognitive effort (DR3) and speed up the setup process (DR4), the chatbot must provide a clear structure (DP3). The design should reflect that the content and functions of a text-based interface are much more hidden from the user compared to a graphical interface [24]. Several authors suggest the need to reveal the capabilities of the system during and during the interaction to form expectations and provide guidance [23-26]. Researchers have found that conversational guidance can be achieved by suggesting user responses [23], providing clickable buttons to generate text [27] and clarifying conversation flow with instructional messages [28]. For the product configuration domain, the chatbot should provide suggestions on how customers can achieve their goals. The guide could be improved by providing the following steps during the setup process. Finally, a guided mode can be offered where the chatbot asks questions, for example. which property of a given resource the user wants to select. However, the implementation of DP3 must take DP1 into account. The clear structure must be an optional offer that does not force the user to proceed mechanically. design requirements
design principles
A configurator must offer flexibility
DR 1
DP 1
Bot offers flexibility regarding conf. process and vocabulary
A configurator must allow exploitation
DR 2
DP 2
The bot provides relevant information
A configurator should reduce cognitive effort
DR 3
DP3
Bot offers a clear structure
A configurator has to quickly configure
DR 4
DP4
Bot provides quick configuration options
Fig. 2. Derivation of design principles from design requirements
Throttling mechanisms can be simple, for example using default values (eg always choose the most frequently selected property from the available options). Configuration researchers are also extensively investigating ways to include recommendations in product configuration [3, 29, 30]. Another suggestion from the interviews (the interviewee
198
T. Niederer et al.
5, 7) consisted of optimizing the final product according to a user-specified objective (for example, price or performance). 4.3 Development The conversational AI RASA framework was used for the development of the chatbot. The main criterion for selecting the framework was the fulfillment of the technical requirements to instantiate the PDs. RASA is open source and compares well with other NLU services for chatbot development [31]. It includes NLU components for intent classification, entity extraction, and response retrieval, as well as a dialog management component that decides the next action the chatbot should take. For context-aware chatbot implementation, the pipeline of components included Whitespace Tokenizer, RASA built-in RegexFeaturizer, RASA LexicalSyntacticFeaturizer, CountVectorsFeaturizer, Dual Intent Entity Transformer (DIET), and RegexEntityExtractor. Regex Entity Extractor was used to extract all defined features and characteristics, DIET Entity Extractor can identify entities not explicitly defined in training data using machine learning techniques. Furthermore, it offers a scalable architecture with easy integration of APIs and databases. Figure 3 shows the basic architecture of the chatbot:
Fig. 3. Chatbot architecture overview
Two versions of the configuration chatbot were developed. Both versions were connected to the enterprise configuration backend. Determine viable product variants, i. h determines which components can be combined and how. This means that all customer requests must be verified by calling the product's domain database. Additionally, the backend can be queried for additional information about the product, its features and how they relate to each other. The basic chatbot is based on simple, step-by-step and configurable question and answer mechanisms, as they are also implemented in the standard web form. During the setup process, the chatbot follows the same feature order and users are prompted to provide the features.
Context sensitive chatbot design for product setup
199
Function for the desired properties. Users can select functions by typing the string or number they want. It can only provide context-independent information and does not implement the proposed design. With the referencing theme, when users request specific features, the chatbot always responds with the same static response indicating which properties the feature contains. However, the proposed layout, shown on the right in Fig. 4, generates the response based on the previous steps (e.g. highlighting selectable features based on the previous configuration steps). The context-sensitive chatbot was designed according to the proposed guidelines. Context awareness was performed by entity recognition to populate the chatbot conversation memory. Conversational memory was used to provide dynamic and contextual responses based on the user's past utterances. The chatbot's memory is made up of slots that store information about all properties of the selected product and the configuration phase of the procedure. It was used to resolve and communicate configuration conflicts, provide feedback and support based on configuration status, and assist with automatic completion leading to a valid configuration. In order to flexibly respond (DP1) to various dialog directions, the chatbot uses NLU modules to understand different user intentions at every step of the setup process. Top user intents include configuration (i.e. property selection), requesting information, requesting additional guidance, or requesting autocomplete.
Fig. 4. Basic step-by-step chatbot setup (left) and context-sensitive chatbot that recognizes two functions at the same time (top right) or resolves conflicts (bottom right)
The results of the mid-term evaluation mainly addressed deficiencies in the instantiation of relevant information (DP2) and a clear structuring (DP3). So, in the second phase of development, the chatbot responses were complemented with information related to the product and the process (for example, how many functionalities need to be configured, what comes next).
200
T. Niederer et al.
setup steps). To improve the structure and clarity, the use of emojis and discount format to delimit the information was applied. Finally, to make the chatbot features more transparent, the chatbot provides example declarations for all its functionality. Figure 5 shows examples of how the red flag emoji is used to indicate conflict, the light bulb icon to highlight instructions, the arrow icon to provide next steps, quotation marks to indicate canned responses, and bold formatting to highlight changes.
Fig. 5. Implementation of notifications for communication of configuration conflicts
5 Evaluation 5.1 Experimental design To evaluate the proposed design, the context-sensitive chatbot was compared with the reference design in an online experiment. The evaluation examined whether the articulated design principles could be instantiated into an artifact that would make the configuration process easier for end users. The artifact will be evaluated for its ability to solve the described problems and add value to end users. Based on established technology acceptance constructs [32], the first two hypotheses were formulated as follows: H1: The proposed design leads to greater perceived usability. H2: The proposed design results in greater (perceived) utility. Based on cognitive load theory, the third hypothesis was raised as follows: H3: The proposed design results in a reduction in perceived cognitive load. To evaluate the proposed design, an online-controlled subject-to-subject design experiment was performed. On the one hand, a control group of users performed a
Context sensitive chatbot design for product setup
201
Product setup task using basic Q&A chatbot. The treatment group worked on the same task using the context-sensitive chatbot that implemented the proposed design principles. Both groups responded to a survey after practical implementation. Tasks included querying product related information and configuring according to desired product attributes. During the configuration task, participants had to resolve conflicts and make corrections. Participants were primarily company employees from a domain other than product configuration who may have recently used a configurator (eg to configure a car). These are considered potential end users. Appendix Table 3 provides an overview of participant demographics and controls. 5.2 Measuring instruments Perceived ease of use (PEOU) was recorded using a large number of subjective measures such as questionnaires [33-36]. Studies have shown that the PEOU is related to the usability criteria of clarity and reliability [36]. Clarity captures the degree to which the product is easy to understand and learn to use. A user who feels in control of the interaction experiences a high level of trust. The User Experience Questionnaire (UEQ) is used to record these dimensions due to its ease of use and proven informational value [37, 38]. The questionnaire consists of 26 items in the form of semantic differentials (ie, adjectives with opposite meanings). It also assesses general attractiveness, efficiency and hedonic attributes (stimulation and novelty) directly or indirectly associated with the behavioral intention to use a system [36]. Perceived Usefulness (PU) evaluates an information system in terms of performance and output [34]. For the subjective evaluation of usefulness, scales were defined for the domains of relevant tasks of retrieving and configuring information in the application [34, 35]. As provided by DR, the main steps in a configuration process are information retrieval and product customization. Therefore, the questionnaire contains two items in which participants are asked to rate the usefulness of the prototype in relation to each step of the process. The applied user-related constructs are listed in Table 4. The subjective assessment is complemented by measuring the user's task performance. User performance was objectively assessed by recording user behavior during interaction and evaluating participants' responses to the information retrieval task. Table 5 in the appendix gives an overview of the measurements obtained. The duration and number of messages provide information about the depth of the interaction and allow quantitative conclusions to be drawn about participants' actions and their commitment to the task. An information retrieval score was calculated from the responses, showing the proportion of questions answered correctly. The performance criteria interaction time and the number of messages sent to the chatbot can be read from the logs. Because the configuration task uniquely specified several desired properties, a configuration performance score was calculated. The score was calculated as the proportion of correctly selected features among all clearly specified features. Consecutive errors (for example, selecting an incorrect feature that prevented other features from being selected) were excluded from scoring. Ultimately, the logs allowed us to understand whether users had set the wrong properties and how much time they spent with them.
202
T. Niederer et al.
This work hypothesizes that a context-sensitive chatbot achieves higher task performance, higher PU and higher PEOU compared to a simple chatbot due to a reduction in cognitive load and a better fit between the problem statement and the task. Thus, measures to assess PU, PEOU and cognitive load were defined and included in the experiment. 5.3 Evaluation Results Users of the context-aware chatbot spent more time overall (on average, 163.5 s more). Participants in both treatment groups spent approximately half of the time on the setup task. Approximately the same amount of time was invested in obtaining the desired information. To perform the configuration task, users spent an average of 111.5 seconds more with the contextual chatbot, which is significant (df=10, t=1.855, p=0.047). Correction required much less time (about one minute on average) with the contextual chatbot (df=8, t=2.845, p=0.011). Two participants who used the reference chatbot did not complete the remediation and were excluded from the setup time. The results of increased investment in interaction with context-aware design are supported by the observed number of messages sent. On average, users of the reference bot sent 37.5 messages (SD = 4.93) to the chatbot, while the test group sent an average of 44.5 (SD = 13.28) messages. Perceived ease of use: Participants reported greater clarity and reliability when using the context-sensitive chatbot (see Table 1). Despite the small group size (6 each) of our exploratory experiment, a significant difference between the baseline and the context-sensitive chatbot for reliability has already been identified. Table 1. Descriptive statistics for intelligibility and reliability treatment Baseline (n=6) Context-aware (n=6) Average intelligibility
1,00
1,67
Dakota del Sur
1.11
0,96
average reliability
1.08
1,58
Dakota del Sur
0,52
0,41
Perceived usefulness: The context-sensitive chatbot scored mean for both information retrieval (context-sensitive mean = 2.17 (SD = 1.17) vs. baseline mean = 1.17 (SD = 1.47) )) and for configuration (context-sensitive mean = 2.00) . (SD = 0.63) vs. baseline mean = 1.50 (SD = 1.64) The results for the task performances are shown in Table 2. For the information retrieval task, the differences in performance values are marginal, while the setting values differ significantly Small deviations in both tasks allow the observation of behavior patterns that can be verified in chat logs.
Context sensitive chatbot design for product setup
203
Table 2. Descriptive statistics for task performance Baseline Treatment (n=6) Context-sensitive (n=6) Inf.Retrieval Mean Score
0,67
0,71
Dakota del Sur
0,13
0,10
Mean
0,79
0,93
Dakota del Sur
0,52
0,41
conference score
Cognitive load: Cognitive load measures indicate that, on average, participants experience a fairly low level of cognitive load, since both half measures are positive. Participants in the reference design experienced a lower overall cognitive load (contextual study mean = 0.82 (SD = 1.97) versus baseline mean = 1.33 (SD = 0.82)), whereas contextual results showed high variation of the Chatbot. The second scale, which focuses more on the means used to solve the task (“how easy or difficult was it to solve tasks with the chatbot”), favors the proposed design (mean contextual sensitivity = 2.17 (SD = 10.75) versus the average baseline). = 1.67 (SD = 0.82)). Qualitative results: Qualitatively, baseline project participants lost flexibility: “It would also be nice if you didn't have to undo each step individually, but just jump to the desired point” (Participant 5). Participant 3 found that "having to think about what's the next best thing" was getting in the way. Five out of six participants who used the reference design found the step-by-step approach difficult to make corrections during setup. Participants with context-sensitive design criticized that not all of their questions could be answered satisfactorily and that “they had to rephrase the questions” (participant 6). Here the chatbot can be improved in terms of understanding natural language and the information it can provide (e.g. additional information about certain features). Most respondents highlighted the easy and intuitive selection of functions for both designs. At baseline, feature selection was very fast due to the ability to enter just one number. However, linear mechanical configuration comes at the expense of flexibility and control. All participants who used the context-sensitive chatbot found setup easy. They recognized that typing "keywords" (participant 4) resulted in meaningful information or was sufficient to select features. Participant 10 also liked the autocomplete option.
6 Discussion 6.1 Contribution This article presents a new approach that addresses the well-known challenge of designing interfaces for product configuration. This work is based on prescriptive knowledge from product and chatbot configuration research and aims to facilitate the configuration process for
204
T. Niederer et al.
end users by providing an intuitive, easy-to-use interface. It summarizes expert interviews and existing literature to address the main problems with existing solutions, namely limited adaptability [6], information overload [2] and limited transparency [3, 6]. Formulated DRs are independent of the specific interface design and can guide the design of systems that meet the limitations of current solutions. Derived DPs reflect the capabilities of the chatbot in terms of perceiving the conversation as the design subject. Developers can use prescriptive knowledge to design chatbots as product configuration interfaces. DPs derived from a product configuration chatbot demonstrate the need for context awareness for this type of chatbot to provide flexibility (DP1), relevant information (DP2), structure (DP3) and quick configuration options (DP4). These principles can serve as the foundation of a design theory for configuration bots. From a practical point of view, the most important results are two chatbot prototypes (baseline and context-sensitive) and the lessons learned from the evaluation. The results of the evaluation indicate that the proposed PDs had an impact on the participants' setup experience and on the setup outcome. Although the sample size was limited to 12 test subjects, participants experienced significantly greater reliability and performed significantly better in the context-aware chatbot configuration. On average, the evaluation showed that participants with the proposed design spent more time and sent more messages to the chatbot. In this way, they took advantage of context-aware capabilities to resolve configuration conflicts, flexibly select and reselect, and finalize their configuration. They were configured according to your preferences rather than the order suggested by the bot. Although the participants who configured with the baseline design knew that there were better alternatives and that conflicts were part of the task, they made no effort to revise the features already selected. Qualitatively, participants indicated that the chatbot's step-by-step approach made it difficult to make corrections. The ability to flexibly make corrections was particularly valuable to the test group. Test group participants were more successful and faster in making changes to their settings. The control group confirmed this observation, as five out of six participants wanted more flexibility for corrections. Self-reported total mental effort was quite low in both projects and even lower in the reference project. Chat logs and evaluation results suggest that this perceived low mental load is due to the unsuitability of the reference chatbot for the task at hand, as it did not make participants reconsider their decisions. Consequently, it was not possible to demonstrate a reduction in perceived cognitive load (H3). However, chat logs suggest that the proposed design results in more engagement (eg, more messages sent, more time spent setting up). The results show that the context-sensitive chatbot was perceived as easier to use than the baseline (H1), while the results for the reliability dimension were significant. Significant differences in task performance scores indicated that the proposed design was more useful than the baseline (H2), which was also influenced by the questioned perceived usefulness. No reduction in cognitive load due to treatment could be demonstrated. Participants with the proposed design expended more of the mental effort required to successfully complete the task at hand. In summary, the context-aware chatbot is particularly useful in an iterative configuration process that involves changes and exploration.
Context sensitive chatbot design for product setup
205
Baselining can serve as an alternative when users want to manually select features quickly. 6.2 Limitations and Future Work This research project has five main limitations that provide opportunities for further study of chatbots for product configuration. Initially, only company employees participated in interviews with experts. Future work may integrate a more diverse set of end users. However, the experts interviewed had access to customer feedback, some of which went back years. Thus, they were able to gain generalizable insights from different product domains, whereas end-user feedback may depend on individual experiences and a single product domain. Second, the analysis was based on a sample size of 12 participants. Participants were mainly recruited without much experience in product configuration, who were considered as potential end users. The assessment results can be strengthened by increasing the sample size and targeting the actual end users of the scope. Thirdly, the task that the participants had to perform during the assessment was designed according to the DR. To objectively measure the performance of correctly selected features, preferences were specified in the task scenario. Configuration conflicts were integrated into the task design, as respondents and the literature confirmed their presence in configuration processes [21]. More research is needed to examine the impact of the proposed design on the actual configuration experience of end users with their preferences. However, such an approach complicates the application of objective performance measures, since the results are subject to individual preferences. Fourth, measures of perceived usefulness and cognitive load consisted of two items each. The small sample size and the high variance of these measures do not allow to derive general and quantitatively substantiated statements. However, the measurements allow us to observe trends that can be reinforced by observing the participants' configuration process, as well as their qualitative responses. Additional research can expand assessment metrics and sample sizes. Fifth, by narrowing down the research questions posed, proposed design principles are evaluated against a baseline without context-sensitive features. Comparison with a classic web form was not the object of this project. Due to its ability to reduce complexity for end users and provide flexibility in the configuration process, this article places value on the use of natural language that was not assessed in a between-subject comparison. In terms of quality, participants appreciated the simple selection and flexible change of product properties. On the other hand, web forms offer more flexibility in presenting information than a chat interface. Further research can be based on our post and compare different configuration modes.
7 Conclusion Chatbots can reduce complexity and make it easier for end users to make requests, allowing them to interact in natural language. Product Setup is an application.
206
T. Niederer et al.
where customers can feel overwhelmed by the mass and complexity of product features in current solutions. Existing interfaces often do not differentiate between novice and experienced users and emphasize an order of configuration for end users. As we were able to show in a short review, a context-sensitive chatbot for product configuration offers flexibility in terms of possible conversation paths and vocabulary used. Based on expert interviews and literature, DRs were compiled from which general PDs were formulated. These can serve as a model for developing chatbots or natural interfaces for configuring products. The proposed design instantiation was compared to a reference design in a between-subjects comparison. On average, users with the proposed design spent more time and messages configuring and ended up performing better in terms of task performance. In addition, participants attributed greater utility and ease of use to the proposed design. For both designs, the reported mental workload was quite low, indicating low perceived complexity. The instantiation of PEs proved to be particularly advantageous for the rapid selection and revision of product attributes in an iterative process, flexible in terms of order and vocabulary used. In use cases with a larger number of product variable attributes, the visual possibilities of a chat window seem to be limited and can be supported by a web form representation. Additional researchers are invited to apply, test and extend the proposed design and design theory to a product configuration chatbot to facilitate the user configuration process. Since the results of this project suggest a high potential for blended approaches, a combination of a chatbot and a classic configuration interface could be an idea on which future work can build.
Appendix Semi-Structured Interview Questions Status Quo • "How often does the product configurator use which user groups?" • "What are the customer's goals when configuring the product?" • "What proportion of started configurations are completed?" Problem Identification • • • •
"For what reasons do users misconfigure a product?" "What are the comments on the Product Configurator?" "What complaints or negative comments are there about the configurator?" "What are the obstacles in the current setup process?" Summary of requirements requirements
• "What are frequently expressed customer requirements?" • "What are the relevant properties that a configurator should implement?"
Context sensitive chatbot design for product setup
207
• "How can you achieve intuitive operation or a smooth process flow?" Concrete requirements for the setup chatbot • "In your opinion, how should a text-based chatbot proceed to allow for a smooth setup process?" • "Which chatbot features would be desirable" • Opt.: Negative questions with insufficient comments. final review
Table 3. Final condition of the experimental group.
norte
To alter
Gender
Product setup experience*
Use of chatbots**
Control (baseline project)
6
Media = 29.5 (DP = 4.14)
Woman = 4 Man = 2
Non-expert = Mean = 3.00 4 (SD = 1.41) Expert = 2
Context sensitive (proposed project)
6
Media = 30.17 (DP = 10.34)
Woman = 2 Man = 4
Non-expert = Mean = 3.33 (SD 4 = 1.21) Expert = 2
*Measured on a 5-point Likert scale **Measured on a 7-point Likert scale
Table 4. User-Related Builds Build
Reference
Measurement
Article
User Experience Questionnaire Laugwitz et al. (2008)
7-point Likert scale 26
perceived usability
Lund (2001)
7-point Likert scale
2
cognitive load
Paas (1992), Eysink et al. (2009) 7-point Likert scale
2
208
T. Niederer et al. Table 5. Metrics related to performance
Measurement
Description
duration of interaction
Duration to complete all tasks
Duration of information retrieval
Duration to complete task one
configuration duration
Duration to complete task two
Correction Call Duration
Duration to complete task three
Number of messages 0000000
Total number of messages sent to the chatbot
Information Retrieval Score
Performance score on the information retrieval task
configuration score
Credit score on setup task
References 1. Pine, B.J., II., Victor, B., Boynton, A.C.: Mass Customization Work. Harv Bus. Rev. 71, 108-111 (1993) 2. Huffman, C., Kahn, BE Variety for Sale: Mass Customization or Mass Confusion? J. Retail. 74, 491-513 (1998). https://doi.org/10.1016/S0022-4359(99)80105-5 3. Felfernig, A. (ed.): Knowledge-Based Configuration of Research to Business Cases. Morgan Kaufmann is an Elsevier imprint, Amsterdam (2014) 4. Li, F., Jagadish, H.V.: Building an interactive natural language interface for relational databases. Proc. VLDB Foundation. 8, 73-84 (2014). https://doi.org/10.14778/2735461.2735468 5. Sun, Y., Zhang, Y.: Conversational recommendation system. In: The 41st ACM SIGIR International Conference on Information Retrieval Research and Development. ACM, New York (2018) 6. Randall T, Terwiesch C, Ulrich KT: Principles for user-customized product design. Caliph. manage Rev. 47, 68-85 (2005). https://doi.org/10.2307/41166317 7. Sweller, J.: Cognitive load in problem solving: implications for learning. know Science. 12, 257-285 (1988). https://doi.org/10.1016/0364-0213(88)90023-7 8. Blecker, T., Abdelkafi, N., Kreutler, G., et al.: Product configuration systems: state of the art, conceptualization and extensions (2004) 9. Dale, R.: The return of chatbots. National Long. Closely. 22, 811-817 (2016). https://doi.org/10. 1017/s1351324916000243 10. Gupta, S., Borkar, D., de Mello, C., et al.: A chatbot based on an e-commerce website. international J. Informatics. Science. Information Technology 6, 1483–1485 (2015) 11. Cui, L., Huang, S., Wei, F., et al.: SuperAgent: a customer service chatbot for e-commerce websites. In: ACL Proceedings 2017, System Demonstrations, Stroudsburg, PA, USA. Association for Computational Linguistics (2017) 12. Nica, I., Tazl, O.A., Wotawa, F.: Chatbot-Based Travel Recommendations with Reasoning in models. In: ConfWS, pp. 25–30 (2018) 13. Christakopoulou, K., Radlinski, F., Hofmann, K.: Towards Conversational Recommender Systems. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York (2016) 14. Clark, L., Pantida, N., Cooney, O., et al.: What makes a good conversation? Challenges in designing truly conversational agents. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (2019)
Context sensitive chatbot design for product setup
209
15. Jain, M., Kumar, P., Kota, R., et al.: Insights and evaluation of the Chatbot project. In: Proceedings of the Interactive Systems Design Conference 2018. ACM, New York (2018) 16 Vlasov V, Drissner-Schmid A, Nichol A Few Trial Generalization Through Dialog Tasks (2018) 17 Hevner AR An Insight of three cycles of research in design sciences. To scan J.Inf. system 19, 4 (2007) 18. Kuechler, B., Vaishnavi, V.: On theory development in design science research: anatomy of a research project. EUR. J.Inf. System 17, 489-504 (2008). https://doi.org/10.1057/ejis.2008.40 19. Barnum, C.M.: Usability Testing Basics: Ready, Set...Test!, 2nd ed. Morgan Kaufmann, Amsterdam (2020) 20. Ardissono, L., Felfernig, A., Friedrich, G., et al.: A framework for developing custom, distributed, web-based configuration systems. AI Mag. 24, 93 (2003). https://doi. org/10.1609/aimag.v24i3.1721 21. Drews, M.: Interaction Patterns for Product Configurators. In: Human & Computer 2008, pp. 367–376. Oldenbourg Wissenschaftsverlag GmbH (2008) 22. Kahneman, D., Tversky, A.: Prospect Theory: An Analysis of Decision Under Risk. In: Handbook of Fundamentals of Financial Decision Making, pp. 99–127. World Scientific (2013) 23. Diederich S, Brendel AB, Kolbe LM: Design of Anthropomorphic Entrepreneurial Conversation Agents. Bus. close information system. 62(3), 193-209 (2020). https://doi.org/10.1007/s12599020-00639-y 24. Følstad, A., Brandtzæg, P.B.: Chatbots and the new world of HCI. Interactions 24, 38–42 (2017). https://doi.org/10.1145/3085558 25. Dzindolet MT, Peterson SA, Pomranky RA, et al.: The role of trust in automation addiction. international J.Hum. Calculation. Race. 58, 697-718 (2003). https://doi.org/10.1016/S1071-581 9(03)00038-7 26. Luger, E., Sellen, A.: "Just like with a really bad AP" the gap between user expectations and experience of conversational agents. In: Proceedings of the 2016 CHI Conference on Human Factors in Computer Systems, pp. 5286-5297 (2016) 27. Jain, M., Kota, R., Kumar, P., et al.: Broadcast: exploring the use of a contextual view for chatbots. In: Proceedings of the 2018 CHI Conference on Human Factors in Computer Systems, pp. 1-6 (2018) 28. Gnewuch U, Morana S, Maedche A Towards the Designation of Cooperative and Social Conversation Agents for Customer Service (2017) 29. Cöster R , Gustavsson A, Olsson T et al: Improving based configuration on the web with recommendations and cluster-based help. In: 2002 Workshop on Recommendation and Personalization in E-Commerce (2002) 30. Tiihonen, J., Felfernig, A.: An Introduction to Personalization and Mass Personalization. J.Intel. information system 49(1), 1-7 (2017). https://doi.org/10.1007/s10844-017-0465-4 31. Braun D, Hernandez-Mendez A, Matthes F, et al.: Evaluating natural language comprehension services for systems conversation issues Give me an answer. In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Stroudsburg, PA, USA. Association for Computational Linguistics (2017) 32. Davis, F.D., Bagozzi, R.P., Warshaw, P.R.: User acceptance of computer technology: a comparison of two theoretical models. manage science. 35, 982-1003 (1989). https://doi.org/10.1287/mnsc.35.8.982 33. Aladwani, A.M.: The development of two tools to measure the simplicity and usefulness of transactional websites. EUR. J.Inf. system 11, 223-234 (2002). https://doi.org/10.1057/pal grave.ejis.3000432 34. Freeman, L.A., Jessup, L.M.: The Power and Benefits of Concept Mapping: Measuring Usage, Utility, Ease of Use, and Satisfaction . international J.Sci. Educators 26, 151-169 (2004). https://doi.org/10.1080/0950069032000097361
210
T. Niederer et al.
35. Lund, A.: Usability-Messung mit dem USE-Fragebogen. Usability Interface 8(2), 3–6 (2001) 36. Mlekus, L., Bentler, D., Paruzel, A., et al.: Increased technology acceptance: user experience characteristics as inherent determinants of technology. GRAM. interact. org. 51, 273–283 (2020). https://doi.org/10.1007/s11612-020-00529-7 37. Laugwitz, B., Held, T., Schrepp, M.: Construction and evaluation of a user experience questionnaire. In: Holzinger, A. (Hrsg.) USAB 2008. LNCS, vol. 5298, pp. 63–76. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-89350-9_6 38. Schrepp, M., Hinderks, A., Thomaschewski, J.: Construction of a Benchmark for the User Experience Questionnaire (UEQ), vol. . . 4, pg. 40 (2017) https://doi.org/10.9781/ijimai.2017.445
author index
Abbas, Nadia 66 Abbo, Giulio Antonio 103 Aerts, Celine 34 Agathangelou, Pantelis 134 Angenius, Max 150 Avgerinakis, Konstantinos 134 Bjørkli, Cato A.
66
Chira, Chloe 134 Christensen, Noemi 190 Christodoulou, Georgia 134 Crovari, Pietro 103 De Cicco, Roberta
Kamoen, Naomi 34 Katakis, Ioannis 134 Kontopoulos, Efstratios 134 Kunneman, Florian A. 171 Leeuwis, Lotte 3 Levi, Maria 50 Liebrecht, Christine 34, 118 Linn, Annemiek J. 83 Martijn, Gabriëlla 118 Mathioudis, Evangelos 134 Michailidou, Christina 134
50
Følstad, Asbjørn 66 Garzotto, Franca 103 Ghajargar, Maliheh 150 Guribye, Frode 18 Hammerschmidt, Maik 50 He, Linwei 3 Henkel, Tanja 83
Niederer, Tom 190 Nordberg, Oda Elise 18 Schloss, Daniel 190 Silva, Susana C. 50 Stolwijk, Gina E.M. 171 van der Goot, Margot J. 83 van Hooijdonk, Charlotte 118