Pragmatic multimodality: Eﬀects of nonverbal cues of focus and certainty in a virtual human

.


Introduction
When humans communicate naturally, a lot more is transferred than just the semantic content.The meaning of an utterance is enhanced by verbal pragmatic markers but also by gestural and other nonverbal and paraverbal signals in order to classify the semantic content of the utterances [1].Senders want to communicate their convictions, viewpoints, knowledge, attitudes, among others.These signals are not discourse related, they merely support the recipient to arrive at the correct interpretation that was intended by the sender.Recipients perceive those signals on top of to the semantic content and integrate everything into a congruent message.However, albeit its prominence and importance, this meta-communication has not received much attention so far.
We define such signals as modal (pragmatic) functions 1 (MPF), a sub-category of pragmatic functions.This notion is related to [3] modal functions which "seem to operate on a given unit of verbal discourse and show how it is to be interpreted" [p.225] as, e.g., to "indicate what units are 'focal' for their arguments" [4, p. 276].
Few studies have looked at how modal functions get expressed in single modalities, most notably language, prosody and gesture.In language, lexical items like modal particles and discourse markers, may mark "assumptions about the current speech event", "its evidential status", and "the speakers stance, attitude, emotional state" [5, p. 160].However, different points of view exist on which markers highlight, understate or make an utterance uncertain.In the field of prosody, [6] examined various dimensions of pragmatic functions of prosodic features, among them are duration, pitch hight and loudness.This analysis forms the baseline for our modification in speech synthesis.
A recent summary of gestures [7] that take up pragmatic functions mentions various gestures with certain recurrent form features/gesture families and the corresponding pragmatic functions, e.g., away gestures [8, p. 1599] "to mark arguments, ideas, and actions as uninteresting and void".However, "no clear notion of pragmatic gesture is available, neither in the area of pragmatics nor in gesture studies" [7, p. 1536].In previous work [9,10], we tried to address this shortcoming with a corpus-based approach to empirically obtain ratings of the "modifying functions" that people see in gestures.
In a first theoretical approach, [11, p. 540] tackled multimodal pragmatic markers and identified pragmatic events similar to the ones we will present, e.g."calling attention", "uncertainty" and "uninterested".However, it is not clear how the different modalities jointly realize modal marking.In this work, we go one step further by selecting various -what we think are decisive -features, based on our natural human data and other findings, to jointly realise the intended modification.To our knowledge, no well controlled study exists, which analyses language, nonverbal and paraverbal cues in combination.
In other previous work, we have demonstrated that pragmatic marking, using speech and/or gesture, is also recognized in virtual humans (VH) [12].This research is important for the field of developing VH as easily accessible, acceptable, understandable and helpful communication partners, as well as getting more insights into human natural communication behaviour.With the present work, we build on this research and extend it in several ways.Our previous study showed that in most cases words indicating MPF (such as modal particles) together with gestures had a strong effect.Here we complement this by focusing on the contribution of other nonverbal and paraverbal cues in the expressive behaviour of the agent.That is, we aim to investigate the effects of gestures in combination with other nonverbal and paraverbal signals like facial expression and intona-tion.Additionally, two aspects ensured more natural and human-like stimuli: The enacted motion capture data itself was improved and the post-processing of the motion capture data led to more unfiltered behaviour.Furthermore, based on experiences from our previous study, we carried out some improvements regarding the experimental design and procedure.The recognition of the MPF in a gestural and nonverbal expressive behaviour, as well as the similarity with an prototypical gesture from our corpus data, were pre-tested (cf.Section 2: Nonverbal behaviour).Then, we developed a story with a well-defined information structure (cf.Section 3: Stimulus videos), ensuring that an MPF appears with the new information.And finally, we improved the MPF elicitation questions (cf.Section 4: MPF recognition) and the method for measuring content recall (cf.Section 4: Content recall).
In the present work, we ask how a combination of nonverbal and paraverbal cues alone (gesture, facial expression and prosody), that is without using explicit keywords, can affect the listener's interpretation of a given utterance of a VH.All cues have previously been reported in studies on natural human-human interaction.We present a study investigating how this influences the uptake and recall of the information conveyed verbally by the VH, and how the overall perception of the VH is affected.We start by providing theoretical background and describe our experimental design in Section 2. In Section 3, we explain stimuli and procedure of the present study, before analysing and discussing the results in Section 4.

Theory and Experiment Design
We define MPF to have a focusing and an epistemic component, and possible attitudinal layers on top of these modifications.Qualitatively, focusing and epistemic functions can be either positive or negative: -A positive focusing function (Foc+) puts importance and emphasis on an specific aspect of an utterance, it highlights or brings a piece of information to the addressee's attention.-A negative focusing function (Foc-) marks unimportance, irrelevance and accessoriness, it moves a piece of information out of focus.-A negative epistemic function (Epi-) indicates a speaker's uncertainty about the corresponding piece of information.-A positive epistemic function (Epi+) corresponds to a competent speaker and is generally assumed to be the default.
In our previous work, we found evidence that speakers in particular use gestural or verbal cues to mark the first three functions (Foc+, Foc-, Epi-).We thus concentrate on them in the remainder of the paper.MPF are loosely related to the notion of information structure which describes the way information is organized and distributed within a sentence's syntactic structure [13].Information structure distinguishes between topic (or theme, what is talked about) and comment (or rheme, what is said about the topic).The notion of focus is used here to denote the grammatical means of indicating that some information is new or contrastive.Note that this differs from MPF which are not defined in terms of utterance structure, but based on the speaker's mental state and her intentions to influence the recipient's interpretation of the utterance.In consequence, MPFs are assumed to act more at the level of discourse units.For example, a Foc-function can be well placed on the utterance's rheme, e.g. when the speaker makes an utterance but wants to signal that its new information is not particularly important for the larger discourse.Likewise, verbal or nonverbal cues of the Epi-function could be added to the rheme to indicate uncertainty about this information.This analysis also hints to the importance of distinguishing different degrees of communicative intentionality in pragmatic marking.Adopting the notion of [14,15], we assume that Foc+ and Foc-functions are rather signalled or displayed, i.e. they are intentionally designed to be perceived by the recipient.On the contrary, the Epi-function is rather indicated or displayed, i.e. this function indicates the mental state of the sender and it does not need not be designed specifically for the recipient.In the present study, we aim to test exactly this interpretation of nonverbal cues when produced by a VH.
Nonverbal behaviour.The gestural forms used for the respective MPF are: abstract deictic gestures [16][17][18] for Foc+, brushing gestures [19,20] for Foc-and Palm Up Open Hand gestures (PUOH) [21,3] for Epi-.This nonverbal behaviour has been extracted from previous analyses of natural human interaction data ( [9,12]).The gestures are accompanied by certain body, shoulder and head movements (as depicted in Table 1).These findings and categorisation of features form the baseline for the motion capture recordings carried out specifically for this study.We put a lot of effort into the choice of gestures as well as the recording and post-processing of the motion capture data.The nonverbal behaviour was re-enacted and recorded with a sixteen-camera OptiTrack motion capture system.From a big corpus of gestures recordings, we chose nine recordings for each function, which we perceived to carry the respective function best and fit the gestures in our natural human interaction data.The total of 27 motion capture recordings were tested by four participants (2 female, 2 male) in a pre-study.We asked two questions: whether a gesture fulfils the particular pragmatic function and whether the re-enacted gesture fits the original human gesture, which the participants saw in a video at the beginning of the study.The results of the first question were weighted as most important.We used only those gestures, which functions were recognised by all or at least three of the participants and which at least two (most of the time it was three and four) of the participants matched with the natural gesture in the video.This resulted in six gestures for the Foc+function, seven gestures for the Foc--function and three gestures of the Epi-function.Since we obtained only a few gestures carrying Epi--functions from the pre-study, we added one gesture to the final corpus of recordings, which was not rated beforehand.The post-processing steps included adding hand shapes designed with the MURML Keyframe editor [22] and merging them with the motion capture data, which was additionally filtered for errors.For proper agent animation, facial expressions, lip movements and blinking was added by modifying the blend shapes.In all conditions, the VH's behaviour was steered by AsapRealizer [23] and his speech is synthesised by the Text-To-Speech system CereProc2 with the female voice Gudrun3 .Pragmatic multimodality.The aim of this study was to test the multimodality of non-speech pragmatic modification.That is, additionally to using nonverbal markers like gesture, body, head and shoulder movements, as well as facial expressions, we added synthetic paraverbal markers to the stimulus videos.These include variations in speech rate, pitch, loudness and added pauses (speech synthesis).In a corpus-based approach, [6] analysed prosodic pragmatic functions and found, e.g., that the loudness correlates with importance and confidence, relevant for Foc+/Foc-, and that the duration of an utterance mirrors the amount of thought a sender needs to express it, indicating the degree of certainty which is relevant for Epi-/Foc+.Our prosodic modifications were based on this analysis, as testing the paraverbal effects in separate conditions was not feasible within this work.Table 1 summarises all nonverbal and paraverbal features that we used to replicate the natural human behaviour with MPF.When interpreting the results it should be kept in mind that the modification of the stimuli are multimodal and do not sorely consist of gestural behaviour.Few approaches dealt with pragmatic multimodal modification.[24] designed 'believable and expressive' VH with tightly synchronised of verbal and nonverbal signals (facial expressions) to mark certainty, topic and emotion among others.Also, [25] worked on nonverbal behaviour in virtual humans and summarizes head movements and facial expressions which "mark uncertain statements" and "emphasise a particular conversation point."[26] looked at the correspondence between tilts and nods and prosodic features to create "intentions, attitudes and emotions" and summarizes that the "emphasis of a word often goes along with head nodding" and that a greater variation of facial parameters accompany a focal accent.However, a full approach of multimodal pragmatic marking in VH -to our knowledge -has not been dealt with so far.
Hypotheses.As dependent variables, we asked whether the VH was perceived to modify its narrative as emphasising, de-emphasising or as being uncertain in order to check whether the participants recognised the underlying MPF.Secondly, we captured the content recall with a cloze text.And finally, we measured the VH likeability, competence and human-likeness.Based on our own previous work, our hypotheses were that the VH is perceived as most emphasising and that the content recall would be best in the Foc+-condition.Additionally, we expected the Epi-and Foc-conditions to retrieve least recall, whereas this effect should be stronger in the latter condition due to explicit de-emphasising.Stimulus videos.We developed a story which the VH narrated to the participants during the study.The story consists of 28 sentences, was about the VH and its life at its research institute, included many technical terms and every sentence was designed to have a theme and a rheme 4 .According to the definition above, we placed one multimodal modification on one or more words in the rheme, and thus, the new part of the sentence.Based on the definition of MPF, we were interested in three main conditions: Foc+, Foc-and Epi-.Additionally, we surveyed a neutral condition (N) and a mixed condition (M).The latter condition contains pragmatic cues of all three main functions and is designed to test, whether the MPF can be used when they are collocated within a cohesive narrative.We were curious if MPF are still perceived and understood independently or if there are interdependences between the pragmatic cues.This results into five conditions.The N-condition included idle motion capture behaviour, which consists of subtle arm and head movements.In all other conditions, the four to seven recorded nonverbal behaviours from the pre-study (cf.Section 2: Nonverbal behaviour) were equally distributed across the narrative.The M-condition contained markers of all MPF, so we had to make sure that the story of the VH was coherent, i.e., that the VH emphasises or downtones or is uncertain about the same instances throughout the story.Since this requirement was decisive for which MPF was assigned to which sentence, we could not balance the order of MPF strictly.
All but the M-condition were carried out by a between-subject design and, thus, participants were exposed to only one condition.Stills of the final stimulus videos are depicted in Figure 1.All five videos were about three minutes long.
Perception study.The study was carried out on seven days between April 7th and 21th, 2017.56 (28=male, 28=female) university students, on average 24.7 year old, took part in the study.They were randomly distributed across conditions.Twelve participants took part in the N-condition and eleven each in all other four conditions.The duration of the study was about 25 minutes and the students were reimbursed with 3 Euros and chocolate.They saw a video recording of a narration by our VH followed by a questionnaire.The whole study was coded in a SoSci Survey [27] questionnaire.As we aimed at a high felt presence of the VH, the study was conducted on a vertically positioned screen of the size 143 x 81 cm (a diagonal of 164 cm) and the participants were placed on a chair about 120 cm in front of the screen.The VH had the size of 66 x 30 cm which ensured a good visibility.Figure 2 depicts the study setup.We conducted several analyses on the recognition of MPF, on the recalled content of the VH's short story and on the perception of the VH.Responses were captured using 5-point Likert scales (5="entirely the case" to 1="not at all the case") in all but the demographic questions.The results were calculated using SPSS 5   These MPF-items were merged onto three corresponding scales, justified by Cronbach's Alpha values being above 0.7 for each MPF-scale: Foc+: a=.74, Foc-: a=.85 and Epi-: a=.77.Conducting a MANOVA using Pillai's trace, there was slightly no overall significant effect of the stimulus video on the perceived MPF, V =0.36, F (12,153)=1.76, p=.06.Individual ANOVAs on each of the dependent variables, however, suggest that there is a significant effect for Foc+, F (4,51)=2.64,p=.044 but none for Foc-and Epi-.Pairwise comparisons with 95% confidence intervals (CI) for Foc+ using Independent Sample T-tests for normally distributed data meeting equality of variances, slightly failed significance after using Bonferroni corrections on four tests for Foc+ and N (t(21)=2.502,p=.084, CI=[0.11;1.14])and Foc+ and Epi-(t(20)=2.529,p=.080, CI=[0.09;0.97]).Since we found significant results on the Foc+-scale from the quantitative analysis, we turned to a descriptive analysis to shed more light on the underlying trends.
Medians and standard deviations of all five conditions are depicted for each scale of MPF-questions separately in Figure 3.In general, a strong perception of Foc+ is apparent: the VH has been rated as very emphasising in all conditions (m=4.2, σ=0.6).At the same time, the VH was rather not perceived as downtoning (Foc-: m=1.6, σ=0.7) and uncertain (Epi-: m=1.2, σ=0.5).Supporting our hypothesis, the VH was perceived as most emphasising in the Foc+-condition on the Foc+-scale (m=4.3, σ=0.5).This result is tightly followed by Foc-and the mixed condition M. Since the comparison between the three main conditions is most meaningful, we note that there is a small difference between Foc+ and Foc-, and even more so between Foc+ and Epi-.Interestingly, the VH was not perceived as most downtoning in the Foc--condition on the Foc--scale (m=1.4,σ=1.0), compared to both, the Epi--and the M-conditions (m=1.8, σ=0.7).The result of this Epi--condition shows that the VH was perceived as rather uncertain than downtoning.Additionally, the VH was perceived as less emphasising (Foc+-condition: m=1.2, σ=0.4).Finally, the VH was perceived as most uncertain in the Epi--condition on the Epi--scale (m=1.4,σ=0.8).This result is followed by the Foc+-and the Foc--conditions (m=1.2, σ=0.5/0.4).
Content recall.We were interested in whether there is a difference between conditions regarding how much the participants recalled the content of the VH narrative.Since it has been shown that iconic gestures improve memory performance [28], we hypothesise that Foc+ pragmatic nonverbal behaviour can be helpful as well and, hopefully, that content marked with Foc-and Epi-won't be.The decision was made for a cloze text for true recall of the participants, instead of multiple choice questions, which we surveyed in our previous study.The participants were not told to memorise the story beforehand and the task turned out to be quite challenging for them.In the following, we will present differences between the conditions.35 individual answers were asked for in the cloze text, which retrieved exactly those items to which the multimodal cues were added.There are two exceptions: one enumeration of seven items and one pair of items, in each case only the first word was accompanied by a cue.As mentioned before, the N-condition was not modified by any cues.The best recall was achieved in the conditions N with m=9.75/µ=9.29 and in Foc+ with m=9.5/µ=8.82correct answers (out of 35), followed by Epi-(m=7/µ=9.73),Foc-(m=7/µ=9.32)and M (m=7/µ=7.82).Since condition N precedes Foc+, we cannot verify our hypothesis from these results alone.However, taking a closer look at the data from the condition M, we obtained interesting results.We analysed whether adding an MPF cue to specific items is beneficial for the recall of these items.Adding an MPF was, e.g., done in the M-condition, which was our starting point.In this condition, we divided the items that were to be recalled into three categories.That is, the items that received a Foc+ MPF in the mixed condition are called Foc+-items; Foc--items and Epi--items are defined analogously.We then compare the N-condition, which is our baseline, to the M-condition.For each of the conditions and for each participant separately, we considered the overall number of recalled items and determined the share of Foc+-, Foc--and Epi--items.We then calculated the average over all participants.The procedure of calculating average recall shares and our recall study results are visualized in Figure 4 and Figure 5, respectively.For a clean analysis, the two exceptions mentioned above were left out.
In the N-condition, the average recall share of Foc+-items was 36.2%, whereas this share was 39.5% for the M-condition.Hence, adding the Foc+ MPF resulted in a 9.0% increase of the average recall share of Foc+-items.In contrast, the average recall share of Foc--items was 39.7% for the N-condition and 33.2% for the M-condition.Adding the Foc-MPF hence yielded a 16.3% decrease of this share.Finally, the average recall share of Epi--items was 24.0% for the Ncondition and 27.3% for the M-condition.Adding the Epi-MPF thus resulted in a 13.5% increase of this share.
Overall, these results give evidence that the Foc+ and Epi-MPF increased the recall and the Foc-MPF decreased the recall.That is, we can identify a tendency that items supported with a Foc-MPF were regarded as irrelevant and thus recalled less often.However, we could not directly confirm that items supported by a Foc-MPF led to a better memorisation.Partly because we

ConsAll
Epi-Foc-Foc+ Epi-Foc-Foc+ got the same result for Epi--items, where we would not have expected a recall increase.
VH perception.In a third analysis, we evaluated whether the VH was perceived as likeable, competent and human-like.We adopted a design by [29], in that 18 adjectives were merged onto three scales, justified by Cronbach's Alpha values being above 0.7 for each VH perception-scale.Items for likeability (a=.87) are pleasant, sensitive, friendly, likeable, affable, approachable and sociable; items for competence (a=.78) are dedicated, trustworthy, thorough, helpful, intelligent, organized and expert; and items for human-likeness (a=.82) are active, humanlike, fun-loving and lively.
The human-likeness-scale shows that the VH is perceived as much more human-like in the Foc--condition than in the others, particularly in contrast to the Epi--condition.This trend is clearly visible in the box plot.However, using an Independent Sample T-test we could not find significant results between the conditions.The neutral condition is perceived as least human-like.

Conclusion
In this paper we have presented a study on how a VH can employ nonverbal cues in gesture, facial expression, head, or prosody to convey modal (pragmatic) information to mark focus and uncertainty in its utterances.In all conditions, the VH was perceived as very competent and as saying something important.
Concerning the different MPF, we can draw several conclusions: First, supporting our hypothesis, Foc+ was recognized best from the multimodal cues.This led to higher content recall (together with the N-condition).Second, cues of Epihad similar but much lesser effects.Finally, Foc-was not well recognized but led to the lowest content recall, again supporting our hypothesis.Interestingly, this also led to the perception that the VH was most human-like and likeable (latter together with the M-condition), suggesting that the participants perceive "downtoning" behaviour positively, possibly because it makes the VH appear less superior.The M-condition tested whether we can use several MPF within one story.We showed that by adding Foc--items to certain words in this condition, the content of these words were recalled less.Another anecdotal effect was that the VH was perceived as having a specific personality by producing various pragmatic cues and thus appeared more natural overall.Note, however, that the present study had three biases.First, the gestures were critical, meaning that always when a gesture and complementary nonverbal behaviour was carried out by the VH, there was a lot of movement; and possibly the participants were induced to believe that the utterance is relevant, no matter which MPF was carried out by the agent.This could be an explanation for the fact that the VH was rated as rather communicating something important (Foc+) across all conditions.And since machines are generally conceived as either functioning well or not at all, it may be difficult for participants to accept that they do not know something (Epi-).Second, the narrative was highly technical, possibly leading to the result that the VH has been perceived as very competent in all conditions, since it was so well-informed.And finally, the pre-test already indicated that the nonverbal behaviour in the Epi--condition was not as clear-cut as in the other conditions.In the final stimulus, the VH Epi--gestures could be interpreted as presenting something (important), since the gestures were partly performed rather high in front of the body.This may have resulted in a focusing effect leading to better recall.The results might limit generalizability as participants may have not been as receptive to the, at least partially, quite subtle nonverbal cues.
We introduced (lack of) focus and uncertainty into an expressive artificial system.It is a bigger question driving our research whether the impression that a VH possesses and affects such mental qualities in communication affects human-VH interaction.The present work contributes to this by showing that pragmatic multimodality can be used in VH and that MPF do have an effect.Since the nonverbal behaviour is elicited from natural human data, these results are not only helpful for the field of designing virtual agents, but also provide insights into human behaviour.The next step will be to combine the nonverbal and paraverbal cues explored here with explicit verbal markers (e.g., modal particles) in a systematic model to express pragmatic information gradually and autonomously.

Fig. 1 .
Fig. 1.Stills of the VH carrying out the three MPF in the various conditions.In the N-condition, the VH showed only "idle" behaviour.

Fig. 2 .
Fig. 2. The setup of the perception study.The photo is staged with a person that was not a participant.

Fig. 3 .
Fig. 3. Three scales with items measuring whether the MPF of the respective condition was recognized.

Fig. 4 .
Fig.4.An example of how we calculated average recall shares: Each box represents (a) recalled word(s) in a sentence.In condition M a nonverbal modification was placed on (a) particular word(s) and in the N-condition there was no modification.

Fig. 5 .
Fig. 5. Recall shares of recalled items over all participants.

9 Fig. 6 .
Fig. 6.Three scales of items measuring the perception of the VH.

Table 1 .
Nonverbal and paraverbal features modified in our VH for the stimulus videos in the three main conditions.
and Microsoft Excel.MPF recognition.To measure whether participants would recognise what kind of pragmatic functions the VH nonverbally and paraverbally expresses during its narrative, we developed a set of questions which we expect to capture the meaning of the pragmatic functions.Foc+ questions surveyed whether the VH wanted to put into focus what it said, underline, emphasize and stress what it said, to express that what the VH said is important for the participant and whether the VH was confident.Foc-items accounted for contrary impressions, namely, whether the VH wanted to put what it said out of focus, discount what it said, express that what it said is unimportant, irrelevant and negligibly.Finally, the negative Epi-items raised whether the VH was unsure with what it said and is vague in its expressions.This was done in conjunction with questions formulated in a negative manner (the VH knows what it is talking about and it knows the topic well), which results we translated into the opposite value.A separate document with the exact wording of the MPF-items (in German and English) can be accessed online 6 .