Home Uncategorized Article Review Assignment

Uncategorized

Article Review Assignment

Work will be checked for originality and following instructions and will be canceled if it was not followed thanks!

PSYC512

Article Review Grading Rubric

Advanced

Proficient

Developing

Not present

0 points
Not present.

Criteria	Levels of Achievement Save Time On Research and Writing Hire a Pro to Write You a 100% Plagiarism-Free Paper. Get My Paper
Content 70%		Advanced		Proficient		Developing	Not present
Content	14 points The paper meets or exceeds content requirements: The Article contains: Title Page Title and authors of the article Purpose Why the article was written (introduction), and what it attempts to find or answer (hypothesis section) Method How it answers the question or questions it proposes (method section) Results/Discussion What the article found (results); What the results actually mean (discussion)	12 to 13 points The paper meets most of the content requirements: The Article contains: Title Page Title and authors of the article Purpose Why the article was written (introduction), and what it attempts to find or answer (hypothesis section) Method How it answers the question or questions it proposes (method section) Results/Discussion What the article found (results); What the results actually mean (discussion)	1 to 11 points The paper meets some of the content requirements: The Article contains: Title Page Title and authors of the article Purpose Why the article was written (introduction), and what it attempts to find or answer (hypothesis section) Method How it answers the question or questions it proposes (method section) Results/Discussion What the article found (results); What the results actually mean (discussion)		0 points Not present.
Structure 30%
Format and Word Count	6 points The paper meets or exceeds structure requirements: Proper spelling and grammar are used. The summary is at least 350 words.	5 points The paper meets most of the structure requirements: Proper spelling and grammar are used. The summary is at least 350 words.	1-4 points The paper meets some of the structure requirements: Proper spelling and grammar are used. The summary is at least 350 words.

PSYC 512

Article Review Assignment Instructions

Overview

Reading and understanding original research is an important skill for working in the field of psychology. Understanding research methodology and the sections of a journal article is critical for success in our field. This Article Review Assignment will help you learn to objectively evaluate research, to find scholarly sources of information, and to use them as a source of knowledge. This Article Review Assignment can also help you in your professional development.

These Article Review Assignments are to help you to remember the most important aspects of each article. By the end, you will have five article summaries on social psychological research that can help you both in this course and in future research and coursework.

Instructions

Over several modules, you will complete five Article Review Assignments that relate to the following topics:

· Social perception

· Stereotypes, prejudice and discrimination

· Group processes

· Close relationships

· Aggression

In each Article Review Assignment, you will find and learn about the research that relates to one of these topics. To find these articles, you can search google scholar, one of the library’s psychology databases (i.e. PSYC INFO), or look in a specific journal (i.e. Journal of Applied Psychology).

Note: do not use the journal articles in the Learn Sections for this.

Once you have chosen an article that related to the topic, summarize the article in at least 350 words.

Your Article Review Assignments should include the following components:

· Introduction: Include general information about the article in the introduction, including a very brief overview of the previous literature on the topic and identifying the gap in the literature that demonstrates the need for this article.

· Hypothesis Section: what the article attempts to find out or answer

· Method Section: how the article answers the question or questions it proposes

· Results Section: what the article found

· Practical Significance/Discussion: What the results actually mean

· References page: Title and authors of the article in current APA format

Be careful to ensure that your answers to the above information make sense to you. You want to be able to develop the skill of making complex/academic information easy to understand to non-academic people. Make sure to explain any complex ideas in plain language, and do not assume the reader already knows what you are talking about. Summarize these articles succinctly but yet thoroughly.

Refer to the Article Review Template for guidance on this Article Review Assignment.

Make sure to check the Article Review Grading Rubric before beginning this Article Review Assignment.

Note: Your assignment will be checked for originality via the Turnitin plagiarism tool.

Page 2 of 2

Journal Article Summary

Social Psychology Article

Stu D. Name

Department of Psychology, Liberty University

PSY 512: Social Psychology

Dr. Wood

July 16, 2020

Journal Article Summary
Social Psychology Article

Introduction

List the article introduction information here.

Purpose

List the purpose this article was written.

Hypothesis

What is this paper’s contribution/question/s that it is trying to provide information on.

Methodology

Sample

Describe the sample of this study.

Measures

Describe the measures that were used in this study.

Procedures

Describe how this study was done.

Results

What did this study find? You can include both stats and an explanation of the stats.

Practical Significance

Why is this study relevant/meaningful?

References

Haney, C., Banks, C., & Zimbardo, P. (1973). A study of prisoners and guards in a simulated prison. Naval Research Reviews, 9(1), 1-17.

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/6761621

Friend Networking Sites and Their Relationship to Adolescents’ Well-Being and

Social Self-Esteem

Article in CyberPsychology & Behavior · November 2006

DOI: 10.1089/cpb.2006.9.584 · Source: PubMed

CITATIONS

1,090
READS

122,006

3 authors:

Some of the authors of this publication are also working on these related projects:

Political Communication View project

Media effects View project

Patti M. Valkenburg

University of Amsterdam

236 PUBLICATIONS 17,706 CITATIONS

SEE PROFILE

Jochen Peter

University of Amsterdam

123 PUBLICATIONS 11,849 CITATIONS

SEE PROFILE

Alexander Schouten

Tilburg University

53 PUBLICATIONS 2,920 CITATIONS

SEE PROFILE

All content following this page was uploaded by Patti M. Valkenburg on 24 September 2014.

The user has requested enhancement of the downloaded file.

https://www.researchgate.net/publication/6761621_Friend_Networking_Sites_and_Their_Relationship_to_Adolescents%27_Well-Being_and_Social_Self-Esteem?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_2&_esc=publicationCoverPdf

https://www.researchgate.net/project/Political-Communication-3?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_9&_esc=publicationCoverPdf

https://www.researchgate.net/project/Media-effects?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_9&_esc=publicationCoverPdf

https://www.researchgate.net/?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_1&_esc=publicationCoverPdf

https://www.researchgate.net/profile/Patti-Valkenburg?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_4&_esc=publicationCoverPdf

https://www.researchgate.net/institution/University_of_Amsterdam?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_6&_esc=publicationCoverPdf

https://www.researchgate.net/profile/Jochen-Peter-2?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_4&_esc=publicationCoverPdf

https://www.researchgate.net/profile/Alexander-Schouten?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_4&_esc=publicationCoverPdf

https://www.researchgate.net/institution/Tilburg_University?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_6&_esc=publicationCoverPdf

Friend Networking Sites and Their Relationship to
Adolescents’ Well-Being and Social Self-Esteem

PATTI M. VALKENBURG, Ph.D., JOCHEN PETER, Ph.D., and ALEXANDER P. SCHOUTEN, M.A.

ABSTRACT

The aim of this study was to investigate the consequences of friend networking sites (e.g.,
Friendster, MySpace) for adolescents’ self-esteem and well-being. We conducted a survey
among 881 adolescents (10–19-year-olds) who had an online profile on a Dutch friend net-
working site. Using structural equation modeling, we found that the frequency with which
adolescents used the site had an indirect effect on their social self-esteem and well-being.
The use of the friend networking site stimulated the number of relationships formed on the
site, the frequency with which adolescents received feedback on their profiles, and the tone
(i.e., positive vs. negative) of this feedback. Positive feedback on the profiles enhanced ado-
lescents’ social self-esteem and well-being, whereas negative feedback decreased their self-
esteem and well-being.

584

INTRODUCTION

THE OPPORTUNITIES for adolescents to form andmaintain relationships on the Internet have
multiplied in the past few years. Social networking
sites have rapidly gained prominence as venues to
relationship formation. Social networking sites
vary in the types of relationships they focus on.
There are dating sites, such as Match.com, whose
primary aim is to help people find a partner. There
are common interest networking sites, such as
Bookcrossing.com, whose aim is to bring people
with similar interests together. And there are friend
networking sites, such as Friendster and MySpace,
whose primary aim is to encourage members to es-
tablish and maintain a network of friends.

The goal of this study is to investigate the conse-
quences of friend networking sites for adolescents’
social self-esteem and well-being. Given the recent
worldwide proliferation of such sites and the ever-
expanding numbers of adolescents joining up,
these sites presumably play an integral role in ado-

lescent life. Friend networking sites are usually
open or semi-open systems. Everyone is welcome
to join, but new members have to register, and
sometimes the sites only allow members if they are
invited by existing members. Members of the sites
present themselves to others through an online
profile, which contains self-descriptions (e.g., de-
mographics, interests) and one or more pictures.
Members organize their contacts by giving and re-
ceiving feedback on one another ’s profiles.

Although friend networking sites have become
tremendously popular among adolescents, there is
as yet no research that specifically focuses on the
uses and consequences of such sites. This is re-
markable because friend networking sites lend
themselves exceptionally well to the investigation
of the social consequences of Internet communica-
tion. After all, peer acceptance and interpersonal
feedback on the self, both important features of
friend network sites, are vital predictors of social
self-esteem and well-being in adolescence.1 There-
fore, if the Internet has the potential to influence

CYBERPSYCHOLOGY & BEHAVIOR
Volume 9, Number 5, 2006
© Mary Ann Liebert, Inc.

Amsterdam School of Communications Research (ASCoR), University of Amsterdam, Amsterdam, The Netherlands.

14337c11.pgs 10/10/06 2:44 PM Page 584

adolescents’ social self-esteem and well-being, it is
likely to occur via their use of friend networking
sites.

There is no period in which evaluations regard-
ing the self are as likely to affect self-esteem and
well-being as in adolescence.1 Especially early and
middle adolescence is characterized by an in-
creased focus on the self. Adolescents often engage
in what has been referred to as “imaginative audi-
ence behavior”2: they tend to overestimate the ex-
tent to which others are watching and evaluating
and, as a result, can be extremely preoccupied with
how they appear in the eyes of others. On friend
networking sites, interpersonal feedback is often
publicly available to all other members of the site.
Such public evaluations are particularly likely to
affect the development of adolescents’ social self-
esteem.1 In this study, social self-esteem is defined
as adolescents’ evaluation of their self-worth or sat-
isfaction with three dimensions of their selves:
physical appearance, romantic attractiveness, and
the ability to form and maintain close friendships.
Well-being refers to a judgment of one’s satisfaction
with life as a whole.3

Our study is conducted in the Netherlands
where, since April 2000, a friend networking site
exists that is primarily used by adolescents. In May
2006, this website, named CU2 (“See You Too”),
contained 415,000 profiles of 10–19-year-olds. Con-
sidering that the Netherlands counts about 1.9 mil-
lion adolescents in this age group, approximately
22% of Dutch adolescents use this website to form
and maintain their social network.

Internet use, well-being, and self-esteem

Ever since Internet use became common as a
leisure activity, researchers have been interested in
investigating its consequences for well-being and
self-esteem. For both well-being and self-esteem,
the literature has yielded mixed results. Some stud-
ies reported negative relationships with various
types of Internet use,4,5 other studies found positive
relationships,6 and yet other studies found no sig-
nificant relationships.7,8

Two reasons may account for the inconsistent
findings on the relationships between Internet use,
self-esteem, and well-being. First, many studies
have treated the independent variable ‘Internet
use’ as a one-dimensional construct. Some studies
did investigate the differential effects of types of In-
ternet use, but the selection of these types usually
did not follow from a theoretical anticipation of
their consequences for self-esteem and well-being.
In our view, at least a distinction between social

and non-social Internet use is required to ade-
quately investigate Internet effects on self-esteem
and well-being. We believe that social self-esteem
and well-being are more likely to be affected if the
Internet is used for communication than for infor-
mation seeking. After all, feedback on the self and
peer involvement, both important precursors of
self-esteem and well-being, are more likely to occur
during online communication than during online
information seeking.

A second shortcoming in earlier studies is that
many authors did not specify how Internet use
could be related to self-esteem and well-being.
Most research has focused on main effects of Inter-
net use on either self-esteem or well-being. None of
these studies have considered models in which the
influence of Internet use on self-esteem and well-
being is considered simultaneously. By modeling
the relationships of Internet use with both self-
esteem and well-being, a more comprehensive set
of hypotheses can be evaluated, which may clarify
some of the contradictory findings in previous
studies.

Our research hypotheses modeled

It has repeatedly been shown that adolescents’
self-esteem is strongly related to their well-being.
Although the literature has not clearly established
causation, most self-esteem theorists believe that
self-esteem is the cause and well-being the effect.9
Based on these theories, we hypothesize that social
self-esteem will predict well-being, and by doing
so, it may act as a mediator between the use of
friend networking sites and well-being. After all, if
the goal of friend networking sites is to encourage
participants to form relationships and to comment
on one another ’s appearance and personality, it is
likely that the use of such sites will affect the di-
mensions of self-esteem that are related to these ac-
tivities. The hypothesis that adolescents’ social
self-esteem predicts their well-being is modeled in
Figure 1 by means of path H1.

We also hypothesize that the use of friend net-
working sites will increase the chance that adoles-
cents (a) form relationships on those site (path H2a),
and (b) receive reactions on their profiles (path
H3a). After all, if the aim of using friend networking
sites is to meet new people and to give and receive
feedback, it is plausible that the more these sites are
used, the more friends and feedback a member gets.
As Figure 1 shows, we do not hypothesize that the
use of friend networking sites will directly influence
the tone of reactions to the profiles because the mere
use of such a site cannot be assumed to influence

FRIEND NETWORKING SITES, WELL-BEING, AND SELF-ESTEEM 585

14337c11.pgs 10/10/06 2:44 PM Page 585

the tone of reactions to the profiles. However, we do
hypothesize an indirect relationship between use of
friend network sites and the tone of the reactions
via the frequency of reactions that adolescents re-
ceive (paths H3a and H5). In a recent study on the
use of dating sites, members of the site often modi-
fied their profile based on the feedback they re-
ceived. By means of a process of trial and error, they
were able to optimize their profile, and, by doing so,
optimize the feedback they received.10 We therefore
assume that the more reactions adolescents receive
to their profiles, the more positive these reactions
will become (path H5). We also assume that the
more reactions adolescents receive the more rela-
tionships they will form (path H6).

We not only assume that adolescents’ social self-
esteem mediates the relationship between the use
of friend networking sites and their well-being; we
also hypothesize that the relationships between the
use of friend networking sites and adolescents’ so-
cial self-esteem will be mediated by three types of
reinforcement processes that are common on friend
network sites and that have been shown to affect
adolescents’ social self-esteem.1 These reinforce-
ment processes are: (a) the number of relationships
formed through the friend network site, (b) the fre-
quency of feedback that adolescents receive on
their profiles (e.g., on their appearance and self-
descriptions), and (c) the tone (i.e., positive vs. neg-
ative) of this feedback. Our hypotheses about these
mediated influences are modeled by means of paths
H2a-b, H3a-b, and H4 in Figure 1.

We expect that for most adolescents the use of
friend networking sites will be positively related to

their social self-esteem. We base this view on theo-
ries of self-esteem, which assume that human be-
ings have a universal desire to protect and enhance
their self-esteem.11 Following these theories, we be-
lieve that adolescents would avoid friend network-
ing sites if these sites were to negatively impact
their social self-esteem. Friend networking sites
provide adolescents with more opportunities than
face-to-face situations to enhance their social self-
esteem. These sites provide a great deal of freedom
to choose interactions. In comparison to face-to face
situations, participants can usually more easily
eliminate undesirable encounters or feedback and
focus entirely on the positive experiences, thereby
enhancing their social self-esteem.

However, if, by contrast, an adolescent for any
reason is mostly involved in negative interactions
on these sites, an adverse influence on his or her so-
cial self-esteem seems plausible. Especially because
reactions to the profiles are made public to other
members of the site, negative reactions are likely to
have a negative influence on adolescents’ social
self-esteem. We therefore hypothesize that a posi-
tive tone of reactions will positively predict social
self-esteem, whereas a negative tone will nega-
tively predict social self-esteem.

METHODS

Sample and procedure

We conducted an online survey among 881
Dutch adolescents between 10 and 19 years of age

586 VALKENBURG ET AL.

s
Use of
site H1

Relationships
formed

Tone of
reactions

H2a
H2b

H3a H3b

Frequency
of reactions

Well-being
Social
self-esteem

FIG. 1. Hypothesized model on the relationships among use of friend networking site, social self-esteem, and
well-being.

14337c11.pgs 10/10/06 2:44 PM Page 586

who had a profile on the friend networking site
CU2 (“See You Too”); 45% were boys and 55% were
girls (M age = 14.8; SD = 2.7). A profile on CU2 in-
cludes demographic information, a description of
the user and his or her interests, and one or more
pictures. Reactions of other CU2 users to the pro-
files are listed at the bottom of each profile (for
more information, see www.cu2.nl).

Upon accessing their profile, members of the site
received a pop-up screen with an invitation to par-
ticipate in an online survey. The pop-up screen
stated that the University of Amsterdam conducted
the survey in collaboration with CU2. The adoles-
cents were informed that their participation would
be voluntary, that they could stop with the ques-
tionnaire whenever they wished, and that their re-
sponses would be anonymous.

Measures

Use of friend networking site. We used three items
measuring the frequency, rate, and intensity of the
use of the friend networking site: (a) “How many
days per week do you usually visit the CU2 site?”,
(b) “On a typical day, how many times do you visit
the CU2 site?”, and (c) “If you visit CU2, how long
do you usually stay on the site?” The first two
items required open-ended responses. Response
categories for the third item ranged from 1 (about 10
min) to 7 (more than an hour). Responses to the three
items were standardized. The standardized items
resulted in a Cronbach’s alpha of 0.61.

Frequency of reactions to profiles. The number of
reactions to the profiles was measured by two
items: “How often do you get reactions to your pro-
file from unknown persons,” and “How often do
you get reactions to your profile from people you
only know through the Internet?” Response cate-
gories to the items ranged from 1 (never) to 5 (very
often). Responses to these two items were averaged,
and resulted in a Cronbach’s alpha of 0.72.

Tone of reactions to profiles. The tone of the reac-
tions to the profiles was measured with the follow-
ing two questions: “The reactions that I receive on
my profile are . . .” and “The reactions that I receive
on what I tell about my friends are . . .” Response
categories ranged from 1 (always negative) to 5 (al-
ways positive). Cronbach’s alpha was 0.87.

Relationships established through CU2. We asked
respondents how often they had established (a) a
friendship and (b) a romantic relationship through
CU2. Response options were 0 (never), 1 (once), and

2 (more than once). The correlation between the two
items was r = 0.34.

Social self-esteem. We used three subscales of
Harters’ self-perception profile for adolescents12:
the physical appearance subscale, the close friend-
ship subscale, and the romantic appeal subscale.
From each subscale we selected the four items with
the highest factor loadings. Response categories for
the items ranged from 1 (agree entirely) to 5 (disagree
entirely). Cronbach’s alpha values were 0.91 for
physical appearance scale, 0.85 for the close friend-
ship scale, and 0.81 for the romantic appeal scale.

Well-being. We used the five-item satisfaction
with life scale developed by Diener et al.3 Response
categories ranged from 1 (agree entirely) to 5 (dis-
agree entirely). Cronbach’s alpha for the scale was
0.89.

Statistical analysis

The hypotheses in our study were investigated
with the Structural Equation Modeling software
AMOS 5.0.13

RESULTS

Descriptive statistics

Adolescents visited the friend networking site on
average three days a week (M = 3.09, SD = 2.07).
When they visited the website, they stayed on the
site for approximately a half hour. The average
number of reactions that adolescents had received
on their profiles was 25.31 (SD = 50.00), with a
range from 0 to 350 reactions. The tone of the reac-
tions varied significantly among adolescents. Of
the adolescent who reported having received reac-
tions to their profiles (n = 592), 5.6% indicated that
these reactions had always been negative; 1.6%
that they had predominantly been negative; 10.1%
that they had sometimes been negative and some-
times positive; 49.3% that they had been predomi-
nantly positive; and 28.4% that they had always
been positive. Thirty-five percent of the adoles-
cents reported having established a friendship, and
8.4% reported having formed a romantic relation-
ship through the friend networking site.

Zero-order correlations

Before testing our hypothesized model, we pres-
ent a matrix showing the Pearson product-moment

FRIEND NETWORKING SITES, WELL-BEING, AND SELF-ESTEEM 587

14337c11.pgs 10/10/06 2:44 PM Page 587

correlations between the variables included in the
model (Table 1).

Testing the hypothesized model

The variables in our model were all modeled as
latent constructs. The construct reflecting the use of
the friend networking site was measured by three
items and well-being by five items. The frequency
of reactions to profiles, the tone of the reactions to
profiles, and the number of relationships estab-
lished by the site were each measured by two
items. The latent construct social self-esteem was
formed by the three subscales measuring physical
appearance self-esteem, close friendship self-
esteem, and romantic appeal self-esteem. For rea-
sons of clarity, we do not present the measurement
model (i.e., the factor-analytic models) in our
graphical presentation of the results. However, all
factor-analytic models led to adequate descriptions
of the data. The factor loadings were all above 0.44.

To investigate our hypotheses, we proceeded in
two steps. First, we tested whether the hypothesized
model in Figure 1 fit the data. Then, we checked
whether we could improve the model’s fit by adding
or removing theoretically meaningful paths from the
hypothesized model. We used three indices to evalu-
ate the fit of our models: the �2/df ratio, the compar-
ative fit index (CFI), and the root mean square error

of approximation (RMSEA). An acceptable model fit
is expressed in a �2/df ratio of <3.0, a CFI value of >0.95, and a RSMEA value of <0.06.14,15

Our hypothesized model fit the data satisfacto-
rily well: �2/df ratio = 2.5; CFI = 0.96; RMSEA =
0.05. However, the results indicated that two paths
assumed in our hypothesized model were not sig-
nificant: path H2b from the number of relation-
ships formed on the friend networking site to
self-esteem, and path H3b from the frequency of re-
actions to the profile to self-esteem.

After removal of the two nonsignificant paths,
we subjected our model to a final test. The modi-
fied model fit the data well, �2/df ratio = 2.5; CFI =
0.98; RMSEA = 0.05. We therefore accepted the
model as an adequate description of the data. Our
final model indicates that all of our research hy-
potheses (i.e., those visualized by paths H1, H2a,
H3a, H4, H5, and H6) were confirmed by the data.
Figure 2 visualizes the observed final model. The
reported coefficients are standardized betas.

The model controlled for age and gender

To test whether our final model also holds when
age and gender are controlled for, we tested a
model in which we allowed paths between age and
gender and all of the remaining independent, me-
diating, and dependent variables in the model. This

588 VALKENBURG ET AL.

TABLE 1. PEARSON PRODUCT-MOMENT CORRELATIONS

Variables 1 2 3 4 5 6 7 8

1. Use of
friend networking site

2. Frequency of 0.16***
reactions to profiles

3. Tone of 0.10* 0.24***
reactions to profiles

4. Close friends 0.18*** 0.31*** 0.01
established via site

5. Romantic relations 0.12*** 0.12*** �0.13** 0.34***
established via site

6. Physical appearance 0.04 0.05 0.29*** �0.00 �0.00
self-esteem

7. Close friendship 0.12*** 0.13*** 0.40*** 0.06 �0.05 0.61***
self-esteem

8. Romantic attractiveness 0.06 0.16*** 0.38*** 0.08* �0.00 0.68*** 0.72***
self-esteem

9. Well-being 0.06 0.07* 0.37*** �0.03 �0.01 0.59*** 0.54*** 0.45***

*p < 0.05. **p < 0.01. ***p < 0.001.

14337c11.pgs 10/10/06 2:44 PM Page 588

model again led to a satisfactory fit: �2/df ratio =
2.6; CFI = 0.95; RMSEA = 0.05.

DISCUSSION

Our study was the first to show the conse-
quences of adolescents’ use of friend networking
sites for their social self-esteem and well-being.
Adolescents’ self-esteem was affected solely by the
tone of the feedback that adolescents received on
their profiles: Positive feedback enhanced adoles-
cents’ self-esteem, and negative feedback de-
creased their self-esteem. Most adolescents (78%)
always or predominantly received positive feed-
back on their profiles. For these adolescents, the
use of friend networking sites may be an effective
vehicle for enhancing their self-esteem.

However, a small percentage of adolescents (7%)
did predominantly or always receive negative feed-
back on their profiles. For those adolescents, the use
of friend networking sites resulted in aversive ef-
fects on their self-esteem. Follow-up research should
attempt to profile these adolescents. Earlier research
suggests that users of social networking sites are
quite able to learn how to optimize their self-presen-
tation through their profiles.10 Adolescents who pre-
dominantly receive negative feedback on their
profiles may especially be in need of mediation on
how to optimize their online self-presentation.

No less than 35% of the respondents reported
having established one or more friendships
through the site, and 8% one or more romantic rela-

tionships. However, as discussed, the number of
friendships and romantic relationship formed via
the site did not affect adolescents’ social self-
esteem. Obviously, it is not the sheer number of re-
lationships formed on the site that affect
adolescents’ social self-esteem. Research on adoles-
cent friendships suggests that the quality of friend-
ships and romantic relationships may be a stronger
predictor of social adjustment than the sheer num-
ber of such relationships.16 Therefore, future re-
search on friend networking sites should include
measures on the quality of the relationships formed
through friend networking sites.

Our study focused on a new and pervasive phe-
nomenon among adolescents: friend networking
sites. In the Netherlands, about one quarter of ado-
lescents is currently a member of one or more of
such sites. The Netherlands is at present at the fore-
front of Internet-based communication technologies
(e.g., 96% of Dutch 10–19-year olds have home ac-
cess to the Internet, and 90% use Instant Messaging).
Therefore, it is a unique spot to start investigating
the social consequences of such technologies. How-
ever, friend networking sites are a worldwide phe-
nomenon that attracts ever younger adolescents.
Such sites can no longer be ignored, neither by com-
munication researchers nor by educators.

REFERENCES

1. Harter, S. (1999). The construction of the self: a develop-
mental perspective. New York: Guilford Press.

FRIEND NETWORKING SITES, WELL-BEING, AND SELF-ESTEEM 589

Frequency
of reactions

Use of
site

Social
self-esteem Well-being.78

Relationships
formed

Tone of
reactions

.19
n.s.

.28 n.s.

.48

.29

.30

FIG. 2. Structural equations model of the relationships among use of friend networking site, social self-esteem, and
well-being. The ellipses represent latent constructs estimated from at least two observed variables; coefficients repre-
sent standardized betas significant at least at p < 0.01.

14337c11.pgs 10/10/06 2:44 PM Page 589

2. Elkind, D., & Bowen, R. (1979). Imaginary audience
behavior in children and adolescents. Developmental
Psychology 15:38–44.

3. Diener, E., Emmons, R.A., Larsen, R.J., et al. (1985).
The satisfaction with life scale. Journal of Personality
Assessment 49:71–75.

4. Kraut, R., Patterson, M., Lundmark, V., et al. (1998).
Internet paradox: a social technology that reduces
social involvement and psychological well being?
American Psychologist 53:1017–1031.

5. Rohall, D.E., & Cotton, S.R. (2002). Internet use and
the self-concept: linking specific issues to global
self-esteem. Current Research in Social Psychology
8:1–19.

6. Kraut, R., Kiesler, S., Boneva, B., et al. (2002). Internet
paradox revisited. Journal of Social Issues 58:49–74.

7. Gross, E.F., Juvonen, J., & Gable, S.L. (2002). Internet
use and well-being in adolescence. Journal of Social Is-
sues 58:75–90.

8. Harman, J.P., Hansen, C.E., Cochran, M.E., et al.
(2005). Liar, liar: Internet faking but not frequency of
use affects social skills, self-esteem, social anxiety,
and aggression. CyberPsychology & Behavior 8:1–6.

9. Baumeister, R.F., Campbell, J.D., Krueger, J.I., et al.
(2003). Does high self-esteem cause better perfor-
mance, interpersonal success, happiness, or healthier
lifestyles? Psychological Science 4:1–44.

10. Ellison, N.B., Heino, R., & Gibbs, J.L. (2006). Managing
impressions online: Self-presentation processes in the
online dating environment. Journal of Computer-Mediated

Communication, 11(2): http://jcmc.indiana.edu/vol11/
issue2/ellison.html

11. Rosenberg, M., Schooler, C., & Schoenbach, C. (1989).
Self-esteem and adolescent problems: modeling recip-
rocal effects. American Sociological Review 54:1004–1018.

12. Harter, S. (1988). Manual for the self-perception profile
for adolescents. Denver, CO: Department of Psychol-
ogy, University of Denver.

13. Arbuckle, J.L. (2003). Amos 5.0 [computer software].
Chicago, IL: SmallWaters.

14. Byrne, B.M. (2001). Structural equation modeling with
AMOS: basic concepts, applications and programming.
Mahwah, NJ: Erlbaum.

15. Kline, R.B. (1998). Principles and practice of structural
equation modeling. New York: Guilford Press.

16. Larson, R.W., Core, G.L., & Wood, G.A. (1999). The
emotions of romantic relationships. In: Furman, W.,
Brown, B.B., Feiring, C. (eds.), The development of ro-
mantic relationships in adolescence. Cambridge, UK:
Cambridge University Press, pp. 19–49.

Address reprint requests to:
Dr. Patti M. Valkenburg (ASCoR)

University of Amsterdam
Kloveniersburgwal 48

1012 CX Amsterdam, The Netherlands

E-mail: p.m.valkenburg@uva.nl

590 VALKENBURG ET AL.

14337c11.pgs 10/10/06 2:44 PM Page 590

View publication statsView publication stats

https://www.researchgate.net/publication/6761621

JOURNAL OF CONSUMER PSYCHOLOGY, ]](I), 57-73
Copyright O 2001, Lawrence Erlbaum Associates, Inc.

Consumers’ Responses to Negative Word-of-Mouth
Communication: An Attribution Theory Perspective

Russell N. Laczniak, Thomas E. DeCarlo, and Sridhar N. Ramaswami
Department of Marketing

Iowa State University

Research on negative word-of-mouth communication (WOMC) in general, and the process by
which negative WOMC affects consumers’ brand evaluations in particular, has been limited.
This study uses attribution theory to explain consumers’ responses to negative WOMC. Experi-
mental results suggest that (a) causal attributions mediate the negative WOMC-brand evalua-
tion relation, (b) receivers’ attributions depend on the manner in which the negative WOMC is
conveyed, and (c) brand name affects attributions. Results also suggest that when receivers at-
tribute the negativity of the WOMC message to the brand, brand evaluations decrease; however,
if receivers attribute the negativity to the communicator, brand evaluations increase.

Word-of-mouth communication (WOMC) is an important
marketplace phenomenon by which consumers receive infor-
mation relating to organizations and their offerings. Because
WOMC usually occurs through sources that consumers view
as being credible (e.g., peer reference groups; Brooks, 1957;
Richins, 1983), it is thought to have a more powerful influ-
ence on consumers’ evaluations than information received
through commercial sources (i.e., advertising and even neu-
tral print sources such as Consumer Reports; Herr, Kardes, &
Kim, 1991). In addition, this influence appears to be asyrn-
metrical because previous research suggests that negative
WOMC has a stronger influence on customers’ brand evalua-
tions than positive WOMC (Amdt, 1967; Mizerski, 1982;
Wright, 1974). Given the strength of negative, as opposed to
positive WOMC, the study presented here focuses on the for-
mer type of information.

Our research develops and tests, using multiple studies, a
set of hypotheses that describes consumers’ attributional and
evaluative responses to different types of negative-WOMC
messages. The hypotheses posit that consumers will generate
predictable patterns of attributional responses to nega-
tive-WOMC messages that are systematically varied in terms
of information content. Furthermore, they predict that
attributional responses will mediate the negative
WOMC-brand evaluation relation. Finally, and similar to re-
cent studies (cf. Herr et al., 1991), the hypotheses suggest

Requests for reprints should be sent to Russell N. Laczniak, Iowa
State University, Department of Marketing, 300 Carver Hall, Ames,
IA 5001 1-2065. E-mail: LACZNIAK@IASTATE.EDU

consumer responses to negative WOMC are likely to be
influenced by strength of the targeted brand’s name.

This study extends research on negative WOMC in two im-
portant ways. First, whereas previous studies have typically
examined receivers’ responses to a summary statement of a fo-
cal brand’s performance (cf. Bone, 1995; Herr et al., 1991), it is
likely that the information contained in negative-WOMC mes-
sages is more complex than thls. In this study, focal messages
are manipulated to include three components of information
besides the communicator’s summary evaluation (Richins,
1984). Messages include information about the (a) consensus
of others’ views of the brand (besides the communicator), (b)
consistency of the communicator’s experiences with the brand
over time, and (c) distinctiveness of the communicator’s opin-
ions of the focal brand versus other brands in the category. In-
terestingly, these types of information correspond to the
information dimensions examined in Kelley’s (1 967) seminal
work dealing with attribution theory. It is also important to note
that although others have used this work to model individual
responses to another’s actions (e.g., observing someone’s in-
ability to dance), this study is the first that empirically extends
Kelley’s research into a context in which consumers interpret a
conversation about a brand.

Second, whereas other studies have posited the existence
of a direct relation between negative WOMC and
postexposure brand evaluations (e.g., Amdt, 1967;
Haywood, 1989; Katz & Lazerfield, 1955; Morin, 1983), our
investigation examines the attributional process that explains
this association. This approach is consistent with the thinking
of several researchers (i.e., Bone, 1995; Herr et al., 1991;
Smith & Vogt, 1995) who posited that cognitive mechanisms

are important, as they can more fully explain the negative
WOMC-brand evaluation linkage. Furthermore, this re-
search is consistent with other studies that suggest (but do not
test the notion) that receivers’ cognitive processing of nega-
tive WOMC involves causal attributional reasoning (cf.
Folkes, 1988; Mizerski, Golden, & Kernan, 1979).

THEORY AND HYPOTHESES

Negative WOMC

Negative WOMC is defined as interpersonal communication
concerning a marketing organization or product that deni-
grates the object of the communication (Richins, 1984;
Weinberger, Allen, & Dillon, 1981). Negative WOMC po-
tentially has a more powerful influence on consumer behav-
ior than print sources, such as Consumer Reports, because in-
dividuals find it to be more accessible and diagnostic (Herr et
al., 1991). In fact, research has suggested that negative
WOMC has the power to influence consumers’ attitudes
(Engel, Kegemeis, & Blackwell, 1969) and behaviors (e.g.,
Arndt, 1967; Haywood, 1989; Katz & Lazerfield, 1955).

Attributions as Responses to
Negative WOMC

Because the transmission of negative WOMC involves inter-
personal and informal processes, attribution theory appears to
be particularly helpful in understanding a receiver’s interpre-
tation of a sender’s motives for communicating such informa-
tion (Hilton, 1995). The central theme underlying attribution
theory is that causal analysis is inherent in an individual’s
need to understand social events, such as why another person
would communicate negative information about a brand
(Heider, 1958; Jones & Davis, 1965; Kelley, 1967). For this
study, causal attribution is defined as the cognition a receiver
generates to infer the cause of a communicator’s generation
of negative information (Calder & Burnkrant, 1977).

Figure 1 illustrates the proposed process consumers use to
deal with negative WOMC. Specifically, it proposes two im-
portant influences on receivers’ attributional responses to
negative-WOMC communication. First, the information con-
veyed by the sender in a negative-WOMC message is posited
to influence receivers’ causal attributions. Second,
brand-name strength of the focal brand is also thought to di-
rectly affect receivers’ causal attributions. These attributional
responses, in turn, are expected to affect receivers’ brand
evaluations. Therefore, this study suggests that attributions
mediate the presupposed negative-WOMC-brand evaluation
relation. Such a model is consistent with theoretical frame-
works of interpersonal communication that suggest that attri-
butions mediate an interpersonal message’s effect on a
receiver’s evaluation ofthe focal object (e.g., Hilton, 1995).

FIGURE 1 Attributional process model for receivers of negative
word-of-mouth communication.

There is additional support for the mediational role played
by attributions in influencing individuals’ brand evaluations.
For example, studies in the advertising literature have sug-
gested that receivers generate causal attributions that in turn
affect their evaluations of the advertised brand (e.g., Wiener
& Mowen, 1986). In the performance evaluation literature,
studies indicate that sales manager attributions of salesperson
performance shape their reactions toward a salesperson (e.g.,
DeCarlo & Leigh, 1996). Thus, the following is proposed for
receivers of negative WOMC:

H I : Causal attributions will mediate the effects of
negative WOMC on brand evaluations.

Information Type and Causal
Attributions

According to research in classical attribution theory
(Kelley, 1967, 1973), the categories of causal attributions that
people generate in response to information include: stimulus
(i.e., brand, in this case), person (i.e., communicator, in this
case), circumstance, or a combination of these three.’ The
specific type of attributions generated by individuals, how-
ever, depends on the manner in which information is con-
veyed. According to attribution theory (Kelley, 1967) and
other studies dealing with WOMC (e.g., Richins, 1984), a re-
ceiver is likely to use three important information dimensions
to generate causal attributions: consensus, distinctiveness,
and consistency. In a negative-WOMC context, the consen-
sus dimension refers to the degree to which others are likely to
agree with the negative views of the communicator. The dis-
tinctiveness dimension encapsulates the extent to which the
communicator associates the negative information with a par-
ticular brand but not other brands. Finally, the consistency di-

‘Although attribution theo~y suggests that individuals have the potential
to generate multiple and interactive attributional responses, this study fo-
cuses only on those attributions that are thought to have a significant impact
on brand evaluations in the negative-WOMC context (i.e., brand and com-
municator attributions).

https://isiarticles.com/article/35420

Pattern

ecognition Letters 118 (2

19) 3–13

Contents lists available at ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier.com/locate/patrec

Perceiving the person and their interactions with the others for social

robotics – A review

Adriana Tapus a , Antonio Bandera b , ∗, Ricardo Vazquez-Martin c , Luis V. Calderita b

a Autonomous Systems and Robotics Lab, Computer Science and System Engineering Department (U2IS), ENSTA ParisTech, 828 Blv des MArechaux, Palaiseau

91120, France
b AVISPA Group, Department of Electronic Technology, Universidad de Málaga, Málaga, 29071, Spain
c Robotics and Mechatronics Lab., Department of System Engineering and Automation, Universidad de Málaga, Málaga, 29071, Spain

a r t i c l e i n f o

Article history:

Available online 6 March 2018

Keywords:

Social robots

Human perception

Human–robot interaction

Social interactions

Proxemics

a b s t r a c t

Social robots need to understand human activities, dynamics, and the intentions behind their behaviors.

Most of the time, this implies the modeling of the whole scene. The recognition of the activities and

intentions of a person are inferred from the perception of the individual, but also from their interactions

with the rest of the environment (i.e., objects and/or people). Centering on the social nature of the per-

son, robots need to understand human social cues, which include verbal but also nonverbal behavioral

signals such as actions, gestures, body postures, facial emotions, and proxemics. The correct understand-

ing of these signals helps these robots to anticipate the needs and expectations of people. It also avoids

abrupt changes on the human–robot interaction, as the temporal dynamics of interactions are anchored

and driven by a major repertoire of social landmarks . Within the general framework of interaction of

robots with their human counterparts, this paper reviews recent approaches for recognizing human ac-

tivities, but also for perceiving social signals emanated from a person or a group of people during an

interaction. The perception of visual and/or audio signals allow them to correctly localize themselves

with respect to humans from the environment while also navigating and/or interacting with a person or

a group of people.

i
i
i

t
o
a
a

o
i

t
o
t

e
s

. Introduction

One of the basic skills allowing people to interact in a safe

nd comfortable way is their ability to understand intuitively each

ther’s role and activities. Everyday, people observe one another

nd, through these observations, they recognize what they are do-

ng and also infer their intentions. In addition, this is addresse

ithout remarkable effort. It is clear that this ordinary and ef-

ortless ability is not only the result of having at our disposal

omplex multimodal perception system, and those other complex

ystems, related to learning and planning, are also involved. Ac-

ivities that have not been seen before cannot be recognized. In

he same way, intentions, which do not respond to, or cannot be

ncluded within, a normal course of actions will not be correctly

nferred. The recognition of activities and intentions is therefore

ntimately tied to the existence of a specific, shared socio-cultural

ackground, which is continuously acquired and improved within

he framework of the interaction with the others. The importance

∗ Corresponding author.
E-mail address: ajbandera@uma.es (A. Bandera).

ttps://doi.org/10.1016/j.patrec.2018.03.006

f the observation and interpretation of various social cues em-

nating from their social interaction with the others is therefore

lso crucial for our acquisition of the correct collection of social

ules.

Now that robots are moving from automatized factories into our

veryday environments, it is natural to endow them with some

f the aforementioned skills (e.g., based on a set of social rules)

entered on the challenge of interacting with humans. In this sce-

ario, it is fundamental to have a robot perception system capable

f reading the social signals emerged from the interaction. The aim

s to produce a socially correct and smooth interaction between the

obot and the humans in its surroundings, based on the predic-

ion of their behaviors [76] . Anticipating which activities people in

ur surroundings will do next (and why they will do so) can hel

he robot to plan in advance its next responses and behaviors [94] .

obots need to understand verbal and nonverbal social cues from

ach individual person and from the dynamics of their relation-

hips. Signals such as body postures, gestures, and facial emotions,

re relevant for estimating the internal state of the humans. Un-

erstanding the dynamics of a group of people and identifying the

ocial role of each member of the group help the robot to exhibit a

https://doi.org/10.1016/j.patrec.2018.03.00

http://www.ScienceDirect.com

http://www.elsevier.com/locate/patre

http://crossmark.crossref.org/dialog/?doi=10.1016/j.patrec.2018.03.006&domain=pd

mailto:ajbandera@uma.e

https://doi.org/10.1016/j.patrec.2018.03.006

4 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13

Fig. 1. Robots interact with people in human-centred environments. In the figure,

the Gualzru robot trying to convince a woman to follow it to an interactive adver-

tising panel [62] .

[

f
a

[

t
r
r
t

s
t
d
s
t

s
n
a
a
t
c

s
n
t
d
d
a
b
t
t
f

p
t

correct behavior from a social perspective. All this knowledge can

only be acquired from the observation and modeling of the human

and from their social interactions with other people.

This paper focuses on reviewing recent approaches and relevan

topics related to the perception and modeling of the human, as an

isolated individual, but also as part of a group of people. Restricted

to the ability of identifying the signals that can help having a so-

cial interaction, the acquisition of this complex skill requires the

robot to be equipped with hardware and software modules that

allows it (i) to perceive humans and their static and dynamic at-

tributes; and (ii) to match the obtained features with a specific,

memorized or on-line captured state (social knowledge) for mod-

eling them. It is important to note that the static role, as a passive

observer, that we are assuming here for the robot is not the real

situation. Our robots are situated agents that perceive but also act

in this outer world. The Theory of Event Coding [32] proposes that

stimulus representations underlying perception are encoded using

the same format that sensorimotor representations underlying ac-

tion. This is a significant difference with respect to the analysis of

video sequences captured from static cameras. Although, we do not

include within this contribution the importance of topics such as

affordance or goal directedness, we must consider that the situ-

atedness of the robot within the whole context plays a significant

role on its ability to recognize the behaviors and social interactions

of the humans in its surroundings.

The rest of the paper is organized as follows:

Section 2 overviews the problem tackled in this work, the model-

ing of the activities and social behavior of individuals, and their

social interactions. Among the most important requirements are

extraction and classification of hand-crafted or learned features,

and modeling and internalizing of the social relationships. Both

topics are described in Sections 3 and 4 , respectively. Section 3 is

divided up into two main sections, which review the typical

parameters of the perception system designed for a dyadic inter-

action ( Section 3.1 ) and for the interaction with a group of people

( Section 3.2 ). It is important to note that this strict separation

between feature extraction algorithms and classifiers does not

always exist and that both processes can be encoded together

within the same solution. A general discussion follows this study

in Section 5 . Finally, our conclusions are drawn in Section 6 .

2. Understanding a scene populated by humans

In this last decade, there has been a growing interest on the

design, methodology, and theory of human–robot interaction [29] .

This is justified by the fact that robots are expected to share our

same environments and cooperating with us to a greater or lesser

extent in our daily activities. Hence, autonomous robots used for

specific tasks with a very limited interaction with humans is not

a viable solution. The restriction of the human–robot interaction

(HRI) scenario to a dyadic interaction, where the robot interacts

with only one human is not true most of the time. Robots are more

and more part of teams (robots or humans), for instance, work-

ing closely alongside humans in industrial settings [66] or help-

ing physiotherapists to evaluate how a patient performs a motion-

based test in a hospital room [84] . This understanding of a situation

forces the robot to perceive details from the whole scene, captur-

ing not only the human but also its interaction with the surround-

ing objects and, especially, its social interaction with other people

( Fig. 1 ). Focusing on only one person could lead to the omission of

important information, and this can conduct to wrong decisions.

The recognition of human activities and social interactions is

a complex task for robots, which require the design and interac-

tion of several modules. Detailing the scheme stated in Section 1 ,

these systems typically include modules for (i) extracting sig-

nificant unary and pairwise-interaction human-related features

74,89] from the scene; (ii) obtaining meaningful, semantic infor-

ation (gender, gestures,…) [67] from these descriptors; and (iii)

using the information coming from several sources for modeling

nd internalizing the scene (usually employing a graphical model

34,43] ). The internalization of the perceived information can help

o fuse multimodal cues or to deal with the subsequent intention

ecognition problem. Fig. 2 summarizes this approach. As other

elated approaches, the classification algorithms need to have at

heir disposal datasets (knowledge) for comparison and matching.

hus, although it is not drawn on the figure, the scheme must in-

orporate the learning mechanisms for updating this knowledge.

The modules in charge of extracting features (unary- and

airwise-interaction features including objects) and the ones re-

ponsible of returning semantic concepts from these features must

ry to build a model of the scene. This allows the robot to un-

erstand the behaviors of the people and even get the gist of this

cene (e.g., catalog the event as birthday celebration, award func-

ion, etc. [59] ). The parameters of these modules are tuned by the

nal use case or application: it is not the same to encourage a

hild to perform an exercise within a rehabilitation session than

uiding a group of people through a museum. Moreover, the sen-

ors, features and recognition needs are not the same either. The

eed of fusing perceptions coming from different modalities (e.g.,

udio and video for emotion recognition) could be a reason for

dding a new module, the so-called ‘Internal representation’, on

he scheme on Fig. 2 . In some cases, the internal representation

an include part of this knowledge (e.g., a priori known models of

uman bodies or faces) and then to be used also as an additional

ource of information for action recognition [6] or emotion recog-

ition [17] . For instance, the hierarchical recognition approaches

hat are build over primitive sub-actions or sub-activities do not

irectly deal with the raw data for activity recognition [1] . An ad-

itional advantage of working over an inner representation is that

pproaches designed for performing the recognition processes can

e partially decoupled from the hardware resources available on

he robot [45] . Finally, the inference module on the Fig. 2 encodes

he processes in charge of extending the model with data obtained

rom the outcomes of the classifiers.

Within this paper, we conduct a survey on the solutions pro-

osed for allowing a robot to perceive and internalize the activi-

ies and social interactions of a group of people. Thus, this review

A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 5

Fig. 2. Major modules on a system in charge of modeling human behaviors and interactions. It typically includes feature extraction and classification, internal representation

and inference mechanisms. The pipeline scheme is only partially true. The internal representation can store raw data from the feature extractors and help on modeling the

whole scene (see text for details).

c
p
t
e
b

g
c

c
g
r
r
r
n
r
f
t
p
i

a
o
a
s
t
t
t
t
a
f
t
d
T
t

f
b
i
m
s
c

t
m
o
i
d
t
d
e
a
r
a
3

d
t
r
r
h
w
m
t

t
c

overs the perception and modeling of the activities of a group of

eople that share the environment with the robot. The term ac-

ivity takes in this context a significance that exceeds the simple

xecution of certain movements. Following the terminology given

y Turaga et al. [79] , we distinguish between action and activity.

ction is referred to a simple motion pattern, executed by a sin-

le person and usually with a short duration of time. Activities are

omplex sequences of actions performed by one or several people,

n a scenario that is typically driven by social cues.

. Perceiving and modeling people and their interactions

.1. Modeling the human

As aforementioned, there exist a large number of signals that

an be captured for modeling a person: speech, face expression,

aze, gestures, and any sort of measurements that a robot can

ecord from the environment related to social interaction. The

obot probably needs the use of dedicated hardware and software

esources for dealing with each one of them. In a simplest sce-

ario, only concerning with activity recognition, the robot typically

equires at least using visual information for extracting motion in-

ormation and characterizing the dynamics of the scene. It concen-

rates all resources on the interaction with one human counter-

art: an action is in any case a sequence of body movements, and

t usually involves several body parts concurrently.

Fig. 3 provides some snapshots of human–robot dyadic inter-

ction. On the right, the ARMAR-III from the Karlsruhe Institute

f Technology in Germany [80] is shown. It focuses on detecting

nd tracking the gestures from a human teacher [26] . The whole

ystem allows the transfer of motion based on predefined ges-

ures and force interaction. Initially, a dynamic movement primi-

ive (DMP) [37] is learned from a human wiping movement. Given

he color of the wiping tool, the robot tracks the movements of the

ool using a stereo camera system. For the subsequent force-based

daptation of the learned DMPs, it relies on the readings of the

orce torque sensor installed on the wrist of the robot. On Fig. 3 (b),

he Loki robot plays a simple game with a person. It is able to

etect the presence of a person and recognize verbal commands.

hus, when the human introduces themselves and asks it to play

he game, Loki uses color and distance information for tracking a

ellow ball. For doing this, it has a RGB-D sensor placed on the

orehead. It continuously fixates its gaze upon the ball. After a ver-

al indication, it reaches the ball with its hand and waits for a ne

nteraction. Loki tracks the object and accepts new speech com-

ands during the whole span of the game, representing all the

cene using an undirected graph [6] . Fig. 3 (c) shows the Nao robot

oaching a child during a rehabilitation session [58] . An external

inect sensor from Microsoft is employed for capturing the skele-

on of the human user and threshold values are used for deter-

ining the correct execution of certain exercises. The same kind

f interaction between a Nao robot and children with autism in an

mitation task is also described in [14] . These examples show how

ifferent modalities, features and classifiers are used for modeling

he human and its interaction with the robot. If we analyze the

etails of the hardware and software architectures behind these

xperiments, we could also note the complexity of the perception

nd actuation systems. As it is probably not possible to summa-

ize all perceptual possibilities within one paper, here we provide

brief description of relevant issues, which are classified in Fig. 4 .

.1.1. Feature extraction

In a dyadic scenario, feature extraction aims to transform the

aw information captured by sensors to feature vectors for subse-

uent modeling of the human. Robots usually employ vision, au-

io, and/or range sensors. Table 1 summarizes the features and

echniques for semantic understanding employed by several social

obots. Typical tasks include human tracking, face, and/or speech

ecognition, and scale up to action and activity recognition. It is

owever noticeable that social robots are not usually endowed

ith the ability of recognizing intentions. In fact, it is not com-

on that they consider the activity recognition task, in the sense

hat we briefly state it in Section 2 .

With respect to the features employed, they usually depend on

he task to solve. We can group them in three major classes ac-

ording to the temporal dimension. On one hand, we have tasks

6 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13

Fig. 3. Human–robot interaction as a dyadic human–robot interaction: ARMAR-III interacting with a human teacher [26] ; Loki playing with one person [6] ; and the Nao

robot coaching a child in a rehabilitation session [58] .

Table 1

Representative perception modalities on social robots.

Social robot Task Features Algorithms for semantic understanding

PaPero Face detection and recognition Shape, 3D model Template matching

Speech recognition Filter banks Hidden Markov model

i-Cub Human detection Motion-based Machine learning [82]

Human/face tracking Color Hierarchical temporal memory [41]

Sound localization ITD, ILD, and notches Active mapping [33]

Maggie Emotion recognition Voice and face expression [3]

Pose recognition Skeleton Template matching [28]

Speech processing Grammar-based [4]

ARMAR-III Human tracking Haar-like, color… Particle filters [54]

Human tracking Time-delay Particle filters [54]

Face recognition DCT-based Nearest neighbor

Gesture recognition Intensity, color Neural network + hidden Markov model
Head pose estimation Intensity, shape Neural network

Sound recognition ICA-transformed features Hidden Markov model [75]

Speech recognition MFCC RTN [75]

Loki Face detection and tracking Solor, depth Active appearance model

Human motion capture Skeleton Template matching [9]

Speech recognition CNN-BLSTM

Emotion recognition Candide model DBN [17]

NaoTherapist Skeleton Human motion capture Machine learning for body-part

Fig. 4. Taxonomy of the methods and approaches covered in this survey.

i
o
n
i
f
a
t
o
o
f
h
t
i
h
n

(

[
n
e
e

t
[
p
t

such as emotion detection from facial features or the recognition

of a specific verbal command. Although, we can incorporate the

time for improving the classification results, they put the empha-

sis on the current instant of time: an image for facial expression,

or a word for verbal command recognition. Within each observa-

tion, these approaches employ static data such as the brightness

or color values for images. These raw data are usually provided as

input data to modules that obtain feature vectors such as the Local

Binary Patterns (LBP) or the Haar-like features. Both features have

been successfully employed for face detection [83] or for gender

and age estimation [52] . Other popular descriptors for character-

zing static images are the scale-invariant feature transform (SIFT)

r the speed up robust features (SURF). In audio perception, sig-

ificant features are the inter-aural level difference (ILD) and the

nter-aural time difference (ITD) [33] . But the most commonly used

eature extraction method in automatic speech recognition is prob-

bly the Mel-Frequency Cepstral Coefficients (MFCC) [75] . Contrary

o static approaches, sequential algorithms consider the scene as a

rdered collection of individual observations. However, within each

bservation, they deal with static features. The matching of these

eatures within the sequence of images allow for example to track

uman body or face parts [31] . In these approaches, the feature ex-

raction can be supported by inner models of the human [5,6] . For

nstance, the Candide model has been successfully employed for

uman face tracking [78] or emotion recognition through the defi-

ition of the action units features [17] ( Fig. 5 ). The tracking of the

oints (head, left shoulder, center shoulder, right shoulder, left el-

ow, right elbow, left wrist, etc.) composing the three-dimensional

3D) representation of the human body as a skeleton is also widely

sed for action recognition in robots equipped with RGB-D sensors

28,58] . Both schemes show the advantages of tying together inter-

al representation and perception. Finally, space-time approaches

qual space and time dimensions, and work in a 3D space. There

xist 3D versions of typical image-based descriptors, such as the

IFT3D [69] or the SURF3D [93] . Unfortunately, they inherit from

heir predecessors the limitations in performance generalization

47] . Many effort s have been made to set features based on other

rinciples: representing actions by a temporally integrated spa-

ial response (TISR descriptor) that extracts bag-of-words features

A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 7

Fig. 5. Recognizing emotions from facial features using the Candide model [17] .

[
(
t
T
t

u
d
a
s
i
m
s
s

l
o
r
f

p
p
t
o
b
t
t
(
t
n
t
e
a
p
a
i
p
3

m
t
w
e
c
p
m

m
t
c
s

a
i
p
d
t
t
t
f
t
[
n
t
(
e

s
r
t
s
t
m
C
a
T
m
t
a
a
r
a
i
c
m
b
s
p
s
t
t
t
t
t
d
d
w
a
t
f
e
m
m
n
t
n
t
o
t

99] ; trajectories described using histograms of oriented gradients

HOG), histograms of optical flow (HOF) and motion boundary his-

ogram (MBH) around interest points (iDT descriptors) [86] , etc.

he plethora of descriptors allow the researchers to fuse and ob-

ain successful schemes for recognition, as we briefly describe at

ection 3.1.2 .

.1.2. Feature vectors classification

Feature vectors can be classified for solving tasks (see Table 1 )

sing a large variety of approaches. Using skin color and image

isparity, Nickel and Stiefelhagen [54] used a k -means clustering

pproach for face detection. Stiefelhagen et al. [75] proposed t

olve the face recognition computing the distances between the

nput images and a collection of training images. A Min–Max nor-

alization approach and a sum rule that normalizes and fuses

cores are applied. Then, face is classified according to the highest

core and a predefined threshold value. However, the most popu-

ar strategy for detecting faces was the combination of the Haar-

ike features with an AdaBoost classifier, originally proposed by Vi-

la and Jones [83] . The approach was extended for dealing with

otated faces, and for performing face recognition using the Eigen-

aces approach [73] . Other boosting approach, the so-called Gentle-

oost, was used for recognizing children’s emotions [63] . When in-

ut data is represented as a sequence of ordered observations, the

roblem is how to compare the incoming stream with the stored

emplate. Previous approaches used dynamic time warping (DTW)

r a simple matching of coefficients obtaining from the activities

y principal component analysis (PCA). Lin et al. [48] described

he activity as a hierarchical prototype tree, which is matched to

he trees on the dataset for recognition. Hidden Markov Models

HMM) were applied for speech recognition [49] . HMMs or ex-

ensions have been also widely applied in human activity recog-

ition, and novel versions are still proposed [42] . Surveys such as

he one by Cheng et al. [12] (for activity recognition) or Mishra

t al. [51] (for face emotion recognition) provided information

bout databases and approaches. New schemes are continuously

roposed, being now possible to adopt one of these state-of-art

lgorithms in our robotics architecture and obtain good results

n a short time. The use of closed solutions for solving human-

erception tasks is widely employed [4,6] .

.1.3. Convolutional Neural Networks (CNNs)

Instead of setting handcrafted features and training traditional

achine learning methods, other option is to learn these descrip-

ors directly from the raw data. Deep Convolutional Neural Net-

orks (CNNs) are currently the state-of-the-art solution for sev-

ral computer vision problems such as object detection [55] and

lassification [27,57] . In a CNN, cells act as local filters over the in-

ut space exploiting the strong spatially local correlation, being the

ain reason behind their success in computer vision applications.

ombined with multi-layered recurrent networks (long short-term

emory, LTSM) used for learning temporal series, CNNs are also

he state-of-art solution for speech recognition [95] . In general, it

an be considered that CNNs and their extensions are currently the

trategy for dealing with the challenge of perceiving the human.

With respect to the pipeline strategy composed by most of the

pproaches described in Section 2 , CNNs can be trained for link-

ng raw information with class labels. This end-to-end training is

erformed in a supervised way [35] , being the traditional major

rawback that a good training requires a vast number of labeled

raining patterns [38] . Fortunately, we have now readily available

hese image-based models trained using millions of labeled pat-

erns [39] . It has also been demonstrated that a model trained

rom a large dataset can be transferred to other visual recogni-

ion tasks with limited training data [21,55] . Recently, Zhang et al.

97] has proposed a part-based hierarchical bidirectional recurrent

eural network (PHRNN) to analyze the facial expression informa-

ion of temporal sequences. Combined with a multi-signal CNN

MSCNN), the resulting deep evolutional spatial-temporal network

ffectively boosts the performance of facial expression recognition.

This last work captures the dynamic variation of facial physical

tructure from a sequence of images. Similarly, for being useful in

ecognizing human activities, the CNN needs to be extended from

he bi-dimensional domain of the image to the three-dimensional,

patio-temporal domain of the video sequence. The solutions for

aking the temporal cue into account can be grouped within three

ajor clusters: (i) three-dimensional (3D) CNNs; (ii) motion-based

NNs; and (iii) fusion approaches. The first cluster includes those

pproaches that perform 3D convolutions on the video sequence.

he second one groups the methods that adopt the scene infor-

ation related to motion as an input for the CNN. The third clus-

er proposes to fuse the information in temporal domains. These

pproaches are complementary and it is typical that CNN-based

pproaches merge techniques from different clusters for activity

ecognition. The better results are typically provided by those

pproaches that adopt the two-stream model [71] . Basically, the

dea is to characterize the sequence of images using two different

onvolutional networks (ConvNet) streams: a temporal stream of

otion-based features and a second spatial stream of appearance-

ased features. Fig. 6 provides a graphical illustration of the two-

tream proposal by Wang et al. [90] . As Fig. 6 shows, a fusion

rocess combine the obtained results and deliver the final deci-

ion. Wang et al. [90] proposed a temporal segment network (TSN)

o recognize action. The approach consists of three steps. First,

he input video is divided up into K segments and a short por-

ion (fragment) is randomly selected from each segment. Second,

he class scores of different fragments are fused by the segmen-

al consensus function to yield video-level prediction. Third, pre-

ictions from spatial and temporal streams are then fused to pro-

uce the final prediction. The second step of the previous scheme

as modified on the sequential segment network (SSN) [11] . The

im is to concatenate the outputs of different segment portions as

he video-level representation. This representation is fed into the

ully-connected layer. Feichtenhofer et al. [24,25] proposed to gen-

ralize the residual networks (ResNets) for the spatio-temporal do-

ain by introducing residual connections within the two-stream

odel. Specifically, Feichtenhofer et al. [24] injected residual con-

ections between the appearance and temporal streams. Moreover,

hey transformed pre-trained image ConvNets into spatio-temporal

etworks by equipping them with learnable convolutional filters

hat are initialized as temporal residual connections and operate

n adjacent feature maps in time. Feichtenhofer et al. [25] fused

wo streams by motion gating and injected identity mapping ker-

8 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13

Fig. 6. Temporal segment network [90] .

Table 2

Activity recognition results on UCF101 and HMDB51 databases.

Approach CNN scheme Features UCF101 – mAP (%) HMDB51 – mAP (%)

Wang et al. [86] – iDT 85,9 57,2

Wang and Schmid [87] – iDT 87,9 61,1

Wang et al. [90] BN-Inception CNN 94,2 69,4

Chen and Zhang [11] BN-Inception CNN 94,8 73,8

Feichtenhofer et al. [24] ST-ResNet CNN 93,4 66,4

CNN + iDT 94,6 70,3
Feichtenhofer et al. [25] ResNet-50 CNN 94,2 68,9

CNN + iDT 94,9 72,2
Wang et al. [91] BN-Inception CNN 94,6 68,9

Duta et al. [22] VGG-16, VGG-19 CNN 93,6 69,5

CNN + HMG 94,0 70,3
CNN + HMG + iDT 94,3 73,1

(
c
a
t
t
g
a
s
f
n
i
s
t
o
o
c
o
g
[
c
i
c
e
c
s
f
c
A
a
e
[
f

[

nels as temporal filters to learn long-term temporal information.

Wang et al. [91] provided a pyramid two-stream model for merg-

ing the spatial and temporal information. The goal is to make both

streams reinforce each other. Duta et al. [22] added to the spatial

and temporal streams, a third spatio-temporal stream built with

the C3D architecture [77] . Spatio-Temporal Vector of Locally Max

Pooled Features (ST-VLMPF) are proposed to build action represen-

tation over the entire video. Table 2 shows the classification accu-

racy of these approaches on the UCF101 and HMDB51 databases.

The UCF101 database consists of 13,320 videos with 101 action

classes [72] . It is characterized by the large diversity in terms of

variations in background, camera motion, illumination and view-

point, as well as object scale, appearance or pose. The HMDB51

dataset consists of 6766 videos [44] . It shows a minor repertoire

of classes (51 action classes), but it is typically considered more

challenging than the UCF101 due to the even wider variations in

which actions are performed [24] . Both datasets provide an eval-

uation protocol. The evaluation metric is the mean of the Average

Precision (mAP) [23] .

3.2. Modeling a group of people

Understanding the activities and social interactions in a group

of people is a challenge topic that is starting to gain an increasing

attention by researchers. Several works pursuit to determine social

networks from appearance- and motion-based parameters charac-

terizing the people in the scene. For instance, Yu et al. [96] es-

timated the social network encoding the interactions among peo-

ple by combining face recognition and motion similarities between

tracks of people on the ground plane. The association problem of

mapping faces and tracks was solved using a novel graph-cut based

algorithm. In the proposal by Ding and Yilmaz [20] , this social net-

work was extracted from the video sequence analyzing the rela-

tionships among visual concepts. A probabilistic graphical model

PGM) with temporal smoothing was employed for analyzing so-

ial relations among actors and for detecting communities. The

pproach assumes that the relations remain constant throughout

he video sequence. RoleNet is a model for describing social rela-

ionships within a group of people [92] . It is built as a weighted

raph, where nodes are people, arcs represent relationships, and

third set of weights encodes the strength of the arcs (relation-

hips). Using co-occurrence matrices and recognizing people by

ace recognition, the social interaction is driven by the actors and

ot by audiovisual features. The method determines roles (lead-

ng roles and supporting roles) and divides up the sequence into

cenes according to the context of roles [92] . As major disadvan-

age, all these approaches do not extrapolate generic social events

r situations (birthday, wedding…) from one video sequence to the

ther. The grouping of the people is local to each sequence and so-

ial roles within an event (e.g. priest, groom, bride…) are not rec-

gnized. Some authors have addressing the problem of detecting

roups of interacting people using the concept of the F-formations

40,50] . F-formations are defined as a geometric arrangement en-

oding the position and orientation information of people stand-

ng in the formation ( Fig. 7 ). The estimation of these F-formations

an be inferred from body poses and/or head orientations. Vascon

t al. [81] associated to each person with a frustum, which was

omputed from the position and orientation information. They de-

igned a game-theoretic framework where the concept of the F-

ormation was embedded, but also the biological constraints of so-

ial attention. Orientation was the main cue for Ricci et al. [60] .

joint learning approach was suggested for estimating the pose

nd F-formation for groups of people. Zhang and Hung [98] also

mployed the frustum of attention. But, contrary to Vascon et al.

81] , they used this frustum to obtain features from people. These

eatures labeled people in associates, singletons and members of

-formations. Using the Group Interaction Zone (GIZ), Cho et al.

15] also addressed the problem of detecting meaningful groups by

A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 9

Fig. 7. Two-people formations [40] and a three-people formation [61] .

m
i
a
d

o
f
h
d
t
m
p
fi
a
t
t
p
e
a
f
u
t
n
n

i
i
c
e
w
w

t
t
t
e
i
p
i

Fig. 8. Sample frames from a ‘wedding’ event from two films with manual role

annotations.

Table 3

Group recognition results on NUS-HGA and BEHAVE databases.

Approach NUS-HGA BEHAVE

Accuracy (%) Accuracy (%)

Cheng et al. [13] 96.20 92.93

Cho et al. [15] 96.03 93.74

Al-Raziqi and Denzler [2] 81.94 79.35

Zhuang et al. [100] 99.25 94.63

d
i
t
(
i
a
r
c
p
t
n
c
r
s

p
a
a
e
r
e
e
C
d
F
r

s
d
a

o
T
f
p

h
c
h
t

odeling proxemics. They described the group activity in a GIZ us-

ng attraction and repulsion properties, which considered an inter-

ction in terms of “getting close”, “away”, and “keeping the same

istance together”.

Other works try to capture the social interactions for helping

n the recognition of joint activities. Facial features were modeled

or recognizing activities such as hand-shaking [56] . The relation

istory image (RHI) descriptor was proposed by Gori et al. [30] for

iscriminating activities and interactions that happen at the same

ime. The RHI is built as the temporal variation of relational infor-

ation between every pair of local subparts belonging to one or a

air of people. Choi and Savarese [16] proposed a model that uni-

es the tracking of multiple people, the recognition of individual

ctions, and the identification of the interactions and collective ac-

ivities. It is assumed that there exists a strong correlation between

he individual activity of each person and the activities of the other

eople. Cheng et al. [13] proposed a layered model. They firstly

xtracted various motion and appearance features from the video

nd trajectory data. And then, features were randomly sampled

rom the training features to generate codebooks of visual words

sing K -means clustering. All features are quantized by assigning

heir nearest visual words with Euclidean distance. The resulting

ormalized histograms of visual word occurrences formed the fi-

al representations, one feature type per group action instance.

ulti-class Support Vector Machine (SVM) was used to build the

lassifier and make the recognition decisions. Al-Raziqi and Den-

ler [2] proposed to divide up the video sequence into clips us-

ng an unsupervised clustering approach. Within the clips, signif-

cant groups of objects were detected using a bottom-up hierar-

hical clustering and then tracked over time. Furthermore, mutual

ffect between objects based on motion and appearances features

ere computed. Finally, the Hierarchical Dirichlet Process (HDP)

as employed to cluster the clips.

The recognition of social roles and its importance for predicting

roup activities has been explored by Ramanathan et al. [59] and

an et al. [46] . The aim is to identify events and roles, being able

o label people ( Fig. 8 ). The first proposal addressed the identifica-

ion of social roles in a weakly supervised framework, meanwhile

he second one works in a fully supervised scenario. Ramanathan

t al. [59] tackled the problem from the perspective of recogniz-

ng social roles, which emerges from the interactions among peo-

le and among people and objects. They proposed to model the

nter-role interactions using Conditional Random Field (CRF) un-

er a weakly supervised setting. Unary component representations

ncluded HOG3D, spatio-temporal features, object interaction fea-

ures (restricted to two objects per event) and social role features

clothing and gender of the person). These features were refined

n a subsequent layer consisting of pairwise spatio-temporal inter-

ction features. The parameters of the CRF-based model and the

ole labels were learned adapting a joint variational inference pro-

edure. Focused on group activities, a hierarchical classifier was

roposed by Lan et al. [46] . Using an undirected graphical model,

he hierarchy encoded individual actions, role-based unary compo-

ents, pairwise roles, and group activities. Thus, at a low-level, the

lassifier recognizes single activities. At a mid-level, it infers social

oles. The parameters of the model are learned using a structured

upport vector machine (SVM). It works under completely super-

ised setting.

.2.1. Convolutional Neural Networks (CNNs)

Similarly to the approaches described in Section 3.1.3 , there are

roposals that deal with the problem of recognizing the activity of

group of people by using a layered model where both motion

nd appearance information are employed. For instance, Zhuang

t al. [100] proposed the Differential Recurrent Convolutional Neu-

al Networks (DRCNN). As Fig. 9 shows, the DRCNN combines lay-

rs of convolutional networks, max-pooling, fully-connected, differ-

ntial long short-term memory (DLSTM) networks and soft-max.

ontrary to Cheng et al. [13] and Cho et al. [15] , this method

oes not need the previous detection of the people on the images.

or assessing the performance of the approaches for group activity

ecognition two popular public video datasets: BEHAVE and NUS-

GA are used. The NUS-HGA dataset consists of 476 video clips,

hich cover six group activity classes (Fight, Gather, Ignore, Run-

nGroup, StandTalk and WalkInGroup). The BEHAVE dataset con-

ists of 7 long video sequences. As these video sequences include

ifferent classes of group activities, video clips containing group

ctivity instances have been extracted from the sequences. These

ideo clips cover ten group activity classes, but it is typical to use

nly six group activity classes (Approach, Fighting, InGroup, Run-

ogether, Split, and WalkTogether), because the rest only contain a

ew short sequences. Table 3 shows the group recognition results

rovided by several approaches on these datasets.

Other approaches represent activities and interactions within a

ierarchical representation. Taken into consideration scene classifi-

ation and group activity recognition, Deng et al. [19] proposed a

ierarchical model that predicts scores for individual actions, ob-

ained from bounding-boxes around each person, and the group

10 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13

Fig. 9. Differential Recurrent Convolutional Neural Networks [100] .

Fig. 10. Overview of the software architecture for human perception proposed by

Lallée et al. [45] .

e
a

f
p
i
u
l
s
e
j
f
t
p
t
t

v
e
f
m
t
[
p
v
fi
d
c
o

p
v
t
d
r
i
s
i
n
[
e
c

activity, from the whole scene. Obtained labels were refined by

applying a belief propagation-like neural network. The dependen-

cies between individual actions and the group activity are taken

into account in the network. The model learns the message pass-

ing parameters and performs inference and learning in an unified

framework using back-propagation. While this approach use neural

network-based graphical representations, Ibrahim et al. [36] lever-

aged LSTM-based temporal modeling to learn discriminative infor-

mation from time varying sports activity data.

4. Internalizing the information

The integration of isolated feature descriptors provided by in-

dividual perceptual units for providing a whole view of the scene

can be achieved by internalizing all this information into an unique

representation. This scheme has been widely employed on robotics,

specially when it is expected that they deploy cognitive function-

alities. If cognition is the ability that allow us to internally deal

with the information about ourselves and the external world, this

ability is subject to the existence of an internal active representa-

tion handling all this information. For instance, Fig. 10 shows an

overview of the architecture proposed by Lallée et al. [45] for the

i-Cub robot. In the figure, we can note the presence of a mod-

ule for storing the spatial knowledge of the scene, which receives

inputs from the 3D perception module. The presence of the first

module, in the ‘Platform independent’ part of the software archi-

tecture, allows the system to decouple sensing and perception. This

odule is a geometric memory in Lallée et al. [45] , the so-called

goSphere . In the proposal by Romero-Garcés et al. [62] , the knowl-

dge is stored in a graphical representation that merges symbolic

nd metric information.

The use of an internal representation can be a good solution

or encoding the complexity of a scene populated by several peo-

le. As it is shown in previous Sections, rich semantic relations are

mportant for understanding these events. If these relations can be

seful for understanding the activities of an individual, building re-

ationships among the people sharing a common task will be ba-

ic for recognizing group activity. In Ramanathan et al. [59] , they

ncode relationships among people but also among people and ob-

ects. Some of the state-of-art approaches presented above success-

ully label the perceived sequence of images, but they are unable

o provide fine details about the individual role or activity of each

erson on the scene. Hierarchical approaches recognize the activi-

ies of each individual person and of the group of people. But it is

ypical that they do not encode all the richness of the interactions.

raphical models emerge as a solution to encode components of

isual appearance and their relations and interactions [6] . Chen

t al. [10] combined graphical models and deep neural networks,

eeding the outcomes of the final layer of a deep network to a CRF

odel. Schwing and Urtasun [68] designed an iterative process for

raining of a CRF model and, expanding this approach, Deng et al.

18] used an iterative approach for employing the actions of other

eople in the scene in the disambiguate of the action of each indi-

idual. They accomplished this by a recurrent neural network, re-

ned by repeatedly passing messages with estimates of each in-

ividual persons action. The inner representation of the scene is

onfigurable, using trainable gating functions for turning on and

ff arcs between individual people in the scene.

. Discussion

The previous sections review the state-of-art approaches on

erceiving people and their social interactions. Although the ad-

ances on accuracy are really surprising, some doubts appear when

hese algorithms must be translated to robotics. One of the major

ifficulties is related to the response time of these algorithms. The

obots illustrated in Fig. 3 need to interact with a person at human

nteraction rates. The hardware and software complexities underlie

ome of the architectures on this review are really relevant, and its

ntegration within a robot should increase its price. This is a sig-

ificant issue: how much would a social robot cost? As Blackman

7] pointed out for care robots, there is a serious lack of robust

vidence of cost-effectiveness. Although we solve the technological

hallenge of endowing a robot with the abilities for understanding

A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 11

Fig. 11. Samples from a video sequence capturing a ‘Handshaking with the observer’: (top) from a 3rd-person viewpoint; and (down) from a first-person viewpoint.

o
b

o
t
a
o
g
v
5

b
r
u
o
a
i
o
a
b
t
i
u
a
5

v
a
s
s
m
e

o
v

u
l
i
d
6

a
u
t
f
o
p
c
v
(
a
c
t
o
h
t
s
c
n
d
l
t
t
n
F
q
o
F
f
s

C
E
R

ur activities and intentions, it will be difficult to bridge the gap

etween the research or academic domain and the market shelf.

ther significant problem is that most of the approaches focused

n recognition from a 3rd-person perspective (i.e., viewpoint). In

hese cases, the camera is typically far away from people. And the

lgorithms recognize what people are doing to each other with-

ut getting involved in the activities (e.g., two people walking to-

ether). This paradigm is insufficient when the observer itself is in-

olved in interactions [65] .

.1. Networked robotics: the strength of being part of an ecology

For addressing both problems, recent proposals suggest to em-

ed intelligent networking robotic devices in our everyday envi-

onments (homes, offices, public buildings…). Similar to the ubiq-

itous computing, the robot is now one element within an ecology

f connected devices. In fact, extending the definition of robot to

ny e mbedded device with computing, communication, and sens-

ng or actuation capabilities [8] , we can refer to this as an ‘ecol-

gy of robots’. Within these approaches, the perceptual and social

bilities of each robot are augmented by adding the ones provided

y the rest of robots. Each robot is in charge of solving a specific

ask, and the human activity understanding can be solved by us-

ng wearable sensors [70] , or external cameras that provide the

sual 3rd-person perspective. Moreover, the robot can shares the

cquired knowledge by uploading it to a distributed database [85] .

.2. Approaches for first-person activity recognition

First-person cameras or microphones are the correct input de-

ices for providing the researchers with the information that will

llow to endow the robot with the situation awareness that we

tate at the end of Section 1 . In this egocentric scenario, the ob-

erver wearing the camera is involved in the ongoing activities. It

ust be noted that videos will visually display very different prop-

rties when compared to the video captured from a conventional,

rd-person viewpoint. As an example, Fig. 11 shows some samples

f the task ‘Handshaking with the observer’ captured from the two

iewpoints.

The research area of first-person activity recognition or scene

nderstanding is gaining an increasing amount of attention these

ast years. There are works on recognition of activities of daily liv-

ng [53] , early recognition [64] , etc. But it is expected that new

atasets and approaches will appear in the next years.

. Conclusions

This review provides a summary of approaches that have been

pplied to characterize and recognize the behaviors of an individ-

al or a group of people. Specifically, the understanding of the in-

eraction with a group of people is receiving significant attention

rom the research community in recent years. Similarly, a large set

f concepts and different approaches have emerged recently. This

aper summarizes some of these advances for modeling the so-

ial setting where the robot is involved and for extracting the rele-

ant information during the interaction. Recurrent neural networks

CNN and LTSM) represent promising techniques for the detection

nd classification tasks in the interaction of a social robot. As dis-

ussed above, these techniques required a vast number of labeled

raining patterns, but this is not a problem due to the availability

f large labeled datasets and trained networks. These approaches

ave shown impressive results in the recognition of human ac-

ivity in the field of computer vision. While achieving these re-

ults is a significant achievement, the researchers still have many

hallenges to deal with, such as repeating the achieved recog-

ition rates in egocentric videos, dealing with noise due to the

ynamics associated to the robot’s motion, etc. The whole prob-

em should be approached from the robotics point-of-view, and

he algorithms should work with low memory and less computa-

ional time. Recently, it was discussed in [88] the development of

ew methods on Application Specific Integrated Circuit (ASIC) or

ield Programmable Gate Array (FPGA). A transversal effort will re-

uire a joint expertise in embedded vision and traditional teams

f robotics, software engineering and computer vision researchers.

urthermore, the work on activity recognition should be extended

or dealing with the early recognition, where the pre-activity ob-

ervations and the context awareness are basic concepts.

cknowledgments

The research work of A. Bandera, R. Vazquez-Martin and L.V.

alderita within this scope has been partially funded by the EU

CHORD++ project (FP7-ICT-601116) and the TIN2015-65686-C5-1-

(Gobierno de España and FEDER funds).

12 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13

References

[1] J. Aggarwal , M. Ryoo , Human activity analysis: a survey, ACM Comput. Surv.

43 (2011) 1–43 .

[2] A. Al-Raziqi, J. Denzler, Unsupervised Group Activity Detection by Hierarchi-
cal Dirichlet Processes, Springer International Publishing, Cham, pp. 399–407.

doi: 10.1007/978- 3- 319- 59876- 5 _ 44 .
[3] F. Alonso-Martín, M. Malfaz, J. Sequeira, J. Gorostiza, M. Salichs, A multimodal

emotion detection system during human-robot interaction, Sensors (Basel) 13
(11) (2013) 15549–15581, doi: 10.3390/s131115549 .

[4] F. Alonso-Martín, M.A. Salichs, Integration of a voice recognition system in a

social robot, Cybern. Syst. 42 (4) (2011) 215–245, doi: 10.1080/01969722.2011.
583593 .

[5] A . Aly, A . Tapus, Multimodal adapted robot behavior synthesis within a nar-
rative human-robot interaction, in: Proceedings of the 2015 IEEE/RSJ Interna-

tional Conference on Intelligent Robots and Systems (IROS), 2015, pp. 2986–
2993, doi: 10.1109/IROS.2015.7353789 .

[6] A. Bandera, P. Bustos, Toward the development of cognitive robots, in:
L. Grandinetti, T. Lippert, N. Petkov (Eds.), Proceedings of the International

Workshop on Brain-Inspired Computing, BrainComp 2013, Springer Interna-

tional Publishing, Cham, 2014, pp. 88–99, doi: 10.1007/978- 3- 319- 12084- 3 _ 8 .
Cetraro, Italy.

[7] T. Blackman , Care robots for the supermarket shelf: a product gap in assistive
technologies, Ageing Soc. 33 (5) (2013) 763–781 .

[8] M. Bordignon, M.J. Rashid, M. Broxvall, A. Saffiotti, Seamless integration of
robots and tiny embedded devices in a PIES-Ecology, in: Proceedings of the

2007 IEEE/RSJ International Conference on Intelligent Robots and Systems,

Sheraton Hotel and Marina, San Diego, California, USA, 2007, pp. 3101–3106 .
October 29–November 2, 2007. 10.1109/IROS.2007.4399282 .

[9] L.V. Calderita, J.P. Bandera, P. Bustos, A. Skiadopoulos, Model-based reinforce-
ment of kinect depth data for human motion capture applications, Sensors 13

(7) (2013) 8835–8855, doi: 10.3390/s130708835 .
[10] L. Chen , G. Papandreou , I. Kokkinos , K. Murphy , A.L. Yuille , Semantic image

segmentation with deep convolutional nets and fully connected CRFs, CoRR

(2014) .
[11] Q. Chen , Y. Zhang , Sequential segment networks for action recognition, IEEE

Signal Process. Lett. 24 (5) (2017) 712–716 .
[12] G. Cheng , Y. Wan , A.N. Saudagar , K. Namuduri , B.P. Buckles , Advances in hu-

man action recognition: a survey, CoRR (2015) . abs/1501.05964.
[13] Z. Cheng , L. Qin , Q. Huang , S. Yan , Q. Tian , Recognizing human group action

by layered model with multiple cues, Neurocomputing 136 (2014) 124–135 .

[14] P. Chevalier, J. Martin, B. Isableu, C. Bazile, A. Tapus, Impact of sensory prefer-
ences of children with ASD on imitation with a robot, in: Proceedings of the

2017 IEEE International Conference on Human–Robot Interaction (HRI), 2017,
doi: 10.1145/2909824.3020234 .

[15] N.-G. Cho , Y.-J. Kim , U. Park , J.-S. Park , S.-W. Lee , Group activity recognition
with group interaction zone based on relative distance between human ob-

jects, Int. J. Pattern Recognit. Artif. Intell. 29 (5) (2015) 1555007 .

[16] W. Choi , S. Savarese , A unified framework for multi-target tracking and col-
lective activity recognition, in: Proceedings of the 2012 European Conference

on Computer Vision (ECCV), 2012, pp. 215–230 .
[17] F. Cid, J. Moreno, P. Bustos, P. Núñez, Muecas: a multi-sensor robotic head for

affective human robot interaction and imitation, Sensors 14 (5) (2014) 7711–
7737, doi: 10.3390/s140507711 .

[18] Z. Deng, A. Vahdat, H. Hu, G. Mori, Structure inference machines: recurrent

neural networks for analyzing relations in group activity recognition, in: Pro-
ceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recog-

nition, CVPR 2016, Las Vegas, NV, USA, 2016, pp. 4772–4781 . June 27–30,
2016. 10.1109/CVPR.2016.516 .

[19] Z. Deng , M. Zhai , L. Chen , Y. Liu , S. Muralidharan , M. Roshtkhari , G. Mori , Deep
structured models for group activity recognition, in: Proceedings of the 2015

British Machine Vision Conference (BMVC), 2015 .
[20] L. Ding, A. Yilmaz, Inferring social relations from visual concepts, in: Proceed-

ings of the 2011 International Conference on Computer Vision, 2011, pp. 699–

706, doi: 10.1109/ICCV.2011.6126306 .
[21] J. Donahue , Y. Jia , O. Vinyals , J. Hoffman , N. Zhang , E. Tzeng , T. Darrell , Decaf:

a deep convolutional activation feature for generic visual recognition, in: Pro-
ceedings of the 2015 International Conference on Machine Learning (ICML),

32, 2014, pp. 1–9 .
[22] I. Duta , B. Ionescu , K. Aizawa , N. Sebe , Spatio-temporal vector of locally max

pooled features for action recognition in videos, in: Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), 2017 .
[23] M. Everingham, L. Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual

object classes (VOC) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338,
doi: 10.10 07/s11263- 0 09- 0275- 4 .

[24] C. Feichtenhofer , A. Pinz , R. Wildes , Spatiotemporal residual networks for
video action recognition, in: Proceedings of the Conference on Neural Infor-

mation Processing Systems (NIPS), 2016 .

[25] C. Feichtenhofer , A. Pinz , R. Wildes , Spatiotemporal multiplier networks for
video action recognition, in: Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), 2017 .
[26] A. Gams , T. Petric , M. Do , B. Nemec , J. Morimoto , T. Asfour , A. Ude , Adapta-

tion and coaching of periodic motion primitives through physical and visual
interaction, Robot. Auton. Syst. 75 (2016) 340–351 .

[27] R. Girshick , J. Donahue , T. Darrell , J. Malik , Rich feature hierarchies for ac-

curate object detection and semantic segmentation, in: Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014,
pp. 580–587 .

[28] V. Gonzalez-Pacheco, M. Malfaz, F. Fernandez, M.A. Salichs, Teaching human
poses interactively to a social robot, Sensors 13 (9) (2013) 12406–12430,

doi: 10.3390/s130912406 .
[29] M. Goodrich , A. Schultz , Human-robot interaction: a survey, Found. Trends

Hum.-Comput. Interact. 1 (2007) 203–275 .
[30] I. Gori , J. Aggarwal , L. Matthies , M. Ryoo , Multitype activity recognition in

robot-centric scenarios, IEEE Robot. Autom. Lett. 1 (1) (2016) 593–600 .

[31] A.M. Gupta, B.S. Garg, C.S. Kumar, D.L. Behera, An on-line visual human track-
ing algorithm using surf-based dynamic object model, in: Proceedings of the

2013 IEEE International Conference on Image Processing, 2013, pp. 3875–
3879, doi: 10.1109/ICIP.2013.6738798 .

[32] B. Hommel , J. Müsseler , G. Aschersleben , W. Prinz , The theory of event coding
(TEC): a framework for perception and action planning, Behav. Brain Sci. 24

(5) (2001) 849–937 .

[33] J. Hornstein, M. Lopes, J. Santos-Victor, F. Lacerda, Sound localization for hu-
manoid robots – building audio-motor maps based on the HRTF, in: Proceed-

ings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and
Systems, 2006, pp. 1170–1176, doi: 10.1109/IROS.2006.281849 .

[34] N. Hu , G. Englebienne , Z. Lou , B. Krose , Learning latent structure for activity
recognition, in: Proceedings of the IEEE Conference Robotics and Automaton

(ICRA), 2014, pp. 1048–1053 .

[35] F. Husain , B. Dellen , C. Torras , Action recognition based on efficient deep fea-
ture learning in the spatio-temporal domain, IEEE Robot. Autom. Lett. 1 (2)

(2016) 984–991 .
[36] M.S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, G. Mori, A hierarchical

deep temporal model for group activity recognition, in: Proceedings of the
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

2016, pp. 1971–1980, doi: 10.1109/CVPR.2016.217 .

[37] A. Ijspeert , J. Nakanishi , P. Pastor , H. Hoffmann , S. Schaal , Dynamical move-
ment primitives: learning attractor models for motor behaviors, Neural Com-

put. 25 (2) (2013) 328–373 .
[38] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Synthetic data and ar-

tificial neural networks for natural scene text recognition, arXiv: 1406.2227
(2014).

[39] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar-

rama, T. Darrell, Caffe: convolutional architecture for fast feature embedding,
arXiv: 1408.5093 (2014).

[40] A. Kendon , Conducting Interaction: Patterns of Behavior in Focused Encoun-
ters, Studies in Interactional Socio, Cambridge University Press, 1990 .

[41] M. Kirtay, E. Falotico, A. Ambrosano, U. Albanese, L. Vannucci, C. Laschi, Vi-
sual Target Sequence Prediction via Hierarchical Temporal Memory Imple-

mented on the iCub Robot, Springer International Publishing, Cham, pp. 119–

130. doi: 10.1007/978- 3- 319- 42417- 0 _ 12 .
[42] M.H. Kolekar, D.P. Dash, Hidden Markov model based human activity recog-

nition using shape and optical flow based features, in: Proceedings of the
2016 IEEE Region 10 Conference (TENCON), 2016, pp. 393–397, doi: 10.1109/

TENCON.2016.7848028 .
[43] H. Koppula , R. Gupta , A. Saxena , Learning human activities and object affor-

dances from RGB-D videos, Int. J. Robot. Res. 32 (8) (2013) 951–970 .
[44] H. Kuhne , H. Jhuang , E. Garrote , T. Poggio , T. Serre , HMDB: A large video

database for human motion recognition, in: Proceedings of the IEEE Inter-

national Conference on Computer Vision (ICCV), 2011 .
[45] S. Lallée , S. Lemaignan , A. Lenz , C. Melhuish , L. Natale , S. Skachek , T. van der

Zant , F. Warneken , P.F. Dominey , Towards a platform-independent coopera-
tive human–robot interaction system: I. Perception, in: Proceedings of the In-

ternational Conference on Intelligent Robots and Systems (IROS), IEEE, 2010,
pp. 4 4 4 4–4 451 .

[46] T. Lan , L. Sigal , G. Mori , Social roles in hierarchical models for human activity

recognition, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2012, pp. 1354–1361 .

[47] Q. Le , W. Zou , S. Yeung , A. Ng , Learning hierarchical invariant spatio-temporal
features for action recognition with independent subspace analysis, in: Pro-

ceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2011, pp. 3361–3368 .

[48] Z. Lin , Z. Jiang , L. Davis , Recognizing actions by shape-motion prototype trees,

in: Proceedings of the IEEE International Conference on Computer Vision,
2009, pp. 4 4 4–451 .

[49] C.Y. Liu, T.H. Hung, K.C. Cheng, T.H.S. Li, HMM and BPNN based speech
recognition system for home service robot, in: Proceedings of the 2013 In-

ternational Conference on Advanced Robotics and Intelligent Systems, 2013,
pp. 38–43, doi: 10.1109/ARIS.2013.6573531 .

[50] P. Marshall, Y. Rogers, N. Pantidi, Using F-formations to analyse spatial pat-

terns of interaction in physical environments, in: Proceedings of the ACM
2011 Conference on Computer Supported Cooperative Work, CSCW ’11, ACM,

New York, NY, USA, 2011, pp. 445–454, doi: 10.1145/1958824.1958893 .
[51] B. Mishra, S.L. Fernandes, K. Abhishek, A. Alva, C. Shetty, C.V. Ajila, D. Shetty,

H. Rao, P. Shetty, Facial expression recognition using feature based techniques
and model based techniques: A survey, in: Proceedings of the Second Interna-

tional Conference on Electronics and Communication Systems (ICECS), 2015,

pp. 589–594, doi: 10.1109/ECS.2015.7124976 .
[52] D. Nguyen, S. Cho, K. Shin, J. Bang, K. Park, Comparative study of human age

estimation with or without preclassification of gender and facial expression,
Sci. World J. 2014 (2014) 905269, doi: 10.1155/2014/905269 . 15 pages

[53] T.-H.-C. Nguyen, J.-C. Nebel, F. Florez-Revuelta, Recognition of activities of

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0001

http://doi.org/10.1007/978-3-319-59876-5_44

https://doi.org/10.3390/s131115549

https://doi.org/10.1080/01969722.2011.583593

https://doi.org/10.1109/IROS.2015.7353789

https://doi.org/10.1007/978-3-319-12084-3_8

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0006

http://doi.org/10.1109/IROS.2007.4399282

https://doi.org/10.3390/s130708835

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0009

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0010

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0011

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0012

https://doi.org/10.1145/2909824.3020234

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0014

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0015

https://doi.org/10.3390/s140507711

http://doi.org/10.1109/CVPR.2016.516

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0018

https://doi.org/10.1109/ICCV.2011.6126306

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0020

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0021

https://doi.org/10.1007/s11263-009-0275-4

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0023

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0024

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0025

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0026

https://doi.org/10.3390/s130912406

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0028

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0029

https://doi.org/10.1109/ICIP.2013.6738798

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0031

https://doi.org/10.1109/IROS.2006.281849

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0033

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0034

https://doi.org/10.1109/CVPR.2016.217

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0036

arxiv:/1406.2227

arxiv:/1408.5093

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0037

http://doi.org/10.1007/978-3-319-42417-0_12

https://doi.org/10.1109/TENCON.2016.7848028

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0039

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0040

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0041

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0042

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0043

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0044

https://doi.org/10.1109/ARIS.2013.6573531

https://doi.org/10.1145/1958824.1958893

https://doi.org/10.1109/ECS.2015.7124976

https://doi.org/10.1155/2014/905269

A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 13

[

daily living with egocentric vision: a review, Sensors 16 (1) (2016), doi: 10.
3390/s16010072 .

[54] K. Nickel , R. Stiefelhagen , Visual recognition of pointing gestures for hu-
man–robot interaction, Image Vis. Comput. 25 (12) (2007) 1875–1884 .

[55] M. Oquab , L. Bottou , I. Laptev , J. Sivic , Learning and transferring mid-level
image representations using convolutional neural networks, in: Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2014, pp. 1717–1724 .

[56] A. Patron-Perez , M. Marszaek , A. Zisserman , I.D. Reid , High five: Recognising

human interactions in tv shows, in: Proceedings of the British Machine Vision
Conference, 2010 .

[57] P. Pinheiro , R. Collobert , Recurrent convolutional neural networks for scene
labeling, in: Proceedings of the Thirty-First International Conference on Ma-

chine Learning (ICML), 2014, pp. 82–90 .
[58] J.C. Pulido, J.C. González, C. Suárez-Mejías, A. Bandera, P. Bustos, F. Fernán-

dez, Evaluating the child–robot interaction of the NAO therapist platform in

pediatric rehabilitation, Int. J. Soc. Robot. 9 (3) (2017) 343–358, doi: 10.1007/
s12369- 017- 0402- 2 .

[59] V. Ramanathan, B. Yao, L. Fei-Fei, Social Role Recognition for Human Event
Understanding, Springer International Publishing, Cham, pp. 75–93. doi: 10.

1007/978- 3- 319- 05491- 9 _ 4 .
[60] E. Ricci , J. Varadarajan , R. Subramanian , S. Rota Bulo , N. Ahuja , O. Lanz , Un-

covering interactions and interactors: joint estimation of head, body orienta-

tion and f-formations from surveillance videos, in: Proceedings of the IEEE
International Conference on Computer Vision, 2015, pp. 4660–4668 .

[61] J. Rios-Martinez, A. Spalanzani, C. Laugier, From proxemics theory to socially-
aware navigation: a survey, Int. J. Soc. Robot. 7 (2015) 137–153, doi: 10.1007/

s12369- 014- 0251- 1 .
[62] A. Romero-Garcés , L. Calderita , J. Martínez-Gómez , J. Bandera , R. Marfil ,

L. Manso , A. Bandera , P. Bustos , Testing a fully autonomous robotic sales-

man in real scenarios., in: Proceedings of the International Conference on
Autonomous Robot Systems and Competitions (ICARSC 2017), IEEE, 2015,

pp. 124–130 .
[63] P. Ruvolo , I. Fasel , J. Movellan , Auditory mood detection for social and educa-

tional robots, in: Proceedings of the IEEE International Conference on Robotics
and Automation, 2008, pp. 3551–3556 .

[64] M.S. Ryoo, T.J. Fuchs, L. Xia, J. Aggarwal, L. Matthies, Robot-centric activ-

ity prediction from first-person videos: What will they do to me? in: Pro-
ceedings of the Tenth Annual ACM/IEEE International Conference on Human-

Robot Interaction, HRI ’15, ACM, New York, NY, USA, 2015, pp. 295–302,
doi: 10.1145/2696454.2696462 .

[65] M.S. Ryoo, L. Matthies, First-person activity recognition: what are they doing
to me? in: Proceedings of the 2013 IEEE Conference on Computer Vision and

Pattern Recognition, 2013, pp. 2730–2737, doi: 10.1109/CVPR.2013.352 .

[66] A. Sauppé, B. Mutlu , The social impact of a robot co-worker in industrial set-
tings, in: Proceedings of the ACM Conference on Human Factors in Computing

Systems, 2015, pp. 3613–3622 .
[67] C. Schuldt , I. Laptev , B. Caputo , Recognizing human actions: a local SVM ap-

proach, in: Proceedings of the Seventeenth International Conference on Pat-
tern Recognition (ICPR), 3, 2004, pp. 32–36 .

[68] A.G. Schwing , R. Urtasun , Fully connected deep structured networks, CoRR
(2015) . abs/1503.02351

[69] P. Scovanner , S. Ali , M. Shah , A 3-dimensional sift descriptor and its applica-

tion to action recognition, in: Proceedings of the Fifteenth International Con-
ference on Multimedia, 2007, pp. 357–360 .

[70] W. Sheng , J. Du , Q. Cheng , G. Li , C. Zhu , M. Liu , G. Xu , Robotic semantic map-
ping through human activity recognition: a wearable sensing and computing

approach, Robot. Auton. Syst. 68 (2015) 47–58 .
[71] K. Simonyan , A. Zisserman , Two-stream convolutional networks for action

recognition in videos, in: Proceedings of the Advances in Neural Information

Processing Systems (NIPS), 2014, pp. 568–576 .
[72] K. Soomro , A. Roshan Zamir , M. Shah , UCF101: a dataset of 101 human actions

classes from videos in the wild, CoRR (2012) . abs/1212.0402.
[73] T. Spexard , M. Hanheide , Gerhard sagerer, human-oriented interaction with

an anthropomorphic robot, IEEE Trans. Robot. 23 (5) (2007) 852–862 .
[74] C. Stefan , Dynamic eye movement datasets and learnt saliency models for vi-

sual action recognition, in: Proceedings of the Twelfth European Conference

on Computer Vision (ECCV), 2012, pp. 842–856 .
[75] R. Stiefelhagen , H. Ekenel , C. Fügen , P. Gieselmann , H. Holzapfel , F. Kraft ,

K. Nickel , A. Waibel , Enabling multimodal human–robot interaction for the
Karlsruhe humanoid robot, IEEE Trans. Robot. 23 (5) (2007) 840–851 .

[76] Y. Tamura , T. Akashi , S. Yano , H. Osumi , Human visual attention model based
on analysis of magic for smooth human–robot interaction, Int. J. Soc. Robot.

8 (2016) 685–694 .

[77] D. Tran , L. Bourdev , R. Fergus , L. Torresani , M. Paluri , Learning spatiotemporal
features with 3D convolutional networks, in: Proceedings of the IEEE Interna-

tional Conference on Computer Vision (ICCV), 2015, pp. 4 489–4 497 .
[78] N.-T. Tran, F.-E. Ababsa, M. Charbit, J. Feldmar, D. Petrovska-Delacrétaz, G.

Chollet, 3D Face Pose and Animation Tracking via Eigen-Decomposition Based
Bayesian Approach, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 562–

571. doi: 10.1007/978- 3- 642- 41914- 0 _ 55 .

[79] P.K. Turaga , R. Chellappa , V.S. Subrahmanian , O. Udrea , Machine recognition
of human activities: a survey, Proc. IEEE Trans. Circuits Syst. Video Technol.

18 (2008) 1473–1488 .
[80] N. Vahrenkamp, T. Asfour, R. Dillmann, Simultaneous grasp and motion plan-

ning: humanoid robot ARMAR-III, IEEE Robot. Autom. Mag. 19 (2) (2012) 43–
57, doi: 10.1109/MRA.2012.2192171 .

[81] S. Vascon , E. Mequanint , M. Cristani , H. Hung , M. Pelillo , V. Murino , A
game-theoretic probabilistic approach for detecting conversational groups, in:

Proceedings of the Asian Conference on Computer Vision, Springer, 2014,

pp. 658–675 .
[82] A. Vignolo, F. Rea, N. Noceti, A. Sciutti, F. Odone, G. Sandini, Biological

movement detector enhances the attentive skills of humanoid robot ICUB,
in: Proceedings of the IEEE-RAS Sixteenth International Conference on Hu-

manoid Robots (Humanoids), 2016, pp. 338–344, doi: 10.1109/HUMANOIDS.
2016.7803298 .

[83] P. Viola , M. Jones , Rapid object detection using a boosted cascade of simple

features, in: Proceedings of the IEEE Computer Society Conference on Com-
puter Vision and Pattern Recognition, 1, 2001, pp. I511–I518 .

[84] D. Voilmy , C. Suarez , A. Romero-Garces , C. Reuther , J. Pulido , R. Marfil ,
L. Manso , K. Lan Hing Ting , A. Iglesias , J. Gonzalez , J. Garcia , A. Garcia-Olaya ,

R. Fuentetaja , F. Fernandez , A. Dueñas , L. Calderita , P. Bustos , T. Barile , J. Ban-
dera Rubio , A. Bandera , CLARC: a cognitive robot for helping geriatric doctors

in real scenarios, in: A. Ollero, A. Sanfeliu, L. Montano, N. Lau, C. Cardeira

(Eds.), Proceedings of the ROBOT 2017: Third Iberian Robotics Conference,
2017 .

[85] M. Waibel, M. Beetz, J. Civera, R. D’Andrea, J. Elfring, D. Glvez-Lpez, K. Husser-
mann, R. Janssen, J.M.M. Montiel, A. Perzylo, B. Schiele, M. Tenorth, O. Zwei-

gle, R.D. Molengraft, Roboearth, IEEE Robot. Autom. Mag. 18 (2) (2011) 69–82,
doi: 10.1109/MRA.2011.941632 .

[86] H. Wang , A. Kläser , C. Schmid , C.-L. Liu , Action recognition by dense trajec-

tories, in: Proceedings of the IEEE Conference on Computer Vision & Pattern
Recognition, 2011, pp. 3169–3176 .

[87] H. Wang , C. Schmid , Action recognition with improved trajectories, in: Pro-
ceedings of the IEEE International Conference on Computer Vision (ICCV),

2013, pp. 3551–3558 .
[88] H. Wang, M. Shao, Y. Liu, W. Zhao, Enhanced efficiency 3D convolution based

on optimal FPGA accelerator, IEEE Access 5 (2017) 6909–6916, doi: 10.1109/

ACCESS.2017.2699229 .
[89] H. Wang , H. Zhou , A. Finn , Discriminative dictionary learning via shared la-

tent structure for object recognition and activity recognition, in: Proceed-
ings IEEE International Conference Robotics and Automation (ICRA), 2014,

pp. 6299–6304 .
[90] L. Wang , Y. Xiong , Z. Wang , Y. Qiao , D. Lin , X. Tang , L. Gool , Temporal segment

networks: towards good practices for deep action recognition, in: Proceedings

of the European Conference on Computer Vision (ECCV), 2016, pp. 20–36 .
[91] Y. Wang , M. Long , J. Wang , P. Yu , Spatiotemporal pyramid network for video

action recognition, in: Proceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), 2017 .

[92] C.Y. Weng , W.T. Chu , J.L. Wu , RoleNet: movie analysis from the perspective of
social networks, IEEE Trans. Multimed. 11 (2) (2009) 256–271 .

[93] G. Willems , T. Tuytelaars , L. Gool , An efficient dense and scale-invariant spa-
tio-temporal interest point detector, in: Proceedings of the European Confer-

ence on Computer Vision (ECCV), 2008, pp. 650–663 .

[94] M. Williams , P. Gardenfors , B. Johnston , G. Wightwick , Anticipation as a strat-
egy: a design paradigm for robotics, in: Y. Bi, M.A. Williams (Eds.), Proceed-

ings of the Knowledge Science, Engineering and Management (KSEM2010),
Lecture Notes in Computer Science, 6291, Springer, Heidelberg, 2010 .

[95] W. Xiong , L. Wu , F. Alleva , J. Droppo , X. Huang , A. Stolcke , The Microsoft 2017
conversational speech recognition system, CoRR (2017) . abs/1708.06073.

[96] T. Yu, S.N. Lim, K. Patwardhan, N. Krahnstoever, Monitoring, recognizing and

discovering social networks, in: Proceedings of the 2009 IEEE Conference on
Computer Vision and Pattern Recognition, 2009, pp. 1462–1469, doi: 10.1109/

CVPR.2009.5206526 .
[97] K. Zhang, Y. Huang, Y. Du, L. Wang, Facial expression recognition based on

deep evolutional spatial-temporal networks, IEEE Trans. Image Process. 26 (9)
(2017) 4193–4203, doi: 10.1109/TIP.2017.2689999 .

[98] L. Zhang, H. Hung, Beyond F-formations: determining social involvement in

free standing conversing groups from static images, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016,

doi: 10.1109/CVPR.2016.123 .
[99] G. Zhu, M. Yang, K. Yu, W. Xu, Y. Gong, Detecting video events based on action

recognition in complex scenes using spatio-temporal descriptor, in: Proceed-
ings of the Seventeenth ACM International Conference on Multimedia, MM

’09, ACM, 2009, pp. 165–174, doi: 10.1145/1631272.1631297 .

100] N. Zhuang, T. Yusufu, J. Ye, K.A. Hua, Group activity recognition with differen-
tial recurrent convolutional neural networks, in: Proceedings of the Twelfth

IEEE International Conference on Automatic Face Gesture Recognition (FG
2017), 2017, pp. 526–531, doi: 10.1109/FG.2017.70 .

https://doi.org/10.3390/s16010072

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0050

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0051

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0052

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0053

https://doi.org/10.1007/s12369-017-0402-2

http://doi.org/10.1007/978-3-319-05491-9_4

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0055

https://doi.org/10.1007/s12369-014-0251-1

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0057

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0058

https://doi.org/10.1145/2696454.2696462

https://doi.org/10.1109/CVPR.2013.352

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0061

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0062

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0063

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0064

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0065

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0066

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0067

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0068

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0069

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0070

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0071

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0072

http://doi.org/10.1007/978-3-642-41914-0_55

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0073

https://doi.org/10.1109/MRA.2012.2192171

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0075

https://doi.org/10.1109/HUMANOIDS.2016.7803298

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0077

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078

https://doi.org/10.1109/MRA.2011.941632

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0080

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0081

https://doi.org/10.1109/ACCESS.2017.2699229

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0083

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0084

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0085

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0086

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0087

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0088

http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0089

https://doi.org/10.1109/CVPR.2009.5206526

https://doi.org/10.1109/TIP.2017.2689999

https://doi.org/10.1109/CVPR.2016.123

https://doi.org/10.1145/1631272.1631297

https://doi.org/10.1109/FG.2017.70

Perceiving the person and their interactions with the others for social robotics – A review

1 Introduction
2 Understanding a scene populated by humans
3 Perceiving and modeling people and their interactions
3.1 Modeling the human
3.1.1 Feature extraction
3.1.2 Feature vectors classification
3.1.3 Convolutional Neural Networks (CNNs)
3.2 Modeling a group of people
3.2.1 Convolutional Neural Networks (CNNs)

4 Internalizing the information
5 Discussion
5.1 Networked robotics: the strength of being part of an ecology
5.2 Approaches for first-person activity recognition
6 Conclusions
Acknowledgments
References

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Article Review Assignment ”

Get high-quality paper

NEW! AI matching with writer

Hire a Writer

Client Reviews

4.9

Sitejabber

4.6

Trustpilot

4.8

Our Guarantees

100% Confidentiality

Information about customers is confidential and never disclosed to third parties.

Original Writing

We complete all papers from scratch. You can get a plagiarism report.

Timely Delivery

No missed deadlines – 97% of assignments are completed in time.

Money Back

If you're confident that a writer didn't follow your order details, ask for a refund.

New to Your Trusted Assignment Help Service? Sign up & Save

Calculate the price of your order

Type of paper needed:

Pages:

You will get a personal manager and a discount.

Academic level:

We'll send you the first draft for approval by at

Total price:

$0.00

Power up Your Academic Success with the
Team of Professionals. We’ve Got Your Back.

Power up Your Study Success with Experts We’ve Got Your Back.

Order Now Order Now