Article Review Assignment
Work will be checked for originality and following instructions and will be canceled if it was not followed thanks!
PSYC512
Article Review Grading Rubric
Criteria |
Levels of Achievement |
|||||||
Content 70% |
Advanced |
Proficient |
Developing |
Not present |
||||
Content |
14 points The paper meets or exceeds content requirements: The Article contains: Title Page Title and authors of the article Purpose Why the article was written (introduction), and what it attempts to find or answer (hypothesis section) Method How it answers the question or questions it proposes (method section) Results/Discussion What the article found (results); What the results actually mean (discussion) |
12 to 13 points The paper meets most of the content requirements: The Article contains: |
1 to 11 points The paper meets some of the content requirements: The Article contains: |
0 points Not present. |
||||
Structure 30% |
||||||||
Format and Word Count |
6 points The paper meets or exceeds structure requirements: Proper spelling and grammar are used. The summary is at least 350 words. |
5 points The paper meets most of the structure requirements: Proper spelling and grammar are used. |
1-4 points The paper meets some of the structure requirements: Proper spelling and grammar are used. |
PSYC 512
Article Review Assignment Instructions
Overview
Reading and understanding original research is an important skill for working in the field of psychology. Understanding research methodology and the sections of a journal article is critical for success in our field. This Article Review Assignment will help you learn to objectively evaluate research, to find scholarly sources of information, and to use them as a source of knowledge. This Article Review Assignment can also help you in your professional development.
These Article Review Assignments are to help you to remember the most important aspects of each article. By the end, you will have five article summaries on social psychological research that can help you both in this course and in future research and coursework.
Instructions
Over several modules, you will complete five Article Review Assignments that relate to the following topics:
· Social perception
· Stereotypes, prejudice and discrimination
· Group processes
· Close relationships
· Aggression
In each Article Review Assignment, you will find and learn about the research that relates to one of these topics. To find these articles, you can search google scholar, one of the library’s psychology databases (i.e. PSYC INFO), or look in a specific journal (i.e. Journal of Applied Psychology).
Note: do not use the journal articles in the Learn Sections for this.
Once you have chosen an article that related to the topic, summarize the article in at least 350 words.
Your Article Review Assignments should include the following components:
· Introduction: Include general information about the article in the introduction, including a very brief overview of the previous literature on the topic and identifying the gap in the literature that demonstrates the need for this article.
· Hypothesis Section: what the article attempts to find out or answer
· Method Section: how the article answers the question or questions it proposes
· Results Section: what the article found
· Practical Significance/Discussion: What the results actually mean
· References page: Title and authors of the article in current APA format
Be careful to ensure that your answers to the above information make sense to you. You want to be able to develop the skill of making complex/academic information easy to understand to non-academic people. Make sure to explain any complex ideas in plain language, and do not assume the reader already knows what you are talking about. Summarize these articles succinctly but yet thoroughly.
Refer to the Article Review Template for guidance on this Article Review Assignment.
Make sure to check the Article Review Grading Rubric before beginning this Article Review Assignment.
Note: Your assignment will be checked for originality via the Turnitin plagiarism tool.
Page 2 of 2
1
2
Journal Article Summary
Social Psychology Article
Stu D. Name
Department of Psychology, Liberty University
PSY 512: Social Psychology
Dr. Wood
July 16, 2020
Journal Article Summary
Social Psychology Article
Introduction
List the article introduction information here.
Purpose
List the purpose this article was written.
Hypothesis
What is this paper’s contribution/question/s that it is trying to provide information on.
Methodology
Sample
Describe the sample of this study.
Measures
Describe the measures that were used in this study.
Procedures
Describe how this study was done.
Results
What did this study find? You can include both stats and an explanation of the stats.
Practical Significance
Why is this study relevant/meaningful?
References
Haney, C., Banks, C., & Zimbardo, P. (1973). A study of prisoners and guards in a simulated prison. Naval Research Reviews, 9(1), 1-17.
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/6761621
Friend Networking Sites and Their Relationship to Adolescents’ Well-Being and
Social Self-Esteem
Article in CyberPsychology & Behavior · November 2006
DOI: 10.1089/cpb.2006.9.584 · Source: PubMed
CITATIONS
1,090
READS
122,006
3 authors:
Some of the authors of this publication are also working on these related projects:
Political Communication View project
Media effects View project
Patti M. Valkenburg
University of Amsterdam
236 PUBLICATIONS 17,706 CITATIONS
SEE PROFILE
Jochen Peter
University of Amsterdam
123 PUBLICATIONS 11,849 CITATIONS
SEE PROFILE
Alexander Schouten
Tilburg University
53 PUBLICATIONS 2,920 CITATIONS
SEE PROFILE
All content following this page was uploaded by Patti M. Valkenburg on 24 September 2014.
The user has requested enhancement of the downloaded file.
https://www.researchgate.net/publication/6761621_Friend_Networking_Sites_and_Their_Relationship_to_Adolescents%27_Well-Being_and_Social_Self-Esteem?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_2&_esc=publicationCoverPdf
https://www.researchgate.net/publication/6761621_Friend_Networking_Sites_and_Their_Relationship_to_Adolescents%27_Well-Being_and_Social_Self-Esteem?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_3&_esc=publicationCoverPdf
https://www.researchgate.net/project/Political-Communication-3?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_9&_esc=publicationCoverPdf
https://www.researchgate.net/project/Media-effects?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_9&_esc=publicationCoverPdf
https://www.researchgate.net/?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_1&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Patti-Valkenburg?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_4&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Patti-Valkenburg?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_5&_esc=publicationCoverPdf
https://www.researchgate.net/institution/University_of_Amsterdam?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_6&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Patti-Valkenburg?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_7&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Jochen-Peter-2?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_4&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Jochen-Peter-2?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_5&_esc=publicationCoverPdf
https://www.researchgate.net/institution/University_of_Amsterdam?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_6&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Jochen-Peter-2?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_7&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Alexander-Schouten?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_4&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Alexander-Schouten?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_5&_esc=publicationCoverPdf
https://www.researchgate.net/institution/Tilburg_University?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_6&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Alexander-Schouten?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_7&_esc=publicationCoverPdf
https://www.researchgate.net/profile/Patti-Valkenburg?enrichId=rgreq-cbcba1a1ee91ca1cdde4fb6afe554724-XXX&enrichSource=Y292ZXJQYWdlOzY3NjE2MjE7QVM6MTQ0OTQ4MjU5MjAxMDI0QDE0MTE1Njk3NjY0MDg%3D&el=1_x_10&_esc=publicationCoverPdf
Friend Networking Sites and Their Relationship to
Adolescents’ Well-Being and Social Self-Esteem
PATTI M. VALKENBURG, Ph.D., JOCHEN PETER, Ph.D., and ALEXANDER P. SCHOUTEN, M.A.
ABSTRACT
The aim of this study was to investigate the consequences of friend networking sites (e.g.,
Friendster, MySpace) for adolescents’ self-esteem and well-being. We conducted a survey
among 881 adolescents (10–19-year-olds) who had an online profile on a Dutch friend net-
working site. Using structural equation modeling, we found that the frequency with which
adolescents used the site had an indirect effect on their social self-esteem and well-being.
The use of the friend networking site stimulated the number of relationships formed on the
site, the frequency with which adolescents received feedback on their profiles, and the tone
(i.e., positive vs. negative) of this feedback. Positive feedback on the profiles enhanced ado-
lescents’ social self-esteem and well-being, whereas negative feedback decreased their self-
esteem and well-being.
584
INTRODUCTION
THE OPPORTUNITIES for adolescents to form andmaintain relationships on the Internet have
multiplied in the past few years. Social networking
sites have rapidly gained prominence as venues to
relationship formation. Social networking sites
vary in the types of relationships they focus on.
There are dating sites, such as Match.com, whose
primary aim is to help people find a partner. There
are common interest networking sites, such as
Bookcrossing.com, whose aim is to bring people
with similar interests together. And there are friend
networking sites, such as Friendster and MySpace,
whose primary aim is to encourage members to es-
tablish and maintain a network of friends.
The goal of this study is to investigate the conse-
quences of friend networking sites for adolescents’
social self-esteem and well-being. Given the recent
worldwide proliferation of such sites and the ever-
expanding numbers of adolescents joining up,
these sites presumably play an integral role in ado-
lescent life. Friend networking sites are usually
open or semi-open systems. Everyone is welcome
to join, but new members have to register, and
sometimes the sites only allow members if they are
invited by existing members. Members of the sites
present themselves to others through an online
profile, which contains self-descriptions (e.g., de-
mographics, interests) and one or more pictures.
Members organize their contacts by giving and re-
ceiving feedback on one another ’s profiles.
Although friend networking sites have become
tremendously popular among adolescents, there is
as yet no research that specifically focuses on the
uses and consequences of such sites. This is re-
markable because friend networking sites lend
themselves exceptionally well to the investigation
of the social consequences of Internet communica-
tion. After all, peer acceptance and interpersonal
feedback on the self, both important features of
friend network sites, are vital predictors of social
self-esteem and well-being in adolescence.1 There-
fore, if the Internet has the potential to influence
CYBERPSYCHOLOGY & BEHAVIOR
Volume 9, Number 5, 2006
© Mary Ann Liebert, Inc.
Amsterdam School of Communications Research (ASCoR), University of Amsterdam, Amsterdam, The Netherlands.
14337c11.pgs 10/10/06 2:44 PM Page 584
adolescents’ social self-esteem and well-being, it is
likely to occur via their use of friend networking
sites.
There is no period in which evaluations regard-
ing the self are as likely to affect self-esteem and
well-being as in adolescence.1 Especially early and
middle adolescence is characterized by an in-
creased focus on the self. Adolescents often engage
in what has been referred to as “imaginative audi-
ence behavior”2: they tend to overestimate the ex-
tent to which others are watching and evaluating
and, as a result, can be extremely preoccupied with
how they appear in the eyes of others. On friend
networking sites, interpersonal feedback is often
publicly available to all other members of the site.
Such public evaluations are particularly likely to
affect the development of adolescents’ social self-
esteem.1 In this study, social self-esteem is defined
as adolescents’ evaluation of their self-worth or sat-
isfaction with three dimensions of their selves:
physical appearance, romantic attractiveness, and
the ability to form and maintain close friendships.
Well-being refers to a judgment of one’s satisfaction
with life as a whole.3
Our study is conducted in the Netherlands
where, since April 2000, a friend networking site
exists that is primarily used by adolescents. In May
2006, this website, named CU2 (“See You Too”),
contained 415,000 profiles of 10–19-year-olds. Con-
sidering that the Netherlands counts about 1.9 mil-
lion adolescents in this age group, approximately
22% of Dutch adolescents use this website to form
and maintain their social network.
Internet use, well-being, and self-esteem
Ever since Internet use became common as a
leisure activity, researchers have been interested in
investigating its consequences for well-being and
self-esteem. For both well-being and self-esteem,
the literature has yielded mixed results. Some stud-
ies reported negative relationships with various
types of Internet use,4,5 other studies found positive
relationships,6 and yet other studies found no sig-
nificant relationships.7,8
Two reasons may account for the inconsistent
findings on the relationships between Internet use,
self-esteem, and well-being. First, many studies
have treated the independent variable ‘Internet
use’ as a one-dimensional construct. Some studies
did investigate the differential effects of types of In-
ternet use, but the selection of these types usually
did not follow from a theoretical anticipation of
their consequences for self-esteem and well-being.
In our view, at least a distinction between social
and non-social Internet use is required to ade-
quately investigate Internet effects on self-esteem
and well-being. We believe that social self-esteem
and well-being are more likely to be affected if the
Internet is used for communication than for infor-
mation seeking. After all, feedback on the self and
peer involvement, both important precursors of
self-esteem and well-being, are more likely to occur
during online communication than during online
information seeking.
A second shortcoming in earlier studies is that
many authors did not specify how Internet use
could be related to self-esteem and well-being.
Most research has focused on main effects of Inter-
net use on either self-esteem or well-being. None of
these studies have considered models in which the
influence of Internet use on self-esteem and well-
being is considered simultaneously. By modeling
the relationships of Internet use with both self-
esteem and well-being, a more comprehensive set
of hypotheses can be evaluated, which may clarify
some of the contradictory findings in previous
studies.
Our research hypotheses modeled
It has repeatedly been shown that adolescents’
self-esteem is strongly related to their well-being.
Although the literature has not clearly established
causation, most self-esteem theorists believe that
self-esteem is the cause and well-being the effect.9
Based on these theories, we hypothesize that social
self-esteem will predict well-being, and by doing
so, it may act as a mediator between the use of
friend networking sites and well-being. After all, if
the goal of friend networking sites is to encourage
participants to form relationships and to comment
on one another ’s appearance and personality, it is
likely that the use of such sites will affect the di-
mensions of self-esteem that are related to these ac-
tivities. The hypothesis that adolescents’ social
self-esteem predicts their well-being is modeled in
Figure 1 by means of path H1.
We also hypothesize that the use of friend net-
working sites will increase the chance that adoles-
cents (a) form relationships on those site (path H2a),
and (b) receive reactions on their profiles (path
H3a). After all, if the aim of using friend networking
sites is to meet new people and to give and receive
feedback, it is plausible that the more these sites are
used, the more friends and feedback a member gets.
As Figure 1 shows, we do not hypothesize that the
use of friend networking sites will directly influence
the tone of reactions to the profiles because the mere
use of such a site cannot be assumed to influence
FRIEND NETWORKING SITES, WELL-BEING, AND SELF-ESTEEM 585
14337c11.pgs 10/10/06 2:44 PM Page 585
the tone of reactions to the profiles. However, we do
hypothesize an indirect relationship between use of
friend network sites and the tone of the reactions
via the frequency of reactions that adolescents re-
ceive (paths H3a and H5). In a recent study on the
use of dating sites, members of the site often modi-
fied their profile based on the feedback they re-
ceived. By means of a process of trial and error, they
were able to optimize their profile, and, by doing so,
optimize the feedback they received.10 We therefore
assume that the more reactions adolescents receive
to their profiles, the more positive these reactions
will become (path H5). We also assume that the
more reactions adolescents receive the more rela-
tionships they will form (path H6).
We not only assume that adolescents’ social self-
esteem mediates the relationship between the use
of friend networking sites and their well-being; we
also hypothesize that the relationships between the
use of friend networking sites and adolescents’ so-
cial self-esteem will be mediated by three types of
reinforcement processes that are common on friend
network sites and that have been shown to affect
adolescents’ social self-esteem.1 These reinforce-
ment processes are: (a) the number of relationships
formed through the friend network site, (b) the fre-
quency of feedback that adolescents receive on
their profiles (e.g., on their appearance and self-
descriptions), and (c) the tone (i.e., positive vs. neg-
ative) of this feedback. Our hypotheses about these
mediated influences are modeled by means of paths
H2a-b, H3a-b, and H4 in Figure 1.
We expect that for most adolescents the use of
friend networking sites will be positively related to
their social self-esteem. We base this view on theo-
ries of self-esteem, which assume that human be-
ings have a universal desire to protect and enhance
their self-esteem.11 Following these theories, we be-
lieve that adolescents would avoid friend network-
ing sites if these sites were to negatively impact
their social self-esteem. Friend networking sites
provide adolescents with more opportunities than
face-to-face situations to enhance their social self-
esteem. These sites provide a great deal of freedom
to choose interactions. In comparison to face-to face
situations, participants can usually more easily
eliminate undesirable encounters or feedback and
focus entirely on the positive experiences, thereby
enhancing their social self-esteem.
However, if, by contrast, an adolescent for any
reason is mostly involved in negative interactions
on these sites, an adverse influence on his or her so-
cial self-esteem seems plausible. Especially because
reactions to the profiles are made public to other
members of the site, negative reactions are likely to
have a negative influence on adolescents’ social
self-esteem. We therefore hypothesize that a posi-
tive tone of reactions will positively predict social
self-esteem, whereas a negative tone will nega-
tively predict social self-esteem.
METHODS
Sample and procedure
We conducted an online survey among 881
Dutch adolescents between 10 and 19 years of age
586 VALKENBURG ET AL.
s
Use of
site H1
Relationships
formed
Tone of
reactions
H2a
H2b
H3a H3b
H4
H6
H5
Frequency
of reactions
Well-being
Social
self-esteem
FIG. 1. Hypothesized model on the relationships among use of friend networking site, social self-esteem, and
well-being.
14337c11.pgs 10/10/06 2:44 PM Page 586
who had a profile on the friend networking site
CU2 (“See You Too”); 45% were boys and 55% were
girls (M age = 14.8; SD = 2.7). A profile on CU2 in-
cludes demographic information, a description of
the user and his or her interests, and one or more
pictures. Reactions of other CU2 users to the pro-
files are listed at the bottom of each profile (for
more information, see www.cu2.nl).
Upon accessing their profile, members of the site
received a pop-up screen with an invitation to par-
ticipate in an online survey. The pop-up screen
stated that the University of Amsterdam conducted
the survey in collaboration with CU2. The adoles-
cents were informed that their participation would
be voluntary, that they could stop with the ques-
tionnaire whenever they wished, and that their re-
sponses would be anonymous.
Measures
Use of friend networking site. We used three items
measuring the frequency, rate, and intensity of the
use of the friend networking site: (a) “How many
days per week do you usually visit the CU2 site?”,
(b) “On a typical day, how many times do you visit
the CU2 site?”, and (c) “If you visit CU2, how long
do you usually stay on the site?” The first two
items required open-ended responses. Response
categories for the third item ranged from 1 (about 10
min) to 7 (more than an hour). Responses to the three
items were standardized. The standardized items
resulted in a Cronbach’s alpha of 0.61.
Frequency of reactions to profiles. The number of
reactions to the profiles was measured by two
items: “How often do you get reactions to your pro-
file from unknown persons,” and “How often do
you get reactions to your profile from people you
only know through the Internet?” Response cate-
gories to the items ranged from 1 (never) to 5 (very
often). Responses to these two items were averaged,
and resulted in a Cronbach’s alpha of 0.72.
Tone of reactions to profiles. The tone of the reac-
tions to the profiles was measured with the follow-
ing two questions: “The reactions that I receive on
my profile are . . .” and “The reactions that I receive
on what I tell about my friends are . . .” Response
categories ranged from 1 (always negative) to 5 (al-
ways positive). Cronbach’s alpha was 0.87.
Relationships established through CU2. We asked
respondents how often they had established (a) a
friendship and (b) a romantic relationship through
CU2. Response options were 0 (never), 1 (once), and
2 (more than once). The correlation between the two
items was r = 0.34.
Social self-esteem. We used three subscales of
Harters’ self-perception profile for adolescents12:
the physical appearance subscale, the close friend-
ship subscale, and the romantic appeal subscale.
From each subscale we selected the four items with
the highest factor loadings. Response categories for
the items ranged from 1 (agree entirely) to 5 (disagree
entirely). Cronbach’s alpha values were 0.91 for
physical appearance scale, 0.85 for the close friend-
ship scale, and 0.81 for the romantic appeal scale.
Well-being. We used the five-item satisfaction
with life scale developed by Diener et al.3 Response
categories ranged from 1 (agree entirely) to 5 (dis-
agree entirely). Cronbach’s alpha for the scale was
0.89.
Statistical analysis
The hypotheses in our study were investigated
with the Structural Equation Modeling software
AMOS 5.0.13
RESULTS
Descriptive statistics
Adolescents visited the friend networking site on
average three days a week (M = 3.09, SD = 2.07).
When they visited the website, they stayed on the
site for approximately a half hour. The average
number of reactions that adolescents had received
on their profiles was 25.31 (SD = 50.00), with a
range from 0 to 350 reactions. The tone of the reac-
tions varied significantly among adolescents. Of
the adolescent who reported having received reac-
tions to their profiles (n = 592), 5.6% indicated that
these reactions had always been negative; 1.6%
that they had predominantly been negative; 10.1%
that they had sometimes been negative and some-
times positive; 49.3% that they had been predomi-
nantly positive; and 28.4% that they had always
been positive. Thirty-five percent of the adoles-
cents reported having established a friendship, and
8.4% reported having formed a romantic relation-
ship through the friend networking site.
Zero-order correlations
Before testing our hypothesized model, we pres-
ent a matrix showing the Pearson product-moment
FRIEND NETWORKING SITES, WELL-BEING, AND SELF-ESTEEM 587
14337c11.pgs 10/10/06 2:44 PM Page 587
correlations between the variables included in the
model (Table 1).
Testing the hypothesized model
The variables in our model were all modeled as
latent constructs. The construct reflecting the use of
the friend networking site was measured by three
items and well-being by five items. The frequency
of reactions to profiles, the tone of the reactions to
profiles, and the number of relationships estab-
lished by the site were each measured by two
items. The latent construct social self-esteem was
formed by the three subscales measuring physical
appearance self-esteem, close friendship self-
esteem, and romantic appeal self-esteem. For rea-
sons of clarity, we do not present the measurement
model (i.e., the factor-analytic models) in our
graphical presentation of the results. However, all
factor-analytic models led to adequate descriptions
of the data. The factor loadings were all above 0.44.
To investigate our hypotheses, we proceeded in
two steps. First, we tested whether the hypothesized
model in Figure 1 fit the data. Then, we checked
whether we could improve the model’s fit by adding
or removing theoretically meaningful paths from the
hypothesized model. We used three indices to evalu-
ate the fit of our models: the �2/df ratio, the compar-
ative fit index (CFI), and the root mean square error
of approximation (RMSEA). An acceptable model fit
is expressed in a �2/df ratio of <3.0, a CFI value of
>0.95, and a RSMEA value of <0.06.14,15
Our hypothesized model fit the data satisfacto-
rily well: �2/df ratio = 2.5; CFI = 0.96; RMSEA =
0.05. However, the results indicated that two paths
assumed in our hypothesized model were not sig-
nificant: path H2b from the number of relation-
ships formed on the friend networking site to
self-esteem, and path H3b from the frequency of re-
actions to the profile to self-esteem.
After removal of the two nonsignificant paths,
we subjected our model to a final test. The modi-
fied model fit the data well, �2/df ratio = 2.5; CFI =
0.98; RMSEA = 0.05. We therefore accepted the
model as an adequate description of the data. Our
final model indicates that all of our research hy-
potheses (i.e., those visualized by paths H1, H2a,
H3a, H4, H5, and H6) were confirmed by the data.
Figure 2 visualizes the observed final model. The
reported coefficients are standardized betas.
The model controlled for age and gender
To test whether our final model also holds when
age and gender are controlled for, we tested a
model in which we allowed paths between age and
gender and all of the remaining independent, me-
diating, and dependent variables in the model. This
588 VALKENBURG ET AL.
TABLE 1. PEARSON PRODUCT-MOMENT CORRELATIONS
Variables 1 2 3 4 5 6 7 8
1. Use of
friend networking site
2. Frequency of 0.16***
reactions to profiles
3. Tone of 0.10* 0.24***
reactions to profiles
4. Close friends 0.18*** 0.31*** 0.01
established via site
5. Romantic relations 0.12*** 0.12*** �0.13** 0.34***
established via site
6. Physical appearance 0.04 0.05 0.29*** �0.00 �0.00
self-esteem
7. Close friendship 0.12*** 0.13*** 0.40*** 0.06 �0.05 0.61***
self-esteem
8. Romantic attractiveness 0.06 0.16*** 0.38*** 0.08* �0.00 0.68*** 0.72***
self-esteem
9. Well-being 0.06 0.07* 0.37*** �0.03 �0.01 0.59*** 0.54*** 0.45***
*p < 0.05. **p < 0.01. ***p < 0.001.
14337c11.pgs 10/10/06 2:44 PM Page 588
model again led to a satisfactory fit: �2/df ratio =
2.6; CFI = 0.95; RMSEA = 0.05.
DISCUSSION
Our study was the first to show the conse-
quences of adolescents’ use of friend networking
sites for their social self-esteem and well-being.
Adolescents’ self-esteem was affected solely by the
tone of the feedback that adolescents received on
their profiles: Positive feedback enhanced adoles-
cents’ self-esteem, and negative feedback de-
creased their self-esteem. Most adolescents (78%)
always or predominantly received positive feed-
back on their profiles. For these adolescents, the
use of friend networking sites may be an effective
vehicle for enhancing their self-esteem.
However, a small percentage of adolescents (7%)
did predominantly or always receive negative feed-
back on their profiles. For those adolescents, the use
of friend networking sites resulted in aversive ef-
fects on their self-esteem. Follow-up research should
attempt to profile these adolescents. Earlier research
suggests that users of social networking sites are
quite able to learn how to optimize their self-presen-
tation through their profiles.10 Adolescents who pre-
dominantly receive negative feedback on their
profiles may especially be in need of mediation on
how to optimize their online self-presentation.
No less than 35% of the respondents reported
having established one or more friendships
through the site, and 8% one or more romantic rela-
tionships. However, as discussed, the number of
friendships and romantic relationship formed via
the site did not affect adolescents’ social self-
esteem. Obviously, it is not the sheer number of re-
lationships formed on the site that affect
adolescents’ social self-esteem. Research on adoles-
cent friendships suggests that the quality of friend-
ships and romantic relationships may be a stronger
predictor of social adjustment than the sheer num-
ber of such relationships.16 Therefore, future re-
search on friend networking sites should include
measures on the quality of the relationships formed
through friend networking sites.
Our study focused on a new and pervasive phe-
nomenon among adolescents: friend networking
sites. In the Netherlands, about one quarter of ado-
lescents is currently a member of one or more of
such sites. The Netherlands is at present at the fore-
front of Internet-based communication technologies
(e.g., 96% of Dutch 10–19-year olds have home ac-
cess to the Internet, and 90% use Instant Messaging).
Therefore, it is a unique spot to start investigating
the social consequences of such technologies. How-
ever, friend networking sites are a worldwide phe-
nomenon that attracts ever younger adolescents.
Such sites can no longer be ignored, neither by com-
munication researchers nor by educators.
REFERENCES
1. Harter, S. (1999). The construction of the self: a develop-
mental perspective. New York: Guilford Press.
FRIEND NETWORKING SITES, WELL-BEING, AND SELF-ESTEEM 589
Frequency
of reactions
Use of
site
Social
self-esteem Well-being.78
Relationships
formed
Tone of
reactions
.19
n.s.
.28 n.s.
.48
.29
.30
FIG. 2. Structural equations model of the relationships among use of friend networking site, social self-esteem, and
well-being. The ellipses represent latent constructs estimated from at least two observed variables; coefficients repre-
sent standardized betas significant at least at p < 0.01.
14337c11.pgs 10/10/06 2:44 PM Page 589
2. Elkind, D., & Bowen, R. (1979). Imaginary audience
behavior in children and adolescents. Developmental
Psychology 15:38–44.
3. Diener, E., Emmons, R.A., Larsen, R.J., et al. (1985).
The satisfaction with life scale. Journal of Personality
Assessment 49:71–75.
4. Kraut, R., Patterson, M., Lundmark, V., et al. (1998).
Internet paradox: a social technology that reduces
social involvement and psychological well being?
American Psychologist 53:1017–1031.
5. Rohall, D.E., & Cotton, S.R. (2002). Internet use and
the self-concept: linking specific issues to global
self-esteem. Current Research in Social Psychology
8:1–19.
6. Kraut, R., Kiesler, S., Boneva, B., et al. (2002). Internet
paradox revisited. Journal of Social Issues 58:49–74.
7. Gross, E.F., Juvonen, J., & Gable, S.L. (2002). Internet
use and well-being in adolescence. Journal of Social Is-
sues 58:75–90.
8. Harman, J.P., Hansen, C.E., Cochran, M.E., et al.
(2005). Liar, liar: Internet faking but not frequency of
use affects social skills, self-esteem, social anxiety,
and aggression. CyberPsychology & Behavior 8:1–6.
9. Baumeister, R.F., Campbell, J.D., Krueger, J.I., et al.
(2003). Does high self-esteem cause better perfor-
mance, interpersonal success, happiness, or healthier
lifestyles? Psychological Science 4:1–44.
10. Ellison, N.B., Heino, R., & Gibbs, J.L. (2006). Managing
impressions online: Self-presentation processes in the
online dating environment. Journal of Computer-Mediated
Communication, 11(2): http://jcmc.indiana.edu/vol11/
issue2/ellison.html
11. Rosenberg, M., Schooler, C., & Schoenbach, C. (1989).
Self-esteem and adolescent problems: modeling recip-
rocal effects. American Sociological Review 54:1004–1018.
12. Harter, S. (1988). Manual for the self-perception profile
for adolescents. Denver, CO: Department of Psychol-
ogy, University of Denver.
13. Arbuckle, J.L. (2003). Amos 5.0 [computer software].
Chicago, IL: SmallWaters.
14. Byrne, B.M. (2001). Structural equation modeling with
AMOS: basic concepts, applications and programming.
Mahwah, NJ: Erlbaum.
15. Kline, R.B. (1998). Principles and practice of structural
equation modeling. New York: Guilford Press.
16. Larson, R.W., Core, G.L., & Wood, G.A. (1999). The
emotions of romantic relationships. In: Furman, W.,
Brown, B.B., Feiring, C. (eds.), The development of ro-
mantic relationships in adolescence. Cambridge, UK:
Cambridge University Press, pp. 19–49.
Address reprint requests to:
Dr. Patti M. Valkenburg (ASCoR)
University of Amsterdam
Kloveniersburgwal 48
1012 CX Amsterdam, The Netherlands
E-mail: p.m.valkenburg@uva.nl
590 VALKENBURG ET AL.
14337c11.pgs 10/10/06 2:44 PM Page 590
View publication statsView publication stats
https://www.researchgate.net/publication/6761621
JOURNAL OF CONSUMER PSYCHOLOGY, ]](I), 57-73
Copyright O 2001, Lawrence Erlbaum Associates, Inc.
Consumers’ Responses to Negative Word-of-Mouth
Communication: An Attribution Theory Perspective
Russell N. Laczniak, Thomas E. DeCarlo, and Sridhar N. Ramaswami
Department of Marketing
Iowa State University
Research on negative word-of-mouth communication (WOMC) in general, and the process by
which negative WOMC affects consumers’ brand evaluations in particular, has been limited.
This study uses attribution theory to explain consumers’ responses to negative WOMC. Experi-
mental results suggest that (a) causal attributions mediate the negative WOMC-brand evalua-
tion relation, (b) receivers’ attributions depend on the manner in which the negative WOMC is
conveyed, and (c) brand name affects attributions. Results also suggest that when receivers at-
tribute the negativity of the WOMC message to the brand, brand evaluations decrease; however,
if receivers attribute the negativity to the communicator, brand evaluations increase.
Word-of-mouth communication (WOMC) is an important
marketplace phenomenon by which consumers receive infor-
mation relating to organizations and their offerings. Because
WOMC usually occurs through sources that consumers view
as being credible (e.g., peer reference groups; Brooks, 1957;
Richins, 1983), it is thought to have a more powerful influ-
ence on consumers’ evaluations than information received
through commercial sources (i.e., advertising and even neu-
tral print sources such as Consumer Reports; Herr, Kardes, &
Kim, 1991). In addition, this influence appears to be asyrn-
metrical because previous research suggests that negative
WOMC has a stronger influence on customers’ brand evalua-
tions than positive WOMC (Amdt, 1967; Mizerski, 1982;
Wright, 1974). Given the strength of negative, as opposed to
positive WOMC, the study presented here focuses on the for-
mer type of information.
Our research develops and tests, using multiple studies, a
set of hypotheses that describes consumers’ attributional and
evaluative responses to different types of negative-WOMC
messages. The hypotheses posit that consumers will generate
predictable patterns of attributional responses to nega-
tive-WOMC messages that are systematically varied in terms
of information content. Furthermore, they predict that
attributional responses will mediate the negative
WOMC-brand evaluation relation. Finally, and similar to re-
cent studies (cf. Herr et al., 1991), the hypotheses suggest
Requests for reprints should be sent to Russell N. Laczniak, Iowa
State University, Department of Marketing, 300 Carver Hall, Ames,
IA 5001 1-2065. E-mail: LACZNIAK@IASTATE.EDU
consumer responses to negative WOMC are likely to be
influenced by strength of the targeted brand’s name.
This study extends research on negative WOMC in two im-
portant ways. First, whereas previous studies have typically
examined receivers’ responses to a summary statement of a fo-
cal brand’s performance (cf. Bone, 1995; Herr et al., 1991), it is
likely that the information contained in negative-WOMC mes-
sages is more complex than thls. In this study, focal messages
are manipulated to include three components of information
besides the communicator’s summary evaluation (Richins,
1984). Messages include information about the (a) consensus
of others’ views of the brand (besides the communicator), (b)
consistency of the communicator’s experiences with the brand
over time, and (c) distinctiveness of the communicator’s opin-
ions of the focal brand versus other brands in the category. In-
terestingly, these types of information correspond to the
information dimensions examined in Kelley’s (1 967) seminal
work dealing with attribution theory. It is also important to note
that although others have used this work to model individual
responses to another’s actions (e.g., observing someone’s in-
ability to dance), this study is the first that empirically extends
Kelley’s research into a context in which consumers interpret a
conversation about a brand.
Second, whereas other studies have posited the existence
of a direct relation between negative WOMC and
postexposure brand evaluations (e.g., Amdt, 1967;
Haywood, 1989; Katz & Lazerfield, 1955; Morin, 1983), our
investigation examines the attributional process that explains
this association. This approach is consistent with the thinking
of several researchers (i.e., Bone, 1995; Herr et al., 1991;
Smith & Vogt, 1995) who posited that cognitive mechanisms
are important, as they can more fully explain the negative
WOMC-brand evaluation linkage. Furthermore, this re-
search is consistent with other studies that suggest (but do not
test the notion) that receivers’ cognitive processing of nega-
tive WOMC involves causal attributional reasoning (cf.
Folkes, 1988; Mizerski, Golden, & Kernan, 1979).
THEORY AND HYPOTHESES
Negative WOMC
Negative WOMC is defined as interpersonal communication
concerning a marketing organization or product that deni-
grates the object of the communication (Richins, 1984;
Weinberger, Allen, & Dillon, 1981). Negative WOMC po-
tentially has a more powerful influence on consumer behav-
ior than print sources, such as Consumer Reports, because in-
dividuals find it to be more accessible and diagnostic (Herr et
al., 1991). In fact, research has suggested that negative
WOMC has the power to influence consumers’ attitudes
(Engel, Kegemeis, & Blackwell, 1969) and behaviors (e.g.,
Arndt, 1967; Haywood, 1989; Katz & Lazerfield, 1955).
Attributions as Responses to
Negative WOMC
Because the transmission of negative WOMC involves inter-
personal and informal processes, attribution theory appears to
be particularly helpful in understanding a receiver’s interpre-
tation of a sender’s motives for communicating such informa-
tion (Hilton, 1995). The central theme underlying attribution
theory is that causal analysis is inherent in an individual’s
need to understand social events, such as why another person
would communicate negative information about a brand
(Heider, 1958; Jones & Davis, 1965; Kelley, 1967). For this
study, causal attribution is defined as the cognition a receiver
generates to infer the cause of a communicator’s generation
of negative information (Calder & Burnkrant, 1977).
Figure 1 illustrates the proposed process consumers use to
deal with negative WOMC. Specifically, it proposes two im-
portant influences on receivers’ attributional responses to
negative-WOMC communication. First, the information con-
veyed by the sender in a negative-WOMC message is posited
to influence receivers’ causal attributions. Second,
brand-name strength of the focal brand is also thought to di-
rectly affect receivers’ causal attributions. These attributional
responses, in turn, are expected to affect receivers’ brand
evaluations. Therefore, this study suggests that attributions
mediate the presupposed negative-WOMC-brand evaluation
relation. Such a model is consistent with theoretical frame-
works of interpersonal communication that suggest that attri-
butions mediate an interpersonal message’s effect on a
receiver’s evaluation ofthe focal object (e.g., Hilton, 1995).
FIGURE 1 Attributional process model for receivers of negative
word-of-mouth communication.
There is additional support for the mediational role played
by attributions in influencing individuals’ brand evaluations.
For example, studies in the advertising literature have sug-
gested that receivers generate causal attributions that in turn
affect their evaluations of the advertised brand (e.g., Wiener
& Mowen, 1986). In the performance evaluation literature,
studies indicate that sales manager attributions of salesperson
performance shape their reactions toward a salesperson (e.g.,
DeCarlo & Leigh, 1996). Thus, the following is proposed for
receivers of negative WOMC:
H I : Causal attributions will mediate the effects of
negative WOMC on brand evaluations.
Information Type and Causal
Attributions
According to research in classical attribution theory
(Kelley, 1967, 1973), the categories of causal attributions that
people generate in response to information include: stimulus
(i.e., brand, in this case), person (i.e., communicator, in this
case), circumstance, or a combination of these three.’ The
specific type of attributions generated by individuals, how-
ever, depends on the manner in which information is con-
veyed. According to attribution theory (Kelley, 1967) and
other studies dealing with WOMC (e.g., Richins, 1984), a re-
ceiver is likely to use three important information dimensions
to generate causal attributions: consensus, distinctiveness,
and consistency. In a negative-WOMC context, the consen-
sus dimension refers to the degree to which others are likely to
agree with the negative views of the communicator. The dis-
tinctiveness dimension encapsulates the extent to which the
communicator associates the negative information with a par-
ticular brand but not other brands. Finally, the consistency di-
‘Although attribution theo~y suggests that individuals have the potential
to generate multiple and interactive attributional responses, this study fo-
cuses only on those attributions that are thought to have a significant impact
on brand evaluations in the negative-WOMC context (i.e., brand and com-
municator attributions).
https://isiarticles.com/article/35420
Pattern
R
ecognition Letters 118 (2
0
19) 3–13
Contents lists available at ScienceDirect
Pattern Recognition Letters
journal homepage: www.elsevier.com/locate/patrec
Perceiving the person and their interactions with the others for social
robotics – A review
Adriana Tapus a , Antonio Bandera b , ∗, Ricardo Vazquez-Martin c , Luis V. Calderita b
a Autonomous Systems and Robotics Lab, Computer Science and System Engineering Department (U2IS), ENSTA ParisTech, 828 Blv des MArechaux, Palaiseau
91120, France
b AVISPA Group, Department of Electronic Technology, Universidad de Málaga, Málaga, 29071, Spain
c Robotics and Mechatronics Lab., Department of System Engineering and Automation, Universidad de Málaga, Málaga, 29071, Spain
a r t i c l e i n f o
Article history:
Available online 6 March 2018
Keywords:
Social robots
Human perception
Human–robot interaction
Social interactions
Proxemics
a b s t r a c t
Social robots need to understand human activities, dynamics, and the intentions behind their behaviors.
Most of the time, this implies the modeling of the whole scene. The recognition of the activities and
intentions of a person are inferred from the perception of the individual, but also from their interactions
with the rest of the environment (i.e., objects and/or people). Centering on the social nature of the per-
son, robots need to understand human social cues, which include verbal but also nonverbal behavioral
signals such as actions, gestures, body postures, facial emotions, and proxemics. The correct understand-
ing of these signals helps these robots to anticipate the needs and expectations of people. It also avoids
abrupt changes on the human–robot interaction, as the temporal dynamics of interactions are anchored
and driven by a major repertoire of social landmarks . Within the general framework of interaction of
robots with their human counterparts, this paper reviews recent approaches for recognizing human ac-
tivities, but also for perceiving social signals emanated from a person or a group of people during an
interaction. The perception of visual and/or audio signals allow them to correctly localize themselves
with respect to humans from the environment while also navigating and/or interacting with a person or
a group of people.
© 2018 Elsevier B.V. All rights reserved.
1
a
o
a
i
w
f
c
s
t
t
i
i
i
b
t
o
a
a
r
e
o
c
n
o
i
r
t
o
t
R
e
s
h
0
. Introduction
One of the basic skills allowing people to interact in a safe
nd comfortable way is their ability to understand intuitively each
ther’s role and activities. Everyday, people observe one another
nd, through these observations, they recognize what they are do-
ng and also infer their intentions. In addition, this is addresse
d
ithout remarkable effort. It is clear that this ordinary and ef-
ortless ability is not only the result of having at our disposal
a
omplex multimodal perception system, and those other complex
ystems, related to learning and planning, are also involved. Ac-
ivities that have not been seen before cannot be recognized. In
he same way, intentions, which do not respond to, or cannot be
ncluded within, a normal course of actions will not be correctly
nferred. The recognition of activities and intentions is therefore
ntimately tied to the existence of a specific, shared socio-cultural
ackground, which is continuously acquired and improved within
he framework of the interaction with the others. The importance
∗ Corresponding author.
E-mail address: ajbandera@uma.es (A. Bandera).
a
d
s
ttps://doi.org/10.1016/j.patrec.2018.03.006
167-8655/© 2018 Elsevier B.V. All rights reserved.
f the observation and interpretation of various social cues em-
nating from their social interaction with the others is therefore
lso crucial for our acquisition of the correct collection of social
ules.
Now that robots are moving from automatized factories into our
veryday environments, it is natural to endow them with some
f the aforementioned skills (e.g., based on a set of social rules)
entered on the challenge of interacting with humans. In this sce-
ario, it is fundamental to have a robot perception system capable
f reading the social signals emerged from the interaction. The aim
s to produce a socially correct and smooth interaction between the
obot and the humans in its surroundings, based on the predic-
ion of their behaviors [76] . Anticipating which activities people in
ur surroundings will do next (and why they will do so) can hel
p
he robot to plan in advance its next responses and behaviors [94] .
obots need to understand verbal and nonverbal social cues from
ach individual person and from the dynamics of their relation-
hips. Signals such as body postures, gestures, and facial emotions,
re relevant for estimating the internal state of the humans. Un-
erstanding the dynamics of a group of people and identifying the
ocial role of each member of the group help the robot to exhibit a
https://doi.org/10.1016/j.patrec.2018.03.00
6
http://www.ScienceDirect.com
http://www.elsevier.com/locate/patre
c
http://crossmark.crossref.org/dialog/?doi=10.1016/j.patrec.2018.03.006&domain=pd
f
mailto:ajbandera@uma.e
s
https://doi.org/10.1016/j.patrec.2018.03.006
4 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13
Fig. 1. Robots interact with people in human-centred environments. In the figure,
the Gualzru robot trying to convince a woman to follow it to an interactive adver-
tising panel [62] .
[
m
f
a
[
t
r
r
t
T
c
p
s
t
d
s
t
fi
c
g
s
n
a
a
t
c
h
s
n
t
d
d
a
b
t
t
f
p
t
correct behavior from a social perspective. All this knowledge can
only be acquired from the observation and modeling of the human
and from their social interactions with other people.
This paper focuses on reviewing recent approaches and relevan
t
topics related to the perception and modeling of the human, as an
isolated individual, but also as part of a group of people. Restricted
to the ability of identifying the signals that can help having a so-
cial interaction, the acquisition of this complex skill requires the
robot to be equipped with hardware and software modules that
allows it (i) to perceive humans and their static and dynamic at-
tributes; and (ii) to match the obtained features with a specific,
memorized or on-line captured state (social knowledge) for mod-
eling them. It is important to note that the static role, as a passive
observer, that we are assuming here for the robot is not the real
situation. Our robots are situated agents that perceive but also act
in this outer world. The Theory of Event Coding [32] proposes that
stimulus representations underlying perception are encoded using
the same format that sensorimotor representations underlying ac-
tion. This is a significant difference with respect to the analysis of
video sequences captured from static cameras. Although, we do not
include within this contribution the importance of topics such as
affordance or goal directedness, we must consider that the situ-
atedness of the robot within the whole context plays a significant
role on its ability to recognize the behaviors and social interactions
of the humans in its surroundings.
The rest of the paper is organized as follows:
Section 2 overviews the problem tackled in this work, the model-
ing of the activities and social behavior of individuals, and their
social interactions. Among the most important requirements are
extraction and classification of hand-crafted or learned features,
and modeling and internalizing of the social relationships. Both
topics are described in Sections 3 and 4 , respectively. Section 3 is
divided up into two main sections, which review the typical
parameters of the perception system designed for a dyadic inter-
action ( Section 3.1 ) and for the interaction with a group of people
( Section 3.2 ). It is important to note that this strict separation
between feature extraction algorithms and classifiers does not
always exist and that both processes can be encoded together
within the same solution. A general discussion follows this study
in Section 5 . Finally, our conclusions are drawn in Section 6 .
2. Understanding a scene populated by humans
In this last decade, there has been a growing interest on the
design, methodology, and theory of human–robot interaction [29] .
This is justified by the fact that robots are expected to share our
same environments and cooperating with us to a greater or lesser
extent in our daily activities. Hence, autonomous robots used for
specific tasks with a very limited interaction with humans is not
a viable solution. The restriction of the human–robot interaction
(HRI) scenario to a dyadic interaction, where the robot interacts
with only one human is not true most of the time. Robots are more
and more part of teams (robots or humans), for instance, work-
ing closely alongside humans in industrial settings [66] or help-
ing physiotherapists to evaluate how a patient performs a motion-
based test in a hospital room [84] . This understanding of a situation
forces the robot to perceive details from the whole scene, captur-
ing not only the human but also its interaction with the surround-
ing objects and, especially, its social interaction with other people
( Fig. 1 ). Focusing on only one person could lead to the omission of
important information, and this can conduct to wrong decisions.
The recognition of human activities and social interactions is
a complex task for robots, which require the design and interac-
tion of several modules. Detailing the scheme stated in Section 1 ,
these systems typically include modules for (i) extracting sig-
nificant unary and pairwise-interaction human-related features
74,89] from the scene; (ii) obtaining meaningful, semantic infor-
ation (gender, gestures,…) [67] from these descriptors; and (iii)
using the information coming from several sources for modeling
nd internalizing the scene (usually employing a graphical model
34,43] ). The internalization of the perceived information can help
o fuse multimodal cues or to deal with the subsequent intention
ecognition problem. Fig. 2 summarizes this approach. As other
elated approaches, the classification algorithms need to have at
heir disposal datasets (knowledge) for comparison and matching.
hus, although it is not drawn on the figure, the scheme must in-
orporate the learning mechanisms for updating this knowledge.
The modules in charge of extracting features (unary- and
airwise-interaction features including objects) and the ones re-
ponsible of returning semantic concepts from these features must
ry to build a model of the scene. This allows the robot to un-
erstand the behaviors of the people and even get the gist of this
cene (e.g., catalog the event as birthday celebration, award func-
ion, etc. [59] ). The parameters of these modules are tuned by the
nal use case or application: it is not the same to encourage a
hild to perform an exercise within a rehabilitation session than
uiding a group of people through a museum. Moreover, the sen-
ors, features and recognition needs are not the same either. The
eed of fusing perceptions coming from different modalities (e.g.,
udio and video for emotion recognition) could be a reason for
dding a new module, the so-called ‘Internal representation’, on
he scheme on Fig. 2 . In some cases, the internal representation
an include part of this knowledge (e.g., a priori known models of
uman bodies or faces) and then to be used also as an additional
ource of information for action recognition [6] or emotion recog-
ition [17] . For instance, the hierarchical recognition approaches
hat are build over primitive sub-actions or sub-activities do not
irectly deal with the raw data for activity recognition [1] . An ad-
itional advantage of working over an inner representation is that
pproaches designed for performing the recognition processes can
e partially decoupled from the hardware resources available on
he robot [45] . Finally, the inference module on the Fig. 2 encodes
he processes in charge of extending the model with data obtained
rom the outcomes of the classifiers.
Within this paper, we conduct a survey on the solutions pro-
osed for allowing a robot to perceive and internalize the activi-
ies and social interactions of a group of people. Thus, this review
A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 5
Fig. 2. Major modules on a system in charge of modeling human behaviors and interactions. It typically includes feature extraction and classification, internal representation
and inference mechanisms. The pipeline scheme is only partially true. The internal representation can store raw data from the feature extractors and help on modeling the
whole scene (see text for details).
c
p
t
e
b
A
g
c
i
3
3
c
g
r
r
r
n
r
f
t
p
i
a
o
a
s
t
t
t
t
a
f
t
d
T
t
y
f
b
i
m
s
c
K
t
m
o
i
d
t
d
e
a
r
a
3
r
q
d
t
r
r
h
w
m
t
t
c
overs the perception and modeling of the activities of a group of
eople that share the environment with the robot. The term ac-
ivity takes in this context a significance that exceeds the simple
xecution of certain movements. Following the terminology given
y Turaga et al. [79] , we distinguish between action and activity.
ction is referred to a simple motion pattern, executed by a sin-
le person and usually with a short duration of time. Activities are
omplex sequences of actions performed by one or several people,
n a scenario that is typically driven by social cues.
. Perceiving and modeling people and their interactions
.1. Modeling the human
As aforementioned, there exist a large number of signals that
an be captured for modeling a person: speech, face expression,
aze, gestures, and any sort of measurements that a robot can
ecord from the environment related to social interaction. The
obot probably needs the use of dedicated hardware and software
esources for dealing with each one of them. In a simplest sce-
ario, only concerning with activity recognition, the robot typically
equires at least using visual information for extracting motion in-
ormation and characterizing the dynamics of the scene. It concen-
rates all resources on the interaction with one human counter-
art: an action is in any case a sequence of body movements, and
t usually involves several body parts concurrently.
Fig. 3 provides some snapshots of human–robot dyadic inter-
ction. On the right, the ARMAR-III from the Karlsruhe Institute
f Technology in Germany [80] is shown. It focuses on detecting
nd tracking the gestures from a human teacher [26] . The whole
ystem allows the transfer of motion based on predefined ges-
ures and force interaction. Initially, a dynamic movement primi-
ive (DMP) [37] is learned from a human wiping movement. Given
he color of the wiping tool, the robot tracks the movements of the
ool using a stereo camera system. For the subsequent force-based
daptation of the learned DMPs, it relies on the readings of the
orce torque sensor installed on the wrist of the robot. On Fig. 3 (b),
he Loki robot plays a simple game with a person. It is able to
etect the presence of a person and recognize verbal commands.
hus, when the human introduces themselves and asks it to play
he game, Loki uses color and distance information for tracking a
ellow ball. For doing this, it has a RGB-D sensor placed on the
orehead. It continuously fixates its gaze upon the ball. After a ver-
al indication, it reaches the ball with its hand and waits for a ne
w
nteraction. Loki tracks the object and accepts new speech com-
ands during the whole span of the game, representing all the
cene using an undirected graph [6] . Fig. 3 (c) shows the Nao robot
oaching a child during a rehabilitation session [58] . An external
inect sensor from Microsoft is employed for capturing the skele-
on of the human user and threshold values are used for deter-
ining the correct execution of certain exercises. The same kind
f interaction between a Nao robot and children with autism in an
mitation task is also described in [14] . These examples show how
ifferent modalities, features and classifiers are used for modeling
he human and its interaction with the robot. If we analyze the
etails of the hardware and software architectures behind these
xperiments, we could also note the complexity of the perception
nd actuation systems. As it is probably not possible to summa-
ize all perceptual possibilities within one paper, here we provide
brief description of relevant issues, which are classified in Fig. 4 .
.1.1. Feature extraction
In a dyadic scenario, feature extraction aims to transform the
aw information captured by sensors to feature vectors for subse-
uent modeling of the human. Robots usually employ vision, au-
io, and/or range sensors. Table 1 summarizes the features and
echniques for semantic understanding employed by several social
obots. Typical tasks include human tracking, face, and/or speech
ecognition, and scale up to action and activity recognition. It is
owever noticeable that social robots are not usually endowed
ith the ability of recognizing intentions. In fact, it is not com-
on that they consider the activity recognition task, in the sense
hat we briefly state it in Section 2 .
With respect to the features employed, they usually depend on
he task to solve. We can group them in three major classes ac-
ording to the temporal dimension. On one hand, we have tasks
6 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13
Fig. 3. Human–robot interaction as a dyadic human–robot interaction: ARMAR-III interacting with a human teacher [26] ; Loki playing with one person [6] ; and the Nao
robot coaching a child in a rehabilitation session [58] .
Table 1
Representative perception modalities on social robots.
Social robot Task Features Algorithms for semantic understanding
PaPero Face detection and recognition Shape, 3D model Template matching
Speech recognition Filter banks Hidden Markov model
i-Cub Human detection Motion-based Machine learning [82]
Human/face tracking Color Hierarchical temporal memory [41]
Sound localization ITD, ILD, and notches Active mapping [33]
Maggie Emotion recognition Voice and face expression [3]
Pose recognition Skeleton Template matching [28]
Speech processing Grammar-based [4]
ARMAR-III Human tracking Haar-like, color… Particle filters [54]
Human tracking Time-delay Particle filters [54]
Face recognition DCT-based Nearest neighbor
Gesture recognition Intensity, color Neural network + hidden Markov model
Head pose estimation Intensity, shape Neural network
Sound recognition ICA-transformed features Hidden Markov model [75]
Speech recognition MFCC RTN [75]
Loki Face detection and tracking Solor, depth Active appearance model
Human motion capture Skeleton Template matching [9]
Speech recognition CNN-BLSTM
Emotion recognition Candide model DBN [17]
NaoTherapist Skeleton Human motion capture Machine learning for body-part
Fig. 4. Taxonomy of the methods and approaches covered in this survey.
i
o
n
i
f
a
t
o
o
f
h
t
i
h
n
j
b
(
u
[
n
e
e
S
t
[
p
t
such as emotion detection from facial features or the recognition
of a specific verbal command. Although, we can incorporate the
time for improving the classification results, they put the empha-
sis on the current instant of time: an image for facial expression,
or a word for verbal command recognition. Within each observa-
tion, these approaches employ static data such as the brightness
or color values for images. These raw data are usually provided as
input data to modules that obtain feature vectors such as the Local
Binary Patterns (LBP) or the Haar-like features. Both features have
been successfully employed for face detection [83] or for gender
and age estimation [52] . Other popular descriptors for character-
zing static images are the scale-invariant feature transform (SIFT)
r the speed up robust features (SURF). In audio perception, sig-
ificant features are the inter-aural level difference (ILD) and the
nter-aural time difference (ITD) [33] . But the most commonly used
eature extraction method in automatic speech recognition is prob-
bly the Mel-Frequency Cepstral Coefficients (MFCC) [75] . Contrary
o static approaches, sequential algorithms consider the scene as a
rdered collection of individual observations. However, within each
bservation, they deal with static features. The matching of these
eatures within the sequence of images allow for example to track
uman body or face parts [31] . In these approaches, the feature ex-
raction can be supported by inner models of the human [5,6] . For
nstance, the Candide model has been successfully employed for
uman face tracking [78] or emotion recognition through the defi-
ition of the action units features [17] ( Fig. 5 ). The tracking of the
oints (head, left shoulder, center shoulder, right shoulder, left el-
ow, right elbow, left wrist, etc.) composing the three-dimensional
3D) representation of the human body as a skeleton is also widely
sed for action recognition in robots equipped with RGB-D sensors
28,58] . Both schemes show the advantages of tying together inter-
al representation and perception. Finally, space-time approaches
qual space and time dimensions, and work in a 3D space. There
xist 3D versions of typical image-based descriptors, such as the
IFT3D [69] or the SURF3D [93] . Unfortunately, they inherit from
heir predecessors the limitations in performance generalization
47] . Many effort s have been made to set features based on other
rinciples: representing actions by a temporally integrated spa-
ial response (TISR descriptor) that extracts bag-of-words features
A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 7
Fig. 5. Recognizing emotions from facial features using the Candide model [17] .
[
(
t
T
t
S
3
u
d
a
s
i
m
s
s
l
l
o
r
f
B
p
p
t
o
b
t
t
(
t
n
t
e
a
p
a
i
p
3
m
t
w
e
c
p
m
C
m
t
c
s
a
i
p
d
t
t
t
f
t
[
n
t
(
e
s
r
t
s
t
m
C
a
T
m
t
a
a
r
a
i
c
m
b
s
p
s
t
t
t
t
t
d
d
w
a
t
f
e
m
m
n
t
n
t
o
t
99] ; trajectories described using histograms of oriented gradients
HOG), histograms of optical flow (HOF) and motion boundary his-
ogram (MBH) around interest points (iDT descriptors) [86] , etc.
he plethora of descriptors allow the researchers to fuse and ob-
ain successful schemes for recognition, as we briefly describe at
ection 3.1.2 .
.1.2. Feature vectors classification
Feature vectors can be classified for solving tasks (see Table 1 )
sing a large variety of approaches. Using skin color and image
isparity, Nickel and Stiefelhagen [54] used a k -means clustering
pproach for face detection. Stiefelhagen et al. [75] proposed t
o
olve the face recognition computing the distances between the
nput images and a collection of training images. A Min–Max nor-
alization approach and a sum rule that normalizes and fuses
cores are applied. Then, face is classified according to the highest
core and a predefined threshold value. However, the most popu-
ar strategy for detecting faces was the combination of the Haar-
ike features with an AdaBoost classifier, originally proposed by Vi-
la and Jones [83] . The approach was extended for dealing with
otated faces, and for performing face recognition using the Eigen-
aces approach [73] . Other boosting approach, the so-called Gentle-
oost, was used for recognizing children’s emotions [63] . When in-
ut data is represented as a sequence of ordered observations, the
roblem is how to compare the incoming stream with the stored
emplate. Previous approaches used dynamic time warping (DTW)
r a simple matching of coefficients obtaining from the activities
y principal component analysis (PCA). Lin et al. [48] described
he activity as a hierarchical prototype tree, which is matched to
he trees on the dataset for recognition. Hidden Markov Models
HMM) were applied for speech recognition [49] . HMMs or ex-
ensions have been also widely applied in human activity recog-
ition, and novel versions are still proposed [42] . Surveys such as
he one by Cheng et al. [12] (for activity recognition) or Mishra
t al. [51] (for face emotion recognition) provided information
bout databases and approaches. New schemes are continuously
roposed, being now possible to adopt one of these state-of-art
lgorithms in our robotics architecture and obtain good results
n a short time. The use of closed solutions for solving human-
erception tasks is widely employed [4,6] .
.1.3. Convolutional Neural Networks (CNNs)
Instead of setting handcrafted features and training traditional
achine learning methods, other option is to learn these descrip-
ors directly from the raw data. Deep Convolutional Neural Net-
orks (CNNs) are currently the state-of-the-art solution for sev-
ral computer vision problems such as object detection [55] and
lassification [27,57] . In a CNN, cells act as local filters over the in-
ut space exploiting the strong spatially local correlation, being the
ain reason behind their success in computer vision applications.
ombined with multi-layered recurrent networks (long short-term
emory, LTSM) used for learning temporal series, CNNs are also
he state-of-art solution for speech recognition [95] . In general, it
an be considered that CNNs and their extensions are currently the
trategy for dealing with the challenge of perceiving the human.
With respect to the pipeline strategy composed by most of the
pproaches described in Section 2 , CNNs can be trained for link-
ng raw information with class labels. This end-to-end training is
erformed in a supervised way [35] , being the traditional major
rawback that a good training requires a vast number of labeled
raining patterns [38] . Fortunately, we have now readily available
hese image-based models trained using millions of labeled pat-
erns [39] . It has also been demonstrated that a model trained
rom a large dataset can be transferred to other visual recogni-
ion tasks with limited training data [21,55] . Recently, Zhang et al.
97] has proposed a part-based hierarchical bidirectional recurrent
eural network (PHRNN) to analyze the facial expression informa-
ion of temporal sequences. Combined with a multi-signal CNN
MSCNN), the resulting deep evolutional spatial-temporal network
ffectively boosts the performance of facial expression recognition.
This last work captures the dynamic variation of facial physical
tructure from a sequence of images. Similarly, for being useful in
ecognizing human activities, the CNN needs to be extended from
he bi-dimensional domain of the image to the three-dimensional,
patio-temporal domain of the video sequence. The solutions for
aking the temporal cue into account can be grouped within three
ajor clusters: (i) three-dimensional (3D) CNNs; (ii) motion-based
NNs; and (iii) fusion approaches. The first cluster includes those
pproaches that perform 3D convolutions on the video sequence.
he second one groups the methods that adopt the scene infor-
ation related to motion as an input for the CNN. The third clus-
er proposes to fuse the information in temporal domains. These
pproaches are complementary and it is typical that CNN-based
pproaches merge techniques from different clusters for activity
ecognition. The better results are typically provided by those
pproaches that adopt the two-stream model [71] . Basically, the
dea is to characterize the sequence of images using two different
onvolutional networks (ConvNet) streams: a temporal stream of
otion-based features and a second spatial stream of appearance-
ased features. Fig. 6 provides a graphical illustration of the two-
tream proposal by Wang et al. [90] . As Fig. 6 shows, a fusion
rocess combine the obtained results and deliver the final deci-
ion. Wang et al. [90] proposed a temporal segment network (TSN)
o recognize action. The approach consists of three steps. First,
he input video is divided up into K segments and a short por-
ion (fragment) is randomly selected from each segment. Second,
he class scores of different fragments are fused by the segmen-
al consensus function to yield video-level prediction. Third, pre-
ictions from spatial and temporal streams are then fused to pro-
uce the final prediction. The second step of the previous scheme
as modified on the sequential segment network (SSN) [11] . The
im is to concatenate the outputs of different segment portions as
he video-level representation. This representation is fed into the
ully-connected layer. Feichtenhofer et al. [24,25] proposed to gen-
ralize the residual networks (ResNets) for the spatio-temporal do-
ain by introducing residual connections within the two-stream
odel. Specifically, Feichtenhofer et al. [24] injected residual con-
ections between the appearance and temporal streams. Moreover,
hey transformed pre-trained image ConvNets into spatio-temporal
etworks by equipping them with learnable convolutional filters
hat are initialized as temporal residual connections and operate
n adjacent feature maps in time. Feichtenhofer et al. [25] fused
wo streams by motion gating and injected identity mapping ker-
8 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13
Fig. 6. Temporal segment network [90] .
Table 2
Activity recognition results on UCF101 and HMDB51 databases.
Approach CNN scheme Features UCF101 – mAP (%) HMDB51 – mAP (%)
Wang et al. [86] – iDT 85,9 57,2
Wang and Schmid [87] – iDT 87,9 61,1
Wang et al. [90] BN-Inception CNN 94,2 69,4
Chen and Zhang [11] BN-Inception CNN 94,8 73,8
Feichtenhofer et al. [24] ST-ResNet CNN 93,4 66,4
CNN + iDT 94,6 70,3
Feichtenhofer et al. [25] ResNet-50 CNN 94,2 68,9
CNN + iDT 94,9 72,2
Wang et al. [91] BN-Inception CNN 94,6 68,9
Duta et al. [22] VGG-16, VGG-19 CNN 93,6 69,5
CNN + HMG 94,0 70,3
CNN + HMG + iDT 94,3 73,1
(
c
a
t
t
g
a
s
f
n
i
s
t
o
o
c
o
g
[
c
i
c
e
c
s
f
c
A
a
e
[
f
F
[
nels as temporal filters to learn long-term temporal information.
Wang et al. [91] provided a pyramid two-stream model for merg-
ing the spatial and temporal information. The goal is to make both
streams reinforce each other. Duta et al. [22] added to the spatial
and temporal streams, a third spatio-temporal stream built with
the C3D architecture [77] . Spatio-Temporal Vector of Locally Max
Pooled Features (ST-VLMPF) are proposed to build action represen-
tation over the entire video. Table 2 shows the classification accu-
racy of these approaches on the UCF101 and HMDB51 databases.
The UCF101 database consists of 13,320 videos with 101 action
classes [72] . It is characterized by the large diversity in terms of
variations in background, camera motion, illumination and view-
point, as well as object scale, appearance or pose. The HMDB51
dataset consists of 6766 videos [44] . It shows a minor repertoire
of classes (51 action classes), but it is typically considered more
challenging than the UCF101 due to the even wider variations in
which actions are performed [24] . Both datasets provide an eval-
uation protocol. The evaluation metric is the mean of the Average
Precision (mAP) [23] .
3.2. Modeling a group of people
Understanding the activities and social interactions in a group
of people is a challenge topic that is starting to gain an increasing
attention by researchers. Several works pursuit to determine social
networks from appearance- and motion-based parameters charac-
terizing the people in the scene. For instance, Yu et al. [96] es-
timated the social network encoding the interactions among peo-
ple by combining face recognition and motion similarities between
tracks of people on the ground plane. The association problem of
mapping faces and tracks was solved using a novel graph-cut based
algorithm. In the proposal by Ding and Yilmaz [20] , this social net-
work was extracted from the video sequence analyzing the rela-
tionships among visual concepts. A probabilistic graphical model
PGM) with temporal smoothing was employed for analyzing so-
ial relations among actors and for detecting communities. The
pproach assumes that the relations remain constant throughout
he video sequence. RoleNet is a model for describing social rela-
ionships within a group of people [92] . It is built as a weighted
raph, where nodes are people, arcs represent relationships, and
third set of weights encodes the strength of the arcs (relation-
hips). Using co-occurrence matrices and recognizing people by
ace recognition, the social interaction is driven by the actors and
ot by audiovisual features. The method determines roles (lead-
ng roles and supporting roles) and divides up the sequence into
cenes according to the context of roles [92] . As major disadvan-
age, all these approaches do not extrapolate generic social events
r situations (birthday, wedding…) from one video sequence to the
ther. The grouping of the people is local to each sequence and so-
ial roles within an event (e.g. priest, groom, bride…) are not rec-
gnized. Some authors have addressing the problem of detecting
roups of interacting people using the concept of the F-formations
40,50] . F-formations are defined as a geometric arrangement en-
oding the position and orientation information of people stand-
ng in the formation ( Fig. 7 ). The estimation of these F-formations
an be inferred from body poses and/or head orientations. Vascon
t al. [81] associated to each person with a frustum, which was
omputed from the position and orientation information. They de-
igned a game-theoretic framework where the concept of the F-
ormation was embedded, but also the biological constraints of so-
ial attention. Orientation was the main cue for Ricci et al. [60] .
joint learning approach was suggested for estimating the pose
nd F-formation for groups of people. Zhang and Hung [98] also
mployed the frustum of attention. But, contrary to Vascon et al.
81] , they used this frustum to obtain features from people. These
eatures labeled people in associates, singletons and members of
-formations. Using the Group Interaction Zone (GIZ), Cho et al.
15] also addressed the problem of detecting meaningful groups by
A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 9
Fig. 7. Two-people formations [40] and a three-people formation [61] .
m
i
a
d
o
f
h
d
t
m
p
fi
a
t
t
p
e
a
f
u
t
n
n
M
c
z
i
i
c
e
w
w
g
L
t
t
t
e
i
p
i
Fig. 8. Sample frames from a ‘wedding’ event from two films with manual role
annotations.
Table 3
Group recognition results on NUS-HGA and BEHAVE databases.
Approach NUS-HGA BEHAVE
Accuracy (%) Accuracy (%)
Cheng et al. [13] 96.20 92.93
Cho et al. [15] 96.03 93.74
Al-Raziqi and Denzler [2] 81.94 79.35
Zhuang et al. [100] 99.25 94.63
d
i
t
(
i
a
r
c
p
t
n
c
r
s
v
3
p
a
a
e
r
e
e
C
d
F
r
H
w
I
s
d
a
v
o
T
f
p
h
c
h
t
odeling proxemics. They described the group activity in a GIZ us-
ng attraction and repulsion properties, which considered an inter-
ction in terms of “getting close”, “away”, and “keeping the same
istance together”.
Other works try to capture the social interactions for helping
n the recognition of joint activities. Facial features were modeled
or recognizing activities such as hand-shaking [56] . The relation
istory image (RHI) descriptor was proposed by Gori et al. [30] for
iscriminating activities and interactions that happen at the same
ime. The RHI is built as the temporal variation of relational infor-
ation between every pair of local subparts belonging to one or a
air of people. Choi and Savarese [16] proposed a model that uni-
es the tracking of multiple people, the recognition of individual
ctions, and the identification of the interactions and collective ac-
ivities. It is assumed that there exists a strong correlation between
he individual activity of each person and the activities of the other
eople. Cheng et al. [13] proposed a layered model. They firstly
xtracted various motion and appearance features from the video
nd trajectory data. And then, features were randomly sampled
rom the training features to generate codebooks of visual words
sing K -means clustering. All features are quantized by assigning
heir nearest visual words with Euclidean distance. The resulting
ormalized histograms of visual word occurrences formed the fi-
al representations, one feature type per group action instance.
ulti-class Support Vector Machine (SVM) was used to build the
lassifier and make the recognition decisions. Al-Raziqi and Den-
ler [2] proposed to divide up the video sequence into clips us-
ng an unsupervised clustering approach. Within the clips, signif-
cant groups of objects were detected using a bottom-up hierar-
hical clustering and then tracked over time. Furthermore, mutual
ffect between objects based on motion and appearances features
ere computed. Finally, the Hierarchical Dirichlet Process (HDP)
as employed to cluster the clips.
The recognition of social roles and its importance for predicting
roup activities has been explored by Ramanathan et al. [59] and
an et al. [46] . The aim is to identify events and roles, being able
o label people ( Fig. 8 ). The first proposal addressed the identifica-
ion of social roles in a weakly supervised framework, meanwhile
he second one works in a fully supervised scenario. Ramanathan
t al. [59] tackled the problem from the perspective of recogniz-
ng social roles, which emerges from the interactions among peo-
le and among people and objects. They proposed to model the
nter-role interactions using Conditional Random Field (CRF) un-
er a weakly supervised setting. Unary component representations
ncluded HOG3D, spatio-temporal features, object interaction fea-
ures (restricted to two objects per event) and social role features
clothing and gender of the person). These features were refined
n a subsequent layer consisting of pairwise spatio-temporal inter-
ction features. The parameters of the CRF-based model and the
ole labels were learned adapting a joint variational inference pro-
edure. Focused on group activities, a hierarchical classifier was
roposed by Lan et al. [46] . Using an undirected graphical model,
he hierarchy encoded individual actions, role-based unary compo-
ents, pairwise roles, and group activities. Thus, at a low-level, the
lassifier recognizes single activities. At a mid-level, it infers social
oles. The parameters of the model are learned using a structured
upport vector machine (SVM). It works under completely super-
ised setting.
.2.1. Convolutional Neural Networks (CNNs)
Similarly to the approaches described in Section 3.1.3 , there are
roposals that deal with the problem of recognizing the activity of
group of people by using a layered model where both motion
nd appearance information are employed. For instance, Zhuang
t al. [100] proposed the Differential Recurrent Convolutional Neu-
al Networks (DRCNN). As Fig. 9 shows, the DRCNN combines lay-
rs of convolutional networks, max-pooling, fully-connected, differ-
ntial long short-term memory (DLSTM) networks and soft-max.
ontrary to Cheng et al. [13] and Cho et al. [15] , this method
oes not need the previous detection of the people on the images.
or assessing the performance of the approaches for group activity
ecognition two popular public video datasets: BEHAVE and NUS-
GA are used. The NUS-HGA dataset consists of 476 video clips,
hich cover six group activity classes (Fight, Gather, Ignore, Run-
nGroup, StandTalk and WalkInGroup). The BEHAVE dataset con-
ists of 7 long video sequences. As these video sequences include
ifferent classes of group activities, video clips containing group
ctivity instances have been extracted from the sequences. These
ideo clips cover ten group activity classes, but it is typical to use
nly six group activity classes (Approach, Fighting, InGroup, Run-
ogether, Split, and WalkTogether), because the rest only contain a
ew short sequences. Table 3 shows the group recognition results
rovided by several approaches on these datasets.
Other approaches represent activities and interactions within a
ierarchical representation. Taken into consideration scene classifi-
ation and group activity recognition, Deng et al. [19] proposed a
ierarchical model that predicts scores for individual actions, ob-
ained from bounding-boxes around each person, and the group
10 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13
Fig. 9. Differential Recurrent Convolutional Neural Networks [100] .
Fig. 10. Overview of the software architecture for human perception proposed by
Lallée et al. [45] .
m
E
e
a
f
p
i
u
l
s
e
j
f
t
p
t
t
G
v
e
f
m
t
[
p
v
fi
d
c
o
5
p
v
t
d
r
i
s
i
n
[
e
c
activity, from the whole scene. Obtained labels were refined by
applying a belief propagation-like neural network. The dependen-
cies between individual actions and the group activity are taken
into account in the network. The model learns the message pass-
ing parameters and performs inference and learning in an unified
framework using back-propagation. While this approach use neural
network-based graphical representations, Ibrahim et al. [36] lever-
aged LSTM-based temporal modeling to learn discriminative infor-
mation from time varying sports activity data.
4. Internalizing the information
The integration of isolated feature descriptors provided by in-
dividual perceptual units for providing a whole view of the scene
can be achieved by internalizing all this information into an unique
representation. This scheme has been widely employed on robotics,
specially when it is expected that they deploy cognitive function-
alities. If cognition is the ability that allow us to internally deal
with the information about ourselves and the external world, this
ability is subject to the existence of an internal active representa-
tion handling all this information. For instance, Fig. 10 shows an
overview of the architecture proposed by Lallée et al. [45] for the
i-Cub robot. In the figure, we can note the presence of a mod-
ule for storing the spatial knowledge of the scene, which receives
inputs from the 3D perception module. The presence of the first
module, in the ‘Platform independent’ part of the software archi-
tecture, allows the system to decouple sensing and perception. This
odule is a geometric memory in Lallée et al. [45] , the so-called
goSphere . In the proposal by Romero-Garcés et al. [62] , the knowl-
dge is stored in a graphical representation that merges symbolic
nd metric information.
The use of an internal representation can be a good solution
or encoding the complexity of a scene populated by several peo-
le. As it is shown in previous Sections, rich semantic relations are
mportant for understanding these events. If these relations can be
seful for understanding the activities of an individual, building re-
ationships among the people sharing a common task will be ba-
ic for recognizing group activity. In Ramanathan et al. [59] , they
ncode relationships among people but also among people and ob-
ects. Some of the state-of-art approaches presented above success-
ully label the perceived sequence of images, but they are unable
o provide fine details about the individual role or activity of each
erson on the scene. Hierarchical approaches recognize the activi-
ies of each individual person and of the group of people. But it is
ypical that they do not encode all the richness of the interactions.
raphical models emerge as a solution to encode components of
isual appearance and their relations and interactions [6] . Chen
t al. [10] combined graphical models and deep neural networks,
eeding the outcomes of the final layer of a deep network to a CRF
odel. Schwing and Urtasun [68] designed an iterative process for
raining of a CRF model and, expanding this approach, Deng et al.
18] used an iterative approach for employing the actions of other
eople in the scene in the disambiguate of the action of each indi-
idual. They accomplished this by a recurrent neural network, re-
ned by repeatedly passing messages with estimates of each in-
ividual persons action. The inner representation of the scene is
onfigurable, using trainable gating functions for turning on and
ff arcs between individual people in the scene.
. Discussion
The previous sections review the state-of-art approaches on
erceiving people and their social interactions. Although the ad-
ances on accuracy are really surprising, some doubts appear when
hese algorithms must be translated to robotics. One of the major
ifficulties is related to the response time of these algorithms. The
obots illustrated in Fig. 3 need to interact with a person at human
nteraction rates. The hardware and software complexities underlie
ome of the architectures on this review are really relevant, and its
ntegration within a robot should increase its price. This is a sig-
ificant issue: how much would a social robot cost? As Blackman
7] pointed out for care robots, there is a serious lack of robust
vidence of cost-effectiveness. Although we solve the technological
hallenge of endowing a robot with the abilities for understanding
A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 11
Fig. 11. Samples from a video sequence capturing a ‘Handshaking with the observer’: (top) from a 3rd-person viewpoint; and (down) from a first-person viewpoint.
o
b
O
o
t
a
o
g
v
5
b
r
u
o
a
i
o
a
b
t
i
u
a
5
v
a
s
s
m
e
3
o
v
u
l
i
d
6
a
u
t
f
o
p
c
v
(
a
c
t
o
h
t
s
c
n
d
l
t
t
n
F
q
o
F
f
s
A
C
E
R
ur activities and intentions, it will be difficult to bridge the gap
etween the research or academic domain and the market shelf.
ther significant problem is that most of the approaches focused
n recognition from a 3rd-person perspective (i.e., viewpoint). In
hese cases, the camera is typically far away from people. And the
lgorithms recognize what people are doing to each other with-
ut getting involved in the activities (e.g., two people walking to-
ether). This paradigm is insufficient when the observer itself is in-
olved in interactions [65] .
.1. Networked robotics: the strength of being part of an ecology
For addressing both problems, recent proposals suggest to em-
ed intelligent networking robotic devices in our everyday envi-
onments (homes, offices, public buildings…). Similar to the ubiq-
itous computing, the robot is now one element within an ecology
f connected devices. In fact, extending the definition of robot to
ny e mbedded device with computing, communication, and sens-
ng or actuation capabilities [8] , we can refer to this as an ‘ecol-
gy of robots’. Within these approaches, the perceptual and social
bilities of each robot are augmented by adding the ones provided
y the rest of robots. Each robot is in charge of solving a specific
ask, and the human activity understanding can be solved by us-
ng wearable sensors [70] , or external cameras that provide the
sual 3rd-person perspective. Moreover, the robot can shares the
cquired knowledge by uploading it to a distributed database [85] .
.2. Approaches for first-person activity recognition
First-person cameras or microphones are the correct input de-
ices for providing the researchers with the information that will
llow to endow the robot with the situation awareness that we
tate at the end of Section 1 . In this egocentric scenario, the ob-
erver wearing the camera is involved in the ongoing activities. It
ust be noted that videos will visually display very different prop-
rties when compared to the video captured from a conventional,
rd-person viewpoint. As an example, Fig. 11 shows some samples
f the task ‘Handshaking with the observer’ captured from the two
iewpoints.
The research area of first-person activity recognition or scene
nderstanding is gaining an increasing amount of attention these
ast years. There are works on recognition of activities of daily liv-
ng [53] , early recognition [64] , etc. But it is expected that new
atasets and approaches will appear in the next years.
. Conclusions
This review provides a summary of approaches that have been
pplied to characterize and recognize the behaviors of an individ-
al or a group of people. Specifically, the understanding of the in-
eraction with a group of people is receiving significant attention
rom the research community in recent years. Similarly, a large set
f concepts and different approaches have emerged recently. This
aper summarizes some of these advances for modeling the so-
ial setting where the robot is involved and for extracting the rele-
ant information during the interaction. Recurrent neural networks
CNN and LTSM) represent promising techniques for the detection
nd classification tasks in the interaction of a social robot. As dis-
ussed above, these techniques required a vast number of labeled
raining patterns, but this is not a problem due to the availability
f large labeled datasets and trained networks. These approaches
ave shown impressive results in the recognition of human ac-
ivity in the field of computer vision. While achieving these re-
ults is a significant achievement, the researchers still have many
hallenges to deal with, such as repeating the achieved recog-
ition rates in egocentric videos, dealing with noise due to the
ynamics associated to the robot’s motion, etc. The whole prob-
em should be approached from the robotics point-of-view, and
he algorithms should work with low memory and less computa-
ional time. Recently, it was discussed in [88] the development of
ew methods on Application Specific Integrated Circuit (ASIC) or
ield Programmable Gate Array (FPGA). A transversal effort will re-
uire a joint expertise in embedded vision and traditional teams
f robotics, software engineering and computer vision researchers.
urthermore, the work on activity recognition should be extended
or dealing with the early recognition, where the pre-activity ob-
ervations and the context awareness are basic concepts.
cknowledgments
The research work of A. Bandera, R. Vazquez-Martin and L.V.
alderita within this scope has been partially funded by the EU
CHORD++ project (FP7-ICT-601116) and the TIN2015-65686-C5-1-
(Gobierno de España and FEDER funds).
12 A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13
References
[1] J. Aggarwal , M. Ryoo , Human activity analysis: a survey, ACM Comput. Surv.
43 (2011) 1–43 .
[2] A. Al-Raziqi, J. Denzler, Unsupervised Group Activity Detection by Hierarchi-
cal Dirichlet Processes, Springer International Publishing, Cham, pp. 399–407.
doi: 10.1007/978- 3- 319- 59876- 5 _ 44 .
[3] F. Alonso-Martín, M. Malfaz, J. Sequeira, J. Gorostiza, M. Salichs, A multimodal
emotion detection system during human-robot interaction, Sensors (Basel) 13
(11) (2013) 15549–15581, doi: 10.3390/s131115549 .
[4] F. Alonso-Martín, M.A. Salichs, Integration of a voice recognition system in a
social robot, Cybern. Syst. 42 (4) (2011) 215–245, doi: 10.1080/01969722.2011.
583593 .
[5] A . Aly, A . Tapus, Multimodal adapted robot behavior synthesis within a nar-
rative human-robot interaction, in: Proceedings of the 2015 IEEE/RSJ Interna-
tional Conference on Intelligent Robots and Systems (IROS), 2015, pp. 2986–
2993, doi: 10.1109/IROS.2015.7353789 .
[6] A. Bandera, P. Bustos, Toward the development of cognitive robots, in:
L. Grandinetti, T. Lippert, N. Petkov (Eds.), Proceedings of the International
Workshop on Brain-Inspired Computing, BrainComp 2013, Springer Interna-
tional Publishing, Cham, 2014, pp. 88–99, doi: 10.1007/978- 3- 319- 12084- 3 _ 8 .
Cetraro, Italy.
[7] T. Blackman , Care robots for the supermarket shelf: a product gap in assistive
technologies, Ageing Soc. 33 (5) (2013) 763–781 .
[8] M. Bordignon, M.J. Rashid, M. Broxvall, A. Saffiotti, Seamless integration of
robots and tiny embedded devices in a PIES-Ecology, in: Proceedings of the
2007 IEEE/RSJ International Conference on Intelligent Robots and Systems,
Sheraton Hotel and Marina, San Diego, California, USA, 2007, pp. 3101–3106 .
October 29–November 2, 2007. 10.1109/IROS.2007.4399282 .
[9] L.V. Calderita, J.P. Bandera, P. Bustos, A. Skiadopoulos, Model-based reinforce-
ment of kinect depth data for human motion capture applications, Sensors 13
(7) (2013) 8835–8855, doi: 10.3390/s130708835 .
[10] L. Chen , G. Papandreou , I. Kokkinos , K. Murphy , A.L. Yuille , Semantic image
segmentation with deep convolutional nets and fully connected CRFs, CoRR
(2014) .
[11] Q. Chen , Y. Zhang , Sequential segment networks for action recognition, IEEE
Signal Process. Lett. 24 (5) (2017) 712–716 .
[12] G. Cheng , Y. Wan , A.N. Saudagar , K. Namuduri , B.P. Buckles , Advances in hu-
man action recognition: a survey, CoRR (2015) . abs/1501.05964.
[13] Z. Cheng , L. Qin , Q. Huang , S. Yan , Q. Tian , Recognizing human group action
by layered model with multiple cues, Neurocomputing 136 (2014) 124–135 .
[14] P. Chevalier, J. Martin, B. Isableu, C. Bazile, A. Tapus, Impact of sensory prefer-
ences of children with ASD on imitation with a robot, in: Proceedings of the
2017 IEEE International Conference on Human–Robot Interaction (HRI), 2017,
doi: 10.1145/2909824.3020234 .
[15] N.-G. Cho , Y.-J. Kim , U. Park , J.-S. Park , S.-W. Lee , Group activity recognition
with group interaction zone based on relative distance between human ob-
jects, Int. J. Pattern Recognit. Artif. Intell. 29 (5) (2015) 1555007 .
[16] W. Choi , S. Savarese , A unified framework for multi-target tracking and col-
lective activity recognition, in: Proceedings of the 2012 European Conference
on Computer Vision (ECCV), 2012, pp. 215–230 .
[17] F. Cid, J. Moreno, P. Bustos, P. Núñez, Muecas: a multi-sensor robotic head for
affective human robot interaction and imitation, Sensors 14 (5) (2014) 7711–
7737, doi: 10.3390/s140507711 .
[18] Z. Deng, A. Vahdat, H. Hu, G. Mori, Structure inference machines: recurrent
neural networks for analyzing relations in group activity recognition, in: Pro-
ceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recog-
nition, CVPR 2016, Las Vegas, NV, USA, 2016, pp. 4772–4781 . June 27–30,
2016. 10.1109/CVPR.2016.516 .
[19] Z. Deng , M. Zhai , L. Chen , Y. Liu , S. Muralidharan , M. Roshtkhari , G. Mori , Deep
structured models for group activity recognition, in: Proceedings of the 2015
British Machine Vision Conference (BMVC), 2015 .
[20] L. Ding, A. Yilmaz, Inferring social relations from visual concepts, in: Proceed-
ings of the 2011 International Conference on Computer Vision, 2011, pp. 699–
706, doi: 10.1109/ICCV.2011.6126306 .
[21] J. Donahue , Y. Jia , O. Vinyals , J. Hoffman , N. Zhang , E. Tzeng , T. Darrell , Decaf:
a deep convolutional activation feature for generic visual recognition, in: Pro-
ceedings of the 2015 International Conference on Machine Learning (ICML),
32, 2014, pp. 1–9 .
[22] I. Duta , B. Ionescu , K. Aizawa , N. Sebe , Spatio-temporal vector of locally max
pooled features for action recognition in videos, in: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), 2017 .
[23] M. Everingham, L. Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual
object classes (VOC) challenge, Int. J. Comput. Vis. 88 (2) (2010) 303–338,
doi: 10.10 07/s11263- 0 09- 0275- 4 .
[24] C. Feichtenhofer , A. Pinz , R. Wildes , Spatiotemporal residual networks for
video action recognition, in: Proceedings of the Conference on Neural Infor-
mation Processing Systems (NIPS), 2016 .
[25] C. Feichtenhofer , A. Pinz , R. Wildes , Spatiotemporal multiplier networks for
video action recognition, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2017 .
[26] A. Gams , T. Petric , M. Do , B. Nemec , J. Morimoto , T. Asfour , A. Ude , Adapta-
tion and coaching of periodic motion primitives through physical and visual
interaction, Robot. Auton. Syst. 75 (2016) 340–351 .
[27] R. Girshick , J. Donahue , T. Darrell , J. Malik , Rich feature hierarchies for ac-
curate object detection and semantic segmentation, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014,
pp. 580–587 .
[28] V. Gonzalez-Pacheco, M. Malfaz, F. Fernandez, M.A. Salichs, Teaching human
poses interactively to a social robot, Sensors 13 (9) (2013) 12406–12430,
doi: 10.3390/s130912406 .
[29] M. Goodrich , A. Schultz , Human-robot interaction: a survey, Found. Trends
Hum.-Comput. Interact. 1 (2007) 203–275 .
[30] I. Gori , J. Aggarwal , L. Matthies , M. Ryoo , Multitype activity recognition in
robot-centric scenarios, IEEE Robot. Autom. Lett. 1 (1) (2016) 593–600 .
[31] A.M. Gupta, B.S. Garg, C.S. Kumar, D.L. Behera, An on-line visual human track-
ing algorithm using surf-based dynamic object model, in: Proceedings of the
2013 IEEE International Conference on Image Processing, 2013, pp. 3875–
3879, doi: 10.1109/ICIP.2013.6738798 .
[32] B. Hommel , J. Müsseler , G. Aschersleben , W. Prinz , The theory of event coding
(TEC): a framework for perception and action planning, Behav. Brain Sci. 24
(5) (2001) 849–937 .
[33] J. Hornstein, M. Lopes, J. Santos-Victor, F. Lacerda, Sound localization for hu-
manoid robots – building audio-motor maps based on the HRTF, in: Proceed-
ings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and
Systems, 2006, pp. 1170–1176, doi: 10.1109/IROS.2006.281849 .
[34] N. Hu , G. Englebienne , Z. Lou , B. Krose , Learning latent structure for activity
recognition, in: Proceedings of the IEEE Conference Robotics and Automaton
(ICRA), 2014, pp. 1048–1053 .
[35] F. Husain , B. Dellen , C. Torras , Action recognition based on efficient deep fea-
ture learning in the spatio-temporal domain, IEEE Robot. Autom. Lett. 1 (2)
(2016) 984–991 .
[36] M.S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, G. Mori, A hierarchical
deep temporal model for group activity recognition, in: Proceedings of the
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2016, pp. 1971–1980, doi: 10.1109/CVPR.2016.217 .
[37] A. Ijspeert , J. Nakanishi , P. Pastor , H. Hoffmann , S. Schaal , Dynamical move-
ment primitives: learning attractor models for motor behaviors, Neural Com-
put. 25 (2) (2013) 328–373 .
[38] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Synthetic data and ar-
tificial neural networks for natural scene text recognition, arXiv: 1406.2227
(2014).
[39] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar-
rama, T. Darrell, Caffe: convolutional architecture for fast feature embedding,
arXiv: 1408.5093 (2014).
[40] A. Kendon , Conducting Interaction: Patterns of Behavior in Focused Encoun-
ters, Studies in Interactional Socio, Cambridge University Press, 1990 .
[41] M. Kirtay, E. Falotico, A. Ambrosano, U. Albanese, L. Vannucci, C. Laschi, Vi-
sual Target Sequence Prediction via Hierarchical Temporal Memory Imple-
mented on the iCub Robot, Springer International Publishing, Cham, pp. 119–
130. doi: 10.1007/978- 3- 319- 42417- 0 _ 12 .
[42] M.H. Kolekar, D.P. Dash, Hidden Markov model based human activity recog-
nition using shape and optical flow based features, in: Proceedings of the
2016 IEEE Region 10 Conference (TENCON), 2016, pp. 393–397, doi: 10.1109/
TENCON.2016.7848028 .
[43] H. Koppula , R. Gupta , A. Saxena , Learning human activities and object affor-
dances from RGB-D videos, Int. J. Robot. Res. 32 (8) (2013) 951–970 .
[44] H. Kuhne , H. Jhuang , E. Garrote , T. Poggio , T. Serre , HMDB: A large video
database for human motion recognition, in: Proceedings of the IEEE Inter-
national Conference on Computer Vision (ICCV), 2011 .
[45] S. Lallée , S. Lemaignan , A. Lenz , C. Melhuish , L. Natale , S. Skachek , T. van der
Zant , F. Warneken , P.F. Dominey , Towards a platform-independent coopera-
tive human–robot interaction system: I. Perception, in: Proceedings of the In-
ternational Conference on Intelligent Robots and Systems (IROS), IEEE, 2010,
pp. 4 4 4 4–4 451 .
[46] T. Lan , L. Sigal , G. Mori , Social roles in hierarchical models for human activity
recognition, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2012, pp. 1354–1361 .
[47] Q. Le , W. Zou , S. Yeung , A. Ng , Learning hierarchical invariant spatio-temporal
features for action recognition with independent subspace analysis, in: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2011, pp. 3361–3368 .
[48] Z. Lin , Z. Jiang , L. Davis , Recognizing actions by shape-motion prototype trees,
in: Proceedings of the IEEE International Conference on Computer Vision,
2009, pp. 4 4 4–451 .
[49] C.Y. Liu, T.H. Hung, K.C. Cheng, T.H.S. Li, HMM and BPNN based speech
recognition system for home service robot, in: Proceedings of the 2013 In-
ternational Conference on Advanced Robotics and Intelligent Systems, 2013,
pp. 38–43, doi: 10.1109/ARIS.2013.6573531 .
[50] P. Marshall, Y. Rogers, N. Pantidi, Using F-formations to analyse spatial pat-
terns of interaction in physical environments, in: Proceedings of the ACM
2011 Conference on Computer Supported Cooperative Work, CSCW ’11, ACM,
New York, NY, USA, 2011, pp. 445–454, doi: 10.1145/1958824.1958893 .
[51] B. Mishra, S.L. Fernandes, K. Abhishek, A. Alva, C. Shetty, C.V. Ajila, D. Shetty,
H. Rao, P. Shetty, Facial expression recognition using feature based techniques
and model based techniques: A survey, in: Proceedings of the Second Interna-
tional Conference on Electronics and Communication Systems (ICECS), 2015,
pp. 589–594, doi: 10.1109/ECS.2015.7124976 .
[52] D. Nguyen, S. Cho, K. Shin, J. Bang, K. Park, Comparative study of human age
estimation with or without preclassification of gender and facial expression,
Sci. World J. 2014 (2014) 905269, doi: 10.1155/2014/905269 . 15 pages
[53] T.-H.-C. Nguyen, J.-C. Nebel, F. Florez-Revuelta, Recognition of activities of
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0001
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0001
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0001
http://doi.org/10.1007/978-3-319-59876-5_44
https://doi.org/10.3390/s131115549
https://doi.org/10.1080/01969722.2011.583593
https://doi.org/10.1109/IROS.2015.7353789
https://doi.org/10.1007/978-3-319-12084-3_8
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0006
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0006
http://doi.org/10.1109/IROS.2007.4399282
https://doi.org/10.3390/s130708835
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0009
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0009
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0009
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0009
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0009
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0009
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0010
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0010
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0010
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0011
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0011
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0011
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0011
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0011
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0011
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0011
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0012
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0012
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0012
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0012
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0012
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0012
https://doi.org/10.1145/2909824.3020234
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0014
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0014
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0014
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0014
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0014
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0014
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0015
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0015
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0015
https://doi.org/10.3390/s140507711
http://doi.org/10.1109/CVPR.2016.516
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0018
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0018
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0018
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0018
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0018
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0018
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0018
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0018
https://doi.org/10.1109/ICCV.2011.6126306
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0020
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0020
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0020
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0020
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0020
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0020
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0020
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0020
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0021
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0021
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0021
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0021
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0021
https://doi.org/10.1007/s11263-009-0275-4
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0023
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0023
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0023
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0023
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0024
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0024
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0024
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0024
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0025
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0025
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0025
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0025
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0025
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0025
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0025
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0025
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0026
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0026
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0026
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0026
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0026
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0026
https://doi.org/10.3390/s130912406
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0028
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0028
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0028
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0029
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0029
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0029
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0029
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0029
https://doi.org/10.1109/ICIP.2013.6738798
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0031
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0031
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0031
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0031
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0031
https://doi.org/10.1109/IROS.2006.281849
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0033
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0033
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0033
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0033
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0033
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0034
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0034
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0034
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0034
https://doi.org/10.1109/CVPR.2016.217
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0036
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0036
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0036
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0036
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0036
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0036
arxiv:/1406.2227
arxiv:/1408.5093
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0037
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0037
http://doi.org/10.1007/978-3-319-42417-0_12
https://doi.org/10.1109/TENCON.2016.7848028
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0039
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0039
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0039
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0039
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0040
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0040
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0040
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0040
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0040
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0040
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0041
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0041
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0041
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0041
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0041
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0041
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0041
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0041
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0041
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0041
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0042
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0042
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0042
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0042
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0043
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0043
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0043
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0043
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0043
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0044
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0044
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0044
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0044
https://doi.org/10.1109/ARIS.2013.6573531
https://doi.org/10.1145/1958824.1958893
https://doi.org/10.1109/ECS.2015.7124976
https://doi.org/10.1155/2014/905269
A. Tapus et al. / Pattern Recognition Letters 118 (2019) 3–13 13
[
daily living with egocentric vision: a review, Sensors 16 (1) (2016), doi: 10.
3390/s16010072 .
[54] K. Nickel , R. Stiefelhagen , Visual recognition of pointing gestures for hu-
man–robot interaction, Image Vis. Comput. 25 (12) (2007) 1875–1884 .
[55] M. Oquab , L. Bottou , I. Laptev , J. Sivic , Learning and transferring mid-level
image representations using convolutional neural networks, in: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2014, pp. 1717–1724 .
[56] A. Patron-Perez , M. Marszaek , A. Zisserman , I.D. Reid , High five: Recognising
human interactions in tv shows, in: Proceedings of the British Machine Vision
Conference, 2010 .
[57] P. Pinheiro , R. Collobert , Recurrent convolutional neural networks for scene
labeling, in: Proceedings of the Thirty-First International Conference on Ma-
chine Learning (ICML), 2014, pp. 82–90 .
[58] J.C. Pulido, J.C. González, C. Suárez-Mejías, A. Bandera, P. Bustos, F. Fernán-
dez, Evaluating the child–robot interaction of the NAO therapist platform in
pediatric rehabilitation, Int. J. Soc. Robot. 9 (3) (2017) 343–358, doi: 10.1007/
s12369- 017- 0402- 2 .
[59] V. Ramanathan, B. Yao, L. Fei-Fei, Social Role Recognition for Human Event
Understanding, Springer International Publishing, Cham, pp. 75–93. doi: 10.
1007/978- 3- 319- 05491- 9 _ 4 .
[60] E. Ricci , J. Varadarajan , R. Subramanian , S. Rota Bulo , N. Ahuja , O. Lanz , Un-
covering interactions and interactors: joint estimation of head, body orienta-
tion and f-formations from surveillance videos, in: Proceedings of the IEEE
International Conference on Computer Vision, 2015, pp. 4660–4668 .
[61] J. Rios-Martinez, A. Spalanzani, C. Laugier, From proxemics theory to socially-
aware navigation: a survey, Int. J. Soc. Robot. 7 (2015) 137–153, doi: 10.1007/
s12369- 014- 0251- 1 .
[62] A. Romero-Garcés , L. Calderita , J. Martínez-Gómez , J. Bandera , R. Marfil ,
L. Manso , A. Bandera , P. Bustos , Testing a fully autonomous robotic sales-
man in real scenarios., in: Proceedings of the International Conference on
Autonomous Robot Systems and Competitions (ICARSC 2017), IEEE, 2015,
pp. 124–130 .
[63] P. Ruvolo , I. Fasel , J. Movellan , Auditory mood detection for social and educa-
tional robots, in: Proceedings of the IEEE International Conference on Robotics
and Automation, 2008, pp. 3551–3556 .
[64] M.S. Ryoo, T.J. Fuchs, L. Xia, J. Aggarwal, L. Matthies, Robot-centric activ-
ity prediction from first-person videos: What will they do to me? in: Pro-
ceedings of the Tenth Annual ACM/IEEE International Conference on Human-
Robot Interaction, HRI ’15, ACM, New York, NY, USA, 2015, pp. 295–302,
doi: 10.1145/2696454.2696462 .
[65] M.S. Ryoo, L. Matthies, First-person activity recognition: what are they doing
to me? in: Proceedings of the 2013 IEEE Conference on Computer Vision and
Pattern Recognition, 2013, pp. 2730–2737, doi: 10.1109/CVPR.2013.352 .
[66] A. Sauppé, B. Mutlu , The social impact of a robot co-worker in industrial set-
tings, in: Proceedings of the ACM Conference on Human Factors in Computing
Systems, 2015, pp. 3613–3622 .
[67] C. Schuldt , I. Laptev , B. Caputo , Recognizing human actions: a local SVM ap-
proach, in: Proceedings of the Seventeenth International Conference on Pat-
tern Recognition (ICPR), 3, 2004, pp. 32–36 .
[68] A.G. Schwing , R. Urtasun , Fully connected deep structured networks, CoRR
(2015) . abs/1503.02351
[69] P. Scovanner , S. Ali , M. Shah , A 3-dimensional sift descriptor and its applica-
tion to action recognition, in: Proceedings of the Fifteenth International Con-
ference on Multimedia, 2007, pp. 357–360 .
[70] W. Sheng , J. Du , Q. Cheng , G. Li , C. Zhu , M. Liu , G. Xu , Robotic semantic map-
ping through human activity recognition: a wearable sensing and computing
approach, Robot. Auton. Syst. 68 (2015) 47–58 .
[71] K. Simonyan , A. Zisserman , Two-stream convolutional networks for action
recognition in videos, in: Proceedings of the Advances in Neural Information
Processing Systems (NIPS), 2014, pp. 568–576 .
[72] K. Soomro , A. Roshan Zamir , M. Shah , UCF101: a dataset of 101 human actions
classes from videos in the wild, CoRR (2012) . abs/1212.0402.
[73] T. Spexard , M. Hanheide , Gerhard sagerer, human-oriented interaction with
an anthropomorphic robot, IEEE Trans. Robot. 23 (5) (2007) 852–862 .
[74] C. Stefan , Dynamic eye movement datasets and learnt saliency models for vi-
sual action recognition, in: Proceedings of the Twelfth European Conference
on Computer Vision (ECCV), 2012, pp. 842–856 .
[75] R. Stiefelhagen , H. Ekenel , C. Fügen , P. Gieselmann , H. Holzapfel , F. Kraft ,
K. Nickel , A. Waibel , Enabling multimodal human–robot interaction for the
Karlsruhe humanoid robot, IEEE Trans. Robot. 23 (5) (2007) 840–851 .
[76] Y. Tamura , T. Akashi , S. Yano , H. Osumi , Human visual attention model based
on analysis of magic for smooth human–robot interaction, Int. J. Soc. Robot.
8 (2016) 685–694 .
[77] D. Tran , L. Bourdev , R. Fergus , L. Torresani , M. Paluri , Learning spatiotemporal
features with 3D convolutional networks, in: Proceedings of the IEEE Interna-
tional Conference on Computer Vision (ICCV), 2015, pp. 4 489–4 497 .
[78] N.-T. Tran, F.-E. Ababsa, M. Charbit, J. Feldmar, D. Petrovska-Delacrétaz, G.
Chollet, 3D Face Pose and Animation Tracking via Eigen-Decomposition Based
Bayesian Approach, Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 562–
571. doi: 10.1007/978- 3- 642- 41914- 0 _ 55 .
[79] P.K. Turaga , R. Chellappa , V.S. Subrahmanian , O. Udrea , Machine recognition
of human activities: a survey, Proc. IEEE Trans. Circuits Syst. Video Technol.
18 (2008) 1473–1488 .
[80] N. Vahrenkamp, T. Asfour, R. Dillmann, Simultaneous grasp and motion plan-
ning: humanoid robot ARMAR-III, IEEE Robot. Autom. Mag. 19 (2) (2012) 43–
57, doi: 10.1109/MRA.2012.2192171 .
[81] S. Vascon , E. Mequanint , M. Cristani , H. Hung , M. Pelillo , V. Murino , A
game-theoretic probabilistic approach for detecting conversational groups, in:
Proceedings of the Asian Conference on Computer Vision, Springer, 2014,
pp. 658–675 .
[82] A. Vignolo, F. Rea, N. Noceti, A. Sciutti, F. Odone, G. Sandini, Biological
movement detector enhances the attentive skills of humanoid robot ICUB,
in: Proceedings of the IEEE-RAS Sixteenth International Conference on Hu-
manoid Robots (Humanoids), 2016, pp. 338–344, doi: 10.1109/HUMANOIDS.
2016.7803298 .
[83] P. Viola , M. Jones , Rapid object detection using a boosted cascade of simple
features, in: Proceedings of the IEEE Computer Society Conference on Com-
puter Vision and Pattern Recognition, 1, 2001, pp. I511–I518 .
[84] D. Voilmy , C. Suarez , A. Romero-Garces , C. Reuther , J. Pulido , R. Marfil ,
L. Manso , K. Lan Hing Ting , A. Iglesias , J. Gonzalez , J. Garcia , A. Garcia-Olaya ,
R. Fuentetaja , F. Fernandez , A. Dueñas , L. Calderita , P. Bustos , T. Barile , J. Ban-
dera Rubio , A. Bandera , CLARC: a cognitive robot for helping geriatric doctors
in real scenarios, in: A. Ollero, A. Sanfeliu, L. Montano, N. Lau, C. Cardeira
(Eds.), Proceedings of the ROBOT 2017: Third Iberian Robotics Conference,
2017 .
[85] M. Waibel, M. Beetz, J. Civera, R. D’Andrea, J. Elfring, D. Glvez-Lpez, K. Husser-
mann, R. Janssen, J.M.M. Montiel, A. Perzylo, B. Schiele, M. Tenorth, O. Zwei-
gle, R.D. Molengraft, Roboearth, IEEE Robot. Autom. Mag. 18 (2) (2011) 69–82,
doi: 10.1109/MRA.2011.941632 .
[86] H. Wang , A. Kläser , C. Schmid , C.-L. Liu , Action recognition by dense trajec-
tories, in: Proceedings of the IEEE Conference on Computer Vision & Pattern
Recognition, 2011, pp. 3169–3176 .
[87] H. Wang , C. Schmid , Action recognition with improved trajectories, in: Pro-
ceedings of the IEEE International Conference on Computer Vision (ICCV),
2013, pp. 3551–3558 .
[88] H. Wang, M. Shao, Y. Liu, W. Zhao, Enhanced efficiency 3D convolution based
on optimal FPGA accelerator, IEEE Access 5 (2017) 6909–6916, doi: 10.1109/
ACCESS.2017.2699229 .
[89] H. Wang , H. Zhou , A. Finn , Discriminative dictionary learning via shared la-
tent structure for object recognition and activity recognition, in: Proceed-
ings IEEE International Conference Robotics and Automation (ICRA), 2014,
pp. 6299–6304 .
[90] L. Wang , Y. Xiong , Z. Wang , Y. Qiao , D. Lin , X. Tang , L. Gool , Temporal segment
networks: towards good practices for deep action recognition, in: Proceedings
of the European Conference on Computer Vision (ECCV), 2016, pp. 20–36 .
[91] Y. Wang , M. Long , J. Wang , P. Yu , Spatiotemporal pyramid network for video
action recognition, in: Proceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), 2017 .
[92] C.Y. Weng , W.T. Chu , J.L. Wu , RoleNet: movie analysis from the perspective of
social networks, IEEE Trans. Multimed. 11 (2) (2009) 256–271 .
[93] G. Willems , T. Tuytelaars , L. Gool , An efficient dense and scale-invariant spa-
tio-temporal interest point detector, in: Proceedings of the European Confer-
ence on Computer Vision (ECCV), 2008, pp. 650–663 .
[94] M. Williams , P. Gardenfors , B. Johnston , G. Wightwick , Anticipation as a strat-
egy: a design paradigm for robotics, in: Y. Bi, M.A. Williams (Eds.), Proceed-
ings of the Knowledge Science, Engineering and Management (KSEM2010),
Lecture Notes in Computer Science, 6291, Springer, Heidelberg, 2010 .
[95] W. Xiong , L. Wu , F. Alleva , J. Droppo , X. Huang , A. Stolcke , The Microsoft 2017
conversational speech recognition system, CoRR (2017) . abs/1708.06073.
[96] T. Yu, S.N. Lim, K. Patwardhan, N. Krahnstoever, Monitoring, recognizing and
discovering social networks, in: Proceedings of the 2009 IEEE Conference on
Computer Vision and Pattern Recognition, 2009, pp. 1462–1469, doi: 10.1109/
CVPR.2009.5206526 .
[97] K. Zhang, Y. Huang, Y. Du, L. Wang, Facial expression recognition based on
deep evolutional spatial-temporal networks, IEEE Trans. Image Process. 26 (9)
(2017) 4193–4203, doi: 10.1109/TIP.2017.2689999 .
[98] L. Zhang, H. Hung, Beyond F-formations: determining social involvement in
free standing conversing groups from static images, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016,
doi: 10.1109/CVPR.2016.123 .
[99] G. Zhu, M. Yang, K. Yu, W. Xu, Y. Gong, Detecting video events based on action
recognition in complex scenes using spatio-temporal descriptor, in: Proceed-
ings of the Seventeenth ACM International Conference on Multimedia, MM
’09, ACM, 2009, pp. 165–174, doi: 10.1145/1631272.1631297 .
100] N. Zhuang, T. Yusufu, J. Ye, K.A. Hua, Group activity recognition with differen-
tial recurrent convolutional neural networks, in: Proceedings of the Twelfth
IEEE International Conference on Automatic Face Gesture Recognition (FG
2017), 2017, pp. 526–531, doi: 10.1109/FG.2017.70 .
https://doi.org/10.3390/s16010072
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0050
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0050
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0050
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0051
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0051
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0051
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0051
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0051
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0052
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0052
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0052
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0052
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0052
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0053
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0053
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0053
https://doi.org/10.1007/s12369-017-0402-2
http://doi.org/10.1007/978-3-319-05491-9_4
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0055
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0055
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0055
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0055
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0055
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0055
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0055
https://doi.org/10.1007/s12369-014-0251-1
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0057
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0057
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0057
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0057
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0057
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0057
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0057
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0057
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0057
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0058
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0058
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0058
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0058
https://doi.org/10.1145/2696454.2696462
https://doi.org/10.1109/CVPR.2013.352
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0061
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0061
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0061
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0062
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0062
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0062
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0062
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0063
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0063
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0063
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0063
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0064
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0064
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0064
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0064
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0065
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0065
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0065
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0065
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0065
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0065
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0065
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0065
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0066
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0066
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0066
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0067
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0067
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0067
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0067
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0067
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0068
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0068
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0068
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0069
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0069
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0070
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0070
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0070
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0070
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0070
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0070
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0070
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0070
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0070
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0071
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0071
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0071
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0071
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0071
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0072
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0072
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0072
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0072
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0072
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0072
http://doi.org/10.1007/978-3-642-41914-0_55
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0073
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0073
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0073
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0073
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0073
https://doi.org/10.1109/MRA.2012.2192171
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0075
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0075
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0075
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0075
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0075
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0075
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0075
https://doi.org/10.1109/HUMANOIDS.2016.7803298
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0077
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0077
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0077
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0078
https://doi.org/10.1109/MRA.2011.941632
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0080
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0080
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0080
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0080
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0080
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0081
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0081
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0081
https://doi.org/10.1109/ACCESS.2017.2699229
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0083
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0083
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0083
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0083
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0084
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0084
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0084
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0084
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0084
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0084
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0084
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0084
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0085
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0085
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0085
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0085
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0085
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0086
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0086
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0086
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0086
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0087
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0087
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0087
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0087
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0088
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0088
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0088
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0088
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0088
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0089
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0089
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0089
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0089
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0089
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0089
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0089
http://refhub.elsevier.com/S0167-8655(18)30077-1/sbref0089
https://doi.org/10.1109/CVPR.2009.5206526
https://doi.org/10.1109/TIP.2017.2689999
https://doi.org/10.1109/CVPR.2016.123
https://doi.org/10.1145/1631272.1631297
https://doi.org/10.1109/FG.2017.70
- Perceiving the person and their interactions with the others for social robotics – A review
1 Introduction
2 Understanding a scene populated by humans
3 Perceiving and modeling people and their interactions
3.1 Modeling the human
3.1.1 Feature extraction
3.1.2 Feature vectors classification
3.1.3 Convolutional Neural Networks (CNNs)
3.2 Modeling a group of people
3.2.1 Convolutional Neural Networks (CNNs)
4 Internalizing the information
5 Discussion
5.1 Networked robotics: the strength of being part of an ecology
5.2 Approaches for first-person activity recognition
6 Conclusions
Acknowledgments
References