Governance of Cloud-hosted Applications Through Policy Enforcement – Methodology
I want 3 pages if APA format on different governance methods used in Cloud hosted application which come under Methodology for my major research paper. I just need you to write this section. I do not want your to write intro or abstract just the methodlogy part which starts from page 21 but you need to refer other papers as well and reference them in this paper.
Note: NO Plagarism please everything should be in your own words.
Time : I need this in 3 hours ( by 1 pm cst)
I have already attached
Un
i
v
ersity of California
Santa Barbara
Governance of Cloud-hosted Web
Applications
A dissertation submitted in partial satisfaction
of the requirements for the degree
Doctor of Philosophy
in
Computer Science
by
Hiranya K. Jayathilaka
(Iyagalle Gedara
)
Committee in charge:
Professor Chandra Krintz, Chair
Professor Rich Wolski
Professor Tevfik Bultan
December
2
0
1
6
ProQuest Number:
All rights reserved
INFORMATION TO ALL USERS
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
ProQuest
Published by ProQuest LLC ( ). Copyright of the Dissertation is held by the Author.
All rights reserved.
This work is protected against unauthorized copying under Title 1
7
, United States Code
Microform Edition © ProQuest LLC.
ProQuest LLC.
7
8
9
East Eisenhower Parkway
P.O. Bo
x
1
3
4
6
Ann Arbor, MI
48
10
6 –
13
46
10
24
8
87
0
102
4
88
70
20
17
The Dissertation of Hiranya K. Jayathilaka is approved.
Professor Rich Wolski
Professor Tevfik Bultan
Professor Chandra Krintz, Committee Chair
December 20
16
Governance of
Cloud-hosted Web Applications
Copyright c© 2016
by
Hiranya K. Jayathilaka
iii
Acknowledgements
My sincere gratitude goes out to my ad
vi
sors Chandra Krintz, Rich Wolski and Tevfik
Bultan, whose guidance has been immensely helpful to me during my time at graduate
school. They are some of the most well informed and innovative people I know, and I
consider it both an honor and a privilege to have had the opportunity to work under
their tutelage.
I also thank my wonderful parents, whose love and continued support have been the
foundation of my life. Thank you for the values you have instilled in me, and giving me
courage to face all sorts of intellectual and emotional challenges.
I am also grateful to all my teachers and mentors from the past, who generously
shared their knowledge and experiences with me. My special thanks go out to Sanjiva
Weerawarana whose intelligence and leadership skills inspire me to be better at my work
everyday.
Finally, I thank the amazing faculty, staff and the community at UC Santa Barbara,
whose smiles and positive attitude have made my life so much easier and exciting. I
especially like to mention my colleagues at the RACELab, both present and past, for all
the intellectual stimulation as well as their warm sense of friendship.
Thank you, and Ayubowan!
iv
Curriculum Vitæ
Hiranya K. Jayathilaka
Education
2016 Ph.D. in Computer Science (Expected),
University of California, Santa Barbara, United States.
2009 B.Sc. Engineering (Hons) Degree,
University of Moratuwa, Sri Lanka.
Publications
Service-Level Agreement Durability for Web Service Response Time
H. Jayathilaka, C. Krintz, R. Wolski
International Conference on Cloud Computing Technology and Science (CloudCom),
201
5
.
Response time service level agreements for cloud-hosted web applications
H. Jayathilaka, C. Krintz, and R. Wolski
ACM Symposium on Cloud Computing (SoCC), 20
15
.
EAGER: Deployment-Time API Governance for Modern PaaS Clouds
H. Jayathilaka, C. Krintz, and R. Wolski
IC2E Workshop on the Future of PaaS, 2015.
Using Syntactic and Semantic Similarity of Web APIs to Estimate Porting Effort
H. Jayathilaka, A. Pucher, C. Krintz, and R. Wolski
International Journal of Services Computing (IJSC),
20
14
.
Towards Automatically Estimating Porting Effort between Web Service APIs
H. Jayathilaka, C. Krintz, and R. Wolski
International Conference on Services Computing (SCC), 2014.
Cloud Platform Support for API Governance
C. Krintz, H. Jayathilaka, S. Dimopoulos, A. Pucher, R. Wolski, and T. Bultan
IC2E Workshop on the Future of PaaS, 2014.
Service-driven Computing with APIs: Concepts, Frameworks and Emerging Trends
H. Jayathilaka, C. Krintz, and R. Wolski
IGI Global Handbook of Research on Architectural Trends in Service-driven Computing,
v
2014.
Improved Server Architecture for Highly Efficient Message Mediation
H. Jayathilaka, P. Fernando, P. Fremantle, K. Indrasiri, D. Abeyruwan, S. Kamburuga-
muwa, S. Jayasumana, S. Weerawarana and S. Perera
International Conference on Information Integration and Web-based Applications
and
Services (IIWAS), 2013.
Extending Modern PaaS Clouds with BSP to Execute Legacy MPI Applications
H. Jayathilaka and M. Agun
ACM Symposium on Cloud Computing (SoCC), 2013.
vi
Governance of Cloud-hosted Web Applications
by
Hiranya K. Jayathilaka
Cloud computing has revolutionized the way developers implement and deploy ap-
plications. By running applications on large-scale compute infrastructures and program-
ming platforms that are remotely accessible as utility services, cloud computing provides
scalability, high-availability, and increased user productivity.
Despite the advantages inherent to the cloud computing model, it has also given rise
to several software management and maintenance issues. Specifically, cloud platforms
do not enforce developer best practices, and other administrative requirements when
deploying applications. Cloud platforms also do not facilitate establishing service level
objectives (SLOs) on application performance, which are necessary to ensure reliable and
consistent operation of applications. Moreover, cloud platforms do not provide adequate
support to monitor the performance of deployed applications, and conduct root cause
analysis when an application exhibits a performance
anomaly.
We employ governance as a methodology to address the above mentioned issues preva-
lent in cloud platforms. We devise novel governance solutions that achieve administrative
conformance, developer best practices, and performance SLOs in the cloud via policy en-
forcement, SLO prediction, performance anomaly detection and root cause analysis. The
proposed solutions are fully automated, and built into the cloud platforms as cloud-native
features thereby precluding the application developers from having to implement similar
features by themselves. We evaluate our methodology using real world cloud platforms,
and show that our solutions are highly effective and efficient.
vii
Contents
v
Abstract vii
1
1
2
8
2.1 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Platform-as-a-Service Clouds . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 PaaS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2.2 PaaS Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Governance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 IT and SOA Governance . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Governance for Cloud-hosted Applications . . . . . . . . . . . . . 17
2.3.3 API Governance . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3
21
3.1 Enforcing API Governance in Cloud Settings . . . . . . . . . . . . . . . .
26
3.2 EAGER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.2.1 Metadata Manager . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.2.2 API Deployment Coordinator . . . . . . . . . . . . . . . . . . . .
33
3.2.3 EAGER Policy Language and Examples . . . . . . . . . . . . . .
35
3.2.4 API Discovery Portal . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.2.5 API Gateway . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.3 Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.3.1 Auto-generation of API Specifications . . . . . . . . . . . . . . . .
44
3.3.2 Implementing the Prototype . . . . . . . . . . . . . . . . . . . . .
45
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Baseline EAGER Overhead by Application . . . . . . . . . . . . .
47
3.4.2 Impact of Number of APIs and Dependencies . . . . . . . . . . . 48
3.4.3 Impact of Number of Policies . . . . . . . . . . . . . . . . . . . .
50
viii
3.4.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
3.4.5 Experimental Results with a Real-World Dataset . . . . . . . . .
53
3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
3.6
s and Future Work . . . . . . . . . . . . . . . . . . . . . . . .
58
4 Response Time Service Level Objectives for Cloud-hosted Web Appli-
cations
60
4.1 Domain Characteristics and Assumptions . . . . . . . . . . . . . . . . . .
65
4.2 Cerebro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.1 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.2.2 PaaS Monitoring Agent . . . . . . . . . . . . . . . . . . . . . . .
72
4.2.3 Making SLO Predictions . . . . . . . . . . . . . . . . . . . . . . .
73
4.2.4 Example Cerebro Workflow . . . . . . . . . . . . . . . . . . . . .
75
4.2.5 SLO Durability . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
4.2.6 SLO Reassessment . . . . . . . . . . . . . . . . . . . . . . . . . .
79
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
80
4.3.1 Correctness of Predictions . . . . . . . . . . . . . . . . . . . . . .
81
4.3.2 Tightness of Predictions . . . . . . . . . . . . . . . . . . . . . . .
85
4.3.3 SLO Validity Duration . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3.4 Long-term SLO Durability and Change Frequency . . . . . . . . .
91
4.3.5 Effectiveness of QBETS . . . . . . . . . . . . . . . . . . . . . . .
98
4.3.6 Learning Duration . . . . . . . . . . . . . . . . . . . . . . . . . .
101
4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . .
106
5 Performance Anomaly Detection and Root Cause Analysis for Cloud-
hosted Web Applications
110
5.1 Performance Debugging Cloud Applications . . . . . . . . . . . . . . . .
115
5.2 Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
116
5.2.1 Data Collection and Correlation . . . . . . . . . . . . . . . . . . .
117
5.2.2 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
0
5.2.3 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
120
5.2.4 Roots Process Management . . . . . . . . . . . . . . . . . . . . .
122
5.3 Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 1
23
5.3.1 SLO-violating Anomalies . . . . . . . . . . . . . . . . . . . . . . . 1
25
5.3.2 Path Distribution Analysis . . . . . . . . . . . . . . . . . . . . . . 1
27
5.3.3 Workload Change Analyzer . . . . . . . . . . . . . . . . . . . . . 1
28
5.3.4 Bottleneck Identification . . . . . . . . . . . . . . . . . . . . . . .
129
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
132
5.4.1 Anomaly Detection: Accuracy and Speed . . . . . . . . . . . . . .
133
5.4.2 Path Distribution Analyzer: Accuracy and Speed . . . . . . . . .
135
5.4.3 Workload Change Analyzer Accuracy . . . . . . . . . . . . . . . . 1
37
ix
5.4.4 Bottleneck Identification Accuracy . . . . . . . . . . . . . . . . . 1
38
5.4.5 Multiple Applications in a Clustered Setting . . . . . . . . . . . .
142
5.4.6 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
143
5.4.7 Roots Performance and Scalability . . . . . . . . . . . . . . . . . 143
5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
147
5.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . .
150
6 Conclusion
153
1
59
x
Chapter 1
Introduction
Cloud computing turns compute infrastructures, programming platforms and software
systems into online utility services that can be easily shared among many users [1, 2]. It
enables processing and storing data on large, managed infrastructures and programming
platforms, that can be accessed remotely via the internet. This provides an alternative to
running applications on local servers, personal computers, and mobile devices, all of which
have strict resource constraints. Today, cloud computing technologies can be obtained
from a large and growing number of providers. Some of these providers offer hosted cloud
platforms that can be used via the web to deploy applications without installing any
physical hardware (e.g. Amazon AWS [3], Google App Engine [4], Microsoft Azure [5]).
Others provide cloud technologies as downloadable software, which users can install on
their computers or data centers to set up their own private clouds (e.g. Eucalyptus [6],
AppScale [7], OpenShift [8]).
Cloud computing model provides high scalability, high availability and enhanced levels
of user productivity. Cloud platforms run on large resource pools, typically in one or more
data centers managed by the platform provider. Therefore cloud platforms have access to
a vast amount of hardware and software resources. This enables cloud-hosted applications
1
Introduction Chapter 1
to scale to varying load conditions, and maintain high availability. Moreover, by offering
resources as utility services, cloud computing is able to facilitate a cost-effective, on-
demand resource provisioning model that greatly enhances user productivity.
Over the last decade cloud computing technologies have enjoyed explosive growth,
and near universal adoption due to their many benefits and promises [9, 10]. Industry
analysts project that the cloud computing market value will exceed $150 billion by the
year 2020 [11]. A large number of organizations run their entire business as a cloud-
based operation (e.g. Netflix, Snapchat). For startups and academic researchers who do
not have a large IT budget or a staff, the cost-effective on-demand resource provisioning
model of the cloud has proved to be indispensable. The growing number of academic
conferences and journals dedicated to discussing cloud computing is further evidence that
cloud is an essential branch in the field of computer science.
Despite its many benefits, cloud computing has also given rise to several application
development and maintenance challenges that have gone unaddressed for many years.
As the number of applications deployed in cloud platforms continue to increase these
shortcoming are rapidly becoming conspicuous. We highlight three such
issues.
Firstly, cloud platforms lack the ability to enforce developer best practices and ad-
ministrative conformance on deployed user applications. The developer best practices
are the result of decades of software engineering research, and include code reuse, proper
versioning of software artifacts, dependency management between application compo-
nents, and backward compatible software updates. Administrative conformance refers
to complying with various development and maintenance standards that an organization
may wish to impose on all of their production software. Cloud platforms do not provide
any facilities that enforce such developer practices or administrative standards. Instead,
cloud platforms make it extremely trivial and quick to deploy new applications or update
existing applications (i.e. roll out new versions). The resulting speed-up of the devel-
2
Introduction Chapter 1
opment cycles combined with the lack of oversight and verification, makes it extremely
difficult for IT personnel to manage large volumes of cloud-hosted
applica
tions.
Secondly, today’s cloud platforms do not provide support for establishing service level
objectives (SLOs) regarding the performance of deployed applications. A performance
SLO specifies a bound on application’s response time (latency). Such bounds are vital
for developers who implement downstream systems that consume the cloud-hosted ap-
plications, and cloud administrators who wish to maintain a consistent quality of service
level. However, when an application is implemented for a cloud platform, one must sub-
ject it to extensive performance testing in order to comprehend its performance bounds;
a process that is both tedious and time consuming. The difficulty in understanding the
performance bounds of cloud-hosted applications is primarily due to the very high level
of abstraction provided by the cloud platforms. These abstractions shield many details
concerning the application runtime, and without visibility into such low level application
execution details it is impossible to build a robust performance model for a cloud-hosted
application. Due to this reason, it is not possible to stipulate SLOs on the performance
of cloud-hosted applications. Consequently, existing cloud platforms only offer SLOs
regarding service availability.
Thirdly, cloud platforms do not provide adequate support for monitoring application
performance, and running diagnostics when an application fails to meet its performance
SLOs. Most cloud platforms only provide the simplest monitoring and logging features,
and do not provide any mechanisms for detecting performance anomalies or identifying
bottlenecks in the application code or the underlying cloud platform. This limitation has
given rise to a new class of third party service providers that specialize in monitoring
cloud applications (e.g. New Relic [12], Dynatrace [13], Datadog [14]). But these third
party solutions are expensive. They also require code instrumentation, which if not done
correctly, leads to incorrect diagnoses. The perturbation introduced by the instrumen-
3
Introduction Chapter 1
tation also changes and degrades application performance. Furthermore, the extrinsic
monitoring systems have a restricted view of the cloud platform, due to the high level
of abstraction provided by cloud platform software. Therefore they cannot observe the
complexity of the cloud platform in full, and hence cannot pinpoint the component that
might be responsible for a perceived application performance anomaly.
In order to make the cloud computing model more dependable, maintainable and
convenient for the users as well as the cloud service providers, the above limitations need
to be addressed satisfactorily. Doing so will greatly simplify the tasks of developing cloud
applications, and maintaining them in the long run. Developers will be able to specify
SLOs on the performance of their cloud-hosted applications, and offer competitive service
level agreements (SLAs) to the end users that consume those applications. Developers
as well as cloud administrators will be able to detect performance anomalies promptly,
and take corrective actions before the issues escalate to major outages or other crises.
Our research focuses on addressing the above issues in cloud environments using
governance. We define governance as the mechanism by which the acceptable operational
parameters are specified and maintained in a software system [15, 16]. This involves
multiple steps:
• Specifying the acceptable operational parameters
• Enforcing the specified parameters
• Monitoring the system to detect deviations from the acceptable behavior
To learn the feasibility and the efficacy of applying governance techniques in a cloud
platform, we propose and explore the following thesis question: Can we efficiently enforce
governance for cloud-hosted web applications to achieve administrative conformance, de-
veloper best practices, and performance SLOs through automated analysis and diagnos-
tics?
4
Introduction Chapter 1
For governance to be useful within the context of cloud computing, it must be both
efficient and automated. Cloud platforms are comprised of many components that have
different life cycles and maintenance requirements. They also serve a very large number
of users who deploy applications in the cloud. Therefore governance systems designed
for the cloud should scale to handle a large number of applications and related software
components, without introducing a significant runtime overhead on them. Also they must
be fully automated since it is not practical for a human administrator to be involved in
the governance process given the scale of the cloud
platforms.
Automated governance for software systems is a well researched area, especially in
connection with classic web services and service-oriented architecture (SOA) applica-
tions [16, 17,
18
, 19, 20]. We adapt the methodologies outlined in the existing SOA
governance research corpus, so they can be applied to cloud computing systems. These
methodologies enable specifying acceptable behavior via machine readable policies, which
are then automatically enforced by a policy enforcement agent. Monitoring agents watch
the system to detect any deviations from the acceptable behavior (i.e. policy violations),
and alert users or follow predefined corrective procedures. We can envision similar fa-
cilities being implemented in a cloud platform to achieve administrative conformance,
developer best practices and performance SLOs. The operational parameters in this case
may include coding and deployment conventions for the cloud-hosted applications, and
their expected performance levels.
In order to answer the above thesis question by developing efficient, automated gov-
ernance systems, we take the following three-step approach.
• Design and implement a scalable, low-overhead governance framework for cloud
platforms, complete with a policy specification language and a policy enforcer. The
governance framework should be built into the cloud platforms, and must keep the
5
Introduction Chapter 1
runtime overhead of the user applications to a minimum while enforcing developer
best practices and administrative conformance.
• Design and implement a methodology for formulating performance SLOs (bounds)
for cloud-hosted web applications, without subjecting them to extensive perfor-
mance testing or instrumentation. The formulated SLOs must be correct, tight
and durable in the face of changing conditions of the cloud.
• Design and implement a scalable cloud application performance monitoring (APM)
framework for detecting violations of performance SLOs. For each violation de-
tected, the framework should be able to run diagnostics, and identify the potential
root cause. It should support collecting data from the cloud platform without
instrumenting user code, and without introducing significant runtime overheads.
To achieve administrative conformance and developer best practices with minimal
overhead, we perform governance policy enforcement when an application is deployed; a
technique that we term deployment-time policy enforcement. We explore the trade off
between what policies can be enforced, and when they can be enforced with respect to
the life cycle of a cloud-hosted application. We show that not all policies are enforceable
at deployment-time, and therefore some support for run-time policy enforcement is also
required in the cloud. However, we find that deployment-time policy enforcement is
efficient, and a governance framework that performs most, if not all, enforcement tasks
at deployment-time can scale to thousands of applications and
policies.
We combine static analysis with platform monitoring to establish performance SLOs
for cloud-hosted applications. Static analysis extracts the sequence of critical operations
(cloud services) invoked by a given application. Platform monitoring facilitates con-
structing a historic performance model for the individual operations. We then employ a
time series analysis method to combine these results, and calculate statistical bounds for
6
Introduction Chapter 1
application response time. The performance bounds calculated in this manner are asso-
ciated with a specific correctness probability, and hence can be used as SLOs. We also
devise a statistical framework to evaluate the validity period of calculated performance
bounds.
In order to detect and diagnose performance SLO violations, we monitor various per-
formance events that occur in the cloud platform, correlate them, and employ statistical
analysis to identify anomalous patterns. Any given statistical method is only sensitive
to a certain class of anomalies. Therefore, to be able to diagnose a wide range of perfor-
mance anomalies, we devise an algorithm that combines linear regression, change point
detection and quantile analysis. Our approach detects performance SLO violations in
near real time, and identifies the root cause of each event as a workload change or a
performance bottleneck in the cloud platform. In case of performance bottlenecks, our
approach also correctly identifies the exact component in the cloud platform, in which
the bottleneck manifested.
Our contributions push the state of the art in cloud computing significantly towards
achieving administrative conformance, developer best practices and performance SLOs.
Moreover, our work addresses all the major steps associated with software system gov-
ernance – specification, enforcement and monitoring. We show that this approach can
significantly improve cloud platforms in terms of their reliability, developer-friendliness
and ease of management. We also demonstrate that the governance capabilities proposed
in our work can be built into existing cloud platforms, without having to implement them
from the scratch.
7
Chapter 2
Background
2.1 Cloud Computing
Cloud computing is a form of distributed computing that turns compute infrastruc-
ture, programming platforms and software systems into scalable utility services [1, 2].
By exposing various compute and programming resources as utility services, cloud com-
puting promotes resource sharing at scale via the Internet. The cloud model precludes
the users from having to set up their own hardware, and in some cases also software.
Instead, the users can simply acquire the resources “in the cloud” via the internet, and
relinquish them when the resources are no longer needed. The cloud model also does
not require the users to spend any start up capital. The users only have to pay for the
resources they acquired, usually based on a pay-per-use billing model. Due to these ben-
efits associated with cloud computing, many developers and organizations use the cloud
as their preferred means of developing and deploying software applications [9, 10, 11].
Depending on the type of resources offered as services, cloud computing platforms
can be categorized into three main categories [2].
Infrastructure-as-a-Service clouds (IaaS) Offers low-level compute, storage and net-
8
Background Chapter 2
working resources as a service. Compute resources are typically provided in the form
of on-demand virtual machines (VMs) with specific CPU, memory and disk config-
urations (e.g. Amazon EC2 [21], Google Compute Engine [22], Eucalyptus [23]).
The provisioned VMs usually come with a base operating system installed. The
users must install all the application software necessary to use them.
Platform-as-a-Service clouds (PaaS) Offers a programming platform as a service,
that can be used to develop and deploy applications at scale (e.g. Google App
Engine [4], AppScale [7], Heroku [24], Amazon Elastic Beanstalk [25]). The pro-
gramming platform consists of several scalable services that can be used to obtain
certain application features such as data storage, caching and authentication.
Software-as-a-Service clouds (SaaS) Offers a collection of software applications and
tools as a service, that can be directly consumed by application endusers (e.g.
Salesforce [26], Workday [27], Citrix go2meeting [28]). This can be thought of as
a new way of delivering software to endusers. Instead of prompting the users to
download and install any software, SaaS enables the users to consume software via
the Internet.
Cloud-hosted applications expose one or more web application programming inter-
faces (web APIs) through which client programs can remotely interact with the applica-
tions. That is, clients send HTTP/S requests to the API, and receive machine readable
responses (e.g. HTML, JSON, XML, Protocol Buffers [29]) in return. This type of web-
accessible, cloud-hosted applications tend to be highly interactive, and clients have strict
expectations on the application response time [
30
].
A cloud-hosted application may also consume web APIs exposed by other cloud-
hosted applications. Thus, cloud-hosted applications form an intricate graph of inter-
dependencies among them, where each application can service a set of client applications,
9
Background Chapter 2
while being dependent on a set of other applications. However, in general, each cloud-
hosted application directly depends on the core services offered by the underlying cloud
platform for compute power, storage, network connectivity and scalability.
In the next section we take a closer look at a specific type of cloud platforms –
Platform-as-a-Service clouds. We use PaaS clouds as a case study and a testbed in a
number of our explorations.
2.2 Platform-as-a-Service Clouds
PaaS clouds, which have been growing in popularity [
31
, 32], typically host web-
accessible (HTTP/S) applications, to which they provide high levels of scalability, avail-
ability, and sandboxed execution. PaaS clouds provide scalability by automatically allo-
cating resources for applications on the fly (auto scaling), and provide availability through
the execution of multiple instances of the application. Applications deployed on a PaaS
cloud depend on a number of scalable services intrinsic to the cloud platform. We refer
to these services as kernel services.
PaaS clouds, through their kernel services, provide a high level of abstraction to the
application developer that effectively hides all the infrastructure-level details such as
physical resource allocation (CPU, memory, disk etc), operating system, and network
configuration. Moreover, PaaS clouds do not require the developers to set up any util-
ity services their applications might require such as a database or a distributed cache.
Everything an application requires is provisioned and managed by the PaaS cloud. This
enables application developers to focus solely on the programming aspects of their appli-
cations, without having to be concerned about deployment issues. On the other hand,
the software abstractions provided by PaaS clouds obscure runtime details of applications
making it difficult to reason about application performance, and diagnose performance
10
Background Chapter 2
Figure 2.1: PaaS system organization.
issues.
PaaS clouds facilitate deploying and running applications that are directly consumed
by human users and other client applications. As a result all the problems outlined
in the previous chapter, such as poor development practices, lack of performance SLOs,
and lack of performance debugging support directly impact PaaS clouds. Therefore PaaS
clouds are ideal candidates for implementing the type of governance systems proposed in
this work.
2.2.1 PaaS Architecture
Figure 2.1 shows the key layers of a typical PaaS cloud. Arrows indicate the flow
of data and control in response to application requests. At the lowest level of a PaaS
cloud is an infrastructure that consists of the necessary compute, storage and networking
resources. How this infrastructure is set up may vary from a simple cluster of physical
machines to a comprehensive Infrastructure-as-a-Service (IaaS) cloud. In large scale PaaS
11
Background Chapter 2
clouds, this layer typically consists of many virtual machines and/or containers with the
ability to acquire more resources on the fly.
On top of the infrastructure layer lies the PaaS kernel – a collection of managed, scal-
able services that high-level application developers can compose into their
applications.
The provided kernel services may include database services, caching services, queuing
services and more. The implementations of the kernel services are highly scalable, highly
available (have SLOs associated with them), and automatically managed by the plat-
form while being completely opaque to the application developers. Some PaaS clouds
also provide a managed set of programming APIs (a “software development kit” or SDK)
for the application developer to access these kernel services. In that case all interactions
between the applications and the PaaS kernel must take place through the cloud provider
specified SDK (e.g. Google App Engine [4], Microsoft Azure [33]).
One level above the PaaS kernel reside the application servers that are used to deploy
and run applications. Application servers provide the necessary integration (linkage)
between application code and the PaaS kernel services, while sandboxing application code
for secure, multi-tenant execution. They also enable horizontal scaling of applications by
running the same application on multiple application server instances.
The front-end and load balancing layer resides on top of the application servers layer.
This layer is responsible for receiving all application requests, filtering them, and routing
them to an appropriate application server instance for further execution. Front-end server
is therefore the entry point for PaaS-hosted applications for all application clients.
Each of the above layers can span multiple processes, running over multiple physical
or virtual machines. Therefore processing a single application request typically involves
cooperation of multiple distributed processes and/or machines.
12
Background Chapter 2
Figure 2.2: Applications deployed in a PaaS cloud: (a) An external client making
requests to an application via the web API; (b) A PaaS-hosted application invoking
another in the same cloud.
2.2.2 PaaS Usage Model
Three types of users interact with PaaS clouds.
Cloud administrators These are the personnel responsible for installing and maintain-
ing the cloud platform software. They are always affiliated with the cloud platform
provider.
Application developers These are the users who develop applications, and deploy
them in the PaaS cloud.
Application clients These are the users that consume the applications deployed in a
PaaS cloud. These include human users as well as other client applications that
programmatically access PaaS-
hosted applications.
Depending on how a particular PaaS cloud is set up (e.g. private or public cloud), the
above three user groups may belong to the same or multiple organizations.
13
Background Chapter 2
Figure 2.2 illustrates how the application developers interact with PaaS clouds. The
cloud platform provides a set of kernel services. The PaaS SDK provides well defined
interfaces (entry points) for these kernel services. The application developer uses the
kernel services via the SDK to implement his/her application logic, and packages it as a
web application. Developers then upload their applications to the cloud for deployment.
Once deployed, the applications and any web APIs exported by them can be accessed
via HTTP/S requests by external or co-located clients.
PaaS-hosted applications are typically developed and tested outside the cloud (on a
developer’s workstation), and then later uploaded to the cloud. Therefore PaaS-hosted
applications typically undergo three phases during their life-cycle:
Development-time The application is being developed and tested on a developer’s
workstation
Deployment-time The finished application is being uploaded to the PaaS cloud for
deployment
Run-time Application is running, and processing user requests
We explore ways to use these different phases to our advantage in order to minimize the
governance overhead on running applications.
We use PaaS clouds in our research extensively both as case studies and experimental
platforms. Specifically, we use Google App Engine and AppScale as test environments
to experiment with our new governance systems. App Engine is a highly scalable public
PaaS cloud hosted and managed by Google in their data centers. While it is open for
anyone to deploy and run web applications, it is not open source software, and its internal
deployment details are not commonly known. AppScale is open source software that can
be used to set up a private cloud platform on one’s own physical or virtual hardware.
14
Background Chapter 2
AppScale is API compatible with App Engine (i.e. it supports the same cloud SDK),
and hence any web application developed for App Engine can be deployed on AppScale
without any code changes. In our experiments, we typically deploy AppScale over a small
cluster of physical machines, or over a set of virtual machines provided by an IaaS cloud
such as Eucalyptus.
By experimenting with real world PaaS clouds we demonstrate the practical feasibil-
ity and the effectiveness of the systems we design and implement. Furthermore, there are
currently over a million applications deployed in App Engine, with a significant propor-
tion of them being open source applications. Therefore we have access to a large number
of real world PaaS applications to experiment with.
2.3 Governance
2.3.1 IT and SOA Governance
Traditionally, information and technology (IT) governance [15] has been a branch of
corporate governance, focused on improving performance and managing the risks associ-
ated with the use of IT. A number of frameworks, models and even certification systems
have emerged over time to help organizations implement IT governance [
34
, 35]. The
primary goals of IT governance are three fold.
• Assure that the use of IT generates business value
• Oversee performance of IT usage and management
• Mitigate the risks of using IT
When the software engineering community started gravitating towards web services
and service-oriented computing (SOC) [
36
, 37, 38], a new type of digital assets rose to
15
Background Chapter 2
prominence within corporate IT infrastructures – “services”. A service is a self-contained
entity that logically represents a business activity (a functionality; e.g. user authenti-
cation, billing, VM management) while hiding its internal implementation details from
the consumers [37]. Compositions of loosely-coupled, reusable, modular services soon
replaced large monolithic software installations.
Services required new forms of governance for managing their performance and risks,
and hence the notion of service-oriented architecture (SOA) governance came into exis-
tence [16, 17]. Multiple definitions of SOA governance are in circulation, but most of
them agree that the purpose of SOA governance is to exercise control over services and
associated processes (service development, testing, monitoring etc). A commonly used
definition of SOA governance is ensuring and validating that service artifacts within the
architecture are operating as expected, and maintaining a certain level of quality [16].
Consequently, a number of tools that help organizations implement SOA governance have
also evolved [18, 20,
39
, 19]. Since web services are the most widely used form of ser-
vices in SOA-driven systems, most of these SOA governance tools have a strong focus on
controlling web services [
40
].
Policies play a crucial role in all forms of governance. A policy is a specification
of the acceptable behavior and the life cycle of some entity. The entity could be a
department, a software system, a service or a human process such as developing a new
application. In SOA governance, policies state how services should be developed, how
they are to be deployed, how to secure them, and what level of quality of service to
maintain while a service is in operation. SOA governance tools enable administrators
to specify acceptable service behavior and life cycle as policies, and a software policy
enforcement agent automatically enacts those policies to control various aspects of the
services [41, 42, 43].
16
Background Chapter 2
2.3.2 Governance for Cloud-hosted Applications
Cloud computing can be thought of as a heightened version of service-oriented com-
puting. While classic SOC strives to offer data and application functionality as services,
cloud computing offers a variety of computing resources as services, including hardware
infrastructure (compute power, storage space and networking) and programming plat-
forms. Moreover, the applications deployed on cloud platforms typically behave like
services with separate implementation and interface components. Much like classic ser-
vices, each cloud-hosted application can be a dependency for another co-located cloud
application, or a client application running elsewhere (e.g. a mobile app).
Due to this resemblance, we argue that many concepts related to SOA governance
are directly applicable to cloud platforms and cloud-hosted applications. We extend
the definition of SOA governance, and define governance for cloud-hosted applications
as the process of ensuring that the cloud-hosted applications operate as expected while
maintaining a certain quality of service level.
Governance is a broad topic that allows room for many potential avenues of research.
In our work we explore three specific features of governance as they apply to cloud-hosted
applications.
Policy enforcement Policy enforcement refers to ensuring that all applications de-
ployed in a cloud platform adhere to a set of policies specified by a cloud adminis-
trator. Some of these policies include specific dependency management practices,
naming and packaging standards for software artifacts, software versioning require-
ments, and practices that enable software artifacts to evolve while maintaining
backward compatibility. Others specify run-time constraints, which need to be
enforced per application request.
Formulating performance SLOs This refers to automatic formulation of statistical
17
Background Chapter 2
bounds on the performance of cloud-hosted web applications. A service level ob-
jective (SLO) specifies a system’s minimum quality of service (QoS) level in a
measurable and controllable manner [44]. They may cover various QoS parameters
such as availability, response time (latency), and throughput. A performance SLO
specifies an upper bound on the application’s response time, and the likelihood that
bound is valid. Cloud administrators and application developers use performance
SLOs to negotiate service level agreements (SLAs) with clients, and monitor appli-
cations for consistent operation. Clients use them to reason about the performance
of downstream applications that depend on cloud-hosted applications.
Application performance monitoring Application performance monitoring (APM)
refers to continuously monitoring cloud-hosted applications to detect violations of
performance SLOs and other performance anomalies. It also includes diagnosing
the root cause of each detected anomaly, thereby expediting remediation. This
feature is useful for cloud administrators, application developers and clients alike.
None of the above features are implemented satisfactorily in the cloud technologies
available today. In order to fill the gaps caused by these limitations, many third-party
governance solutions that operate as external services have come into existence. For ex-
ample, services like 3Scale [45], Apigee [46] and Layer7 [47] provide a wide range of access
control and API management features for web applications served from cloud platforms.
Similarly, services like New Relic [12], Dynatrace [13] and Datadog [14] provide monitor-
ing support for cloud-hosted applications. But these services are expensive, and require
additional programming and/or configuration. Some of them also require changes to
applications in the form of code instrumentation. Moreover, since these services operate
outside the cloud platforms they govern, they have limited visibility and control over the
applications and related components residing in the cloud. A goal of our research is to
18
Background Chapter 2
facilitate governance from within the cloud, as an automated, cloud-native feature. We
show that such built-in governance capabilities are more robust, effective and easy to use
than external third-party solutions that overlay governance on top of the cloud.
2.3.3 API Governance
A cloud-hosted application is comprised of two parts – implementation and interface.
The implementation contains the functionality of the application. It primarily consists
of code that implements various application features. The interface, which abstracts
and modularizes the implementation details of an application while making it network-
accessible, is often referred to as a web API (or API in short). The API enables remote
users and client applications to interact with the application by sending HTTP/S re-
quests. The responses generated by an API could be based on HTML (for display on
a web browser), or they could be based on a data format such as XML or JSON (for
machine-to-machine interaction). Regardless of the technology used to implement an
API, it is the part of the application that is visible to the remote clients.
Developers today increasingly depend on the functionality of already existing web
applications in the cloud, which are accessible through their interfaces (APIs). Thus, a
modern application often combines local program logic with calls to remote web APIs.
This model significantly reduces both the programming and the maintenance workload
associated with applications. In theory, because the APIs interface to software that is
curated by cloud providers, the client application leverages greater scalability, perfor-
mance, and availability in the implementations it calls upon through these APIs, than
it would if those implementations were local to the client application (e.g. as locally
available software libraries). Moreover, by accessing shared web applications, developers
avoid “re-inventing the wheel” each time they need a commonly available application
19
Background Chapter 2
feature. The scale at which clouds operate ensures that the APIs can support the large
volume of requests generated by the ever-growing client population.
As a result, web-accessible APIs and the software applications to which they provide
access are rapidly proliferating. At the time of this writing, ProgrammableWeb [48], a
popular web API index, lists more than 15, 000 publicly available web APIs, and a nearly
100
% annual growth rate [
49
]. These APIs increasingly employ the REST (Represen-
tational State Transfer) architectural style [50], and many of them target commercial
applications (e.g. advertising, shopping, travel, etc.). However, several non-commercial
entities have also recently published web APIs, e.g. IEEE [
51
], UC Berkeley [52], and
the US White House [53].
This proliferation of web APIs in the cloud demands new techniques that automate
the maintenance and evolution of APIs as a first-class software resource – a notion that we
refer to as API governance [
54
]. API management in the form of run-time mechanisms to
implement access control is not new, and many good commercial offerings exist today [45,
46, 47]. However, API governance – consistent, generalized, policy implementation across
multiple APIs in an administrative domain – is a new area of research made poignant by
the emergence of cloud computing.
We design governance systems targeting the APIs exposed by the cloud-hosted web
applications. We facilitate configuring and enforcing policies at the granularity of APIs.
Similarly, we design systems that stipulate performance SLOs for individual APIs, and
monitor them as separate independent entities.
20
Chapter 3
Governance of Cloud-hosted
Applications Through Policy
Enforcement
In this chapter we discuss implementing scalable, automated API governance through
policy enforcement for cloud-hosted web applications. A lack of API governance can lead
to many problems including security breaches, poor code reuse, violation of service-level
objectives (SLOs), naming and branding issues, and abuse of digital assets by the API
consumers. Unfortunately, most existing cloud platforms within which web APIs are
hosted provide only minimal governance support; e.g. authentication and authorization.
These features are important to policy implementation since governance often requires
enforcement of access control on APIs. However, developers are still responsible for im-
plementing governance policies that combine features such as API versioning, dependency
management, and SLO enforcement as part of their respective applications.
Moreover, today’s cloud platforms require that each application implements its own
governance. There is no common, built-in system that enables cloud administrators to
21
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
specify policies, which are automatically enforced on applications and their APIs. As a
result, the application developers must be concerned with development issues (correct and
efficient programming of application logic), as well as governance issues (administrative
control and management) when implementing applications for the cloud.
Existing API management solutions [45, 46, 47] typically operate as external stand-
alone services that are not integrated with the cloud. They do attempt to address gov-
ernance concerns beyond mere access control. However, because they are not integrated
within the cloud platform, their function is advisory and documentarian. That is, they
do not possess the ability to implement full enforcement, and instead, alert operators to
potential issues without preventing non-compliant behavior. They are also costly, and
they can fail independently of the cloud, thereby affecting the scalability and availability
of the software that they govern. Finally, it is not possible for them to implement policy
enforcement at deployment-time – the phase of the software lifecycle during which an
API change or a new API is being put into service. Because of the scale at which clouds
operate, deployment-time enforcement is critical since it permits policy violations to be
remediated before the changes are put into production (i.e. before run-time).
Thus, our thesis is that governance must be implemented as a built-in, native cloud
service to overcome these shortcomings. That is, instead of an API management approach
that layers governance features on top of the cloud, we propose to provide API governance
as a fundamental service of the cloud platform. Cloud-native governance capabilities
• enable both deployment-time and run-time enforcement of governance policies as
part of the cloud platform’s core functionality,
• avoid inconsistencies and failure modes caused by integration and configuration of
governance services that are not end-to-end integrated within the cloud fabric itself,
• leverage already-present cloud functionality such as fault tolerance, high availability
22
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
and elasticity to facilitate governance, and
• unify a vast diversity of API governance features across all stages of the API lifecycle
(development, deployment, deprecation, retirement).
As a cloud-native functionality, such an approach also simplifies and automates the en-
forcement of API governance in the cloud. This in turns enables separation of governance
concerns from development concerns for both cloud administrators as well as cloud ap-
plication developers. The cloud administrators simply specify the policies, and trust
the cloud platform to enforce them automatically on the applications. The application
developers do not have to program any governance features into their applications, and
instead rely on the cloud platform to perform the necessary governance checks either
when the application is uploaded to the cloud, or when the application is being executed.
To explore the efficacy of cloud-integrated API governance, we have developed an
experimental cloud platform that supports governance policy specification, and enforce-
ment for the applications it hosts. EAGER – Enforced API Governance Engine for
REST – is a model and an architecture that is designed to be integrated within ex-
isting cloud platforms in order to facilitate API governance as a cloud-native feature.
EAGER enforces proper versioning of APIs and supports dependency management and
comprehensive policy enforcement at API deployment-
time.
Using EAGER, we investigate the trade-offs between deployment-time policy enforce-
ment and run-time policy enforcement. Deployment-time enforcement is attractive for
several reasons. First, if only run-time API governance is implemented, policy violations
will go undetected until the offending APIs are used, possibly in a deep stack or call path
in an application. As a result, it may be difficult or time consuming to pinpoint the spe-
cific API and policy that are being violated (especially in a heavily loaded web service). In
these settings, multiple deployments and rollbacks may occur before a policy violation is
23
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
triggered making it difficult or impossible to determine the root cause of the violation. By
enforcing governance as much as possible at deployment-time, EAGER implements “fail
fast” in which violations are detected immediately making diagnosis and remediation less
complex. Further, from a maintenance perspective, the overall system is prevented from
entering a non-compliant state, which aids in the certification of regulatory compliance.
In addition, run-time governance typically implies that each API call will be intercepted
by a policy-checking engine that uses admission control, and an enforcement mechanism
creating scalability concerns. Because deployment events occur before the application
is executed, traffic need not be intercepted and checked “in flight”, thus improving the
scaling properties of governed APIs. However, not all governance policies can be imple-
mented strictly at deployment-time. As such, EAGER includes run-time enforcement
facilities as well. The goal of our research is to identify how to implement enforced API
governance most efficiently by combining deployment-time enforcement where possible,
and run-time enforcement where necessary.
EAGER implements policies governing the APIs that are deployed within a single
administrative domain (i.e. a single cloud platform). It treats APIs as first-class software
assets due to the following
reasons.
• APIs are often longer lived than the individual clients that use them or the imple-
mentations of the services that they represent.
• APIs represent the “gateway” between software functionality consumption (API
clients and users) and service production (web service implementation).
EAGER acknowledges the crucial role APIs play by separating the API life cycle
management from that of the service implementations and the client users. It facilitates
policy definition and enforcement at the API level, thereby permitting the service and
client implementations to change independently without the loss of governance control.
24
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
EAGER further enhances software maintainability by guaranteeing that developers reuse
existing APIs when possible to create new software artifacts (to prevent API redundancy
and unverified API use). At the same time, it tracks changes made by developers to
already deployed web APIs to prevent any backwards-incompatible API changes from
being put into production.
EAGER includes a language for specifying API governance policies. The EAGER lan-
guage is distinct from existing policy languages like WS-Policy [
55
, 56] in that it avoids
the complexities of XML, and it incorporates a developer-friendly Python programming
language syntax for specifying complex policy statements in a simple and intuitive man-
ner. Moreover, we ensure that specifying the required policies is the only additional
activity that API providers should perform in order to use EAGER. All other API gov-
ernance related verification and enforcement work is carried out by the cloud platform
automatically.
To evaluate the feasibility and performance of the proposed architecture, we proto-
type the EAGER concepts in an implementation that extends AppScale [
57
], an open
source cloud platform that emulates Google App Engine [4]. We describe the implemen-
tation and integration as an investigation of the generality of the approach. By focusing
on deployment actions and run-time message checking, we believe that the integration
methodology will translate to other extant cloud platforms.
We further show that EAGER API governance and policy enforcement impose a
negligible overhead on the application deployment process, and the overhead is linear in
the number of APIs in the applications being validated. Finally, we show that EAGER
is able to scale to tens of thousands of deployed web APIs and hundreds of governance
policies.
In the sections that follow, we present some background on cloud-hosted APIs, and
overview the design and implementation of EAGER. We then empirically evaluate EA-
25
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
GER using a wide range of APIs and experiments. Finally, we discuss related work, and
conclude the chapter.
3.1 Enforcing API Governance in Cloud Settings
Software engineering best practices separate the service implementation from API,
both during development and maintenance. The service implementation and API are
integrated via a “web service stack” that implements functionality common to all web
services (message routing, request authentication, etc.). Because the API is visible to
external parties (i.e. clients of the services), any changes to the API impacts users and
client applications not under the immediate administrative control of the API provider.
For this reason, API features usually undergo long periods of “deprecation” so that in-
dependent clients of the services can have ample time to “migrate” to newer versions of
an API. On the other hand, technological innovations often prompt service reimplemen-
tation and/or upgrade to achieve greater cost efficiencies, performance levels, etc. Thus,
APIs typically have a more slowly evolving and longer lasting lifecycle than the service
implementations to which they provide access.
Modern computing clouds, especially clouds implementing some form of Platform-as-
a-Service (PaaS) [58], have accelerated the proliferation of web APIs and their use. Most
PaaS clouds [57, 59, 8] include features designed to ease the development and hosting of
web APIs for scalable use over the Internet. This phenomenon is making API governance
an absolute necessity in
cloud environments.
In particular, API governance promotes code reuse among developers since each API
must be treated as a tracked and controlled software entity. It also ensures that software
users benefit from change control since the APIs they depend on change in a controlled
and non-disruptive manner. From a maintenance perspective, API governance makes it
26
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
possible to enforce best-practice coding procedures, naming conventions, and deployment
procedures uniformly. API governance is also critical to API lifecycle management – the
management of deployed APIs in response to new feature requests, bug fixes, and orga-
nizational priorities. API “churn” that results from lifecycle management is a common
phenomenon and a growing problem for web-based applications [60]. Without proper
governance systems to manage the constant evolution of APIs, API providers run the
risk of making their APIs unreliable while potentially breaking downstream applications
that depend on the APIs.
Unfortunately, most web technologies used to develop and host web APIs do not
provide API governance facilities. This missing functionality is especially glaring for
cloud platforms that are focused on rapid deployment of APIs at scale. Commercial
pressures frequently prioritize deployment speed and scale over longer-term maintenance
considerations only to generate unanticipated future costs.
As a partial countermeasure, developers of cloud-hosted applications often undertake
additional tasks associated with implementing custom ad hoc governance solutions using
either locally developed mechanisms or loosely integrated third-party API management
services. These add-on governance approaches often fall short in terms of their consis-
tency and enforcement capabilities since by definition they have to operate outside the
cloud (either external to it or as another cloud-hosted application). As such, they do not
have the end-to-end access to all the metadata and cloud-internal control mechanisms
that are necessary to implement strong governance at scale.
In a cloud setting, enforcement of governance policies on APIs is a tradeoff between
what can be enforced, and when they are enforced. Performing policy enforcement at
application run-time provides full control over what can be enforced, since the policy en-
gine can intercept and control all operations and instructions executed by the application.
However, this approach is highly intrusive, which introduces complexity and performance
27
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
overhead. Alternatively, attempting to enforce policies prior to application’s execution
is attractive in terms of performance, but it necessarily limits what can be enforced. For
example, ensuring that an application does not connect to a specific network address
and/or port requires run-time traffic interception, typically by a firewall that is inter-
posed between the application and the offending network. Enforcing such a policy can
only be performed during run-time.
For policy implementation, often the additional complexities and overhead introduced
by run-time enforcement outweigh its benefits. For example, in an application that con-
sists of API calls to services that, in turn, make calls to other services, run-time policy
enforcement can make violations difficult to resolve, especially when the interaction be-
tween services is non-deterministic. When a specific violation occurs, it may be “buried”
in a lattice of API invocations that is complex to traverse, especially if the application
itself is designed to handle large-scale request traffic loads.
Ideally, then, enforcement takes place as non-intrusively as possible before the ap-
plication begins executing. In this way, a violation can be detected and resolved before
the API is used, thereby avoiding possible degradations in user-experience that run-time
checks and violations may introduce. The drawback of attempting to enforce all gov-
ernance before the application begins executing is that policies that express restrictions
only resolvable at run time cannot be implemented. Thus, for scalable applications that
use API calls internally in a cloud setting, an API governance approach should attempt
to implement as much as possible no later than deployment time, but must also include
some form of run-time enforcement.
Note that the most effective approach to implementing a specific policy may not
always be clear. For example, user authentication is usually implemented as a run-
time policy check for web services since users enter and leave the system dynamically.
However it is possible to check statically, at deployment time, whether the application
28
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
is consulting with a specific identity management service (accessed by a versioned API)
thereby enabling some deployment-time enforcement.
Thus, any efficient API governance solution for clouds must include the following
functionalities.
• Policy Specification Language – The system must include a way to specify
policies that can be enforced either at deployment-time (or sooner) or, ultimately
at run-time.
• API Specification Language – Policies must be able to refer to API functional-
ities to be able to express governance edicts for specific APIs or classes of APIs.
• Deployment-time Control – The system must be able to check policies no later
than the time that an application is deployed.
• Run-time Control – For policies that cannot be enforced before runtime, the
system must be able to intervene dynamically.
In addition, a good solution should automate as much of the implementation of API
governance as possible. Automation in a cloud context serves two purposes. First, it
enables scale by allowing potentially complex optimizations to be implemented reliably by
the system, and not by manual intervention. Secondly, automation improves repeatability
and auditability thereby ensuring greater system integrity.
3.2 EAGER
To experiment with API governance in cloud environments, we devise EAGER –
an architecture for implementing governance that is suitable for integration as a cloud-
native feature. EAGER leverages existing SOA governance techniques and best practices,
29
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
Figure 3.1: EAGER Architecture
and adapts them to make them suitable for cloud platform-level integration. In this
section, we overview the high-level design of EAGER, its main components, and the
policy language. Our design is motivated by two objectives. First, we wish to verify that
the integration among policy specification, API specification, deployment-time control,
and run-time control is feasible in a cloud setting. Secondly, we wish to use the design
as the basis for a prototype implementation that we could use to evaluate the impact of
API governance empirically.
EAGER is designed to be integrated with PaaS clouds. PaaS clouds accept code that
is then deployed within the platform so that it may make calls to kernel services offered
by the cloud platform, or other applications already deployed in the cloud platform via
their APIs. EAGER intercepts all events related to application deployment within the
cloud, and enforces governance checks at deployment-time. When a policy verification
check fails, EAGER aborts the deployment of the application, and logs the information
30
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
necessary to perform remediation. EAGER assumes that it is integrated with the cloud,
and that the cloud initiates in a policy compliant state (i.e. there are no policy violations
when the cloud is first launched before any applications are deployed). We use the term
“initiates” to differentiate the first clean launch of the cloud, from a platform restart.
EAGER must be able to maintain compliance across restarts, but it assumes that when
the cloud is first installed and suitably tested, it is in a policy compliant state. Moreover,
it maintains the cloud in a policy compliant state at all times. That is, with EAGER
active, the cloud is automatically prevented from transitioning out of policy compliance
due to a change in the applications it hosts.
Figure 3.1 illustrates the main components of EAGER (in blue), and their interac-
tions. Solid arrows represent the interactions that take place during application deployment-
time, before an application has been validated for deployment. Short-dashed arrows in-
dicate the interactions that take place during deployment-time, after an application has
been successfully validated. Long-dashed arrows indicate interactions at run-time. The
diagram also outlines the components of EAGER that are used to provide deployment-
time control and run-time control. Note that some components participate in interactions
related to both deployment and run-time control (e.g. metadata manager).
EAGER is invoked by the cloud whenever a user attempts to deploy an application
in the cloud. The cloud’s application deployment mechanisms must be altered so that
each deployment request is intercepted by EAGER, which then performs the required
governance checks. If a governance check fails, EAGER preempts the application deploy-
ment, logs relevant data pertaining to the event for later analysis, and returns an error.
Otherwise, it proceeds with the application deployment by activating the deployment
mechanisms on the user’s behalf.
Architecturally, the deployment action requires three inputs: the policy specification
governing the deployment, the application code to be deployed, and a specification of
31
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
the APIs that the application exports. EAGER assumes that cloud administrators have
developed and installed policies (stored in the metadata manager) that are to be checked
against all deployments. API specifications for the application must also be available to
the governance framework. Because the API specifications are to be derived from the
code (and are, thus, under developer control and not administrator control) our design
assumes that automated tools are available to perform analysis on the application, and
generate API specifications in a suitable API specification language. These specifications
must be present when the deployment request is considered by the platform. In the
prototype implementation described in section 3.3, the API specifications are generated
as part of the application development process (e.g. by the build system). They may also
be offered as a trusted service hosted in the cloud. In this case, developers will submit
their source code to this service, which will generate the necessary API specifications in
the cloud, and trigger the application deployment process via EAGER.
The proposed architecture does not require major changes to the existing components
of the cloud, since its deployment mechanisms are likely to be web service based. However,
EAGER does require integration at the platform level. That is, it must be a trusted
component in
the cloud platform.
3.2.1 Metadata Manager
The metadata manager stores all the API metadata in EAGER. This metadata in-
cludes policy specifications, API names, versions, specifications and dependencies. It
uses the dependency information to compute the dependency tree among all deployed
APIs and applications. Additionally, the metadata manager also keeps track of develop-
ers, their subscriptions to various APIs, and the access credentials (API keys) issued to
them. For these purposes, the metadata manager must logically include both a database,
32
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
and an identity management system.
The metadata manager is exposed to other components through a well defined web
service interface. This interface allows querying existing API metadata and updating
them. In the proposed model, the stored metadata is updated occasionally – only when a
new application is deployed or when a developer subscribes to a published API. Therefore
the Metadata Manager does not need to support a very high write throughput. This
performance characteristic allows the Metadata Manager to be implemented with strong
transactional semantics, which reduces the development overhead of other components
that rely on metadata manager. Availability can be improved via simple replication
methods.
3.2.2 API Deployment Coordinator
The API Deployment Coordinator (ADC) intercepts all application deployment re-
quests, and determines whether they are suitable for deployment, based on a set of policies
specified by the cloud administrators. It receives application deployment requests via a
web service interface. At a high-level, ADC is the most important entity in the EAGER’s
deployment-time control strategy.
An application deployment request contains the name of the application, version
number, names and versions of the APIs exported by the application, detailed API
specifications, and other API dependencies as declared by the developer. Application
developers only need to specify explicitly the name and version of the application and
the list of dependencies (i.e. APIs consumed by the application). All other metadata can
be computed automatically by performing introspection on the application source code.
The API specifications used to describe the web APIs should state the operations
and the schema of their inputs and outputs. Any standard API description language
33
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
can be used for this purpose, as long as it clearly describes the schema of the requests
and responses. For describing REST interfaces, we can use Web Application Description
Language (WADL) [
61
], Swagger [
62
], RESTful API Modeling Language (RAML) or any
other language that provides similar functionality.
When a new deployment request is received, the ADC checks whether the application
declares any API dependencies. If so, it queries the metadata manager to make sure
that all the declared dependencies are already available in the cloud. Then it inspects
the enclosed application metadata to see if the current application exports any web
APIs. If the application exports at least one API, the ADC makes another call to
the metadata manager, and retrieves any existing metadata related to that API. If the
metadata manager cannot locate any data related to the API in question, ADC assumes
it to be a brand new API (i.e. no previous version of that API has been deployed in the
cloud), and proceeds to the next step of the governance check, which is policy validation.
However, if any metadata regarding the API is found, then the ADC is dealing with an
API update. In this case, the ADC compares the old API specifications with the latest
ones provided in the application deployment request to see if they are compatible.
To perform this API compatibility verification, the ADC checks to see whether the
latest specification of an API contains all the operations available in the old specification.
If the latest API specification is missing at least one operation that it had previously, the
ADC reports this to the user and aborts the deployment. If all the past operations are
present in the latest specification, the ADC performs a type check to make sure that all
past and present operations are type compatible. This is done by performing recursive
introspection on the input and output data types declared in the API specifications.
EAGER looks for type compatibility based on the following rules inspired by Hoare
logic [
63
], and the rules of type inheritance from object oriented programming.
34
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
• New version of an input type is compatible with the old version of an input type, if
the new version contains either all or less attributes than the old version, and any
new attributes that are unique to the new version are optional.
• New version of an output type is compatible with the old version of an output type,
if the new version contains either all or more attributes than the old version.
In addition to the type checks, ADC may also compare other parameters declared in
the API specifications such as HTTP methods, mime types and URL patterns. We have
also explored and published results on using a combination of syntactic and semantic
comparison to determine the compatibility between APIs [60,
64
]. Once the API specifi-
cations have been successfully compared without error, and the compatibility established,
the ADC initiates policy validation.
3.2.3 EAGER Policy Language and Examples
Policies are specified by cloud or organizational administrators using a subset of
the popular Python programming language. This design choice is motivated by several
reasons.
• A high-level programming language such as Python is easier to learn and use for
policy implementors.
• Platform implementors can use existing Python interpreters to parse and execute
policy files. Similarly, policy implementors can use existing Python development
tools to write and test policies.
• In comparison to declarative policy languages (e.g. WS-Policy), a programming
language like Python offers more flexibility and expressive power. For example, a
policy may perform some local computation, and use the results in its enforcement
35
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
clauses. The control flow tools of the language (e.g. conditionals, loops) facilitate
specifying complex policies.
• The expressive power of the language can be closely regulated by controlling the
set of allowed built-in modules and functions.
We restrict the language to prevent state from being preserved across policy valida-
tions. In particular, the EAGER policy interpreter disables file and network operations,
third party library calls, and other language features that allow state to persist across
invocations. In addition, EAGER processes each policy independently of others (i.e. each
policy must be self-contained and access no external state). All other language constructs
and language features can be used to specify policies in EAGER.
To accommodate built-in language APIs that the administrators trust by fiat, all
module and function restrictions of the EAGER policy language are enforced through a
configurable white-list. The policy engine evaluates each module and function reference
found in policy specifications against this white-list to determine whether they are allowed
in the context of EAGER. Cloud administrators have the freedom to expand the set of
allowed built-in and third party modules by making changes to this white-list.
As part of policy language, EAGER defines a set of assertions that policy writers
can use to specify various checks to perform on the applications. Listing 3.1 shows the
assertions currently supported by EAGER.
Listing 3.1: Assertions supported by the EAGER policy language.
a s s e r t t r u e ( c o n d i t i o n , o p t i o n a l e r r o r m s g )
a s s e r t f a l s e ( c o n d i t i o n , o p t i o n a l e r r o r m s g )
a s s e r t a p p d e p e n d e n c y ( app , d name , d v e r s i o n )
a s s e r t n o t a p p d e p e n d e n c y ( app , d name , d v e r s i o n )
a s s e r t a p p d e p e n d e n c y i n r a n g e ( app , name ,\
l o w e r , upper , e x c l u d e l o w e r , e x c l u d e u p p e r )
36
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
In addition to these assertions, EAGER adds a function called “compare versions”
to the list of available built-in functions. Policy implementors can use this function to
compare version number strings associated with applications and APIs.
In the remainder of this section we illustrate the use of the policy language through
several examples. The first example policy, shown in listing 3.2, mandates that any
application or mash-up that uses both Geo and Direction APIs must adhere to certain
versioning rules. More specifically, if the application uses Geo 3.0 or higher, it must
use Direction 4.0 or higher. Note that the version numbers are compared using the
“compare versions” functions described earlier.
Listing 3.2: Enforcing API version comparison
g = f i l t e r (lambda dep : dep . name == ‘ Geo ’ ,
app . d e p e n d e n c i e s )
d = f i l t e r ( lambda dep : dep . name == ‘ D i r e c t i o n ’ , app . d e p e n d e n c i e s )
i f g and d :
g a p i , d a p i = g [ 0 ] , d [ 0 ]
i f c o m p a r e v e r s i o n s ( g a p i . v e r s i o n , ‘ 3 . 0 ’ ) >= 0 :
a s s e r t t r u e ( c o m p a r e v e r s i o n s ( d a p i . v e r s i o n , ‘ 4 . 0 ’ ) >= 0 )
In listing 3.2, app is a special immutable logical variable available to all policy files.
This variable allows policies to access information pertaining to the current application
deployment request. The assert true and assert false functions allow testing for arbitrary
conditions, thus greatly improving the expressive power of the policy language.
Listing 3.3 shows a policy file that mandates that all applications deployed by the
“admin@test.com” user must have role-based authentication enabled, so that only users
in the “manager” role can access them. To carry out this check the policy accesses the
security configuration specified in the application descriptor (e.g. the web.xml for a Java
application).
37
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
Listing 3.3: Enforcing role-based authorization.
i f app . owner == ‘ admin@test . com ’ :
r o l e s = app . web xml [ ‘ s e c u r i t y−r o l e ’ ]
c o n s t r a i n t s = app . web xml [ ‘ s e c u r i t y−c o n s t r a i n t ’ ]
a s s e r t t r u e ( r o l e s and c o n s t r a i n t s )
a s s e r t t r u e ( l e n ( r o l e s ) == 1 )
a s s e r t t r u e ( ‘ manager ’ == r o l e s [ 0 ] [ ‘ r o l e−name ’ ] )
Listing 3.4 shows an example policy, which mandates that all deployed APIs must
explicitly declare an operation which is accessible through the HTTP OPTIONS method.
This policy further ensures that these operations return a description of the API in the
Swagger [62] machine-readable API description language.
Listing 3.4: Enforcing APIs to publish a description.
o p t i o n s = f i l t e r (lambda op : op . method == ‘OPTIONS ’ ,
a p i . o p e r a t i o n s )
a s s e r t t r u e ( o p t i o n s , ‘ API d o e s n o t s u p p o r t OPTIONS ’ )
a s s e r t t r u e ( o p t i o n s [ 0 ] . type == ‘ s w a g g e r . API ’ ,
‘ Does n o t r e t u r n a Swagger d e s c r i p t i o n ’ )
Returning machine-readable API descriptions from web APIs makes it easier to au-
tomate the API discovery and consumption processes. Several other research efforts
confirm the need for such descriptions [65,
66
]. A policy such as this can help enforce
such practices, thus resulting in a high-quality API ecosystem in the target cloud.
The policy above also shows the use of the second and optional string argument to
the assert true function (the same is supported by assert false as well). This argument
can be used to specify a custom error message that will be returned to the application
developer, if his/her application violates the assertion in question.
The next example policy prevents developers from introducing dependencies on dep-
38
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
recated web APIs. Deprecated APIs are those that have been flagged by their respective
authors for removal in the near future. Therefore introducing dependencies on such APIs
is not recommended. The policy in listing 3.5 enforces this condition in the cloud.
Listing 3.5: Preventing dependencies on deprecated APIs.
d e p r e c a t e d = f i l t e r (
lambda dep : dep . s t a t u s == ’DEPRECATED ’ ,
app . d e p e n d e n c i e s )
a s s e r t f a l s e ( d e p r e c a t e d ,
’ Must n o t u s e a d e p r e c a t e d d epe nde nc y ’ )
Listing 3.6: Tenant-aware policy enforcement.
i f app . owner . e n d s w i t h ( ‘ @ e n g i n e e r i n g . t e s t . com ’ ) :
a s s e r t a p p d e p e n d e n c y ( app , ‘ Log ’ , ‘ 1 . 0 ’ )
e l i f app . owner . e n d s w i t h ( ‘ @ s a l e s . t e s t . com ’ ) :
a s s e r t a p p d e p e n d e n c y ( app , ‘ A n a l y t i c s L o g ’ , ‘ 1 . 0 ’ )
else :
a s s e r t a p p d e p e n d e n c y ( app , ‘ G e n e r i c L o g ’ , ‘ 1 . 0 ’ )
Our next example presents a policy that enforces governance rules in a user-aware (i.e.
tenant-aware) manner. Assume a multi-tenant private PaaS cloud that is being used by
members of the development team and the sales team of a company. The primary goal in
this case is to ensure that applications deployed by both teams log their activities using
a set of preexisting logging APIs. However, we further want to ensure that applications
deployed by the sales team log their activities using a special analytics API. A policy
such as the one in listing 3.6 can enforce these conditions.
The example in listing 3.7 shows a policy, which mandates that all HTTP GET
operations exposed by APIs must support paging. APIs that do so define two input
parameters named “start” and “count” to the GET call.
39
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
Listing 3.7: Enforcement of paging functionality in APIs.
for a p i in app . a p i l i s t :
g e t = f i l t e r (lambda op : op . method == ‘GET ’ ,
a p i . o p e r a t i o n s )
f o r op i n g e t :
param names = map( lambda p : p . name ,
op . p a r a m e t e r s )
a s s e r t t r u e ( ‘ s t a r t ’ in param names and
‘ c o u n t ’ i n param names )
This policy accesses the metadata of API operations that is available in the API de-
scriptions. Since API descriptions are auto-generated from the source code of the APIs,
this policy indirectly references information pertaining to the actual API implementa-
tions.
Finally, we present an example for the HTTP POST method. The policy in listing 3.8
mandates that all POST operations exposed by an API are secured with OAuth version
2.0.
Listing 3.8: Enforcement of OAuth-based authentication for APIs.
for a p i in app . a p i l i s t :
p o s t = f i l t e r (lambda op : op . method == ‘POST ’ ,
a p i . o p e r a t i o n s )
f o r op i n p o s t :
a s s e r t t r u e ( op . a u t h o r i z a t i o n s . g e t ( ‘ o a u t h 2 ’ ) )
EAGER places no restrictions on how many policy files are specified by adminis-
trators. Applications are validated against each policy file. Failure of any assertion in
any policy file causes the ADC to abort application deployment. Once an application
is checked against all applicable policies, ADC persists the latest application and API
metadata into the Metadata Manager. At this point, the ADC may report success to
40
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
the user, and proceed with application deployment. In a PaaS setting this deployment
activity typically involves three steps:
1. Deploy the application in the cloud application run-time (application server).
2. Publish the APIs enclosed in the application and their specifications to the API
Discovery Portal or catalog.
3. Publish the APIs enclosed in the application to an API Gateway server.
Step 1 is required to complete the application deployment in the cloud even without
EAGER. We explain the significance of steps 2 and 3 in the following subsections.
3.2.4 API Discovery Portal
The API Discovery Portal (ADP) is an online catalog where developers can browse
available web APIs. Whenever the ADC approves and deploys a new application, it
registers all the APIs exported by the application in ADP. EAGER mandates that any
developer interested in using an API, first subscribe to that API and obtain the proper
credentials (API keys) from the ADP. The API keys issued by the ADP can consist of an
OAuth [
67
] access token (as is typical of many commercial REST-based web services) or a
similar authorization credential, which can be used to identify the developer/application
that is invoking the API. This credential validation process is used for auditing, and
run-time governance in EAGER.
The API keys issued by the ADP are stored in the metadata manager. When a
programmer develops a new application using one or more API dependencies, we require
the developer to declare its dependencies along with the API keys obtained from the
ADP. The ADC verifies this information against the metadata manager as a part of
41
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
its dependency check, and ensures that the declared dependencies are correct and the
specified API keys are valid.
Deployment-time governance policies may further incentivize the declaration of API
dependencies explicitly by making it impossible to call an API without first declaring
it as a dependency along with the proper API keys. These types of policies can be
implemented with minor changes to the application run-time in the cloud so that it
loads the API credentials from the dependency declaration provided by the application
developer.
In addition to API discovery, the ADP also provides a user interface for API authors
to select their own APIs and deprecate them or retire them. Deprecated APIs will be
removed from the API search results of the portal, and application developers will no
longer be able to subscribe to them. However, already existing subscriptions and API keys
will continue to work until the API is eventually retired. The deprecation is considered
a courtesy notice for application developers who have developed applications using the
API, to migrate their code to a newer version of the API. Once retired, any applications
that have not still been migrated to the latest version of the API will cease to operate.
3.2.5 API Gateway
Run-time governance of web services by systems such as Synapse [
68
] make use of an
API “proxy” or gateway. The EAGER API gateway does so to intercept API calls and
validate the API keys contained within them. EAGER intercepts requests by blocking
direct access to the APIs in the application run-time (app servers), and publishing the
API Gateway address as the API endpoint in the ADP. We do so via firewall rules that
prevent the cloud app servers from receiving any API traffic from a source other than
the API gateway. Once the API gateway validates an API call, it routes the message to
42
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
the application server in the cloud platform that hosts the API.
The API gateway can be implemented via one or more load-balanced servers. In
addition to API key validation, the API gateway can perform other functions such as
monitoring, throttling (rate limiting), and run-time policy validation.
3.3 Prototype Implementation
We implement a prototype of EAGER by extending AppScale [57], an open source
PaaS cloud that is functionally equivalent to Google App Engine (GAE). AppScale sup-
ports web applications written in Python, Java, Go and PHP. Our prototype implements
governance for all applications and APIs hosted in an AppScale cloud.
As described in subsection 3.2.3, EAGER’s policy specification language is based on
Python. This allows the API deployment coordinator (also written in Python) to execute
the policies directly using a modified Python interpreter to implement the restrictions
previously discussed.
The prototype relies on a separate tool chain (i.e. one not hosted as a service in
the cloud) to automatically generate API specifications and other metadata (c.f. Sec-
tion 3.2.2), which currently supports only the Java language. Developers must document
the APIs manually for web applications implemented in languages other than Java.
Like most PaaS technologies, AppScale includes an application deployment service
that distributes, launches and exports an application as a web-accessible service. EAGER
controls this deployment process according to the policies that the platform administrator
specifies.
43
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
3.3.1 Auto-generation of API Specifications
To auto-generate API specifications, the build process of an application must include
an analysis phase that generates specifications from the source code. Our prototype
includes two stand-alone tools for implementing this “build-and-analyze” function.
1. An Apache Maven archetype that is used to initialize a Java web application project,
and
2. A Java doclet that is used to auto-generate API specifications from web APIs
implemented in Java
Developers invoke the Maven archetype from the command-line to initialize a new
Java web application project. Our archetype sets up projects with the required AppScale
(GAE) libraries, Java JAX-RS [
69
] (Java API for RESTful Web Services) libraries, and
a build configuration.
Once the developer creates a new project using the archetype, he/she can develop
web APIs using the popular JAX-RS library. When the code is developed, it can be built
using our auto-generated Maven build configuration, which introspects the project source
code to generate specifications for all enclosed web APIs using the Swagger [70] API
description language. It then packages the compiled code, required libraries, generated
API specifications, and the dependency declaration file into a single, deployable artifact.
Finally, the developer submits the generated artifact for deployment to the cloud
platform, which in our prototype is done via AppScale developer tools. To enable this,
we modify the tools so that they send the application deployment request to the EAGER
ADC and delegate the application deployment process to EAGER. This change required
just under 50 additional lines of code in
AppScale.
44
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
EAGER Component Implementation Technology
Metadata Manager MySQL
API Deployment Coordinator Native Python implementation
API Discovery Portal WSO2 API Manager [71]
API Gateway WSO2 API Manager
Table 3.1: Implementation technologies used to implement the EAGER prototype
3.3.2 Implementing the Prototype
Table 3.1 lists the key technologies that we use to implement various EAGER func-
tionalities described in section 3.2 as services within AppScale. For example, AppScale
controls the lifecycle of the MySQL database as it would any of its other constituent
services. EAGER incorporates the WSO2 API Manager [72] for use as an API discovery
mechanism, and to implement any run-time policy enforcement. In the prototype, the
API gateway does not share policies expressed in the policy language with the ADC.
This integration is left to be implemented in the future.
Also, according to the architecture of EAGER, metadata manager is the most suit-
able location for storing all policy files. The ADC may retrieve the policies from the
metadata manager through its web service interface. However, for simplicity, our current
prototype stores the policy files in a file system, that the ADC can directly read from.
In a more sophisticated future implementation of EAGER, we will move all policy files
to the metadata manager where they can be better managed.
3.4 Experimental Results
In this section, we describe our empirical evaluation of the EAGER prototype, and
evaluate its overhead and scaling characteristics. To do so, we populate the EAGER
database (metadata manager) with a set of APIs, and then examine the overhead as-
45
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
sociated with governing a set of sample AppScale applications (shown in Table 3.2) for
varying degrees of policy specifications and dependencies. In the first set of results we use
randomly generated APIs, so that we may vary different parameters that may affect per-
formance. We then follow with a similar analysis using a large set of API specifications
“scraped” from the ProgrammableWeb [48] public API registry.
Note that all the figures included in this section present the average values calculated
over three sample runs. The error bars cover an interval of two standard deviations
centered at the calculated sample average.
We start by presenting the time required for AppScale application deployment without
EAGER, as it is this process on which we piggyback EAGER support. These measure-
ments are conservative since they are taken from a single node deployment of AppScale
where there is no network communication overhead. Our test AppScale cloud is deployed
on an Ubuntu 12.04 Linux virtual machine with a 2.7 GHz CPU, and 4 GB of memory. In
practice AppScale is deployed over multiple hosts in a distributed manner where different
components of the cloud platform must communicate via the network.
Table 3.2 lists a number of App Engine applications that we consider, their artifact
size, and their average deployment times across three runs, on AppScale without EA-
GER. We also identify the number of APIs and dependencies for each application in
the Description column. These applications represent a wide range of programming
languages, application sizes, and business domains.
On average, deployment without EAGER takes 34.5 seconds, and this time is corre-
lated with application artifact size. The total time consists of network transfer time of
the application to the cloud (which in this case is via localhost networking), and disk
copy time to the application servers. For actual deployments, both components are likely
to increase due to network latency, available bandwidth, contention, and large numbers
of distributed application servers.
46
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
Application Description Size
(MB)
Deployment
Time (s)
guestbook-py A simple Python web application
that allows users to post comments
and view them
0.16 22.13
guestbook-java A Java clone of the guestbook-
python app
52 24.18
appinventor A popular open source web appli-
cation that enables creating mobile
apps
198
111
.47
coursebuilder A popular open source web applica-
tion used to facilitate teaching online
courses
37 23.75
hawkeye A sample Java application used to
test AppScale
35 23.37
simple-jaxrs-app A sample JAXRS app that exports
2 web APIs
34 23.45
dep-jaxrs-app A sample JAXRS app that exports
a web API and has one dependency
34 23.72
dep-jaxrs-app-
v2
A sample JAXRS app that exports
2 web APIs and has one dependency
34 23.
95
Table 3.2: Sample AppScale applications
3.4.1 Baseline EAGER Overhead by Application
Figure 3.2 shows the average time in seconds taken by EAGER to validate and verify
each application. We record these results on an EAGER deployment without any policies
deployed, and without any prior metadata recorded in the metadata manager (that is, an
unpopulated database of APIs). We present the values as absolute measurements (here
and henceforth) because of the significant difference between them and deployment times
on AppScale without EAGER (100’s of milliseconds compared to 10’s of seconds). We
can alternatively observe this overhead as a percentage of AppScale deployment time by
dividing these times by those shown in Table 3.2.
Note that some applications do not export any web APIs. For these EAGER over-
47
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
Figure 3.2: Absolute mean overhead of EAGER by application. Each data point
averages three executions, the error bars are two standard deviations, and the units
are seconds.
head is negligibly small (approximately 0.1s). This result indicates that EAGER does
not impact deployment time of applications that do not require API governance. For
applications that do export web APIs, the recorded overhead measurements include the
time to retrieve old API specifications from the metadata manager, the time to compare
the new API specifications with the old ones, the time to update the API specifications
and other metadata in the Metadata Manager, and the time to publish the updated APIs
to the cloud. The worst case observed overhead for governed APIs (simple-jaxrs-app in
the figure 3.2) is 2.8%.
3.4.2 Impact of Number of APIs and Dependencies
Figure 3.3 shows that EAGER overhead grows linearly with the number of APIs
exported by an application. This scaling occurs because the current prototype imple-
mentation iterates through the APIs in the application sequentially, and records the API
metadata in the metadata manager. Then EAGER publishes each API to the ADP and
API Gateway. This sequencing of individual EAGER events, each of which generates a
48
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
Figure 3.3: Average EAGER overhead vs. number of APIs exported by the appli-
cation. Each data point averages three executions, the error bars are two standard
deviations, and the units are seconds.
separate web service call, represents an optimization opportunity via parallelization in
future implementations.
At present we expect most applications deployed in cloud to have a small to mod-
erate number of APIs (10 or fewer). With this API density EAGER’s current scaling
is adequate. Even in the unlikely case that a single application exports as many as 100
APIs, the average total time for EAGER is under 20 seconds.
Next, we analyze EAGER overhead as the number of dependencies declared in an
application grows. For this experiment, we first populate the EAGER metadata manager
with metadata for 100 randomly generated APIs. To generate random APIs we use the
API specification auto-generation tool to generate fictitious APIs with randomly varying
numbers of input/output parameters. Then we deploy an application on EAGER which
exports a single API, and declares artificial dependencies on the set of fictitious APIs
that are already stored in the Metadata Manager. We vary the number of declared
dependencies and observe the EAGER overhead.
Figure 3.4 shows the results of these experiments. EAGER overhead does not appear
to be significantly influenced by the number of dependencies declared in an application.
49
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
Figure 3.4: Average EAGER overhead vs. number of dependencies declared in the ap-
plication. Each data point averages three executions, the error bars are two standard
deviations, and the units are seconds.
In this case, the EAGER implementation processes all dependency-related information
via batch operations. As a result, the number of web service calls and database queries
that originate due to varying number of dependencies remains constant.
3.4.3 Impact of Number of Policies
So far we have conducted all our experiments without any active governance policies
in the system. In this section, we report how EAGER overhead is influenced by the
number of policies.
The overhead of policy validation is largely dependent on the actual policy content
which is implemented as Python code. Since users may include any Python code (as
long as it falls in the accepted subset) in a policy file, evaluating a given policy can take
an arbitrary amount of time. Therefore, in this experiment, our goal is to evaluate the
overhead incurred by simply having many policy files to execute. We keep the content
of the policies small and trivial. We create a policy file that runs following assertions:
1. Application name must start with an upper case letter
2. Application must be owned by a specific user
50
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
Figure 3.5: Average EAGER overhead vs. number of policies. Each data point
averages three executions, the error bars are two standard deviations, and the units
are seconds. Note that some of the error bars for guestbook-py are smaller than the
graph features at this scale, and are thus obscured.
3. All API names must start with upper case letters
We create many copies of this initial policy file to vary the number of policies deployed.
Then we evaluate the overhead of policy validation on two of our sample applications –
guestbook-py and simple-jaxrs-app.
Figure 3.5 shows how the number of active policies impact EAGER overhead. We see
that even large numbers of policies do not impact EAGER overhead significantly. It is
only when the active policy count approaches 1000 that we can notice a small increase
in the overhead. Even then, the increase in deployment time is under 0.1 seconds.
This result is due to the fact that EAGER loads policy content into memory at system
startup, or when a new policy is deployed, and executes them from memory each time an
application is deployed. Since policy files are typically small (at most a few kilobytes),
this is a viable option. The overhead of validating the simple-jaxrs-app is higher than
that of the guestbook-py because, simple-jaxrs-app exports web APIs. This means the
third assertion in the policy set is executed for this app, and not for guestbook-py. Also,
additional interactions with the metadata manager is needed in case of simple-jaxrs-app
in order to persist the API metadata for future use.
51
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
Figure 3.6: Average EAGER overhead vs. number of APIs in metadata manager.
Each data point averages three executions, the error bars are two standard deviations,
and the units are seconds. Note that some of the error bars for guestbook-py are
smaller than the graph features at this scale and are thus obscured.
Our results indicate that EAGER scales well to hundreds of policies. That is, there
is no significant overhead associated with simply having a large number of policy files.
However, as mentioned earlier, the content of a policy may influence the overhead of
policy validation, and will be specific to the policy and application EAGER analyzes.
3.4.4 Scalability
Next, we evaluate how EAGER scales when a large number of APIs are deployed in
the cloud. In this experiment, we populate the EAGER metadata manager with a varying
number of random APIs. We then attempt to deploy various sample applications. We
also create random dependencies among the APIs recorded in the metadata manager to
make the experimental setting more realistic.
Figure 3.6 shows that the deployment overhead of the guestbook-py application is
not impacted by the growth of metadata in the cloud. Recall that guestbook-py does not
export any APIs nor does it declare any dependencies. Therefore the deployment process
of the guestbook-py application has minimal interactions with the metadata manager.
Based on this result we conclude that applications that do not export web APIs are not
52
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
significantly affected by the accumulation of metadata in EAGER.
Both simple-jaxrs-app and dep-jaxrs-app are affected by the volume of data stored
in metadata manager. Since these applications export web APIs that must be recorded
and validated by EAGER, the growth of metadata has an increasingly higher impact
on them. The degradation of performance as a function of the number of APIs in the
metadata manager database is due to the slowing of query performance of the RDBMS
engine (MySQL) as the database size grows. Note that the simple-jaxrs-app is affected
more by this performance drop, because it exports two APIs compared to the single API
exported by dep-jaxrs-app. However, the growth in overhead is linear to the number of
APIs deployed in the cloud, presumably indicating linear scaling factor in the installation
of MySQL that EAGER used in these experiments. Also, even after deploying 10000
APIs, the overhead on simple-jaxrs-app is only increased by 0.5 seconds.
Another interesting characteristic in Figure 3.6 is the increase in overhead variance
as the number of APIs in the cloud grows. We believe that this is due to the increasing
variability of database query performance and the data transfer performance as the size
of the database increases.
In summary, the current EAGER prototype scales well to 1000’s of APIs. If further
scalability is required, we can employ parallelization and database query optimization.
3.4.5 Experimental Results with a Real-World Dataset
Finally, we explore how EAGER operates with a real-world dataset with API meta-
data and dependency information. For this, we crawl the ProgrammableWeb API reg-
istry, and extract metadata regarding all registered APIs and mash-ups. At the time
of the experiment, we managed to collect 1
109
5 APIs and 7227 mash-ups, where each
mash-up depends on one or more APIs.
53
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
Figure 3.7: Average EAGER overhead over three experiments when deploying on
ProgrammableWeb Dataset. The the error bars are two standard deviations, and the
units are seconds.
We auto-generated API specifications for each API and mash-up, and populated the
EAGER metadata manager with them. We then used the mashup-API dependency in-
formation gathered from ProgrammableWeb to register dependencies among the APIs in
EAGER. This resulted in a dependency graph of total 1
83
22 APIs with 33615 dependen-
cies. We then deploy a subset of our applications, and measure EAGER overhead.
Figure 3.7 shows the results for three applications. The guestbook-py app (without
any web APIs) is not significantly impacted by the large dependency database. Ap-
plications that export web APIs show a slightly higher deployment overhead due to the
database scaling properties previously discussed. However, the highest overhead observed
is under 2 seconds for simple-jaxrs-app, which is an acceptably small percentage of the
23.45 second deployment time as shown in table 3.2.
The applications in this experiment do not declare dependencies on any of the APIs
in the ProgrammableWeb dataset. The dep-jaxrs-app does declare a dependency, but
that is on an API exported by simple-jaxrs-app. To see how the deployment time is
impacted when applications become dependent on other APIs already registered in EA-
GER, we deploy a test application that declares random fictitious dependencies on APIs
from the ProgrammableWeb corpus registered in EAGER. We consider 10, 20, and 50
54
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
Figure 3.8: EAGER Overhead when deploying on ProgrammableWeb dataset with
dependencies. The suffix value indicates the number of dependencies; the prefix in-
dicates if these dependencies are randomized or not, upon redeployment. Each data
point averages three executions, the error bars that are two standard deviations, and
the units are seconds.
declared dependencies, and deploy each application three times. We present the results
in Figure 3.8. For the “random” datasets, we run a deployment script that randomly
modifies the declared dependencies at each redeployment. For the “fixed” datasets the
declared dependencies remains the same across redeployments.
We observe that the dependency count does not have a significant impact on the
overhead. The largest overhead observed is under 1.2 seconds for 50 randomly varied
dependencies. In addition, when the dependency declaration is fixed, the overhead is
slightly smaller. This is because our prototype caches the edges of its internally generated
dependency tree, which expedites redeployments.
In summary, EAGER adds a very small overhead to the application deployment pro-
cess, and this overhead increases linearly with the number of APIs exported by the
applications, and the number of APIs deployed in the cloud. Interestingly, the number
of deployed policies and declared dependencies have little impact on the EAGER gover-
nance overhead. Finally, our results indicate that EAGER scales well to 1000’s of APIs
and adds less than 2 seconds latency with over 18, 000 “real-world” deployed APIs in its
database. Based on this analysis we conclude that enforced deployment-time API gov-
55
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
ernance can be implemented in modern PaaS clouds with negligible overhead and high
scalability. Further, deployment-time API governance can be made an intrinsic compo-
nent of the PaaS cloud itself, thus alleviating the need for weakly integrated third-party
API management solutions.
3.5 Related Work
Our research builds upon advances in the areas of SOA governance and service man-
agement. Guan et al introduced FASWSM [73] a web service management framework
for application servers. FASWSM uses an adaptation technique that wraps web services
in a way so they can be managed by the underlying application server platform. Wu
et al introduced DART-Man [
74
], a web service management system based on seman-
tic web concepts. Zhu and Wang proposed a model that uses Hadoop and HBase to
store web service metadata, and process them to implement a variety of management
functions [75]. Our work is different from these past approaches in that EAGER targets
policy enforcement, and we focus on doing so by extending extant cloud platforms (e.g.
PaaS) to provide an integrated and scalable governance solution.
Lin et al proposed a service management system for clouds that monitors all service
interactions via special “hooks” that are connected to the cloud-hosted services [
76
].
These hooks monitor and record service invocations, and also provide an interface so
that the individual service artifacts can be managed remotely. However, this system only
supports run-time service management and provides no support for deployment-time
policy checking and enforcement. Kikuchi and Aoki [
77
] proposed a technique based on
model checking to evaluate the operational vulnerabilities and fault propagation patterns
in cloud services. However, this system provides no active monitoring or enforcement
functionality. Sun et al proposed a reference architecture for monitoring and managing
56
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
cloud services [78]. This too lacks deployment-time governance, policy validation support,
and the ability to intercept and act upon API calls which limits its use as a comprehensive
governance solution for clouds.
Other researchers have shown that policies can be used to perform a wide range
of governance tasks for SOA such as access control [79, 80], fault diagnosis [81], cus-
tomization [
82
], composition [83,
84
] and management [85,
86
, 87]. We build upon the
foundation of these past efforts, and use policies to govern RESTful web APIs deployed in
cloud settings. Our work is also different in that it defines an executable policy language
(implemented as a subset of Python in the EAGER prototype) that employs a simple,
developer-friendly syntax based upon the Python language (vs XML), which is capable
of capturing a wide range of governance requirements.
Peng, Lui and Chen showed that the major concerns associated with SOA governance
involve retaining the high reliability of services, recording how many services are avail-
able on the platform to serve, and making sure all the available services are operating
within an acceptable service level [20]. EAGER attempts to satisfy similar requirements
for modern RESTful web APIs deployed in cloud environments. EAGER’s metadata
manager and ADP record and keep track of all deployed APIs in a simple, extensible,
and comprehensive manner. Moreover, EAGER’s policy validation, dependency manage-
ment, and API change management features “fail fast” to detect violations immediately
making diagnosis and remediation less complex, and prevent the system from ever enter-
ing a non-compliant state.
API management has been a popular topic in the industry over the last few years, re-
sulting in many commercial and open source API management solutions [72, 46, 47, 88].
These products facilitate API lifecycle management, traffic shaping, access control, mon-
itoring and a variety of other important API-related functionality. However, these tools
do not support deep integration with cloud environments in which many web applications
57
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
and APIs are deployed today. EAGER is also different in that it combines deployment-
time and run-time enforcement. Previous systems either work exclusively at run-time or
do not include an enforcement capability (i.e. they are advisory).
3.6 Conclusions and Future Work
In this chapter, we describe EAGER, a model and a software architecture that fa-
cilitates API governance as a cloud-native feature. EAGER supports comprehensive
policy enforcement, dependency management, and a variety of other deployment-time
API governance features. It promotes many software development and maintenance best
practices including versioning, code reuse, and API backwards compatibility retention.
EAGER also includes a language based on Python that enables creating, debugging, and
maintaining API governance policies in a simple and intuitive manner. EAGER can be
built into cloud platforms that are used to host APIs to automate governance tasks that
otherwise require custom code or developer intervention.
Our empirical results, gathered using a prototype of EAGER developed for AppScale,
show that EAGER adds negligibly small overhead to the cloud application deployment
process, and the overhead grows linearly with the number of APIs deployed. We also
show that EAGER scales well to handle tens of thousands of APIs and hundreds of
policies. Based on our results we conclude that efficient and automated policy enforce-
ment is feasible in cloud environments. Furthermore, we find that policy enforcement at
deployment-time can help cloud administrators and application developers achieve ad-
ministrative conformance and developer best practices with respect to cloud-hosted web
applications.
As part of our future work, we plan to investigate the degree to which deployment-
time governance can be expanded. Run-time API governance imposes a number of new
58
Governance of Cloud-hosted Applications Through Policy Enforcement Chapter 3
scalability and reliability challenges. By offloading as much of the governance overhead
to deployment-time as possible, EAGER ensures that the impact of run-time governance
is minimized.
We also plan to investigate the specific language features that are essential to EA-
GER’s combined deployment-time and run-time approach. The use of Python in the
prototype proved convenient from a programmer productivity perspective. It is not yet
clear, however, whether the full set of language features that we have left unrestricted
are necessary. By minimizing the policy language specification we hope to make its im-
plementation more efficient, less error prone to develop and debug, and more amenable
to automatic analysis.
Another future research direction is the integration of policy language and run-time
API governance. We wish to explore the possibility of using the same Python-based
policy language for specifying policies that are enforced on APIs at run-time (i.e. on
individual API calls). Since API calls far more frequent than API deployment events,
we should evaluate the performance aspects of the policy engine to make this integration
practically useful.
59
Chapter 4
Response Time Service Level
Objectives for Cloud-hosted Web
Applications
In the previous chapter we discussed how to implement API governance in cloud en-
vironments via policy enforcement. This chapter focuses on stipulating bounds on the
performance of cloud-hosted web applications. The ability to understand the performance
bounds of an application is vital in several governance use cases such as performance-
aware policy enforcement, and application performance monitoring.
Cloud-hosted web applications are deployed and used as web services. They enable
a level of service reuse that both expedites and simplifies the development of new client
applications. Despite the many benefits, reusing existing services also has pitfalls. In par-
ticular, new client applications become dependent on the services they compose. These
dependencies impact correctness, performance, and availability of the composite appli-
cations, for which the “top level” developer is often held accountable. Compounding the
situation, the underlying services can and do change over time while their APIs remain
60
Chapter 4
stable, unbeknownst to the developers that programmatically access them. Unfortu-
nately, there is a dearth of tools that help developers reason about these dependencies
throughout an application’s life cycle (i.e. development, deployment, and run-time).
Without such tools, programmers must adopt extensive, continuous, and costly, testing
and profiling methods to understand the performance impact on their applications that
results from the increasingly complex collection of services that they depend on.
We present Cerebro to address this requirement without subjecting applications to
extensive testing or instrumentation. Cerebro is a new approach that predicts bounds on
the response time performance of web APIs exported by applications that are hosted in
a PaaS cloud. The goal of Cerebro is to allow a PaaS administrator to determine what
response time service level objective (SLO) can be fulfilled by each web API operation
exported by the applications hosted in the PaaS.
An “SLO” specifies the minimum service level promised by the service provider re-
garding some non-functional property of the service such as its availability or performance
(response time). Such SLOs are explicitly stated by the service provider, and are typi-
cally associated with a correctness probability, which can be described as the likelihood
the service will meet the promised minimum service level. A typical availability SLO
takes the form: “the service will be available p% of the time”. Here the value p% is
the correctness probability of the SLO. Similarly, a response time SLO would take the
form of the statement: “the service will respond under Q milliseconds, p% of the time.
Naturally, p should be a value close to 100, for this type of SLOs to be useful in practice.
In a corporate setting, SLOs are used to form service level agreements (SLAs), formal
contracts that govern the service provider-consumer relationship [
89
]. They consist of
SLOs, and the clauses that describe what happens if the service fails to meet those SLOs
(for example, if the service is only available p′% of the time, where p′ < p). This typically
boils down to service provider paying some penalty (a refund), or providing some form
61
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
of free service credits for the users. We do not consider such legal and social obligations
of an SLA in this work, and simply focus on the minimum service levels (i.e. SLOs), and
the associated correctness probabilities, since those are the parameters that matter from
a performance and capacity planning point of view of an application.
Currently, cloud computing systems such as Amazon Web Services (AWS) [3] and
Google App Engine (GAE) [4] advertise SLOs specifying the fraction of availability
over a fixed time period (i.e. uptime) for their services. However, they do not pro-
vide SLOs that state minimum levels of performance. In contrast, Cerebro facilitates
auto-generating performance SLOs for cloud-hosted web APIs in a way that is scalable.
Cerebro uses a combination of static analysis of the hosted web APIs, and runtime mon-
itoring of the PaaS kernel services to determine what minimum statistical guarantee can
be made regarding an API’s response time, with a target probability specified by a PaaS
administrator. These calculated SLOs enable developers to reason about the perfor-
mance of the client applications that consume the cloud-hosted web APIs. They can also
be used to negotiate SLAs concerning the performance of cloud-hosted web applications.
Moreover, predicted SLOs are useful as baselines or thresholds when monitoring APIs for
consistent performance – a feature that is useful for both API providers and consumers.
Collectively, Cerebro and the SLOs predicted by it enable implementing a number of au-
tomated governance scenarios involving policy enforcement and application performance
monitoring, in ways that
were not possible before.
Statically reasoning about the execution time of arbitrary programs is challenging
if not unsolvable. Therefore we scale the problem down by restricting our analysis to
cloud-hosted web applications. Specifically, Cerebro generates response time SLOs for
APIs exported by a web application developed using the kernel services available within
a PaaS cloud. For brevity, in this work we will use the term web API to refer to a web-
accessible API exported by an application hosted on a PaaS platform. Further, we will
62
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
use the term kernel services to refer to the services that are maintained as part of the
PaaS and available to all hosted applications. This terminology enables us to differentiate
the internal services of the PaaS from the APIs exported by the deployed applications.
For example, an application hosted in Google App Engine might export one or more web
APIs to its users while leveraging the internal datastore kernel service that is available
as part of the Google App Engine PaaS.
Cerebro uses static analysis to identify the PaaS kernel invocations that dominate the
response time of web APIs. By surveying a collection of web applications developed for
a PaaS cloud, we show that such applications indeed spend majority of their execution
time on PaaS kernel invocations. Further, they do not have many branches and loops,
which makes them amenable to static analysis (section 4.1). Independently, Cerebro also
maintains a running history of response time performance for PaaS kernel services. It uses
QBETS [
90
] – a forecasting methodology we have developed in prior work for predicting
bounds on “ill behaved” univariate time series – to predict response time bounds on
each PaaS kernel invocation made by the application. It combines these predictions
dynamically for each static program path through a web API operation, and returns the
“worst-case” upper bound on the time necessary to complete the operation.
Because service implementations and platform behavior under load change over time,
Cerebro’s predictions necessarily have a lifetime. That is, the predicted SLOs may become
invalid after some time. As part of this chapter, we develop a model for detecting such
SLO invalidations. We use this model to investigate the effective lifetime of Cerebro
predictions. When such changes occur, Cerebro can be reinvoked to establish new SLOs
for any deployed web API.
We have implemented Cerebro for both the Google App Engine public PaaS, and
the AppScale private PaaS. Given its modular design and this experience, we believe
that Cerebro can be easily integrated into any PaaS system. We use our prototype
63
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
implementation to evaluate the accuracy of Cerebro, as well as the tightness of the bounds
it predicts (i.e. the difference between the predictions and the actual API execution
times). To this end, we carry out a range of experiments using App Engine applications
that are available as open source.
We also detail the duration over which Cerebro predictions hold in both GAE and
AppScale. We find that Cerebro generates correct SLOs (predictions that meet or exceed
their probabilistic guarantees), and that these SLOs are valid over time periods ranging
from 1.4 hours to several weeks. We also find that the high variability of performance in
public PaaS clouds due to their multi-tenancy and massive scale requires that Cerebro
be more conservative in its predictions to achieve the desired level of correctness. In
comparison, Cerebro is able to make much tighter SLO predictions for web APIs hosted
in private, single tenant clouds.
Because Cerebro provides this analysis statically it imposes no run-time overhead
on the applications themselves. It requires no run-time instrumentation of application
code, and it does not require any performance testing of the web APIs. Furthermore,
because the PaaS is scalable and platform monitoring data is shared across all Cerebro
executions, the continuous monitoring of the kernel services generates no discernible load
on the cloud platform. Thus we believe Cerebro is suitable for highly scalable cloud
settings.
Finally, we have developed Cerebro for use with EAGER (Enforced API Governance
Engine for REST) [91] – an API governance system for PaaS clouds. EAGER attempts
to enforce governance policies at the deployment-time of cloud applications. These gover-
nance policies are specified by cloud administrators to ensure the reliable operation of the
cloud and the deployed applications. PaaS platforms include an application deployment
phase during which the platform provisions resources for the application, installs the ap-
plication components, and configures them to use the kernel services. EAGER injects a
64
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
policy checking and enforcement step into this deployment workflow, so that only appli-
cations that are compliant with respect to site-specific policies are successfully deployed.
Cerebro allows PaaS administrators to define performance-aware EAGER policies that
allow an application to be deployed only when its web APIs meet a pre-determined SLO,
and developers to be notified by the platform when such SLOs require revision.
We structure the rest of this chapter as follows. We first characterize the domain
of PaaS-hosted web APIs for GAE and AppScale in Section 4.1. We then present the
design of Cerebro in section 4.2 and overview our software architecture and prototype
implementation. Next, we present our empirical evaluation of Cerebro in section 4.3.
Finally, we discuss related work (Section 4.4) and conclude (Section 4.5).
4.1 Domain Characteristics and Assumptions
The goal of our work is to analyze a web API statically, and from this analysis without
deploying or running the web API, accurately predict an upper bound on its response
time. With such a prediction, developers and cloud administrators can provide perfor-
mance SLOs to the API consumers (human or programmatic), to help them reason about
the performance implications of using APIs – something that is not possible today. For
general purpose applications, such worst-case execution time analysis has been shown by
numerous researchers to be challenging to achieve for all but simple programs or specific
application domains. To overcome these challenges, we take inspiration from the latter,
and exploit the application domain of PaaS-hosted web APIs to achieve our goal. In this
chapter, we focus on the popular Google App Engine (GAE) public PaaS, and AppScale
private PaaS, which support the same applications, development and deployment model,
and platform services.
The first characteristic of PaaS systems that we exploit to facilitate our analysis is
65
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
their predefined programming interfaces through which they export various kernel ser-
vices. Herein we refer to these programming interfaces as the cloud software development
kit or the cloud SDK. The cloud SDK is comprised of several interfaces, each of which
plays the role of a client stub for some kernel service offered by the cloud platform.
We refer to the individual member interfaces of the cloud SDK as cloud SDK inter-
faces, and to their constituent operations as cloud SDK operations. These interfaces
export scalable functionality that is commonly used to implement web APIs: key-value
datastores, caching, task scheduling, security and authentication, etc. In an applica-
tion, each cloud SDK call represents an invocation of a PaaS kernel service. Therefore,
we use the terms cloud SDK calls and PaaS kernel invocations interchangeably in the
remainder of this chapter. The App Engine and AppScale cloud SDK is detailed in
https://cloud.google.com/appengine/docs/java/javadoc/.
With PaaS clouds, developers implement their application code as a combination of
calls to the cloud SDK, and their own code. Developers then upload their applications to
the cloud for deployment. Once deployed, the applications and any web APIs exported
by them can be accessed via HTTP/S requests by external or co-located clients.
Typically, PaaS-hosted web APIs make one or more cloud SDK calls. The reason for
this is two-fold. First, kernel services that underpin the cloud SDK provide web APIs
with much of the functionality that they require. Second, PaaS clouds “sandbox” web
APIs to enforce quotas, to enable billing, and to restrict certain functionality that can
lead to security holes, platform instability, or scaling issues [
92
]. For example, GAE and
AppScale cloud platforms restrict the application code from accessing the local file sys-
tem, accessing shared memory, using certain libraries, and arbitrarily spawning threads.
Therefore developers must use the provided cloud SDK operations to implement program
logic equivalent to the restricted features. For example, the datastore interface can be
used to read and write persistent data instead of using the local file system, and the
66
https://cloud.google.com/appengine/docs/java/javadoc/
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
memcache interface can be used in lieu of global shared
memory.
Furthermore, the only way for a web API to execute is in response to an HTTP/S
request or as a background task. Therefore, execution of all web API operations start and
end at well defined program points, and we are able to infer this structure from common
software patterns. Also, concurrency is restricted by capping the number of threads and
requiring that a thread cannot outlive the request that creates it. Finally, PaaS clouds
enforce quotas and limits on kernel service (cloud SDK) use [
93
,
94
, 92]. App Engine,
for example, requires that all web API requests complete under 60 seconds. Otherwise
they are terminated by the platform. Such enforcement places a strict upper bound on
the execution of a web API operation.
To understand the specific characteristics of PaaS-hosted web APIs, and the potential
of this restricted domain to facilitate efficient static analysis and response time prediction,
we next summarize results from static analysis (using the Soot framework [95]) of 35 real
world App Engine web APIs. These web APIs are open source (available via GitHub [
96
]),
written in Java, and run over Google App Engine or AppScale without modification. We
selected them based on availability of documentation, and the ability to compile and run
them without errors.
Our analysis detected a total of
145
8 Java methods in the analyzed codes. Figure 4.1
shows the cumulative distribution of static program paths in these methods. Approxi-
mately
97
% of the methods considered in the analysis have 10 or fewer static program
paths through them.
99
% of the methods have 36 or fewer paths. However, the CDF
is heavy tailed, and grows to 34992. We truncate the graph at 100 paths for clarity.
As such, only a very small number of methods each contains a large number of paths.
Fortunately, over 65% of the methods have exactly 1 path (i.e. there are no branches).
Next, we consider the looping behavior of web APIs.
128
6 of the methods (88%)
considered in the study do not have any loops. 172 methods (12%) contain loops. We
67
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Figure 4.1: CDF of the number of static paths through methods in the surveyed web APIs.
believe that this characteristic is due to the fact that the PaaS SDK and the platform
restrictions like quotas and response time limits discourage looping.
Approximately 29% of all the loops in the analyzed programs do not contain any cloud
SDK calls. A majority of the loops (61%) however, are used to iterate over a dataset
that is returned from the datastore cloud SDK interface of App Engine (i.e iterating on
the result set returned by a datastore query). We refer to this particular type of loops
as iterative datastore reads.
Table 4.1 lists the number of times each cloud SDK interface is called across all paths
and methods in the analyzed programs. The datastore API is the most commonly used
interface. This is because data management is fundamental to most web APIs, and the
PaaS disallows using the local filesystem to do so for scalability and portability reasons.
Next, we explore the number of cloud SDK calls made along different paths of exe-
cution in the web APIs. For this study we consider all paths of execution through the
methods (64780 total paths). Figure 4.2 shows the cumulative distribution of the number
of SDK calls within paths. Approximately 98% of the paths have 1 cloud SDK call or
68
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Table 4.1: Static cloud SDK calls in surveyed web APIs
Cloud SDK Interface No. of Invocations
blobstore 7
channel 1
datastore 735
files 4
images 3
memcache 12
search 6
taskqueue 24
tools 2
urlfetch 8
users 44
xmpp 3
Figure 4.2: CDF of cloud SDK call counts in paths of execution.
69
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
fewer. The probability of finding an execution path with more than 5 cloud SDK calls is
smaller than 1%.
Finally, our experience with App Engine web APIs indicates that a significant portion
of the total time of a method (web API operation) is spent in cloud SDK calls. Confirming
this hypothesis requires careful instrumentation (i.e. difficult to automate) of the web
API codes. We performed such a test by hand on two representative applications, and
found that the time spent in code other than cloud SDK calls accounts for 0-6% of the
total time (0-3ms for a 30-50ms web API operation).
This study of various characteristics typical of PaaS-hosted web APIs indicates that
there may be opportunities to exploit the specific aspects of this application domain to
simplify analysis, and to facilitate performance SLO prediction. In particular, operations
in these applications are short, have a small number of paths to analyze, implement few
loops, and invoke a small number of cloud SDK calls. Moreover, most of the time spent
executing these operations results from cloud SDK invocations. In the next section,
we describe our design and implementation of Cerebro that takes advantage of these
characteristics and assumptions. We then use a Cerebro prototype to experimentally
evaluate its efficacy for estimating the worst-case response time for applications from
this domain.
4.2 Cerebro
Given the restricted application domain of PaaS-hosted web APIs, we believe that
it is possible to design a system that predicts response time SLOs for them using only
static information from the web API code itself. To enable this, we design Cerebro with
three primary components:
• A static analysis tool that extracts sequences of cloud SDK calls for each path
70
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
through a method (web API operation),
• A monitoring agent that runs in the target PaaS, and efficiently monitors the
performance of the underlying cloud SDK operations, and
• An SLO predictor that uses the outputs of these two components to accurately
predict an upper bound on the response time of the web API.
We overview each of these components in the subsections that follow, and then discuss
the Cerebro workflow with an example.
4.2.1 Static Analysis
This component analyzes the source code of the web API (or some intermediate repre-
sentation of it), and extracts a sequence of cloud SDK calls. We implement our analysis
for Java bytecode programs using the Soot framework [95]. Currently, our prototype
analyzer considers the following Java codes as exposed web APIs.
• classes that extend the javax.servlet.HttpServlet class (i.e. Java servlet implemen-
tations)
• classes that contain JAX-RS @Path annotations, and
• any other classes explicitly specified by the developer in a special configuration file.
Cerebro performs a simple construction and inter-procedural static analysis of control
flow graph (CFG) [97, 98, 99, 100] for each web API operation. The algorithm extracts all
cloud SDK calls along each path through the methods. Cerebro analyzes other functions
that the method calls, recursively. Cerebro caches cloud SDK details for each function
once analyzed so that it can be reused efficiently for other call sites to the same function.
Cerebro does not analyze third-party library calls, if any, which in our experience typically
71
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
do not contain cloud SDK calls. Cerebro encodes each cloud SDK call sequence for each
path in a lookup table. We identify cloud SDK calls by their Java package name (e.g.
com.google.appengine.apis).
To handle loops, we first extract them from the CFG and annotate all cloud SDK
calls that occur within them. We then annotate each such SDK call with an estimate
on the number of times the loop is likely to execute in the worst case. We estimate loop
bounds using a loop bound prediction algorithm based on abstract interpretation [101].
As shown in the previous section, loops in these programs are rare, and when they
do occur, they are used to iterate over a dataset returned from a database. For such
data-dependent loops, we estimate the bounds if specified in the cloud SDK call (e.g.
the maximum number of entities to return [102]). If our analysis is unable to estimate
the bounds for these loops, Cerebro prompts the developer for an estimate of the likely
dataset size and/or loop bounds.
4.2.2 PaaS Monitoring Agent
Cerebro monitors and records the response time of individual cloud SDK operations
within a running PaaS system. Such support can be implemented as a PaaS-native
feature or as a PaaS application (web API); we use the latter in our prototype. The
monitoring agent runs in the background with, but separate from, other PaaS-hosted web
APIs. The agent invokes cloud SDK operations periodically on synthetic datasets, and
records timestamped response times in the PaaS datastore for each cloud SDK operation.
The agent also periodically reclaims old measurement data to eliminate unnecessary
storage. The Cerebro monitoring and reclamation rates are configurable, and monitoring
benchmarks can be added and customized easily to capture common PaaS-hosted web
API coding patterns.
72
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
In our prototype, the agent monitors the datastore and memcache SDK interfaces
every 60 seconds. In addition, it benchmarks loop iteration over datastore entities to
capture the performance of iterative datastore reads for datastore result set sizes of 10,
100, and 1000. We limit ourselves to these values because the PaaS requires that all
operations complete (respond) within 60 seconds – so the data sizes (i.e. number of data
entities) returned are typically small. Sizing up the datastore in terms of powers of 10,
mirrors the typical approach taken by DevOps personnel to approximate the size of a
database. If necessary, our prototype allows adding iterative datastore read benchmarks
for other result set sizes easily.
4.2.3 Making SLO Predictions
To make SLO predictions, Cerebro uses Queue Bounds Estimation from Time Series
(QBETS) [90], a non-parametric time series analysis method that we developed in prior
work. We originally designed QBETS for predicting the scheduling delays for the batch
queue systems used in high performance computing environments, but it has proved
effective in other settings where forecasts from arbitrary times series are needed [
103
,
104
,
105
]. In particular, it is both non-parametric, and it automatically adapts to changes
in the underlying time series dynamics making it useful in settings where forecasts are
required from arbitrary data with widely varying characteristics. We adapt it herein for
use “as-a-service” in PaaS systems to predict the execution time of web APIs.
A QBETS analysis requires three inputs:
1. A time series of data generated by a continuous experiment.
2. The percentile for which an upper bound should be predicted (p ∈ [1..99]).
3. The upper confidence level of the prediction (c ∈ (0, 1)).
73
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
QBETS uses this information to predict an upper bound for the pth percentile of the
time series. It does so by treating each observation in the time series as a Bernoulli
trial with probability 0.01p of success. Let q = 0.01p. If there are n observations,
the probability of there being exactly k successes is described by a Binomial distribution
(assuming observation independence) having parameters n and q. If Q is the pth percentile
of the distribution from which the observations have been drawn, the equation
1 −
k∑
j=0
(
n
j
)
· (1 − q)j · qn−j (4.1)
gives the probability that more than k observations are greater than Q. As a result, the
kth largest value in a sorted list of n observations gives an upper c confidence bound on
Q when k is the smallest integer value for which Equation 4.1 is larger than c.
More succinctly, QBETS sorts observations in a history of observations, and com-
putes the value of k that constitutes an index into this sorted list that is the upper c
confidence bound on the pth percentile. The methodology assumes that the time series
of observations is ergodic so that, in the long run, the confidence bounds are accurate.
QBETS also attempts to detect change points in the time series of observations so
that it can apply this inference technique to only the most recent segment of the series
that appears to be stationary. To do so, it compares percentile bound predictions with
observations throughout the series, and determines where the series is likely to have
undergone a change. It then discards observations from the series prior to this change
point and continues. As a result, when QBETS starts, it must “learn” the series by
scanning it in time series order to determine the change points. We report Cerebro
learning time in our empirical evaluation in subsection 4.3.6.
Note that c is an upper confidence level on pth percentile which makes the QBETS
bound estimates conservative. That is, the value returned by QBETS as a bound predic-
74
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
tion is larger than the true pth percentile with probability 1 − c under the assumptions
of the QBETS model. In this study, we use the 95th percentile with c = 0.01.
Note that the algorithm itself can be implemented efficiently so that it is suitable for
on-line use. Details of this implementation as well as a fuller accounting of the statistical
properties and assumptions are available in [90, 103, 104, 106].
QBETS requires a sufficiently large number of data points in the input time series
before it can make an accurate prediction. Specifically, the largest value in a sorted
list of n observations is greater than the pth percentile with confidence c when n >=
log(c)/log(0.01p).
For example, predicting the 95th percentile of the API execution time, with an upper
confidence of 0.01 requires at least 90 observations. We use this limit as a lower bound
for the length of the history to keep. There is no upper bound for the history length that
QBETS can process. But in Cerebro’s case, several thousand data points in the history
(i.e. 1-3 days of monitoring data) provides a good balance between results accuracy and
computation overhead.
The minimum history length also provides a bound on the variability of the time
series that can be tolerated by QBETS. In general, each time series must be approxi-
mately ergodic, meaning their mean and the variance should not change abruptly. More
specifically, if the values in the time series change too fast for QBETS to gather a sta-
tionary dataset at least as long as the minimum history length, its prediction accuracy
may suffer.
4.2.4 Example Cerebro Workflow
Figure 4.3 illustrates how the Cerebro components interact with each other during
the prediction making process. Cerebro can be invoked when a web API is deployed to
75
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Figure 4.3: Cerebro architecture and component interactions.
a PaaS cloud, or at any time during the development process to give developers insight
into the worst-case response time of their applications.
Upon invoking Cerebro with a web API code, Cerebro performs its static analysis on
all operations in the API. For each analyzed operation it produces a list of annotated
cloud SDK invocation sequences – one sequence per program path. Cerebro then prunes
this list to eliminate duplicates. Duplicates occur when a web API operation has multiple
program paths with the same sequence of cloud SDK invocations. Next, for each pruned
list Cerebro performs the following operations:
1. Retrieve (possibly compressed) benchmarking data from the monitoring agent for
all SDK operations in each sequence. The agent returns ordered time series data
(one time series per cloud SDK operation).
2. Align retrieved time series across operations in time, and sum the aligned values
to form a single joint time series of the summed values for the sequence of cloud
SDK operations.
3. Run QBETS on the joint time series with the desired p and c values to predict an
76
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
upper bound.
Cerebro uses the largest predicted value (across path sequences) as its SLO prediction
for a web API operation. The exhaustive approach by which Cerebro predicts SLOs for
all possible program paths ensures that the final SLO holds valid regardless of which
path gets executed at runtime. This SLO prediction process can be implemented as a
co-located service in the PaaS cloud or as a standalone utility. We do the latter in our
prototype.
As an example, suppose that the static analysis results in the cloud SDK invocation
sequence < op1,op2,op3 > for some operation in a web API. Assume that the monitoring
agent has collected the following time series for the three SDK operations:
• op1: [t0 : 5, t1 : 4, t2 : 6, …., tn : 5]
• op2: [t0 : 22, t1 : 20, t2 : 21, …., tn : 21]
• op3: [t0 : 7, t1 : 7, t2 : 8, …., tn : 7]
Here tm is the time at which the m
th measurement is taken. Cerebro aligns the three
time series according to timestamps, and sums the values to obtain the following joint
time series: [t0 : 34, t1 : 31, t2 : 35, …., tn : 33]
If any operation is tagged as being inside a loop, where the loop bounds have been
estimated, Cerebro multiplies the time series data corresponding to that operation by
the loop bound estimate before aggregating. In cases where the operation is inside a
data-dependent loop, we request the time series data from the monitoring agent for its
iterative datastore read benchmark for a number of entities that is equal to or larger than
the annotation, and include it in the joint time series.
Cerebro passes the final joint time series for each sequence of operations to QBETS,
which returns the worst-case upper bound response time it predicts. If the QBETS
77
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
predicted value is Q milliseconds, Cerebro forms the SLO as “the web API will respond
in under Q milliseconds, p% of the time”. When the web API has multiple operations,
Cerebro estimates multiple SLOs for the API. If a single value is needed for the entire
API regardless of operation, Cerebro returns the largest predicted value as the final SLO
(i.e. the worst-case SLO for the API).
4.2.5 SLO Durability
For a given web API, Cerebro predicts an initial response time SLO at the API’s
deployment-time (following the above workflow). It then consults an on-line API bench-
marking service to continuously verify the predicted response time SLO to determine
if and when it has been violated. SLO violations occur when conditions in the PaaS
change in ways that adversely impact the performance of the cloud SDK operations.
Such changes can result from congestion (multi-tenancy), component failures, and mod-
ifications to PaaS service implementations. The continuous tracking of SLO violations is
necessary to notify the affected API consumers promptly.
Cerebro also periodically recomputes the SLOs for the APIs over time. Cerebro is
able to perform fast, online prediction of time series percentiles via QBETS as more SDK
benchmarking data becomes available from the cloud SDK monitor. This periodic re-
computation of SLOs is important because changes in the PaaS can occur that make new
SLOs available that are better and tighter than the previously predicted ones. Cerebro
must detect when such changes occur so that API consumers can be notified.
To determine SLO durability, we extend Cerebro with a statistical model for detecting
when a Cerebro-generated SLO becomes invalid. Suppose at time t Cerebro predicts value
Q as the p-th percentile of some API’s execution time. If Q is a correct prediction, the
probability of API’s next measured response time being greater than Q is 1−(0.01p). If
78
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
the time series consists of independent measurements, then the probability of seeing n
consecutive values greater than Q (due to random chance) is (1 − 0.01p)n. For example,
using the 95th percentile, the probability of seeing 3 values in a row larger than the
predicted percentile due to random chance is (0.05)3 = 0.00012.
This calculation is conservative with respect to autocorrelation. That is, if the time
series is stationary but autocorrelated, then the number of consecutive values above the
95th percentile that correspond to a probability of 0.00012 is larger than 3. For example,
in previous work [90] using an artificially generated AR(1) series, we observed that 5
consecutive values above the 95th percentile occurred with probability 0.00012 when the
first autocorrelation was 0.5, and 14 when the first autocorrelation was 0.85. QBETS uses
a look-up table of these values to determine the number of consecutive measurements
above Q that constitute a “rare event” indicating a possible change in conditions.
Each time Cerebro makes a new prediction, it computes the current autocorrelation,
and uses the QBETS rare-event look-up table to determine Cw: the number of consecutive
values that constitute a rare event. We measure the time from when Cerebro makes the
prediction until we observe Cw consecutive values above that prediction as being the time
duration over which the prediction is valid. We refer to this duration as the SLO validity
duration.
4.2.6 SLO Reassessment
We extend Cerebro with an SLO reassessment process that invalidates SLOs at the
end of the SLO validity duration, and provides a new SLO for the API consumer. API
consumers receive an initial SLO for a web API hosted by a Cerebro-equipped PaaS as
part of the API subscription process (i.e. when obtaining API keys). This initial SLO may
be issued in the form of an SLA that is negotiated between the API provider and the con-
79
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
sumer. At this point Cerebro records the tuple < Consumer,API,Timestamp,SLO >.
When Cerebro detects consecutive violations of one of its predictions, it considers the
corresponding SLO to be invalid, and provides the affected API consumers with a new
SLO. Upon this SLO change, Cerebro updates the Timestamp and SLO entries in the
appropriate data tuple for future reference.
There is also a second way that an API consumer may encounter an SLO change.
When recomputing SLOs periodically, Cerebro might come across situations where the
latest SLO is smaller than some previously issued SLO (i.e. a tighter SLO is available).
Cerebro can notify the API consumer about this prospect. If the API consumer consents
to the SLO change, Cerebro may update the data tuple, and treat the new SLO as in
effect.
We next use empirical testing and simulations to explore the feasibility of the Cerebro
SLO reassessment process, and evaluate how SLO validity duration and invalidation
impact API consumers over time.
4.3 Experimental Results
To empirically evaluate Cerebro, we conduct experiments using five open source,
Google App Engine applications.
StudentInfo RESTful (JAX-RS) application for managing students of a class (adding,
removing, and listing student information).
ServerHealth Monitors, computes, and reports statistics for server uptime for a given
web URL.
SocialMapper A simple social networking application with APIs for adding users and
comments.
80
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
StockTrader A stock trading application that provides APIs for adding users, register-
ing companies, buying and selling stocks among users.
Rooms A hotel booking application with APIs for registering hotels and querying avail-
able rooms.
These web APIs use the datastore cloud SDK interface extensively. The Rooms web
API also uses the memcache interface. We focus on these two interfaces exclusively in
this study. We execute these applications in the Google App Engine public cloud (SDK
v1.9.17) and in an AppScale (v2.0) private cloud. We instrument the programs to collect
execution time statistics for verification purposes only (the instrumentation data is not
used to predict the SLOs). The AppScale private cloud used for testing was hosted using
four “m3.2xlarge” virtual machines running on a private Eucalyptus [6] cloud.
We first report the time required for Cerebro to perform its analysis and SLO predic-
tion. Across web APIs, Cerebro takes 10.00 seconds on average, with a maximum time
of 14.25 seconds for the StudentInfo application. These times include the time taken
by the static analyzer to analyze all the web API operations, and the time taken by
QBETS to make predictions. For these results, the length of the time series collected by
PaaS monitoring agent is
152
8 data points (25.5 hours of monitoring data). Since the
QBETS analysis time depends on the length of the input time series, we also measured
the time for 2 weeks of monitoring data (19322 data points) to provide some insight into
the overhead of SLO prediction. Even in this case, Cerebro requires only 574.05 seconds
(9.6 minutes) on average.
4.3.1 Correctness of Predictions
We first evaluate the correctness of Cerebro predictions. A set of predictions is correct
if the fraction of measured response time values that fall below the Cerebro prediction
81
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
is greater than or equal to the SLO success probability. For example, if the SLO success
probability is 0.95 (i.e. p = 95 in QBETS) for a specific web API, then the Cerebro
predictions are correct if at least 95% of the response times measured for the web API
are smaller than their corresponding Cerebro predictions.
We benchmark each web API for a period of 15 to 20 hours. During this time we
run a remote HTTP client that makes requests to the web APIs once every minute. The
application instrumentation measures and records the response time of the API operation
for each request (i.e. within the application). Concurrently, and within the same PaaS
system, we execute the Cerebro PaaS monitoring agent, which is an independently hosted
application within the cloud that benchmarks each SDK operation once every minute.
Our test request rate (1 request/minute) is not sufficient to put the backend cloud
servers under any stress. However, cloud platforms like Google App Engine and AppScale
are highly scalable. When the load increases, they automatically spin up new backend
servers, and maintain the average response time of deployed web APIs steady. This
enables us to measure and evaluate the correctness of the Cerebro predictions under
light load conditions. Note that our cloud SDK benchmarking rate at the cloud SDK
monitor is also 1 request per minute. We assume that the time series of cloud SDK
performance is ergodic (i.e. stationary over a long period). Under that assumption,
QBETS is insensitive to the measurement frequency, and a higher benchmarking rate
would not significantly change the
results.
Cerebro predicts the web API execution times using only the cloud SDK benchmark-
ing data collected by Cerebro’s PaaS monitoring agent. We configure Cerebro to predict
an upper bound for the 95th percentile of the web API response time, with an upper
confidence of 0.01.
QBETS generates a prediction for every value in the input time series (one per
minute). Cerebro reports the last one as the SLO prediction to the user or PaaS admin-
82
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Figure 4.4: Cerebro correctness percentage in Google App Engine and AppScale cloud
platforms.
istrator in production. However, having per-minute predictions enables us to compare
these predictions against actual web API execution times measured during the same time
period to evaluate Cerebro correctness. More specifically, we associate with each measure-
ment the prediction from the prediction time series that most nearly precedes it in time.
The correctness fraction is computed from a sample of 1000 prediction-measurement
pairs.
Figure 4.4 shows the final results of this experiment. Each of the columns in fig-
ure 4.4 corresponds to a single web API operation in one of the sample applications. The
columns are labeled in the form of ApplicationName#OperationName, a convention we
will continue to use in the rest of the section. To maintain clarity in the figures we do
not illustrate the results for all web API operations in the sample applications. Instead
we present the results for a selected set of web API operations covering all five sample
applications. We note that other web API operations we tested also produce very similar
results.
83
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Since we are using Cerebro to predict the 95th percentile of the API response times,
Cerebro’s predictions are correct when at least 95% of the measured response times are
less than their corresponding predicted upper bounds. According to figure 4.4, Cere-
bro achieves this goal for all the applications in both cloud environments. The lowest
percentage accuracy observed in our tests is 94.6% (in the case of StockTrader#buy on
AppScale), which is also very close to the target of 95%. Such minor lapses below 95%
are acceptable anyway, since we expect percentage accuracy value to be gently fluctuat-
ing around some average value over time (a phenomenon that will be explained in our
later results). Overall, this result shows us that Cerebro produces highly accurate SLO
predictions for a variety of applications running on two very different cloud platforms.
The web API operations illustrated in Figure 4.4 cover a wide spectrum of scenarios
that may be encountered in real world. StudentInfo#getStudent and StudentInfo#addStudent
are by far the simplest operations in the mix. They invoke a single cloud SDK operation
each, and perform a simple datastore read and a simple datastore write respectively. As
per our survey results, these alone cover a significant portion of the web APIs developed
for the App Engine and AppScale cloud platforms (1 path through the code, and 1 cloud
SDK call). The StudentInfo#deleteStudent operation makes two cloud SDK operations
in sequence, whereas StudentInfo#getAllStudents performs an iterative datastore read.
In our experiment with StudentInfo#getAllStudents, we had the datastore preloaded
with 1000 student records, and Cerebro was configured to use a maximum entity count
of 1000 when making predictions.
ServerHealth#info invokes the same cloud SDK operation three times in sequence.
Both StockTrader#buy and StockTrader#sell have multiple paths through the applica-
tion (due to branching), thus causing Cerebro to make multiple sequences of predictions
– one sequence per path. The results shown in Figure 4.4 are for the longest paths
which consist of seven cloud SDK invocations each. According to our survey, 99.8% of
84
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
the execution paths found in Google App Engine applications have seven or fewer cloud
SDK calls in them. Therefore we believe that the StockTrader web API represents an
important upper bound case.
Rooms#getRoomByName invokes two different cloud SDK interfaces, namely data-
store and memcache. Rooms#getAllRooms is another operation that consists of an
iterative datastore read. In this case, we had the datastore preloaded with 10 entities,
and Cerebro was configured to use a maximum entity count of 10.
4.3.2 Tightness of Predictions
In this section we discuss the tightness of the predictions generated by Cerebro.
Tightness is a measure of how closely the predictions bound the actual response times of
the web APIs. Note that it is possible to perfectly achieve the correctness goal by simply
predicting overly large values for web API response times. For example, if Cerebro were
to predict a response time of several years for exactly 95% of the web API invocations
and zero for the others, it would likely achieve a correctness percentage of 95%. From a
practical perspective, however, such an extreme upper bound is not useful as an SLO.
Figure 4.5 depicts the average difference between predicted response time bounds and
actual response times for our sample web APIs when running in the App Engine and
AppScale clouds. These results were obtained considering a sequence of 1000 consecutive
predictions (of 95th percentile), and the averages are computed only for correct predictions
(i.e. ones above their corresponding measurements).
According to Figure 4.5, Cerebro generates fairly tight SLO predictions for most web
API operations considered in the experiments. In fact, 14 out of the 20 cases illustrated
in the figure show average difference values less than 65ms. In a few cases, however, the
bounds differ from the average measurement substantially:
85
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Figure 4.5: Average difference between predictions and actual response times in
Google App Engine and AppScale. The y-axis is in log scale.
• StudentInfo#getAllStudents on both cloud platforms
• ServerHealth#info, SocialMapper#addComment, StockTrader#buy and StockTrader#sell
on App Engine
To understand why Cerebro generates conservative predictions for some operations we
further investigate the performance characteristics of them. We take StudentInfo#getAllStudents
operation on App Engine as a case study, and analyze its execution time measurements
in depth. This is the case which exhibits the largest average difference between predicted
and actual execution times.
Figure 4.6 shows the empirical cumulative distribution function (CDF) of measured
execution times for the StudentInfo#getAllStudents on Google App Engine. This dis-
tribution was obtained by considering the application’s instrumentation results gathered
within a window of 1000 minutes. The average of this sample is 3431.79ms, and the 95th
percentile from the CDF is 4739ms. Thus, taken as a distribution, the “spread” between
the average and the 95th percentile is more than
130
0ms.
86
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Figure 4.6: CDF of measured executions times of the StudentInfo#getAllStudents
operation on App Engine.
From this, it becomes evident that StudentInfo#getAllStudents records very high
execution times frequently. In order to incorporate such high outliers, Cerebro must be
conservative and predict large values for the 95th percentile. This is a required feature
to ensure that 95% or more API invocations have execution times under the predicted
SLO. But as a consequence, the average distance between the measurements and the
predictions increases significantly.
We omit a similar analysis of the other cases in the interest of brevity but summarize
the tightness results as indicating that Cerebro achieves a bound that is “tight” with
respect to the percentiles observed by sampling the series for long periods.
Another interesting observation we can make regarding the tightness of predictions is
that the predictions made in the AppScale cloud platform are significantly tighter than
the ones made in Google App Engine (Figure 4.5). For nine out of the ten operations
tested, Cerebro has generated tighter predictions in the AppScale environment. This
is because web API performance on AppScale is far more stable and predictable thus
87
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
resulting in fewer measurements that occur far from the average.
The reason why AppScale’s performance is more stable over time is because it is
deployed on a set of closely controlled, and monitored cluster of virtual machines (VMs)
that use a private Infrastructure-as-a-Service (IaaS) cloud to implement isolation. In
particular, the VMs assigned to AppScale do not share nodes with “noisy neighbors” in
our test environment. In contrast, Google App Engine does not expose the performance
characteristics of its multi-tenancy. While it operates at vastly greater scale, our test
applications also exhibit wider variance of web API response time when using it. Cerebro,
however, is able to predict a correct and tight SLOs for applications running in either
platform: the lower variance private AppScale PaaS, and the extreme scale but more
varying Google App Engine PaaS.
4.3.3 SLO Validity Duration
To be of practical value to PaaS administration, the duration over which a Cerebro
prediction remains valid must be long enough to allow appropriate remedial action when
load conditions change, and the SLO is in danger of being violated. In particular, SLOs
must remain correct for at least the time necessary to allow human responses to changing
conditions such as the commitment of more resources to web APIs that are in violation or
alerts to support staff that customers may be calling to claim SLO breach (which likely
resulted in a higher level SLA violation). Ideally, each prediction should persist as correct
for several hours or more to match staff response time to potential SLO
violations.
However, determining when a Cerebro-predicted SLO becomes invalid is potentially
complex. For example, given the definition of correctness described in subsection 4.3.1,
it is possible to report an SLO violation when the running tabulation of correctness per-
centage falls below the target probability (when expressed as a percentage). However, if
88
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Table 4.2: Prediction validity period distributions of different operations in App En-
gine. Validity durations were computed by observing 3 consecutive SLO violations.
5th and 95th columns represent the 5th and 95th percentiles of the distributions re-
spectively. All values are in hours.
Operation 5th Average 95th
StudentInfo#getStudent 7.15 70.72
134
.43
StudentInfo#deleteStudent 2.55 37.97 94.37
StudentInfo#addStudent 1.45 26.8 64.78
ServerHealth#info 1.41 39.22 117.71
Rooms#getRoomByName 7.24 70.47 133.36
Rooms#getRoomsInCity 2.08 30.12 82.58
this metric is used, and Cerebro is correct for many consecutive measurements, a sud-
den change in conditions that causes the response time to persist at a higher level will
not immediately trigger a violation. For example, Cerebro might be correct for several
consecutive months and then incorrect for several consecutive days before the overall cor-
rectness percentage drops below 95%, and a violation is detected. If the SLO is measured
over a year, such time scales may be acceptable but we believe that PaaS administrators
would consider such a long period of time where the SLOs were continuously in violation
unacceptable. Thus we adopt the more conservative approach described in section 4.2.5
to measure the duration over which a prediction remains valid than simply measuring the
time until the correctness percentage drops below the SLO-specified value. Tables 4.2
and 4.3 present these durations for Cerebro predictions in Google App Engine and App-
Scale respectively. These results were calculated by analyzing a trace of data collected
over 7 days.
From Table 4.2 the average validity duration for all 6 operations considered in App
Engine is longer than 24 hours. The lowest average value observed is 26.8 hours, and
that is for the StudentInfo#addStudent operation. If we just consider the 5th percentiles
of the distributions, they are also longer than 1 hour. The smallest 5th percentile value of
89
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Table 4.3: Prediction validity period distributions of different operations in AppScale.
Validity periods were computed by observing 3 consecutive SLO violations. 5th and
95th columns represent the 5th and 95th percentiles of the distributions respectively.
All values are in hours.
Operation 5th Average 95th
StudentInfo#getStudent 6.1 60.67 115.24
StudentInfo#deleteStudent 6.08 60.21
114
.32
StudentInfo#addStudent 6.1 60.67 115.24
ServerHealth#info 6.29 54.53
108
.14
Rooms#getRoomByName 6.07 59.18
112
.28
Rooms#getRoomsInCity 1.95 33.77 84.63
1.41 hours is given by the ServerHealth#info operation. This result implies that, based
on our conservative model for detecting SLO violations, Cerebro predictions made on
Google App Engine would be valid for at least 1.41 hours or more, at least 95% of the
time.
By comparing the distributions for different operations we can conclude that API
operations that perform a single basic datastore or memcache read tend to have longer
validity durations. In other words, those cloud SDK operations have fairly stable perfor-
mance characteristics in Google App Engine. This is reflected in the 5th percentiles of
StudentInfo#getStudent and Rooms#getRoomByName. Alternatively operations that
execute writes, iterative datastore reads or long sequences of cloud SDK operations have
shorter prediction validity durations.
For AppScale, the smallest average validity duration of 33.77 hours is observed from
the Rooms#getRoomsInCity operation. All other operations tested in AppScale have
average prediction validity durations greater than 54 hours. The lowest 5th percentile
value in the distributions, which is 1.95 hours, is also shown by Rooms#getRoomsInCity.
This means, the SLOs predicted for AppScale would hold correct for at least 1.95 hours or
more, at least 95% of the time. The relatively smaller validity durations values computed
90
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
for the Rooms#getRoomsInCity operation indicates that the performance of iterative
datastore reads is subject to some variability in AppScale.
4.3.4 Long-term SLO Durability and Change Frequency
In this section we further analyze how the Cerebro-predicted SLOs change over long
periods of time (e.g. several months). Our goal is to understand the frequency with
which Cerebro’s auto-generated SLOs get updated due to the changes that occur in
the cloud platform, and the time duration between these update events. That is, we
assess the number of times an API consumer is prompted with an updated SLO, thereby
potentially initiating SLA renegotiations.
To enable this, we deploy Cerebro’s cloud SDK monitoring agent in the Google App
Engine cloud, and benchmark the cloud SDK operations every 60 seconds for 112 days.
We then use Cerebro to make SLO predictions (95th percentile) for a set of test web
applications. Note that we conduct this long-term experiment only on App Engine, which
according to our previous results gives shorter SLO validity durations than AppScale.
Cerebro analyzes the benchmarking results collected by the cloud SDK monitor, and
generates sequences of SLO predictions for the web APIs of each application. Each
prediction sequence is a time series that spans the duration in which the cloud SDK
monitor was active in the cloud. Each prediction is timestamped. Therefore given any
timestamp that falls within the 112 day period of the experiment, we can find an SLO
prediction that is closest to it. Further, we associate each prediction with an integer value
(Cw) which indicates the consecutive number of SLO violations that should be observed,
before we may consider the prediction to be invalid.
We also estimate the actual web API response times for the test applications. This is
done by simply summing up the benchmarking data gathered by the cloud SDK monitor.
91
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Again, we assume that the time spent on non cloud SDK operations is negligible. For
example, consider a web API that executes the cloud SDK operations O1, O2 and O1 in
that order. Now suppose the cloud SDK monitor has gathered following benchmarking
results for O1 and O2:
• O1: [t1 : x1, t2 : x2, t3 : x3…]
• O2: [t1 : y1, t2 : y2, t3 : y3…]
Here ti are timestamps at which the benchmark operations were performed. xi and
yi are execution times of the two SDK operations measured in milliseconds. Given this
benchmarking data, we can calculate the time series of actual response time of the API
as follows:
[t1 : 2×1 + y1, t2 : 2×2 + y2, t3 : 2×3 + y3…]
The coefficient 2 that appears with each xi term accounts for the fact that our web
API invokes O1 twice. In this manner, we can combine the static analysis results of
Cerebro with the cloud SDK benchmarking data to obtain a time series of estimated
actual response times for all web APIs in our sample applications.
Having obtained a time series of SLO predictions (Tp) and a time series of actual
response times (Ta) for each web API, we perform the following computation. From Tp
we pick a pair < s0, t0 >, where s0 is a predicted SLO value and t0 is the timestamp
associated with it. Then starting from t0, we scan the time series Ta to detect the earliest
point in time at which we can consider the predicted SLO value s0 as invalid. This is done
by comparing s0 against each entry in Ta that has a timestamp greater than or equal to
t0, until we see Cw consecutive entries that are larger than s0. Here Cw is the rare event
threshold computed by Cerebro when making SLO predictions. Having found such an
SLO invalidation event at time t′, we record the duration t′ − t0 (i.e. the SLO validity
duration), and increment the counter invalidations, which starts from 0. Then we pick
92
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
the pair < s1, t1 > from Tp where t1 is the smallest timestamp greater than or equal to
t′, and s1 is the predicted SLO value at that timestamp. Then we scan Ta starting from
t1, until we detect the next SLO invalidation (for s1). We repeat this process until we
exhaust either Tp or Ta. At the end of this computation we have a distribution of SLO
validity periods, and the counter invalidations indicates the number of SLO invalidations
we encountered in the process.
This experimental process simulates how a single API consumer encounters SLO
changes. Selecting the first pair of values < s0, t0 > represents the API consumer receiving
an SLO for the first time (i.e. at API subscription). When this SLO becomes invalid,
the API consumer receives a new SLO, which is represented by the selection of the pair
< s1, t1 >. Therefore, when the simulation reaches the end of the time series, we can
determine how many times the API consumer observed changes to the SLO (given by
invalidations). The recorded SLO validity periods give an indication of the time between
these SLO change events.
For a given web API we perform the above simulation many times, using each entry
in Tp as a starting point. That is, in each run we change our selection of < s0, t0 > to be
a different entry in Tp. This way, for a time series comprised of n entries, we can run the
simulation n−1 times, discarding the last entry. We can assume that each simulation run
corresponds to a different API consumer. Therefore, at the end of a complete execution
of the experiment we have the number of SLO changes for many different API consumers,
and the empirical SLO validity period distributions for each of them.
The smallest n we encountered in all our experiments was
125
805. That is, we repeat-
edly simulated each web API SLO trace for at least 125804 API consumers. Similarly,
the largest number of API consumers we performed the simulation for is 145130.
We now present the experimental results obtained using this methodology. We an-
alyze the number of SLO changes observed by each API consumer during the 112 day
93
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Figure 4.7: CDF of the number of SLO change events faced by API consumers.
period of the experiment, and calculate a set of cumulative distribution functions (CDF).
These CDFs describe the probability of finding an API consumer that experienced a given
number of SLO change events. Figure 4.7 presents the CDFs. We use the convention
ApplicationName#Operation to label individual web API operations.
According to Figure 4.7, the largest number of SLO changes experienced by any user
is 6. This is with regard to the StudentInfo#addStudent operation. Across all web APIs,
at least 96% of the API consumers experience no more than 4 SLO changes during the
period of 112 days. Further, at least 76% of the API consumers see no more than 3
SLO changes. These statistics indicate that SLOs predicted by Cerebro for Google App
Engine are stable over time, and reassessment is required only rarely. From an API
consumer’s perspective this is a highly desirable property, since it reduces the frequency
of SLO changes, which reduces the potential SLA renegotiation overhead.
Next we analyze the time duration between SLO change events. For this we combine
the SLO validity periods computed for different API consumers into a single statistical
distribution. Table 4.4 shows the 5th percentile, mean, and 95th percentile of these
94
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Operation 5th Mean 95th
StudentInfo#getStudent 12.97 631.24 1911.19
StudentInfo#deleteStudent 7.65 472.07 2031.59
StudentInfo#addStudent 0.05 458.24
171
1.08
ServerHealth#info 12.96 630.01 1911.19
Rooms#getRoomByName 8.48 345.13 1096.53
Rooms#getRoomsInCity 20.56 296.44 1143.45
Stocks#buy 8.46 411.75 815.5
Table 4.4: Prediction validity period distributions (in hours). 5th and 95th columns
represent the 5th and 95th percentiles of the distributions respectively.
combined distributions.
The smallest mean SLO validity period observed in our experiments is 296.44 hours
(12.35 days). This value is given by the Rooms#getRoomsInCity operation. This implies
that on average, API consumers do not see a change in Cerebro-predicted SLOs for at
least 12.35 days. Similarly, we observed the largest mean SLO validity period of 26.3 days
with the StudentInfo#getStudent operation. The smallest 5th percentile value of 0.05
hours is shown by the StudentInfo#addStudent operation, but this appears to be a special
case compared to the other web API operations. The second smallest 5th percentile value
of 7.65 hours is shown by the StudentInfo#deleteStudent operation. Therefore, ignoring
the StudentInfo#addStudent operation, API consumers observe SLO validity periods
longer than 7.65 hours at least 95% of the time. That is, the time between SLO changes
is greater than 7.65 hours at least 95% of the time.
To reduce the number of SLO changes further, we observe that we can exploit the
SLO change events in which the difference between an invalidated SLO and a new SLO
is small. In such cases, it is of little use to provide a new SLO, and API consumers
may be content to continue with the old SLO. To incorporate this behavior into Cerebro
(and our simulation process), we introduce threshold value slo delta threshold into the
process. This parameter takes a percentage value that represents the minimum acceptable
95
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Figure 4.8: CDF of the number of SLO change events faced by API consumers, when
slo delta threshold = 10%
percentage difference between the old and new SLO values before renegotiation. If the
percentage difference between the two SLO values is below this threshold, we do not
record the SLO validity period, nor increment the count of the SLO invalidations. That
is, we do not consider such cases as SLO change events. We simply carry on with the
old SLO value until we come across an invalidation event with a percentage difference
that exceeds the threshold. Note that our previous experiments are a special case of
thresholding for which slo delta threshold is 0.
Next we evaluate the sensitivity of our results to slo delta threshold. Figure 4.8 shows
the resulting CDFs of per-user renegotiation count when the threshold is 10%. That is,
Cerebro does not prompt the API consumer with an SLO change, unless the new SLO
is at least 10% off from the old one. In this case, the maximum number of SLO change
events drops from 6 to 5. Also most of the probabilities shift slightly upwards. For
instance, now more than 82% of the users see 3 or less renegotiation events (as opposed
to 76%).
96
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Operation 5th Mean 95th
StudentInfo#getStudent 19.93 644.58 1911.19
StudentInfo#deleteStudent 7.93 512.52 2031.59
StudentInfo#addStudent 0.05 491.68 1711.08
ServerHealth#info 19.91 643.33 1911.19
Rooms#getRoomByName 8.48 392.01 1096.53
Rooms#getRoomsInCity 21.82 304.97 1143.45
Stocks#buy 7.41 510.31
127
7.7
Table 4.5: Prediction validity period distributions (in hours) when slo delta threshold
= 10%. 5th and 95th columns represent the 5th and 95th percentiles of the distributions
respectively.
Table 4.5 shows the SLO validity period distributions computed when slo delta threshold
is 10%. Here, as expected most of the mean and 5th percentile values have increased
slightly from their original values. The smallest mean value recorded in the table is
304.97 hours. We have also considered a slo delta threshold value of 20%. This change
introduces only small shifts in the probability values of the CDFs (more than 84% of the
users see 3 or less renegotiations), and the maximum number of renegotiations remains
at 5.
In summary, we find that the performance SLOs predicted by Cerebro for the Google
App Engine cloud environment are stable over time. That is, the predictions are valid
for long periods of time, and API consumers do not observe SLO changes often. In our
experiment spanning over a period of 112 days, the maximum number of SLO changes a
user had to undergo was 6. More than 76% of the users experienced only 3 or less changes.
We can further reduce the number of SLO changes per API consumer by introducing a
threshold for the minimum applicable percentage SLO change. This helps to eliminate
the cases where an old SLO has been marked as invalid by our statistical model for
detecting SLO invalidations, but the new SLO predicted by Cerebro is not very different
from the old one. However, the effect of this parameter starts to diminish as we increase
97
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
its value. In our experiments, we observe the best results for a threshold of 10%. Using
a value of 20% does not achieve significantly better results.
4.3.5 Effectiveness of QBETS
In order to gauge the effectiveness of QBETS, we compare it to a “näıve” approach
that simply uses the running empirical percentile tabulation of a given joint time series
as a prediction. This simple predictor retains a sorted list of previous observations, and
predicts the p-th percentile to be the value that is larger than p% of the values in the
observation history. Whenever a new observation is available, it is added to the history
and each prediction uses the full history.
Figure 4.9 shows the correctness measurements for the simple predictor using the
same cloud SDK monitoring data and application benchmarking data that was used in
Subsection 4.3.1. That is, we keep the rest of Cerebro unchanged, swap QBETS out for
the simple predictor, and run the same set of experiments using the logged observations.
Thus the results in figure 4.9 are directly comparable to figure 4.4 where Cerebro uses
QBETS as a forecaster.
For the simple predictor, Figure 4.9 shows lower correctness percentages compared to
Figure 4.4 for QBETS (i.e. the simple predictor is less conservative). However, in several
cases the simple predictor falls well short of the target correctness of 95% necessary for
the SLO. That is, it is unable to furnish a prediction correctness that can be used as
the basis of an SLO in all of the test cases. This indicates that QBETS is a superior
approach, albeit conservativeness, for making SLO predictions than simply calculating
the percentiles on cloud SDK monitoring data.
To illustrate why the simple predictor fails to meet the desired correctness level,
figure 4.10 shows the time series of observations, simple predictor forecasts, and QBETS
98
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Figure 4.9: Cerebro correctness percentage resulting from the simple predictor (with-
out QBETS).
forecasts for the Rooms#getRoomsInCity operation on Google App Engine (the case in
figure 4.9 that shows lowest correctness percentage).
In this experiment, there are a significant number of response time measurements that
violate the SLO given by simple predictor (i.e. are larger than the predicted percentile),
but are below the corresponding QBETS prediction made for the same observation. No-
tice also that while QBETS is more conservative (its predictions are generally larger than
those made by the simple predictor), in this case the predictions are typically only 10%
larger. That is, while the simple predictor shows the 95th percentile to be approximately
40ms, the QBETS predictions vary between 42ms and 48ms, except at the beginning
where QBETS is “learning” the series. This difference in prediction, however, results in
a large difference in correctness percentage. For QBETS, the correctness percentage is
97.4% (Figure 4.4) compared to 75.5% for the simple predictor (Figure 4.9).
99
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Figure 4.10: Comparison of predicted and actual response times of
Rooms#getRoomsInCity on Google App Engine.
Figure 4.11: Running tabulation of correctness percentage for predictions made on
App Engine for a period of 1000 minutes, one prediction per minute.
100
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
Figure 4.12: Running tabulation of correctness percentage for predictions made on
AppScale for a period of 1000 minutes, one prediction per minute.
4.3.6 Learning Duration
As described in subsection 4.2.3, QBETS uses a form of supervised learning internally
to determine each of its bound predictions. Each time a new prediction is presented, it
updates its internal state with respect to autocorrelation and change-point detection.
As a result, the correctness percentage may require some number of state updates to
converge to a stable value.
Figure 4.11 shows a running tabulation of correctness percentage for Cerebro pre-
dictions made in Google App Engine during the first 1000 minutes of operation (one
prediction is generated each minute). Similarly, in figure 4.12 we show a running tabula-
tion of correctness percentage for Cerebro predictions made in AppScale during the first
1000 minutes of operation (again, one prediction generated per minute).
For clarity we do not show results for all tested operations. Instead, we only show
data for the operation that reaches stability in the shortest amount of time, and the
operation that takes the longest to converge. Results for other operations fall between
these two extremes.
101
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
In the worst case, Cerebro takes up to 200 minutes to achieve correctness percentage
above 95% in Google App Engine (for StudentInfo#getAllStudents). Alternatively, the
longest time until Cerebro has “learned” the series in AppScale is approximately 40
minutes.
Summarizing these results, the learning time for Cerebro may be several hours (up to
200 minutes in case of Google App Engine), before it produces trustworthy and correct
SLO predictions. The predictions made during this learning period are not necessarily
incorrect. It is just not possible to gauge their correctness quantitatively before the series
has been learned. We envision Cerebro as a continuous monitoring process in PaaS clouds
for which “startup time” is not an issue.
4.4 Related Work
Our research leverages a number of mature research areas in computer science and
mathematics. These areas include static program analysis, cloud computing, time series
analysis, and SOA governance.
The problem of predicting response time SLOs of web APIs is similar to worst-case
execution time (WCET) analysis [
107
, 108, 109, 100, 110]. The objective of WCET
analysis is to determine the maximum execution time of a software component in a given
hardware platform. It is typically discussed in the context of real-time systems, where the
developers should be able to document and enforce precise hard real-time constraints on
the execution time of programs. In order to save time, manpower and hardware resources,
WCET analysis solutions are generally designed favoring static analysis methods over
software testing. We share similar concerns with regard to cloud platforms, and strive to
eliminate software testing in the favor of static analysis.
Ermedahl et al describe SWEET [108], a WCET analysis tool that make use of pro-
102
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
gram slicing [109], abstract interpretation [111], and invariant analysis [100] to determine
the loop bounds and worst-case execution time of a program. Program slicing used in
this prior work to limit the amount of code being analyzed is similar in its goal to our
focus on cloud SDK invocations. SWEET uses abstract interpretation in interval and
congruence domains to identify the set of values that can be assigned to key control
variables of a program. These sets are then used to calculate exact loop bounds for most
data-independent loops in the code. Invariant analysis is used to detect variables that
do not change during the course of a loop iteration, and remove them from the analy-
sis thus further simplifying the loop bound estimation. Lokuceijewski et al propose a
similar WCET analysis using program slicing and abstract interpretation [112]. They
additionally use a technique called polytope models to speed up the analysis.
The corpus of research that covers the use of static analysis methods to estimate
the execution time of software applications is extensive. Gulwani, Jain and Koskinen
used two techniques named control-flow refinement and progress invariants to estimate
the bounds for procedures with nested and multi-path loops [
113
]. Gulwani, Mehra and
Chilimbi proposed SPEED [114], a system that computes symbolic bounds for programs.
This system makes use of user-defined quantitative functions to predict the bounds for
loops iterating over data structures like lists, trees and vectors. Our idea of using user-
defined values to bound data-dependent loops (e.g. iterative datastore reads) is partly
inspired by this concept. Bygde [101] proposed a set of algorithms for predicting data-
independent loops using abstract interpretation and element counting (a technique that
was partly used in [108]). Cerebro incorporates minor variations of these algorithms
successfully due to their simplicity.
Cerebro makes use of and is similar to many of the execution time analysis systems
discussed above. However, there are also several key differences. For instance, Cerebro is
focused on solving the execution time prediction problem for PaaS-hosted web APIs. As
103
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
we show in our characterization survey, such applications have a set of unique properties,
that can be used to greatly simplify static analysis. Also, Cerebro is designed to only work
with web API codes. This makes designing a solution much more simpler but less general.
To handle the highly variable and evolving nature of cloud platforms, Cerebro combines
static analysis with runtime monitoring of cloud platforms at the level of SDK operations.
No other system provides such a hybrid approach to the best of our knowledge. Finally,
we use time series analysis [90] to predict API execution time upper bounds with specific
confidence levels.
SLA management on service-oriented systems and cloud systems has been throughly
researched over the years. However, a lot of the existing work has focused on issues
such as SLA monitoring [115, 116, 117,
118
], SLA negotiation [
119
, 120,
121
], and SLA
modeling [122,
123
,
124
]. Some work has looked at incorporating a given SLA to the
design of a system, and then monitoring it at the runtime to ensure SLA compliant
behavior [125]. Our research takes a different approach from such works, whereby it
attempts to predict the performance SLOs for a given web API, which in turns can be
used to formulate performance SLAs between API providers and consumers. To the best
of our knowledge, Cerebro is the first system to predict performance SLOs for web APIs
developed for PaaS clouds.
A work that is similar to ours has been proposed by Ardagna, Damiani and Sagbo
in [
126
]. The authors develop a system for early estimation of service performance based
on simulations. Given a STS model (Symbolic Transitions System) of a service, their
system is able to generate a simulation script, which can be used to assess the perfor-
mance of the service. STS models are a type of finite state automata. Further, they
use probabilistic distributions with fixed parameters to represent the delays incurred by
various operations in the service. Cerebro is easier to use than this system because we do
not require API developers to construct any models of the web APIs. They only need to
104
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
provide the source code of the API implementations. Also, instead of using probabilistic
distributions with fixed parameters, Cerebro uses actual historical performance metrics
of cloud SDK operations. This enables Cerebro to generate more accurate results, that
reflect the dynamic nature of the cloud platform.
In PROSDIN [119], a proactive service discovery and negotiation framework, the SLA
negotiation occurs during the service discovery phase. This is similar to how Cerebro
provides an initial SLO with an API consumer, when the consumer subscribes to an API.
PROSDIN also establishes a fixed SLA validity period upon negotiation, and triggers
an SLA renegotiation when this time period has elapsed. Cerebro on the other hand
continuously monitors the cloud platform, and periodically re-evaluates the response time
SLOs of web APIs to determine when a reassessment is needed. Similarly, researchers
have investigated the notions of SLA brokering [121], and the automatic SLA negotiation
between intelligent agents [120], ideas that can complement the simple SLO provisioning
model of Cerebro to make it more powerful and flexible.
Meryn [127] is an SLA-driven PaaS system that attempts to maximize cloud provider
profit, while providing the best possible quality of service to the cloud users. It sup-
ports SLA negotiation at application deployment, and SLA monitoring to detect viola-
tions. However, it does not automatically determine what SLAs are feasible or address
SLA renegotiation, and employs a policy-based mechanism coupled with a penalty cost
charged against the cloud provider to handle SLA violations. Also, Meryn formulates
SLAs in terms of the computing resources (CPU, memory, storage etc.) allocated to
applications. It assumes a batch processing environment where the execution time of an
application is approximated based on a detailed description of the application provided
by the developer. In contrast, Cerebro handles SLOs for interactive web applications. It
predicts the response time of applications using static analysis, without any input from
the application developer. Cerebro also supports automatic SLO reassessment, with
105
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
possible room for economic incentives.
Iosup et al showed via empirical analysis, that production cloud platforms like Google
App Engine and AWS regularly undergo performance variations, thus impacting the re-
sponse time of the applications deployed in such cloud platforms [128]. Some of these
cloud platforms even exhibit temporal patterns in their performance variations (weekly,
monthly, annual or seasonal). Cerebro and the associated API performance forecasting
model acknowledge this fact, and periodically reassess the predicted response time up-
per bounds. It detects when a previously predicted upper bound becomes invalid, and
prompts the API clients to update their SLOs accordingly. Indeed, one of Cerebro’s
strength’s is its ability to detect change points in the input time series data (periodically
collected cloud SDK benchmark results), and generate up-to-date predictions that are
not affected by old obsolete observations that were gathered prior to a change point.
There has also been prior work in the area of predicting SLO violations [129, 130,
131
].
These systems take an existing SLO and historical performance data of a service, and
predict when the service might violate the given SLO in the future. Cerebro’s notion of
SLO validity period has some relation to this line of research. However, Cerebro’s main
goal is to make SLO predictions for web APIs before they are deployed and executed. We
believe that some of these existing SLO violation predictors can complement our work by
providing API developers and cloud administrators insights on when a Cerebro-predicted
SLO will be violated.
4.5 Conclusions and Future Work
Stipulating SLOs (bounds) on the response time of web APIs is crucial for implement-
ing several features related to automated governance. To this end we present Cerebro,
a system that predicts response time SLOs for web APIs deployed in PaaS clouds. The
106
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
SLOs predicted by Cerebro can be used to enforce policies regarding the performance
level expected from cloud-hosted web applications. They can be used to negotiate SLAs
with API clients. They can also be used as thresholds when implementing application
performance monitoring (APM) – subject of the next chapter. Cerebro is intended for
use during development and deployment phases of a web API, and precludes the need
for continuous performance testing of the API code. Further, it does not interfere with
run-time operation (i.e. it requires no application instrumentation) making it scalable.
Cerebro uses static analysis to extract the sequence of cloud SDK calls (i.e. PaaS
kernel invocations) made by a given web API code, and combines that with the historical
performance measurements of individual cloud SDK calls. Cerebro employs QBETS, a
non-parametric time series analysis and forecasting method, to analyze cloud SDK per-
formance data, and predict bounds on API response time that can be used as statistical
“guarantees” with associated guarantee probabilities.
We have implemented a prototype of Cerebro for Google App Engine public PaaS,
and AppScale private PaaS. We evaluate it using a set of representative and open source
web applications developed by others. Our findings indicate that the prototype can
determine response time SLOs with target accuracy levels specified by an administrator.
Specifically, we use Cerebro to predict the 95th percentile of the API response time. We
find that:
• Cerebro achieves the desired correctness goal of 95% for all the applications in both
cloud environments.
• Cerebro generates tight predictions (i.e. the predictions are similar to measured
values) for most web APIs. Because some operations and PaaS systems exhibit
more variability in cloud SDK response time, Cerebro must be conservative in some
cases, and produce predictions that are less tight to meet its correctness guarantees.
107
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
• Cerebro requires a “warm up” period of up to 200 minutes to produce trustworthy
predictions. Since PaaS systems are designed to run continuously, this is not an
issue in practice.
• We can use a simple yet administratively useful model to identify when an SLO
becomes invalid to compute prediction validity durations for Cerebro. The average
duration of a valid Cerebro prediction is between 24 and 72 hours, and 95% of
the time this duration is at least 1.41 hours for App Engine and 1.95 hours for
AppScale.
We also find that, when using Cerebro to establish SLOs, the API consumers do not
experience SLO changes often, and the maximum number of times an API consumer
encounters an SLO change over a period of 112 days is six. Overall, this work shows that
automatic stipulation of response-time SLOs for web APIs is practically viable in real
world cloud settings, and API consumer timeframes.
In the current design, Cerebro’s cloud SDK monitoring agent only monitors a prede-
fined set of cloud SDK operations. In our future work we wish to explore the possibility
of making this component more dynamic, so that it automatically learns what operations
to benchmark from the web APIs deployed in the cloud. This also includes learning the
size and the form of the datasets that cloud SDK invocations operate on, so that Cerebro
can acquire more realistic benchmarking data. We also plan to investigate further how
to better handle data-dependent loops (iterative datastore reads) for different workloads.
We are interested in exploring the ways in which we can handle API codes with unpre-
dictable execution patterns (e.g. loops based on a random number), even though such
cases are quite rare in the applications we have looked at so far. Further, we plan to
integrate Cerebro with EAGER, our API governance system and policy engine for PaaS
clouds, so that PaaS administrators can enforce SLO-related policies on web APIs at
108
Response Time Service Level Objectives for Cloud-hosted Web Applications Chapter 4
deployment-time. Such a system will make it possible to prevent any API that does not
adhere to the organizational performance standards from being deployed in the produc-
tion cloud environment. It can also enforce policies that prevent applications from taking
dependencies on APIs that are not up to the expected performance standards.
109
Chapter 5
Performance Anomaly Detection
and Root Cause Analysis for
Cloud-hosted Web Applications
In the previous chapter we presented a methodology for stipulating performance SLOs
for cloud-hosted web applications. In this chapter we discuss detecting performance
SLO violations, and conducting root cause analysis. Timely detection of performance
problems, and the ability to diagnose the root causes of such issues are critical elements
of governance.
This widespread adoption of cloud computing, particularly for deploying web appli-
cations, is facilitated by ever-deepening software abstractions. These abstractions elide
the complexity necessary to enable scale, while making application development easier
and faster. But they also obscure the runtime details of cloud applications, making
the diagnosis of performance problems challenging. Therefore, the rapid expansion of
cloud technologies combined with their increasing opacity has intensified the need for
new techniques to monitor applications deployed in cloud platforms [132].
110
Chapter 5
Application developers and cloud administrators generally wish to monitor applica-
tion performance, detect anomalies, and identify bottlenecks. To obtain this level of
operational insight into cloud-hosted applications, and facilitate governance, the cloud
platforms must support data gathering and analysis capabilities that span the entire soft-
ware stack of the cloud. However, most cloud technologies available today do not provide
adequate application monitoring support. Cloud administrators must therefore trust the
application developers to implement necessary instrumentation at the application level.
This typically entails using third party, external monitoring software [12, 13, 14], which
significantly increases the effort and financial cost of maintaining applications. Develop-
ers must also ensure that their instrumentation is both correct, and does not degrade
application performance. Nevertheless, since the applications depend on extant cloud
services (e.g. scalable database services, scalable in-memory caching, etc.) that are per-
formance opaque, it is often difficult, if not impossible to diagnose the root cause of a
performance problem using such extrinsic forms of monitoring.
Further compounding the performance diagnosis problem, today’s cloud platforms are
very large and complex [132, 133]. They are comprised of many layers, where each layer
may consist of many interacting components. Therefore when a performance anomaly
manifests in a user application, it is often challenging to determine the exact layer or the
component of the cloud platform that may be responsible for it. Facilitating this level of
comprehensive root cause analysis requires both data collection at different layers of the
cloud, and mechanisms for correlating the events recorded at different layers.
Moreover, performance monitoring for cloud applications needs to be highly cus-
tomizable. Different applications have different monitoring requirements in terms of data
gathering frequency (sampling rate), length of the history to consider when performing
statistical analysis (sample size), and the performance SLOs (service level objectives [89])
and policies that govern the application. Cloud monitoring should be able to facilitate
111
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
these diverse requirements on a per-application basis. Designing such customizable and
extensible performance monitoring frameworks that are built into the cloud platforms is
a novel and challenging undertaking.
To address these needs, we present a full-stack application performance monitor
(APM) called Roots that can be integrated with a variety of cloud Platform-as-a-Service
(PaaS) technologies. PaaS clouds provide a set of managed services, which develop-
ers compose into applications. To be able to correlate application activity with cloud
platform events, we design Roots as another managed service built into the PaaS cloud.
Therefore it operates at the same level as the other services offered by the cloud platform.
This way Roots can collect data directly from the internal service implementations of the
cloud platform, thus gaining full visibility into all the inner workings of an application.
It also enables Roots to operate fully automatically in the background, without requiring
instrumentation of application code.
Previous work has outlined several key requirements that need to be considered when
designing a cloud monitoring system [132, 133]. We incorporate many of these features
into our design:
Scalability Roots is lightweight, and does not cause any noticeable overhead in appli-
cation performance. It puts strict upper bounds on the data kept in memory. The
persistent data is accessed on demand, and can be removed after their usefulness
has expired.
Multitenancy Roots facilitates configuring monitoring policies at the granularity of
individual applications. Users can employ different statistical analysis methods to
process the monitoring data in ways that are most suitable for their applications.
Complex application architecture Roots collects data from the entire cloud stack
(load balancers, app servers, built-in PaaS services etc.). It correlates data gathered
112
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
from different parts of the cloud platform, and performs systemwide bottleneck
identification.
Dynamic resource management Cloud platforms are dynamic in terms of their mag-
nitude and topology. Roots captures performance events of applications by aug-
menting the key components of the cloud platform. When new processes/compo-
nents become active in the cloud platform, they inherit the same augmentations,
and start reporting to Roots automatically.
Autonomy Roots detects performance anomalies online without manual intervention.
When Roots detects a problem, it attempts to automatically identify the root cause
by analyzing available workload and service invocation data.
Roots collects most of the data it requires by direct integration with various inter-
nal components of the cloud platform. In addition to high-level metrics like request
throughput and latency, Roots also records the internal PaaS service invocations made
by applications, and the latency of those calls. It uses batch operations and asynchronous
communication to record events in a manner that does not substantively increase request
latency.
The previous two chapters present systems that perform the specification (policies
and SLOs) and enforcement functions of governance. Roots also performs an important
function associated with automated governance – monitoring. It is designed to monitor
cloud-hosted web applications for SLO violations, and any other deviations from specified
or expected behavior. Roots flags such issues as anomalies, and notifies cloud admin-
istrators in near real time. Also, when Roots detects an anomaly in an application, it
attempts to uncover the root cause of the anomaly by analyzing the workload data, and
the performance of the internal PaaS services the application depends on. Roots can de-
termine if the detected anomaly was caused by a change in the application workload (e.g.
113
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
a sudden spike in the number of client requests), or an internal bottleneck in the cloud
platform (e.g. a slow database query). To this end we propose a statistical bottleneck
identification method for PaaS clouds. It uses a combination of quantile analysis, change
point detection and linear regression to perform root cause analysis.
Using Roots we also devise a mechanism to identify different paths of execution in an
application – i.e. different paths in the application’s control flow graph. Our approach
does not require static analysis, and instead uses the runtime data collected by Roots.
This mechanism also calculates the proportion of user requests processed by each path,
which is used to characterize the workload of an application (e.g. read-heavy vs write-
heavy workload in a data management application). Based on that, Roots monitors for
characteristic changes in the application workload.
We build a working prototype of Roots using the AppScale [7] open source PaaS.
We evaluate the feasibility and the efficacy of Roots by conducting a series of empirical
trials using our prototype. We also show that our approach for identifying performance
bottlenecks in PaaS clouds, produces accurate results nearly 100% of the time. We also
demonstrate that Roots does not add a significant performance overhead to the applica-
tions, and that it scales well to monitor tens of thousands of applications concurrently.
We discuss the following contributions in this chapter:
• We describe the architecture of Roots as an intrinsic PaaS service, which works
automatically without requiring or depending upon application instrumentation.
• We describe a statistical methodology for determining when an application is ex-
periencing a performance anomaly, and identifying the workload change or the
application component that is responsible for the anomaly.
• We present a mechanism for identifying the execution paths of an application via
the runtime data gathered from it, and characterizing the application workload by
114
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
computing the proportion of requests handled by each path.
• We demonstrate the effectiveness of the approach using a working PaaS prototype.
5.1 Performance Debugging Cloud Applications
By providing most of the functionality that applications require via kernel services, the
PaaS model significantly increases programmer productivity. However, a downside of this
approach is that these features also hide the performance details of PaaS applications.
Since the applications spend most of their time executing kernel services [134], it is
challenging for the developers to debug performance issues given the opacity of the cloud
platform’s internal implementation.
One way to circumvent this problem is to instrument application code [12, 14, 13],
and continuously monitor the time taken by various parts of the application. But such
application-level instrumentation is tedious, and error prone thereby misleading those
attempting to diagnose problems. Moreover, the instrumentation code may slow down
or alter the application’s performance. In contrast, implementing data collection and
analysis as a kernel service built into the PaaS cloud allows performance diagnosis to be
a “curated” service that is reliably managed by the cloud platform.
In order to maintain a satisfactory level of user experience and adhere to any previ-
ously agreed upon performance SLOs, application developers and cloud administrators
wish to detect performance anomalies as soon as they occur. When detected, they must
perform root cause analysis to identify the cause of the anomaly, and take some corrective
and/or preventive action. This diagnosis usually occurs as a two step process. First, one
must determine whether the anomaly was caused by a change in the workload (e.g. a
sudden increase in the number of client requests). If that is the case, the resolution typ-
ically involves allocating more resources to the application or spawning more instances
115
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
of the application for load balancing purposes. If the anomaly cannot be attributed to a
workload change, one must go another step to find the bottleneck component that has
given rise to the issue at hand.
5.2 Roots
Roots is a holistic system for application performance monitoring (APM), perfor-
mance anomaly detection, and bottleneck identification. The key intuition behind the
system is that, as an intrinsic PaaS service, Roots has visibility into all activities of the
PaaS cloud, across layers. Moreover, since the PaaS applications we have observed spend
most of their time in PaaS kernel services [134], we hypothesize that we can reason about
application performance from observations of how the application uses the platform, i.e.
by monitoring the time spent in PaaS kernel services. If we are able to do so, then we
can avoid application instrumentation and its downsides while detecting performance
anomalies, and identifying their root cause in near real time with low overhead.
The PaaS model that we assume with Roots is one in which the clients of a web
application engage in a “service level agreement” (SLA) [89] with the “owner” or operator
of the application that is hosted in a PaaS cloud. The SLA stipulates a response-time
“service level objective” (SLO) that, if violated, constitutes a breech of the agreement. If
the performance of an application deteriorates to the point that at least one of its SLOs
is violated, we treat it as an anomaly. Moreover, we refer to the process of diagnosing
the reason for an anomaly as root cause analysis. For a given anomaly, the root cause
could be a change in the application workload or a bottleneck in the application runtime.
Bottlenecks may occur in the application code, or in the PaaS kernel services that the
application depends on.
Roots collects performance data across the cloud platform stack, and aggregates it
116
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
based on request/response. It uses this data to infer application performance, and to
identify SLO violations (performance anomalies). Roots can further handle different
types of anomalies in different ways. We overview each of these functionalities in the
remainder of this section.
5.2.1 Data Collection and Correlation
We must address two issues when designing a monitoring framework for a system as
complex as a PaaS cloud.
1. Collecting data from multiple different layers.
2. Correlating data collected from different layers.
Each layer of the cloud platform is only able to collect data regarding the state
changes that are local to it. A layer cannot monitor state changes in other layers due
to the level of encapsulation provided by layers. However, processing an application
request involves cooperation of multiple layers. To facilitate system-wide monitoring and
bottleneck identification, we must gather data from all the different layers involved in
processing a request. To combine the information across layers, we correlate the data,
and link events related to the same client request together.
To enable this, we augment the front-end server of the cloud platform. Specifically, we
have it tag incoming application requests with unique identifiers. This request identifier
is added to the HTTP request as a header, which is visible to all internal components
of the PaaS cloud. Next, we configure data collecting agents within the platform to
record the request identifiers along with any events they capture. This way we record
the relationship between application requests, and the resulting local state changes in
different layers of the cloud, without breaking the existing level of abstraction in the
117
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Figure 5.1: Roots APM architecture.
cloud architecture. This approach is also scalable, since the events are recorded in a
distributed manner without having to maintain any state at the data collecting agents.
Roots aggregates the recorded events by request identifier to efficiently group the related
events as required during analysis.
Figure 5.1 illustrates the high-level architecture of Roots, and how it fits into the PaaS
stack. APM components are shown in grey. The small grey boxes attached to the PaaS
components represent the agents used to instrument the cloud platform. In the diagram,
a user request is tagged with the identifier value R at the front-end server. This identifier
is passed down to the lower layers of the cloud along with the request. Events that occur
in the lower layers while processing this request are recorded with the request identifier
R, so Roots can correlate them later. For example, in the data analysis component we
can run a filter query to select all the events related to a particular request (as shown in
the pseudo query in the diagram). Similarly, Roots can run a “group by” query to select
all events, and aggregate them by the request identifier.
Figure 5.1 also depicts Roots data collection across all layers in the PaaS stack (i.e.
118
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
full stack monitoring). From the front-end server layer we gather information related
to incoming application requests. This involves scraping the HTTP server access logs,
which are readily available in most technologies used as front-end servers (e.g. Apache
HTTPD, Nginx).
From the application server layer, we collect application logs and metrics from the
application runtime that are easily accessible, e.g. process level metrics indicating re-
source usage of the individual application instances. Additionally, Roots employs a set
of per-application benchmarking processes that periodically probes different applications
to measure their performance. These are lightweight, stateless processes managed by the
Roots framework. Data collected by these processes is sent to the data storage compo-
nent, and is available for analysis as per-application time series data.
At the PaaS kernel layer we collect information regarding all kernel invocations made
by the applications. This requires intercepting the PaaS kernel invocations at runtime.
This must be done carefully so as to not introduce significant overhead application exe-
cution. For each PaaS kernel invocation, we capture the following parameters.
• Source application making the kernel invocation
• Timestamp
• A sequence number indicating the order of PaaS kernel invocations within an ap-
plication request
• Target kernel service and operation
• Execution time of the invocation
• Request size, hash and other parameters
Collecting PaaS kernel invocation details enables tracing the execution of application
requests without requiring that the application code be instrumented.
119
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Finally, at the lowest level we can collect information related to virtual machines,
containers and their resource usage. We gather metrics on network usage by individual
components which is useful for traffic engineering use cases. We also scrape hypervisor
and container manager logs to learn how resources are allocated and released over time.
To avoid introducing delays to the application request processing flow, we implement
Roots data collecting agents as asynchronous tasks. That is, none of them suspend
application request processing to report data to the data storage components. To enable
this, we collect data into log files or memory buffers that are local to the components
being monitored. This locally collected (or buffered) data is periodically sent to the data
storage components of Roots using separate background tasks and batch communication
operations. We also isolate the activities in the cloud platform from potential failures in
the Roots data collection or storage components.
5.2.2 Data Storage
The Roots data storage is a database that supports persistently storing monitoring
data, and running queries on them. Most data retrieval queries executed by Roots use
application and time intervals as indices. Therefore a database that can index monitoring
data by application and timestamp will greatly improve the query performance. It is also
acceptable to remove old monitoring data to make room for more recent events, since
Roots performs anomaly detection using the most recent data in near realtime.
5.2.3 Data Analysis
Roots data analysis components use two basic abstractions: anomaly detectors and
anomaly handlers. Anomaly detectors are processes that periodically analyze the data
collected for each deployed application. Roots supports multiple detector implementa-
120
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
tions, where each implementation uses a different statistical method to look for per-
formance anomalies. Detectors are configured per-application, making it possible for
different applications to use different anomaly detectors. Roots also supports multiple
concurrent anomaly detectors on the same application, which can be used to evaluate
the efficiency of different detection strategies for any given application. Each anomaly
detector has an execution schedule (e.g. run every 60 seconds), and a sliding window
(e.g. from 10 minutes ago to now) associated with it. The boundaries of the window de-
termines the time range of the data processed by the detector at any round of execution.
Window is updated after each round of execution.
When an anomaly detector finds an anomaly in application performance, it sends
an event to a collection of anomaly handlers. The event encapsulates a unique anomaly
identifier, timestamp, application identifier and the source detector’s sliding window that
correspond to the anomaly. Anomaly handlers are configured globally (i.e. each han-
dler receives events from all detectors), but each handler can be programmed to handle
only certain types of events. Furthermore, they can fire their own events, which are also
delivered to all the listening anomaly handlers. Similar to detectors, Roots supports
multiple anomaly handler implementations – one for logging anomalies, one for sending
alert emails, one for updating a dashboard etc. Additionally, Roots provides two special
anomaly handler implementations: a workload change analyzer, and a bottleneck iden-
tifier. We implement the communication between detectors and handlers using shared
memory.
The ability of anomaly handlers to fire their own events, coupled with their support
for responding to a filtered subset of incoming events enables constructing elaborate event
flows with sophisticated logic. For example, the workload change analyzer can run some
analysis upon receiving an anomaly event from any anomaly detector. If an anomaly
cannot be associated with a workload change, it can fire a different type of event. The
121
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Figure 5.2: Anatomy of a Roots pod. The diagram shows 2 application benchmarking
processes (B), 3 anomaly detectors (D), and 2 handlers (H). Processes communicate
via a shared memory communication bus local to the pod.
bottleneck identifier, can be programmed to only execute its analysis upon receiving
this second type of event. This way we perform the workload change analysis first, and
perform the systemwide bottleneck identification only when it is necessary.
Both the anomaly detectors and anomaly handlers work with fixed-sized sliding win-
dows. Therefore the amount of state these entities must keep in memory has a strict
upper bound. The extensibility of Roots is primarily achieved through the abstractions
of anomaly detectors and handlers. Roots makes it simple to implement new detectors
and handlers, and plug them into the system. Both the detectors and the handlers are
executed as lightweight processes that do not interfere with the rest of the processes in
the cloud platform.
5.2.4 Roots Process Management
Most data collection activities in Roots can be treated as passive – i.e. they take place
automatically as the applications receive and process requests in the cloud platform. They
do not require explicit scheduling or management. In contrast, application benchmarking
and data analysis are active processes that require explicit scheduling and management.
122
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
This is achieved by grouping benchmarking and data analysis processes into units called
Roots pods.
Each Roots pod is responsible for starting and maintaining a preconfigured set of
benchmarkers and data analysis processes (i.e. anomaly detectors and handlers). These
processes are light enough, so as to pack a large number of them into a single pod. Pods
are self-contained entities, and there is no inter-communication between pods. Processes
in a pod can efficiently communicate with each other using shared memory, and call out
to the central Roots data storage to retrieve collected performance data for analysis. This
enables starting and stopping Roots pods with minimal impact on the overall monitoring
system. Furthermore, pods can be replicated for high availability, and application load
can be distributed among multiple pods for scalability.
Figure 5.2 illustrates a Roots pod monitoring two applications. It consists of two
benchmarking processes, three anomaly detectors and two anomaly handlers. The anomaly
detectors and handlers are shown communicating via an internal shared memory com-
munication bus.
5.3 Prototype Implementation
To investigate the efficacy of Roots as an approach to implementing performance
diagnostics as a PaaS service, we have developed a working prototype, and a set of
algorithms that uses it to automatically identify SLO-violating performance anomalies.
For anomalies not caused by workload changes (HTTP request rate), Roots performs
further analysis to identify the bottleneck component that is responsible for the issue.
We implement our prototype in AppScale [7], an open source PaaS cloud that is API
compatible with Google App Engine (GAE) [4]. This compatibility enables us to evaluate
our approach using real applications developed by others since GAE applications run on
123
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Figure 5.3: Roots prototype implementation for AppScale PaaS.
AppScale without modification. Because AppScale is open source, we were able to modify
its implementation minimally to integrate Roots.
Figure 5.3 shows an overview of our prototype implementation. Roots components are
shown in grey, while the PaaS components are shown in blue. We use ElasticSearch [135]
as the data storage component of our prototype. ElasticSearch is ideal for storing large
volumes of structured and semi-structured data [
136
]. ElasticSearch continuously orga-
nizes and indexes data, making the information available for fast and efficient querying.
Additionally, it also provides powerful data filtering and aggregation features, which
greatly simplify the implementations of high-level data analysis algorithms.
We configure AppScale’s front-end server (based on Nginx) to tag all incoming ap-
plication requests with a unique identifier. This identifier is attached to the incoming
request as a custom HTTP header. All data collecting agents in the cloud extract this
identifier, and include it as an attribute in all the events reported to ElasticSearch.
We implement a number of data collecting agents in AppScale to gather runtime
information from all major components. These agents buffer data locally, and store
them in ElasticSearch in batches. Events are buffered until the buffer accumulates 1MB
of data, subject to a hard time limit of 15 seconds. This ensures that the events are
promptly reported to the Roots data storage while keeping the memory footprint of
124
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
the data collecting agents small and bounded. For scraping server logs, and storing the
extracted entries in ElasticSearch, we use the Logstash tool [
137
]. To capture the PaaS
kernel invocation data, we augment AppScale’s PaaS kernel implementation, which is
derived from the GAE PaaS SDK. More specifically we implement an agent that records
all PaaS SDK calls, and reports them to ElasticSearch asynchronously.
We implement Roots pods as standalone Java server processes. Threads are used to
run benchmarkers, anomaly detectors and handlers concurrently within each pod. Pods
communicate with ElasticSearch via a web API, and many of the data analysis tasks
such as filtering and aggregation are performed in ElasticSearch itself. This way, our
Roots implementation offloads heavy computations to ElasticSearch which is specifically
designed for high-performance query processing and analytics. Some of the more sophis-
ticated statistical analysis tasks (e.g. change point detection and linear regression as
described below) are implemented in the R language, and the Roots pods integrate with
R using the Rserve protocol [
138
].
5.3.1 SLO-violating Anomalies
As described previously, Roots defines anomalies as performance events that trigger
SLO violations. Thus, we devise a detector to automatically identify when a SLO viola-
tion has occurred. This anomaly detector allows application developers to specify simple
performance SLOs for deployed applications. A performance SLO consists of an upper
bound on the application response time (T), and the probability (p) that the application
response time falls under the specified upper bound. A general performance SLO can be
stated as: “application responds under T milliseconds p% of the time”.
When enabled for a given application, the SLO-based anomaly detector starts a
benchmarking process that periodically measures the response time of the target ap-
125
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
plication. Probes made by the benchmarking process are several seconds apart in time
(sampling rate), so as to not strain the application with load. The detector then periodi-
cally analyzes the collected response time measurements to check if the application meets
the specified performance SLO. Whenever it detects that the application has failed to
meet the SLO, it triggers an anomaly event. The SLO-based anomaly detector supports
following configuration parameters:
• Performance SLO: Response time upper bound (T), and the probability (p).
• Sampling rate: Rate at which the target application is benchmarked.
• Analysis rate: Rate at which the anomaly detector checks whether the application
has failed to meet the SLO.
• Minimum samples: Minimum number of samples to collect before checking for SLO
violations.
• Window size: Length of the sliding window (in time) to consider when checking for
SLO violations. This imposes a limit on the number of samples to keep in memory.
Once the anomaly detector identifies an SLO violation, it will continue to detect the
same violation until the historical data which contains the anomaly drops off from the
sliding window. In order to prevent the detector from needlessly reporting the same
anomaly multiple times, we purge all the data from anomaly detector’s sliding window
whenever it detects an SLO violation. Therefore, the detector cannot check for further
SLO violations until it repopulates the sliding window with the minimum number of
samples. This implies that each anomaly is followed by a “warm up” period. For instance,
with a sampling rate of 15 seconds, and a minimum samples count of 100, the warm up
period can last up to 25 minutes.
126
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
5.3.2 Path Distribution Analysis
We have implemented a path distribution analyzer in Roots whose function it is to
identify recurring sequences of PaaS kernel invocations made by an application. Each
identified sequence corresponds to a path of execution through the application code (i.e.
a path through the control flow graph of the application). This detector is able to
determine the frequency with which each path is executed over time. Then, using this
information which we term a “path distribution,” it reports an anomaly event when the
distribution of execution paths changes.
For each application, a path distribution is comprised of the set of execution paths
available in that application, along with the proportion of requests that executed each
path. It is an indicator of the type of request workload handled by an application. For
example, consider a data management application that has a read-only execution path,
and a read-write execution path. If 90% of the requests execute the read-only path, and
the remaining 10% of the requests execute the read-write path, we may characterize the
request workload as read-heavy.
Roots path distribution analyzer facilitates computing the path distribution for each
application with no static analysis, by only analyzing the runtime data gathered from the
applications. It periodically computes the path distribution for a given application. If it
detects that the latest path distribution is significantly different from the distributions
seen in the past, it triggers an event. This is done by computing the mean request
proportion for each path (over a sliding window of historical data), and then comparing
the latest request proportion values against the means. If the latest proportion is off
by more than n standard deviations from its mean, the detector considers it to be an
anomaly. The sensitivity of the detector can be configured by changing the value of n,
which defaults to 2.
127
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Path distribution analyzer enables developers to know when the nature of their ap-
plication request workload changes. For example in the previous data management ap-
plication, if suddenly 90% of the requests start executing the read-write path, the Roots
path distribution analyzer will detect the change. Similarly it is also able to detect when
new paths of execution are being invoked by requests (a form of novelty detection).
5.3.3 Workload Change Analyzer
Performance anomalies can arise either due to bottlenecks in the cloud platform or
changes in the application workload. When Roots detects a performance anomaly (i.e.
an application failing to meet its performance SLO), it needs to be able to determine
whether the failure is due to an increase in workload or a bottleneck that has suddenly
manifested. To check if the workload of an application has changed recently, Roots uses a
workload change analyzer. This Roots component is implemented as an anomaly handler,
which gets executed every time an anomaly detector identifies a performance anomaly.
Note that this is different from the path distribution analyzer, which is implemented as
an anomaly detector. While the path distribution analyzer looks for changes in the type
of the workload, the workload change analyzer looks for changes in the workload size or
rate. In other words, it determines if the target application has received more requests
than usual, which may have caused a performance degradation.
Workload change analyzer uses change point detection algorithms to analyze the
historical trend of the application workload. We use the “number of requests per unit
time” as the metric of workload size. Our implementation of Roots supports a number
of well known change point detection algorithms (PELT [
139
], binary segmentation and
CL method [
140
]), any of which can be used to detect level shifts in the workload size.
Algorithms like PELT favor long lasting shifts (plateaus) in the workload trend, over
128
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
momentary spikes. We expect momentary spikes to be fairly common in workload data.
But it is the plateaus that cause request buffers to fill up, and consume server-side
resources for extended periods of time, thus causing noticeable performance anomalies.
5.3.4 Bottleneck Identification
Applications running in the cloud consist of user code executed in the application
server, and remote service calls to various PaaS kernel services. An AppScale cloud con-
sists of the same kernel services present in the Google App Engine public cloud (datastore,
memcache, urlfetch, blobstore, user management etc.). We consider each PaaS kernel in-
vocation, and the code running on the application server as separate components. Each
application request causes one or more components to execute, and any one of the com-
ponents can become a bottleneck to cause performance anomalies. The purpose of bot-
tleneck identification is to find, out of all the components executed by an application, the
one component that is most likely to have caused application performance to deteriorate.
Suppose an application makes n PaaS kernel invocations (X1,X2, …Xn) for each re-
quest. For any given application request, Roots captures the time spent on each kernel
invocation (TX1,TX2, …TXn ), and the total response time (Ttotal) of the request. These
time values are related by the formula Ttotal = TX1 + TX2 + … + TXn + r, where r is
all the time spent in the resident application server executing user code (i.e. the time
spent not executing PaaS kernel services). r is not directly measured in Roots, since that
requires code instrumentation. However, in previous work [134] we showed that typical
PaaS-hosted web applications spend most of their time invoking PaaS kernel services. We
make use of these findings, and assert that for typical, well-designed PaaS applications
r � TX1 + TX2 + … + TXn .
Roots bottleneck identification mechanism first selects up to four components as
129
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
possible candidates for the bottleneck. These candidates are then further evaluated by a
weighting algorithm to determine the actual bottleneck in the cloud platform.
Relative Importance of PaaS Kernel Invocations
The purpose of this metric is to find the component that is contributing the most
towards the variance in the total response time. We select a window W in time which
includes a sufficient number of application requests, and ending at the point when the
performance anomaly was detected. Note that for each application request in W, we can
fetch the total response time (Ttotal), and the time spent on individual PaaS kernel services
(TXn ) from the Roots data storage. We take all these Ttotal values and the corresponding
TXn values in W , and fit a linear model of the form Ttotal = TX1 + TX2 + … + TXn using
linear regression. Here we leave r out deliberately, since it is typically and ideally small.
Occasionally in AppScale, we observe a request where r is large relative to TXn . Often
these rare events are correlated with large TXn values as well leading us to suspect that
the effect may be due to an issue with the AppScale infrastructure (e.g. a major garbage
collection event in the PaaS software). Overall, Roots detects these events, and identifies
them correctly (as explained below), but they perturb the linear regression model. To
prevent that, we filter out requests where the r value is too high. This is done by
computing the mean (µr) and standard deviation (σr) of r over the selected window, and
removing any requests where r > µr + 1.65σr.
Once the regression model has been computed, we run a relative importance algo-
rithm [
141
] to rank each of the regressors (i.e. TXn values) based on their contribution to
the variance of Ttotal. We use the LMG method [142] which is resistant to multicollinear-
ity, and provides a break down of the R2 value of the regression according to how strongly
each regressor influences the variance of the dependent variable. The relative importance
values of the regressors add up to the R2 of the linear regression. We consider 1 − R2
130
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
(the portion of variance in Ttotal not explained by the PaaS kernel invocations) as the
relative importance of r. The component associated with the highest ranked regressor
(i.e. highest relative importance) is chosen as a bottleneck candidate. Statistically, this
is the component that causes the application response time to vary the most.
Changes in Relative Importance
Next we divide the time window W into equal-sized segments, and compute the
relative importance metrics for regressors within each segment. We also compute the
relative importance of r within each segment. This way we obtain a time series of
relative importance values for each regressor and r. These time series represent how the
relative importance of each component has changed over time.
We subject each relative importance time series to change point analysis to detect
if the relative importance of any particular variable has increased recently. If such a
variable can be found, then the component associated with that variable is also a po-
tential candidate for the bottleneck. The candidate selected by this method represents
a component whose performance has been stable in the past, and has become variable
recently.
High Quantiles
Next we analyze the individual distributions of TXn and r. Recall that for each PaaS
kernel invocation Xk, we have a distribution of TXk values in the window W. Similarly we
can also extract a distribution of r values from W. Out of all the available distributions we
find the one whose quantile values are the largest. Specifically, we compute a high quantile
(e.g. 0.99 quantile) for each distribution. The component, whose distribution contains
the largest quantile value is chosen as another potential candidate for the bottleneck.
This component can be considered having a high latency in general.
131
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Tail End Values
Finally, Roots analyzes each TXk and r distribution to identify the one with the
largest tail values with respect to a particular high quantile. For each maximum (tail
end) latency value t, we compute the metric P
q
t as the percentage difference between
t and a target quantile q of the corresponding distribution. We set q to 0.99 in our
experiments. Roots selects the component with the distribution that has the largest
P
q
t as another potential bottleneck candidate. This method identifies candidates that
contain rare, high-valued outliers (point anomalies) in their distributions.
Selecting Among the Candidates
The above four methods may select up to four candidate components for the bot-
tleneck. We designate the candidate chosen by a majority of methods as the actual
bottleneck. Ties are broken by assigning more priority to the candidate chosen by the
relative importance method.
5.4 Results
We evaluate the efficacy of Roots as a performance monitoring and root cause analysis
system for PaaS applications. To do so, we consider its ability to identify and characterize
SLO violations. For violations that are not caused by a change in workload, we evaluate
Roots’ ability to identify the PaaS component that is the cause of the performance
anomaly. We also evaluate the Roots path distribution analyzer, and its ability to identify
execution paths along with changes in path distributions. Finally, we investigate the
performance and scalability of the Roots prototype.
132
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Faulty Service L1 (30ms) L2 (35ms) L3 (45ms)
datastore 18 11 10
user management 19 15 10
Table 5.1: Number of anomalies detected in guestbook app under different SLOs (L1,
L2 and L3) when injecting faults into two different PaaS kernel services.
5.4.1 Anomaly Detection: Accuracy and Speed
To begin the evaluation of the Roots prototype we experiment with the SLO-based
anomaly detector, using a simple HTML-producing Java web application called “guest-
book”. This application allows users to login, and post comments. It uses the AppScale
datastore service to save the posted comments, and the AppScale user management ser-
vice to handle authentication. Each request processed by guestbook results in two PaaS
kernel invocations – one to check if the user is logged in, and another to retrieve the
existing comments from the datastore. We conduct all our experiments on a single node
AppScale cloud except where specified. The node itself is an Ubuntu 14.04 VM with 4
virtual CPU cores (clocked at 2.4GHz), and 4GB of memory.
We run the SLO-based anomaly detector on guestbook with a sampling rate of 15
seconds, an analysis rate of 60 seconds, and a window size of 1 hour. We set the minimum
sample count to 100, and run a series of experiments with different SLOs on the guestbook
application. Specifically, we fix the SLO success probability at 95%, and set the response
time upper bound to µg + nσg. µg and σg represent the mean and standard deviation of
the guestbook’s response time. We learn these two parameters apriori by benchmarking
the application. Then we obtain three different upper bound values for the guestbook’s
response time by setting n to 2, 3 and 5. We denote the resulting three SLOs L1, L2 and
L3 respectively.
We also inject performance faults into AppScale by modifying its code to cause the
datastore service to be slow to respond. This fault injection logic activates once every
133
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
hour, and slows down all datastore invocations by 45ms over a period of 3 minutes. We
chose 45ms because it is equal to µg + 5σg for the guestbook deployment under test.
Therefore this delay is sufficient to violate all three SLOs used in our experiments. We
run a similar set of experiments where we inject faults into the user management service
of AppScale. Each experiment is run for a period of 10 hours.
Table 5.1 shows how the number of anomalies detected by Roots in a 10 hour period
varies when the SLO is changed. The number of anomalies drops noticeably when the
response time upper bound is increased. When the L3 SLO (45ms) is used, the only
anomalies detected are the ones caused by our hourly fault injection mechanism. As the
SLO is tightened by lowering the upper bound, Roots detects additional anomalies. These
additional anomalies result from a combination of injected faults, and other naturally
occurring faults in the system. That is, Roots detected some naturally occurring faults
(temporary spikes in application latency), when a number of injected faults were still in
the sliding window of the anomaly detector. Together these two types of faults caused
SLO violations, usually several minutes after the fault injection period has expired.
Next we analyze how fast Roots can detect anomalies in an application. We first
consider the performance of guestbook under the L1 SLO while injecting faults into the
datastore service. Figure 5.4 shows anomalies detected by Roots as events on a time
line. The horizontal axis represents passage of time. The red arrows indicate the start
of a fault injection period, where each period lasts up to 3 minutes. The blue arrows
indicate the Roots anomaly detection events. Note that every fault injection period is
immediately followed by an anomaly detection event, implying near real time reaction
from Roots, except in case of the fault injection window at 20:00 hours. Roots detected
a naturally occurring anomaly (i.e. one that we did not explicitly inject, but nonetheless
caused an SLO violation) at 19:52 hours, which caused the anomaly detector to go into
the warm up mode. Therefore Roots did not immediately react to the faults injected at
134
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Time (hh:mm)
13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00
Fault injection Anomaly detection
Figure 5.4: Anomaly detection in guestbook application during a period of 10 hours.
Red arrows indicate fault injection at the datastore service. Blue arrows indicate all
anomalies detected by Roots during the experimental run.
Time (HH:mm)
01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00
Fault injection Anomaly detection
Figure 5.5: Anomaly detection in guestbook application during a period of 10 hours.
Red arrows indicate fault injection at the user management service. Blue arrows
indicate all anomalies detected by Roots during the experimental run.
20:00 hours. But as soon as the detector became active again at 20:17, it detected the
anomaly.
Figure 5.5 shows the anomaly detection time line for the same application and SLO,
while faults are being injected into the user management service. Here too we see that
Roots detects anomalies immediately following each fault injection window.
5.4.2 Path Distribution Analyzer: Accuracy and Speed
Next we evaluate the effectiveness and accuracy of the path distribution analyzer.
For this we employ two different applications.
key-value store This application provides the functionality of an online key-value store.
It allows users to store data objects in the cloud where each object is assigned a
unique key. The objects can then be retrieved, updated or deleted using their
keys. Different operations (create, retrieve, update and delete) are implemented as
135
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Time (HH:mm)
14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 00:00
Anomalous workload injection Anomaly detection
Figure 5.6: Anomaly detection in key-value store application during a period of 10
hours. Steady-state traffic is read-heavy. Red arrows indicate injection of write-heavy
bursts. Blue arrows indicate all the anomalies detected by the path distribution
analyzer.
Time (HH:mm)
01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00
Anomalous workload injection Anomaly detection
Figure 5.7: Anomaly detection in cached key-value store application during a period
of 10 hours. Steady-state traffic is mostly served from the cache. Red arrows indicate
injection of cache-miss bursts. Blue arrows indicate all the anomalies detected by the
path distribution analyzer.
separate paths of execution in the application.
cached key-value store This is a simple extension of the regular key-value store, which
adds caching to the read operation using the AppScale’s memcache service. The
application contains separate paths of execution for cache hits and cache misses.
We first deploy the key-value store on AppScale, and populate it with a number of
data objects. Then we run a test client against it which generates a read-heavy workload.
On average this workload consists of 90% read requests and 10% write requests. The
test client is also programmed to randomly send bursts of write-heavy workloads. These
bursts consist of 90% write requests on average, and each burst lasts up to 2 minutes.
Figure 5.6 shows the write-heavy bursts as events on a time line (indicated by red arrows).
Note that almost every burst is immediately followed by an anomaly detection event
(indicated by blue arrows). The only time we do not see an anomaly detection event
136
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
is when multiple bursts are clustered together in time (e.g. 3 bursts between 17:04 and
17:24 hours). In this case Roots detects the very first burst, and then goes into the warm
up mode to collect more data. Between 20:30 and 21:00 hours we also had two instances
where the read request proportion dropped from 90% to 80% due to random chance.
This is because our test client randomizes the read request proportion around the 90%
mark. Roots identified these two incidents also as anomalous.
We conduct a similar experiment using the cached key-value store. Here, we run a
test client that generates a workload that is mostly served from the cache. This is done
by repeatedly executing read requests on a small selected set of object keys. However,
the client randomly sends bursts of traffic requesting keys that are not likely to be in the
application cache, thus resulting in many cache misses. Each burst lasts up to 2 minutes.
As shown in figure 5.7, Roots path distribution analyzer correctly detects the change in
the workload (from many cache hits to many cache misses), nearly every time the test
client injects a burst of traffic that triggers the cache miss path of the application. The
only exception is when multiple bursts are clumped together, in which case only the first
raises an alarm in Roots.
5.4.3 Workload Change Analyzer Accuracy
Next we evaluate the Roots workload change analyzer. In this experiment we run a
varying workload against the key-value store application for 10 hours. The load generat-
ing client is programmed to maintain a mean workload level of 500 requests per minute.
However, the client is also programmed to randomly send large bursts of traffic at times of
its choosing. During these bursts the client may send more than 1000 requests a minute,
thus impacting the performance of the application server that hosts the key-value store.
Figure 5.8 shows how the application workload has changed over time. The workload
137
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
13:00 15:00 17:00 19:00 21:00
5
0
0
1
0
0
0
1
5
0
0
2
0
0
0
Time (hh:mm)
R
e
q
u
e
st
s
p
e
r
m
in
u
te
Figure 5.8: Workload size over time for the key-value store application. The test client
randomly sends large bursts of traffic causing the spikes in the plot. Roots anomaly
detection events are shown in red dashed lines.
generator has produced 6 large bursts of traffic during the period of the experiment,
which appear as tall spikes in the plot. Note that each burst is immediately followed by
a Roots anomaly detection event (shown by red dashed lines). In each of these 6 cases,
the increase in workload caused a violation of the application performance SLO. Roots
detected the corresponding anomalies, and determined them to be caused by changes in
the workload size. As a result, bottleneck identification was not triggered for any of these
anomalies. Even though the bursts of traffic appear to be momentary spikes, each burst
lasts for 4 to 5 minutes thereby causing a lasting impact on the application performance.
5.4.4 Bottleneck Identification Accuracy
Next we evaluate the bottleneck identification capability of Roots. We first discuss
the results obtained using the guestbook application, and follow with results obtained
using a more complex application. In the experimental run illustrated in figure 5.4, Roots
determined that all the detected anomalies except for one were caused by the AppScale
datastore service. This is consistent with our expectations since in this experiment we
artificially inject faults into the datastore. The only anomaly that is not traced back to
138
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Time (hh:mm)
00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00
Fault injection Anomaly detection
Figure 5.9: Anomaly detection in stock-trader application during a period of 10 hours.
Red arrows indicate fault injection at the 1st datastore query. Blue arrows indicate
all anomalies detected by Roots during the experimental run.
the datastore service is the one that was detected at 14:32 hours. This is indicated by
the blue arrow with a small square marker at the top. For this anomaly, Roots concluded
that the bottleneck is the local execution at the application server (r). We have veri-
fied this result by manually inspecting the AppScale logs, and traces of data collected
by Roots. As it turns out, between 14:19 and 14:22 the application server hosting the
guestbook application experienced some problems, which caused request latency to in-
crease significantly. Therefore we can conclude that Roots has correctly identified the
root causes of all 18 anomalies in this experimental run including one that we did not
inject explicitly.
Similarly, in the experiment shown in figure 5.5, Roots determined that all the anoma-
lies are caused by the user management service, except in one instance. This is again
inline with our expectations since in this experiment we inject faults into the user man-
agement service. For the anomaly detected at 04:30 hours, Roots determined that local
execution time is the primary bottleneck. Like earlier, we have manually verified this
diagnosis to be accurate. In this case too the server hosting the guestbook application
became slow during the 04:23 – 04:25 time window, and Roots correctly identified the
bottleneck as the local application server.
In order to evaluate how the bottleneck identification performs when an application
makes more than 2 PaaS kernel invocations, we conduct another experiment using an
139
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Time (HH:mm)
03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00
Fault injection Anomaly detection
Figure 5.10: Anomaly detection in stock-trader application during a period of 10
hours. Red arrows indicate fault injection at the 2nd datastore query. Blue arrows
indicate all anomalies detected by Roots during the experimental run.
application called “stock-trader”. This application allows setting up organizations, and
simulating trading of stocks between the organizations. The two main operations in this
application are buy and sell. Each of these operations makes 8 calls to the AppScale
datastore. According to our previous work [134], 8 kernel invocations in the same path of
execution is very rare in web applications developed for a PaaS cloud. The probability of
finding an execution path with more than 5 kernel invocations in a sample of PaaS-hosted
applications is less than 1%. Therefore the stock-trader application is a good extreme
case example to test the Roots bottleneck identification support. We execute a number
of experimental runs using this application, and here we present the results from two of
them. In all experiments we configure the anomaly detector to check for the response
time SLO of 177ms with 95% success probability.
In one of our experimental runs we inject faults into the first datastore query executed
by the buy operation of stock-trader. The fault injection logic runs every two hours, and
lasts for 3 minutes. The duration of the full experiment is 10 hours. Figure 5.9 shows the
resulting event sequence. Note that every fault injection event is immediately followed
by a Roots anomaly detection event. There are also four additional anomalies in the time
line which were SLO violations caused by a combination of injected faults, and naturally
occurring faults in the system. For all the anomalies detected in this test, Roots correctly
selected the first datastore call in the application code as the bottleneck. The additional
140
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
four anomalies occurred when a large number of injected faults were still in the sliding
window of the detector. Therefore, it is accurate to attribute those anomalies also to the
first datastore query of the application.
Figure 5.10 shows the results from a similar experiment where we inject faults into
the second datastore query executed by the operation. Here also Roots detects all the
artificially induced anomalies along with a few extras. All the anomalies, except for one,
are determined to be caused by the second datastore query of the buy operation. The
anomaly detected at 08:56 (marked with a square on top of the blue arrow) is attributed
to the fourth datastore query executed by the application. We have manually verified
this diagnosis to be accurate. Since 08:27, when the previous anomaly was detected, the
fourth datastore query has frequently taken a long time to execute (again, on its own),
which resulted in an SLO violation at 08:56 hours.
In the experiments illustrated in figures 5.4, 5.5, 5.9, and 5.10 we maintain the ap-
plication request rate steady throughout the 10 hour periods. Therefore, the workload
change analyzer of Roots did not detect any significant shifts in the workload level. Con-
sequently, none of the anomalies detected in these 4 experiments were attributed to a
workload change. The bottleneck identification was therefore triggered for each anomaly.
To evaluate the agreement level among the four bottleneck candidate selection meth-
ods, we analyze 407 anomalies detected by Roots over a period of 3 weeks. We report
that except on 13 instances, in all the remaining cases 2 or more candidate selection
methods agreed on the final bottleneck component chosen. This implies that most of the
time (96.8%) Roots identifies bottlenecks with high confidence.
141
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Time (hh:mm)
13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00
G4 anomaly G6 anomaly G7 anomalyFault injection
Figure 5.11: Anomaly detection in 8 applications deployed in a clustered AppScale
cloud. Red arrows indicate fault injection at the datastore service for queries generated
from a specific host. Cross marks indicate all the anomalies detected by Roots during
the experiment.
5.4.5 Multiple Applications in a Clustered Setting
To demonstrate how Roots can be used in a multi-node environment, we set up an
AppScale cloud on a cluster of 10 virtual machines (VMs). VMs are provisioned by a
Eucalyptus (IaaS) cloud, and each VM is comprised of 2 CPU cores and 2GB memory.
Then we proceed to deploy 8 instances of the guestbook application on AppScale. We use
the multitenant support in AppScale to register each instance of guestbook as a different
application (named G1 through G8). Each instance is hosted on a separate application
server instance, has its own private namespace on the AppScale datastore, and can be
accessed via a unique URL. We disable auto-scaling support in the AppScale cloud, and
inject faults into the datastore service of AppScale in such a way that queries issued
from a particular VM, are processed with a 100ms delay. We identify this VM by its
IP address in our test environment, and shall refer to it as Vf in the discussion. We
trigger the fault injection every 2 hours, and when activated it lasts for up to 5 minutes.
Then we monitor the applications using Roots for a period of 10 hours. Each anomaly
detector is configured to check for the 75ms response time SLO with 95% success rate.
ElasticSearch, Logstash and the Roots pod are deployed on a separate VM.
Figure 5.11 shows the resulting event sequence. Note that we detect anomalies in
3 applications (G4, G6 and G7) immediately after each fault injection. Inspecting the
142
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
topology of our AppScale cloud revealed that these were the only 3 applications that
were hosted on Vf . As a result, the bi-hourly fault injection caused their SLOs to get
violated. Other applications did not exhibit any SLO violations since we are monitoring
against a very high response time upper bound.
In each case Roots detected the SLO violations 2-3 minutes into the fault injection
period. As soon as that happened, the anomaly detectors of G4, G6 and G7 entered
the warmup mode. But our fault injection logic kept injecting faults for at least 2 more
minutes. Therefore when the anomaly detectors reactivated after 25 minutes (time to
collect the minimum sample count), they each detected another SLO violation. As a
result, we see another set of detection events approximately half an hour after the fault
injection events.
5.4.6 Results Summary
We conclude our discussion of Roots efficacy with a summary of our results. Table 5.2
provides an overview of all the results presented so far, broken down into four features
that we wish to see in an anomaly detection and bottleneck identification system.
5.4.7 Roots Performance and Scalability
Next we evaluate the performance overhead incurred by Roots on the applications
deployed in the cloud platform. We are particularly interested in understanding the
overhead of recording the PaaS kernel invocations made by each application, since this
feature requires some changes to the PaaS kernel implementation. We deploy a number
of applications on a vanilla AppScale cloud (with no Roots), and measure their request
latencies. We use the popular Apache Bench tool to measure the request latency under
a varying number of concurrent clients. We then take the same measurements on an
143
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Feature Results Observed in Roots
Detecting anomalies All the artificially induced anomalies were de-
tected, except when multiple anomalies are clus-
tered together in time. In that case only the first
anomaly was detected. Roots also detected several
anomalies that occurred due to a combination of
injected faults, and natural faults.
Characterizing anomalies
as being due to workload
changes or bottlenecks
When anomalies were induced by varying the ap-
plication workload, Roots correctly determined
that the anomalies were caused by workload
changes. In all other cases we kept the workload
steady, and hence the anomalies were attributed
to a system bottleneck.
Identifying correct bottle-
neck
In all the cases where bottleneck identification was
performed, Roots correctly identified the bottle-
neck component.
Reaction time All the artificially induced anomalies (SLO viola-
tions) were detected as soon as enough samples of
the fault were taken by the benchmarking process
(2-5 minutes from the start of the fault injection
period).
Path distribution All the artificially induced changes to the path dis-
tribution were detected.
Table 5.2: Summary of Roots efficacy results.
144
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Without Roots With Roots
App./Concurrency Mean
(ms)
SD Mean
(ms)
SD
guestbook/1 12 3.9 12 3.7
guestbook/50 375 51.4 374 53
stock-trader/1
151
13 145 13.7
stock-trader/50 3631 690.8 3552 667.7
kv store/1 7 1.5 8 2.2
kv store/50
169
26.7 150 25.4
cached kv store/1 3 2.8 2 3.3
cached kv store/50 101 24.8 97 35.1
Table 5.3: Latency comparison of applications when running on a vanilla AppScale
cloud vs when running on a Roots-enabled AppScale cloud.
AppScale cloud with Roots, and compare the results against the ones obtained from the
vanilla AppScale cloud. In both environments we disable the auto-scaling support of
AppScale, so that all client requests are served from a single application server instance.
In our prototype implementation of Roots, the kernel invocation events get buffered in
the application server before they are sent to the Roots data storage. We wish to explore
how this feature performs when the application server is under heavy load.
Table 5.3 shows the comparison of request latencies. We discover that Roots does
not add a significant overhead to the request latency in any of the scenarios considered.
In all the cases, the mean request latency when Roots is in use, is within one standard
deviation from the mean request latency when Roots is not in use. The request latency
increases when the number of concurrent clients is increased from 1 to 50 (since all
requests are handled by a single application server), but still there is no sign of any
detrimental overhead from Roots even under load.
Finally, to demonstrate how lightweight and scalable Roots is, we deploy a Roots
pod on a virtual machine with 4 CPU cores and 4GB memory. To simulate monitoring
multiple applications, we run multiple concurrent anomaly detectors in the pod. Each
145
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
100 1000 10000
Memory
CPU
Number of Detectors
M
a
x
M
e
m
o
ry
U
sa
g
e
(
M
B
)
0
2
0
0
4
0
0
6
0
0
8
0
0
0
5
0
1
0
0
1
5
0
2
0
0
2
5
0
M
a
x
C
P
U
U
sa
g
e
(
%
)
Figure 5.12: Resource utilization of a Roots pod.
detector is configured with a 1 hour sliding window. We vary the number of concurrent
detectors between 100 and 10000, and run each configuration for 2 hours. We track the
memory and CPU usage of the pod during each of these runs using the jstat and pidstat
tools.
Figure 5.12 illustrates the maximum resource utilization of the Roots pod for different
counts of concurrent anomaly detectors. We see that with 10000 concurrent detectors,
the maximum CPU usage is 238%, where 400% is the available limit for 4 CPU cores.
The maximum memory usage in this case is only 778 MB. Since each anomaly detector
operates with a fixed-sized window, and they bring additional data into memory only
when required, the memory usage of the Roots pod generally stays low. We also exper-
imented with larger concurrent detector counts, and we were able to pack up to 40000
detectors into the pod before getting constrained by the CPU capacity of our VM. This
result implies that we can monitor tens of thousands of applications using a single pod,
thereby scaling up to a very large number of applications using only a handful of pods.
146
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
5.5 Related Work
Roots falls into the category of performance anomaly detection and bottleneck iden-
tification (PADBI) systems. A PADBI system is an entity that observes, in real time,
the performance behaviors of a running system or application, while collecting vital
measurements at discrete time intervals to create baseline models of typical system be-
haviors [133]. Such systems play a crucial role in achieving guaranteed service reliability,
performance and quality of service by detecting performance issues in a timely manner
before they escalate into major outages or SLO/SLA violations [143]. PADBI systems
are thoroughly researched, and well understood in the context of traditional standalone
and network applications. Many system administrators are familiar with frameworks like
Nagios [144], Open NMS [145] and Zabbix [146] which can be used to collect data from
a wide range of applications and devices.
However, the paradigm of cloud computing, being relatively new, is yet to be fully
penetrated by PADBI systems research. The size, complexity and the dynamic nature of
cloud platforms make performance monitoring a particularly challenging problem. The
existing technologies like Amazon CloudWatch [147], New Relic [12] and DataDog [14]
facilitate monitoring cloud applications by instrumenting low level cloud resources (e.g.
virtual machines), and application code. But such technologies are either impracticable
or insufficient in PaaS clouds where the low level cloud resources are hidden under layers
of managed services, and the application code is executed in a sandboxed environment
that is not always amenable to instrumentation. When code instrumentation is possible,
it tends to be burdensome, error prone, and detrimental to the application’s performance.
Roots on the other hand is built into the fabric of the PaaS cloud giving it full visibility
into all the activities that take place in the entire software stack, and it does not require
application-level instrumentation.
147
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
Our work borrows heavily from the past literature [132, 133] that detail the key
features of cloud APMs. Consequently, we strive to incorporate requirements like scala-
bility, autonomy and dynamic resource management into our design. Ibidunmoye et al
highlight the importance of multilevel bottleneck identification as an open research ques-
tion [133]. This is the ability to identify bottlenecks from a set of top-level application
service components, and further down through the virtualization layer to system resource
bottlenecks. Our plan for Roots is highly in sync with this vision. We currently support
identifying bottlenecks from a set of kernel services provided by the PaaS cloud. As a
part of our future work, we plan to extend this support towards the virtualization layer
and the physical resources of the cloud platform.
Cherkasova et al developed an online performance modeling technique to detect
anomalies in traditional transaction processing systems [
148
]. They divide time into
contiguous segments, such that within each segment the application workload (volume
and type of transactions) and resource usage (CPU) can be fit to a linear regression
model. Segments for which a model cannot be found, are considered anomalous. Then
they remove anomalous segments from the history, and perform model reconciliation to
differentiate between workload changes and application problems. While this method is
powerful, it requires instrumenting application code to detect different external calls (e.g.
database queries) executed by the application. Since the model uses different transaction
types as parameters, some prior knowledge regarding the transactions also needs to be
fed into the system. The algorithm is also very compute intensive, due to continuous
segmentation and model fitting. In contrast, we use a very lightweight SLO monitoring
method in Roots to detect performance anomalies, and only perform heavy computations
to perform bottleneck identification.
Dean et al implemented PerfCompass [
149
], an anomaly detection and localization
method for IaaS clouds. They instrument the VM operating system kernels to capture the
148
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
system calls made by user applications. Anomalies are detected by looking for unusual
increases in system call execution time. They group system calls into execution units
(processes, threads etc), and analyze how many units are affected by any given anomaly.
Based on this metric they conclude if the problem was caused by a workload change or an
application level issue. We take a similar approach in Roots, in that we capture the PaaS
kernel invocations made by user applications. We use application response time (latency)
as an indicator of anomalies, and group PaaS kernel invocations into application requests
to perform bottleneck identification.
Nguyen et al presented PAL, another anomaly detection and localization mechanism
targeting distributed applications deployed on IaaS clouds [150]. Similar to Roots, they
also use an SLO monitoring approach to detect application performance anomalies. When
an anomaly is detected, they perform change point analysis on gathered resource usage
data (CPU, memory and network) to identify the anomaly onset time. Having detected
one or more anomaly onset events in different components of the distributed application,
they sort the events by time to determine the propagation pattern of the anomaly.
Magalhaes and Silva have made significant contributions in the area of anomaly detec-
tion and root cause analysis in web applications [151, 152]. They compute the correlation
between application workload and latency. If the level of correlation drops significantly,
they consider it to be an anomaly. A similar correlation analysis between workload and
other local system metrics (e.g. CPU and memory usage) is used to identify the sys-
tem resource that is responsible for a given anomaly. They also use an aspect-oriented
programming model in their target applications, which allows them to easily instrument
application code, and gather metrics regarding various remote services (e.g. database)
invoked by the application. This data is subjected to a series of simple linear regressions
to perform root cause analysis. This approach assumes that remote services are indepen-
dent of each other. However, in a cloud platform where kernel services are deployed in the
149
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
same shared infrastructure, this assumption might not hold true. Therefore we improve
on their methodology, and use multiple linear regression with relative importance to
identify cloud platform bottlenecks. Relative importance is resistant to multicollinearity,
and therefore does not require the independence assumption.
Anomaly detection is a general problem not restricted to performance analysis. Re-
searchers have studied anomaly detection from many different points of view, and as a
result many viable algorithms and solutions have emerged over time [153]. Prior work
in performance anomaly detection and root cause analysis can be classified as statistical
methods (e.g. [
154
,
155
, 152, 150]) and machine learning methods (e.g. [
156
,
157
,
158
]).
While we use many statistical methods in our work (change point analysis, relative im-
portance, quantile analysis), Roots is not tied to any of these techniques. Rather, we
provide a framework on top of which new anomaly detectors and anomaly handlers can
be built.
5.6 Conclusions and Future Work
Uncovering performance bottlenecks in a timely manner, and resolving them urgently
is a key requirement for implementing governance in cloud environments. Application
developers and cloud administrators wish to detect performance anomalies in cloud appli-
cations, and perform root cause analysis to diagnose problems. However, the high level of
abstraction provided by cloud platforms, coupled with their scale and complexity, makes
performance diagnosis a daunting problem. This situation is particularly apparent in
PaaS clouds, where the application runtime details are hidden beneath a layer of kernel
services. The existing cloud monitoring solutions do not have the necessary penetra-
tive power to monitor all the different layers of cloud platforms, and consequently, their
diagnosis capabilities are severely limited.
150
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
We present Roots, a near real time monitoring framework for applications deployed
in a PaaS cloud. Roots is designed to function as a curated service built into the cloud
platform, as opposed to an external monitoring system. It relieves the application devel-
opers from having to configure their own monitoring solutions, or having to instrument
the application code in anyway. Roots captures runtime data from all the different layers
involved in processing application requests. It can correlate events across different layers,
and identify bottlenecks deep within the kernel services of the PaaS.
Roots monitors applications for SLO compliance, and detects anomalies via SLO vio-
lations. When Roots detects an anomaly, it analyzes workload data and other application
runtime data to perform root cause analysis. Roots is able to determine whether a partic-
ular anomaly was caused by a change in the application workload, or due to a bottleneck
in the cloud platform. To this end we also devise a bottleneck identification algorithm,
that uses a combination of linear regression, quantile analysis and change point detec-
tion. We also present an analysis method by which Roots can identify different paths of
execution in an application. Our method does not require static analysis, and we use it
to detect changes in an application’s workload characteristics.
We evaluate Roots using a prototype built for the AppScale open source PaaS. Our
results indicate that Roots is effective at detecting performance anomalies in near real
time. We also show that our bottleneck identification algorithm produces accurate results
nearly 100% of the time, pinpointing the exact PaaS kernel service or the application
component responsible for each anomaly. Our empirical trials further reveal that Roots
does not add a significant overhead to the applications deployed on the cloud platform.
Finally, we show that Roots is very lightweight, and scales well to handle large populations
of applications.
In our future work we plan to expand the data gathering capabilities of Roots into
the low level virtual machines, and containers that host various services of the cloud plat-
151
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Chapter 5
form. We intend to tap into the hypervisors and container managers to harvest runtime
data regarding the resource usage (CPU, memory, disk etc.) of PaaS services and other
application components. With that we expect to extend the root cause analysis support
of Roots so that it can not only pinpoint the bottlenecked application components, but
also the low level hosts and system resources that constitute each bottleneck.
152
Chapter 6
Conclusion
Cloud computing delivers IT infrastructure resources, programming platforms, and soft-
ware applications as shared utility services. Enterprises and developers increasingly de-
ploy applications on cloud platforms due to their scalability, high availability and many
other productivity enhancing features. Cloud-hosted applications depend on the core
services provided by the cloud platform for compute, storage and network resources. In
some cases they use the services provided by the cloud to implement most of the appli-
cation functionality as well (e.g. PaaS-hosted applications). Cloud-hosted applications
are typically accessed over the Internet, via the web APIs exposed by the applications.
As the applications hosted in cloud platforms continue to increase in number, the
need for enforcing governance on them becomes accentuated. We define governance as
the mechanism by which the acceptable operational parameters are specified and main-
tained for a cloud-hosted application. Governance enables specifying the acceptable
development standards and runtime parameters (performance, availability, security re-
quirements etc.) for cloud-hosted applications as policies. Such policies can then be
enforced automatically at various stages of the application life-cycle. Governance also
entails monitoring cloud-hosted applications to ensure that they operate at a certain level
153
Conclusion Chapter 6
of quality, and taking corrective action when deviations are detected. Through the steps
of specification, enforcement, monitoring and correction, governance facilitates resolv-
ing a number of prevalent issues in today’s cloud platforms. These issues include lack
of good software engineering practices (code reuse, dependency management, versioning
etc), lack of performance SLOs for cloud-hosted applications, and lack of performance
debugging support.
We explore the feasibility of efficiently enforcing governance on cloud-hosted applica-
tions, and evaluate the effectiveness of governance as a means of achieving administrative
conformance, developer best practices and performance SLOs in the cloud. Considering
the scale of today’s cloud platforms in terms of the number of users and the applica-
tions, we strive to automate much of the governance tasks through automated analysis
and diagnostics. To achieve efficiency, we put more emphasis on deployment-time policy
enforcement, static analysis of performance bounds, and non-invasive passive monitoring
of cloud platforms, thereby keeping the governance overhead to a minimum. We avoid
run-time enforcement and invasive instrumentation of cloud applications as much as pos-
sible. We also focus on building governance systems that are deeply integrated with the
cloud platforms themselves. This enables using the existing scalability and high avail-
ability features of the cloud to provide an efficient governance solution that can control
all application events in a fine-grained manner. Furthermore, such integrated solutions
relieve the users from having to maintain and pay for additional, external governance
and monitoring solutions.
In order to explore the feasibility of implementing efficient, automated governance
systems in cloud environments, and evaluate the efficacy of such systems, we follow a
three-step research plan.
1. Design and implement a scalable, low-overhead policy enforcement system for cloud
154
Conclusion Chapter 6
platforms.
2. Design and implement a methodology for formulating performance SLOs for cloud-
hosted applications.
3. Design and implement a scalable application performance monitoring system for
detecting and diagnosing performance anomalies in cloud platforms.
We design and implement EAGER [54, 91] – a lightweight governance policy enforce-
ment framework built into PaaS clouds. It supports defining policies using a simple syntax
based on the popular Python programming language. EAGER promotes deployment-
time policy enforcement, where policies are enforced on user applications (and APIs)
every time an application is uploaded to the cloud. By carrying out policy validations
at application deployment-time, and refusing to deploy applications that violate policies,
we provide fail-fast semantics, which ensure that deployed applications are fully policy
compliant. EAGER architecture also provides the necessary provisions for facilitating
run-time policy enforcement (through an API gateway proxy) when necessary. This is
required, since not all policy requirements are enforceable at deployment-time; e.g. a
policy that prevents an application from making connections to a specific network ad-
dress. Our experimental results show that EAGER validation and policy enforcement
overhead is negligibly small, and it scales well to handle thousands of user applications
and policies. Overall, we show that integrated governance for cloud-hosted applications
is not only feasible, but also can be implemented with very little overhead and effort.
To facilitate formulating performance SLOs, we design and implement Cerebro [134]
– a system that predicts bounds on the response time of web applications developed for
PaaS clouds. Cerebro is able to analyze a given web application, and determine a bound
on its response time without subjecting the application to any testing or runtime instru-
mentation. This is achieved by a mechanism that combines static analysis of application
155
Conclusion Chapter 6
source code with runtime monitoring of the underlying cloud platform (PaaS SDK to
be specific). Our approach is limited to interactive web applications developed using a
PaaS SDK. We show that such applications have very few branches and loops, and they
spend most of their execution time invoking PaaS SDK operations. These properties
make the applications amenable to both static analysis, and statistical treatment of their
performance limits.
Cerebro is fast, can be invoked at the deployment-time of an application, and does not
require any human input or intervention. The bounds predicted by Cerebro can be used as
statistical guarantees (with well defined correctness probabilities) to form performance
SLOs. These SLOs in turns can be used in SLAs that are negotiated with the users
of the web applications. Cerebro’s SLO prediction capability, coupled with a policy
enforcement framework such as EAGER, can facilitate specification and enforcement of
performance-related policies for cloud-hosted applications. We implement Cerebro for
Google App Engine public cloud and AppScale private cloud. Our experiments with real
world PaaS applications show that Cerebro is able to determine accurate performance
SLOs that closely reflect the actual response time of the applications. Furthermore, we
show that Cerebro-predicted SLOs are not easily affected by the dynamic nature of the
cloud platform, and they remain valid for long durations. More specifically, Cerebro
predictions remain correct for more than 12 days on average [
159
].
Finally, we design and implement Roots – a performance anomaly detection and
bottleneck identification system built into PaaS clouds. It collects data from all the
different layers of the PaaS stack; from load balancers to low level PaaS kernel service
implementations. However, it does so without instrumenting user code, and without
introducing a significant overhead to the application request processing flow. Roots uses
the metadata (request identifiers) injected by the load balancers to correlate the events
observed in different layers, thereby enabling tracing of application requests through
156
Conclusion Chapter 6
the PaaS stack. Roots is also extensible in the sense that any number of statistical
analysis methods can be incorporated into Roots for performance anomaly detection
and diagnosis. Furthermore, it facilitates configuring monitoring requirements at the
granularity of user applications, which allows different applications to be monitored and
analyzed differently.
Roots detects performance anomalies by monitoring applications for performance SLO
violations. When an anomaly (i.e. an SLO violation) is detected, Roots determines if
the anomaly was caused by a change in the application workload or by a performance
bottleneck in one of the underlying PaaS kernel services. If the SLO violation was caused
by a performance bottleneck in the cloud, Roots needs to be able to locate the exact PaaS
kernel service in which the bottleneck manifested. To this end we present a root cause
analysis method that uses a combination of linear regression, change point detection and
quantile analysis. We show that our combined methodology makes accurate diagnoses
nearly 100% of the time. Moreover, we also present a path distribution analyzer that can
identify different paths of execution in an application, via the run-time data gathered from
the cloud platform. We show that this mechanism is capable of detecting characteristic
changes in application workload as a special type of anomalies.
Our results demonstrate that efficient and automated governance in cloud environ-
ments is not only feasible, but also highly effective. We did not have to implement a cloud
platform from the scratch to implement the governance systems designed as a part of this
work. Rather, we were able to implement the proposed governance systems for existing
cloud platforms like Google App Engine and AppScale; often with minimal changes to
the cloud platform software. Our policy enforcement and monitoring systems are inte-
grated with the cloud platform (i.e. they operate from within the cloud platform), and
hence preclude the cloud platform users from having to set up or implement their own
external governance solutions that provide API management or application monitoring
157
functionality. Our governance systems are also efficient, in the sense they do not add a
significant overhead to the applications deployed in the cloud platform, and they scale
well to handle a very large number of applications and governance policies.
Our research is aimed at providing increased levels of oversight, control and automa-
tion to cloud platforms. Therefore it has the potential to increase the value offered by the
cloud platforms to the application developers and the application clients. More specif-
ically, our research can greatly enhance the use of PaaS clouds. A lot of our work is
directly applicable to popular PaaS clouds such as Google App Engine and AppScale,
and the respective developer communities can greatly benefit from our findings.
Our research paves the way to making cloud platforms more dependable and main-
tainable for administrators, application developers and clients alike. It brings automated
policy enforcement – a governance technique that has been successfully applied in classic
SOA systems in the past – to modern cloud environments. Policy enforcement solves a
variety of issues related to poor application coding practices, and lack of administrative
control. We also enable stipulating performance SLOs for cloud-hosted applications, a
feature that is not supported in existing cloud platforms to the best of our knowledge.
Our research also supports full-stack monitoring of cloud platforms for detecting perfor-
mance SLO violations, and determining the root causes of such violations. When taken
together, our research addresses all three components of governance (specification, en-
forcement and monitoring) both efficiently and automatically, as cloud-native features.
The systems we propose ensure that cloud-hosted applications always operate in a policy
compliant state, and any performance anomalies are detected and diagnosed fast. In
conclusion, our governance systems facilitate achieving developer best practices, admin-
istrative conformance and performance SLOs for cloud-hosted applications in ways that
were not possible before.
158
Bibliography
[1] Q. Hassan, Demystifying cloud computing, The Journal of Defense Software
Engineering (2011) 16–21.
[2] P. M. Mell and T. Grance, Sp 800-145. the nist definition of cloud computing,
tech. rep., Gaithersburg, MD, United States, 2011.
[3] Amazon Web Services home page, 2015. http://aws.amazon.com/ [Accessed
March 2015].
[4] “App Engine – Run your applications on a fully managed PaaS.”
“https://cloud.google.com/appengine”
[Accessed March 2015].
[5] “Microsoft windows azure.” “http://www.microsoft.com/windowsazure/”
[Accessed March 2015].
[6] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and
D. Zagorodnov, The Eucalyptus open-source cloud-computing system, in
IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009.
[7] C. Krintz, The appscale cloud platform: Enabling portable, scalable web
application deployment, Internet Computing, IEEE 17 (March, 2013) 72–75.
[8] “OpenShift by RedHat.” “https://www.openshift.com”.
[9] N. Antonopoulos and L. Gillam, Cloud Computing: Principles, Systems and
Applications. Springer Publishing Company, Incorporated, 1st ed., 2010.
[10] P. Pinheiro, M. Aparicio, and C. Costa, Adoption of cloud computing systems, in
Proceedings of the International Conference on Information Systems and Design
of Communication, 2014.
[11] “Roundup of Cloud Computing Forecasts and Market Estimates 2015.”
http://www.forbes.com/sites/louiscolumbus/2015/01/24/
roundup-of-cloud-computing-forecasts-and-market-estimates-2015
[Accessed May 2016].
159
http://aws.amazon.com/
”
”
”
http://www.forbes.com/sites/louiscolumbus/2015/01/24/roundup-of-cloud-computing-forecasts-and-market-estimates-2015
http://www.forbes.com/sites/louiscolumbus/2015/01/24/roundup-of-cloud-computing-forecasts-and-market-estimates-2015
[12] “Application Performance Monitoring and Management – New Relic.”
http://www.newrelic.com [Accessed April 2016].
[13] “Application Performance Monitoring and Management – Dynatrace.”
http://www.dynatrace.com [Accessed April 2016].
[14] “Datadog – Cloud-scale Performance Monitoring.” http://www.datadoghq.com
[Accessed April 2016].
[15] Brown, Allen E and Grant, Gerald G, Framing the frameworks: A review of IT
governance research, Communications of the Association for Information Systems
15 (2005), no. 1 38.
[16] “Gartner, Magic Quadrant for Integrated SOA Governance Technology Sets,
2007.” https://www.gartner.com/doc/572713/
magic-quadrant-integrated-soa-governance [Accessed April 2016].
[17] “SOA Governance.”
http://www.opengroup.org/soa/source-book/gov/gov.htm. [Online; accessed
14-October-2013].
[18] T. G. J. Schepers, M. E. Iacob, and P. A. T. Van Eck, A Lifecycle Approach to
SOA Governance, in Proceedings of the 2008 ACM Symposium on Applied
Computing, 2008.
[19] F. Hojaji and M. R. A. Shirazi, A Comprehensive SOA Governance Framework
Based on COBIT, in 2010 6th World Congress on Services, 2010.
[20] K. Y. Peng, S. C. Lui, and M. T. Chen, A Study of Design and Implementation
on SOA Governance: A Service Oriented Monitoring and Alarming Perspective,
in Service-Oriented System Engineering, 2008. SOSE ’08. IEEE International
Symposium on, 2008.
[21] “Amazon Elastic Compute Cloud (Amazon EC2).”
http://aws.amazon.com/ec2/.
[22] “Google Compute Engine IaaS.” https://cloud.google.com/compute/.
[23] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and
D. Zagorodnov, Eucalyptus : A technical report on an elastic utility computing
archietcture linking your programs to useful systems, in UCSB Technical Report
ID: 2008-10, 2008.
[24] “Heroku Cloud Application Platform.” http://www.heroku.com.
[25] “Amazon Elastic Beanstalk.” https://aws.amazon.com/elasticbeanstalk/.
160
http://www.newrelic.com
http://www.dynatrace.com
http://www.datadoghq.com
https://www.gartner.com/doc/572713/magic-quadrant-integrated-soa-governance
https://www.gartner.com/doc/572713/magic-quadrant-integrated-soa-governance
http://www.opengroup.org/soa/source-book/gov/gov.htm
http://aws.amazon.com/ec2/
https://cloud.google.com/compute/
http://www.heroku.com
https://aws.amazon.com/elasticbeanstalk/
[26] “Salesforce – What is SaaS?.” https://www.salesforce.com/saas/.
[27] “Workday – Alternative to ERP for HR and Financial Management.”
http://www.workday.com/.
[28] “GoToMeeting – Easy Online Conferencing.” http://www.gotomeeting.com.
[29] Protocol buffers, 2016. https://developers.google.com/protocol-buffers
[Accessed Sep 2016].
[30] 2009. http://highscalability.com/
latency-everywhere-and-it-costs-you-sales-how-crush-it [Accessed Sep
2016].
[31] SearchCloudComputing, 2015. http://searchcloudcomputing.techtarget.
com/feature/Experts-forecast-the-2015-cloud-computing-market
[Accessed March 2015].
[32] Forbes, 2016. http://www.forbes.com/sites/louiscolumbus/2016/03/13/
roundup-of-cloud-computing-forecasts-and-market-estimates-2016
[Accessed Sep 2016].
[33] “Microsoft windows azure.” “http://www.microsoft.com/windowsazure/”.
[34] G. Ataya, Information security, risk governance and management frameworks:
An overview of cobit 5, in Proceedings of the 6th International Conference on
Security of Information and Networks, SIN ’13, (New York, NY, USA), pp. 3–5,
ACM, 2013.
[35] 2007. http://www.isaca.org/certification/
cgeit-certified-in-the-governance-of-enterprise-it/pages/default.
aspx [Accessed Sep 2016].
[36] M. P. Papazoglou, Service-oriented computing: concepts, characteristics and
directions, in Web Information Systems Engineering, 2003. WISE 2003.
Proceedings of the Fourth International Conference on, 2003.
[37] “What is SOA?.” http://www.opengroup.org/soa/source-book/soa/soa.htm
[Accessed April 2016].
[38] M. N. Haines and M. A. Rothenberger, How a service-oriented architecture may
change the software development process, Commun. ACM 53 (Aug., 2010)
135–140.
[39] C. Xian-Peng, L. Bi-Ying, and M. Rui-Fang, An ITIL v3-Based Solution to SOA
Governance, in Services Computing Conference (APSCC), 2012 IEEE
Asia-Pacific, 2012.
161
https://www.salesforce.com/saas/
http://www.workday.com/
http://www.gotomeeting.com
https://developers.google.com/protocol-buffers
http://highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-it
http://highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-it
http://searchcloudcomputing.techtarget.com/feature/Experts-forecast-the-2015-cloud-computing-market
http://searchcloudcomputing.techtarget.com/feature/Experts-forecast-the-2015-cloud-computing-market
http://www.forbes.com/sites/louiscolumbus/2016/ 03/13/roundup-of-cloud-computing-forecasts-and-market-estimates-2016
http://www.forbes.com/sites/louiscolumbus/2016/ 03/13/roundup-of-cloud-computing-forecasts-and-market-estimates-2016
”
http://www.isaca.org/certification/cgeit-certified-in-the-governance-of-enterprise-it/pages/default.aspx
http://www.isaca.org/certification/cgeit-certified-in-the-governance-of-enterprise-it/pages/default.aspx
http://www.isaca.org/certification/cgeit-certified-in-the-governance-of-enterprise-it/pages/default.aspx
http://www.opengroup.org/soa/source-book/soa/soa.htm
[40] F. Belqasmi, R. Glitho, and C. Fu, Restful web services for service provisioning in
next-generation networks: a survey, Communications Magazine, IEEE 49
(December, 2011) 66–73.
[41] A. M. Gutierrez, J. A. Parejo, P. Fernandez, and A. Ruiz-Cortes, WS-Governance
Tooling: SOA Governance Policies Analysis and Authoring, in Policies for
Distributed Systems and Networks (POLICY), 2011 IEEE International
Symposium on, 2011.
[42] T. Phan, J. Han, J. G. Schneider, T. Ebringer, and T. Rogers, A Survey of
Policy-Based Management Approaches for Service Oriented Systems, in 19th
Australian Conference on Software Engineering (aswec 2008), 2008.
[43] Y. C. Zhou, X. P. Liu, E. Kahan, X. N. Wang, L. Xue, and K. X. Zhou, Context
Aware Service Policy Orchestration, in IEEE International Conference on Web
Services (ICWS 2007), 2007.
[44] R. Strum, W. Morris, and M. Jander, Foundations of Service Level Management.
Pearson, 2000.
[45] “Free and Enterprise API Management Platform and Infrastructure by 3scale –
http://www.3scale.net.”
[46] “Enterprise API Management and API Strategy – http://apigee.com/about/.”
[47] “Enterprise API Management – Layer 7 Technologies –
http://www.layer7tech.com.”
[48] “ProgrammableWeb.” http://www.programmableweb.com [Accessed March
2015].
[49] “ProgrammableWeb Blog – http://blog.programmableweb.com/2013/04/30/
9000-apis-mobile-gets-serious/.”
[50] R. T. Fielding, Architectural Styles and the Design of Network-based Software
Architectures. PhD thesis, University of California, Irvine, 2000. AAI9980887.
[51] IEEE Xplore Search Gateway, 2015. http://ieeexplore.ieee.org/gateway/
[Accessed March 2015].
[52] Berkeley API Central, 2015. https://developer.berkeley.edu [Accessed
March 2015].
[53] Agency Application Programming Interfaces, 2015.
http://www.whitehouse.gov/digitalgov/apis [Accessed March 2015].
162
http://www.3scale.net
http://apigee.com/about/
http://www.layer7tech.com
http://www.programmableweb.com
http://blog.programmableweb.com/2013/04/30/9000-apis-mobile-gets-serious/
http://blog.programmableweb.com/2013/04/30/9000-apis-mobile-gets-serious/
http://ieeexplore.ieee.org/gateway/
https://developer.berkeley.edu
http://www.whitehouse.gov/digitalgov/apis
[54] C. Krintz, H. Jayathilaka, S. Dimopoulos, A. Pucher, R. Wolski, and T. Bultan,
Cloud platform support for api governance, in Cloud Engineering (IC2E), 2014
IEEE International Conference on, 2014.
[55] A. S. Vedamuthu, D. Orchard, F. Hirsch, M. Hondo, P. Yendluri, T. Boubez, and
U. Yalcinalp, Web services policy framework (wspolicy), September, 2007.
[56] “SOA Governance Technical Standard –
http://www.opengroup.org/soa/source-book/gov/intro.htm.”
[57] C. Krintz, The AppScale Cloud Platform: Enabling Portable, Scalable Web
Application Deployment, IEEE Internet Computing Mar/Apr (2013).
[58] G. Lawton, Developing software online with platform-as-a-service technology,
Computer 41 (June, 2008) 13–15.
[59] “Platform as a Service – Pivotal CF.”
“http://www.gopivotal.com/platform-as-a-service/pivotal-cf”.
[60] H. Jayathilaka, C. Krintz, and R. Wolski, Towards Automatically Estimating
Porting Effort between Web Service APIs, in Services Computing (SCC), 2014
IEEE International Conference on, 2014.
[61] “Web Application Description Language.”
http://www.w3.org/Submission/wadl/, 2013. [Online; accessed
27-September-2013].
[62] “Swagger: A simple, open standard for describing REST APIs with JSON.”
https://developers.helloreverb.com/swagger/. [Online; accessed
05-August-2013].
[63] C. A. R. Hoare, An axiomatic basis for computer programming, Commun. ACM
12 (Oct., 1969) 576–580.
[64] H. Jayathilaka, A. Pucher, C. Krintz, and R. Wolski, Using syntactic and
semantic similarity of Web APIs to estimate porting effort, International Journal
of Services Computing 2 (2014), no. 4.
[65] R. Verborgh, T. Steiner, D. Van Deursen, S. Coppens, J. G. Vallés, and R. Van de
Walle, Functional descriptions as the bridge between hypermedia APIs and the
Semantic Web, in International Workshop on RESTful Design, 2012.
[66] T. Steiner and J. Algermissen, Fulfilling the hypermedia constraint via http
options, the http vocabulary in rdf, and link headers, in Proceedings of the Second
International Workshop on RESTful Design, WS-REST ’11, (New York, NY,
USA), pp. 11–14, ACM, 2011.
163
http://www.opengroup.org/soa/source-book/gov/intro.htm
”
http://www.w3.org/Submission/wadl/
https://developers.helloreverb.com/swagger/
[67] “OAuth 2.0 – http://oauth.net/2/.”
[68] “Apache Synapse.” https://synapse.apache.org/. [Online; accessed
25-March-2014].
[69] “JSR311 – The Java API for RESTful Web Services –
https://jcp.org/aboutJava/communityprocess/final/jsr311/.”
[70] “Swagger – A simple, open standard for describing REST APIs with JSON –
https://helloreverb.com/developers/swagger.”
[71] “WSO2 API Manager.” http://wso2.com/products/api-manager/, 2013.
[Online; accessed 27-September-2013].
[72] “WSO2 API Manager – http://wso2.com/products/api-manager/.”
[73] H. Guan, B. Jin, J. Wei, W. Xu, and N. Chen, A framework for application server
based web services management, in Software Engineering Conference, 2005.
APSEC ’05. 12th Asia-Pacific, pp. 8 pp.–, Dec, 2005.
[74] J. Wu and Z. Wu, Dart-man: a management platform for web services based on
semantic web technologies, in Computer Supported Cooperative Work in Design,
2005. Proceedings of the Ninth International Conference on, vol. 2, pp. 1199–1204
Vol. 2, May, 2005.
[75] X. Zhu and B. Wang, Web service management based on hadoop, in Service
Systems and Service Management (ICSSSM), 2011 8th International Conference
on, pp. 1–6, June, 2011.
[76] C.-F. Lin, R.-S. Wu, S.-M. Yuan, and C.-T. Tsai, A web services status
monitoring technology for distributed system management in the cloud, in
Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2010
International Conference on, pp. 502–505, Oct, 2010.
[77] S. Kikuchi and T. Aoki, Evaluation of operational vulnerability in cloud service
management using model checking, in Service Oriented System Engineering
(SOSE), 2013 IEEE 7th International Symposium on, pp. 37–48, March, 2013.
[78] Y. Sun, Z. Xiao, D. Bao, and J. Zhao, An architecture model of management and
monitoring on cloud services resources, in Advanced Computer Theory and
Engineering (ICACTE), vol. 3, pp. V3–207–V3–211, Aug, 2010.
[79] R. Bhatti, D. Sanz, E. Bertino, and A. Ghafoor, A policy-based authorization
framework for web services: Integrating xgtrbac and ws-policy, in Web Services,
2007. ICWS 2007. IEEE International Conference on, pp. 447–454, July, 2007.
164
http://oauth.net/2/
https://synapse.apache.org/
https://jcp.org/aboutJava/communityprocess/final/jsr311/
https://helloreverb.com/developers/swagger
http://wso2.com/products/api-manager/
http://wso2.com/products/api-manager/
[80] S.-C. Chou and J.-Y. Jhu, Access control policy embedded composition algorithm
for web services, in Advanced Information Management and Service (IMS), 2010
6th International Conference on, pp. 54–59, Nov, 2010.
[81] L. Li, K. Xiaohui, L. Yuanling, X. Fei, Z. Tao, and C. YiMin, Policy-based fault
diagnosis technology for web service, in Instrumentation, Measurement,
Computer, Communication and Control, 2011 First International Conference on,
pp. 827–831, Oct, 2011.
[82] H. Liang, W. Sun, X. Zhang, and Z. Jiang, A policy framework for collaborative
web service customization, in Service-Oriented System Engineering, 2006. SOSE
’06. Second IEEE International Workshop, pp. 197–204, Oct, 2006.
[83] A. Erradi, P. Maheshwari, and S. Padmanabhuni, Towards a policy-driven
framework for adaptive web services composition, in Next Generation Web
Services Practices, 2005. NWeSP 2005. International Conference on, pp. 6 pp.–,
Aug, 2005.
[84] A. Erradi, P. Maheshwari, and V. Tosic, Policy-driven middleware for
self-adaptation of web services compositions, in International Conference on
Middleware, 2006.
[85] B. Suleiman and V. Tosic, Integration of uml modeling and policy-driven
management of web service systems, in ICSE Workshop on Principles of
Engineering Service Oriented Systems, 2009.
[86] M. Thirumaran, D. Ponnurangam, K. Rajakumari, and G. Nandhini, Evaluation
model for web service change management based on business policy enforcement,
in Cloud and Services Computing (ISCOS), 2012 International Symposium on,
pp. 63–69, Dec, 2012.
[87] F. Zhang, J. Gao, and B.-S. Liao, Policy-driven model for autonomic management
of web services using mas, in Machine Learning and Cybernetics, 2006
International Conference on, pp. 34–39, Aug, 2006.
[88] “Mashery – http://www.mashery.com.”
[89] A. Keller and H. Ludwig, The WSLA Framework: Specifying and Monitoring
Service Level Agreements for Web Services, J. Netw. Syst. Manage. 11 (Mar.,
2003).
[90] D. Nurmi, J. Brevik, and R. Wolski, QBETS: Queue Bounds Estimation from
Time Series, in International Conference on Job Scheduling Strategies for Parallel
Processing, 2008.
165
http://www.mashery.com
[91] H. Jayathilaka, C. Krintz, and R. Wolski, EAGER: Deployment-time API
Governance for Modern PaaS Clouds, in IC2E Workshop on the Future of PaaS,
2015.
[92] Google App Engine Java Sandbox, 2015.
“https://cloud.google.com/appengine/docs/java/#Java The sandbox” [Accessed
March 2015].
[93] “Microsoft Azure Cloud SDK Service Quotas and Limits.”
http://azure.microsoft.com/en-us/documentation/articles/
azure-subscription-service-limits/#cloud-service-limits [Accessed
March 2015].
[94] “Google Cloud SDK Service Quotas and Limits.”
https://cloud.google.com/appengine/docs/quotas [Accessed March 2015].
[95] R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan, Soot:
A Java Bytecode Optimization Framework, in CASCON First Decade High Impact
Papers, 2010.
[96] Github – build software better, together, 2015. “https://github.com” [Accessed
March 2015].
[97] F. E. Allen, Control Flow Analysis, in Symposium on Compiler Optimization,
1970.
[98] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and
Tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1986.
[99] R. Morgan, Building an Optimizing Compiler. Digital Press, Newton, MA, USA,
1998.
[100] S. S. Muchnick, Advanced Compiler Design and Implementation. Morgan
Kaufmann Publishers Inc., San Francisco, CA, USA, 1997.
[101] S. Bygde, Static WCET analysis based on abstract interpretation and counting of
elements. PhD thesis, Mälardalen University, 2010.
[102] https://cloud.google.com/appengine/docs/java/javadoc/com/google/
appengine/api/datastore/FetchOptions [Accessed March 2015].
[103] D. Nurmi, J. Brevik, and R. Wolski, Modeling Machine Availability in Enterprise
and Wide-area Distributed Computing Environments, in Proceedings of Europar
2005, 2005.
[104] J. Brevik, D. Nurmi, and R. Wolski, Quantifying Machine Availability in
Networked and Desktop Grid Systems, in Proceedings of CCGrid04, April, 2004.
166
”
http://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/#cloud-service-limits
http://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/#cloud-service-limits
https://cloud.google.com/appengine/docs/quotas
”
https://cloud.google.com/appengine/docs/java/javadoc/com/google/appengine/api/datastore/FetchOptions
https://cloud.google.com/appengine/docs/java/javadoc/com/google/appengine/api/datastore/FetchOptions
[105] R. Wolski and J. Brevik, QPRED: Using Quantile Predictions to Improve Power
Usage for Private Clouds, Tech. Rep. UCSB-CS-2014-06, Computer Science
Department of the University of California, Santa Barbara, Santa Barbara, CA
93106, September, 2014.
[106] D. Nurmi, R. Wolski, and J. Brevik, Model-Based Checkpoint Scheduling for
Volatile Resource Environments, in Proceedings of Cluster 2005, 2004.
[107] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley,
G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra, F. Mueller, I. Puaut,
P. Puschner, J. Staschulat, and P. Stenström, The Worst-case Execution-time
Problem&Mdash;Overview of Methods and Survey of Tools, ACM Trans. Embed.
Comput. Syst. 7 (May, 2008).
[108] A. Ermedahl, C. Sandberg, J. Gustafsson, S. Bygde, and B. Lisper, Loop Bound
Analysis based on a Combination of Program Slicing, Abstract Interpretation, and
Invariant Analysis., in WCET, 2007.
[109] C. Sandberg, A. Ermedahl, J. Gustafsson, and B. Lisper, Faster WCET Flow
Analysis by Program Slicing, in ACM SIGPLAN/SIGBED Conference on
Language, Compilers, and Tool Support for Embedded Systems, 2006.
[110] C. Frost, C. S. Jensen, K. S. Luckow, and B. Thomsen, WCET Analysis of Java
Bytecode Featuring Common Execution Environments, in International Workshop
on Java Technologies for Real-Time and Embedded Systems, 2011.
[111] P. Cousot and R. Cousot, Abstract Interpretation: A Unified Lattice Model for
Static Analysis of Programs by Construction or Approximation of Fixpoints, in
ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages,
1977.
[112] P. Lokuciejewski, D. Cordes, H. Falk, and P. Marwedel, A Fast and Precise Static
Loop Analysis Based on Abstract Interpretation, Program Slicing and Polytope
Models, in IEEE/ACM International Symposium on Code Generation and
Optimization, 2009.
[113] S. Gulwani, S. Jain, and E. Koskinen, Control-flow Refinement and Progress
Invariants for Bound Analysis, in ACM SIGPLAN Conference on Programming
Language Design and Implementation, 2009.
[114] S. Gulwani, K. K. Mehra, and T. Chilimbi, SPEED: Precise and Efficient Static
Estimation of Program Computational Complexity, in ACM SIGPLAN-SIGACT
Symposium on Principles of Programming Languages, 2009.
167
[115] A. Michlmayr, F. Rosenberg, P. Leitner, and S. Dustdar, Comprehensive QoS
Monitoring of Web Services and Event-based SLA Violation Detection, in
International Workshop on Middleware for Service Oriented Computing, 2009.
[116] A. K. Tripathy and M. R. Patra, Modeling and Monitoring SLA for Service Based
Systems, in International Conference on Intelligent Semantic Web-Services and
Applications, 2011.
[117] F. Raimondi, J. Skene, and W. Emmerich, Efficient Online Monitoring of
Web-service SLAs, in ACM SIGSOFT International Symposium on Foundations
of Software Engineering, 2008.
[118] A. Bertolino, G. De Angelis, A. Sabetta, and S. Elbaum, Scaling Up SLA
Monitoring in Pervasive Environments, in Workshop on Engineering of Software
Services for Pervasive Environments, 2007.
[119] K. Mahbub and G. Spanoudakis, Proactive SLA Negotiation for Service Based
Systems: Initial Implementation and Evaluation Experience, in IEEE
International Conference on Services Computing, 2011.
[120] E. Yaqub, R. Yahyapour, P. Wieder, C. Kotsokalis, K. Lu, and A. I. Jehangiri,
Optimal negotiation of service level agreements for cloud-based services through
autonomous agents, in IEEE International Conference on Services Computing,
2014.
[121] L. Wu, S. Garg, R. Buyya, C. Chen, and S. Versteeg, Automated SLA Negotiation
Framework for Cloud Computing, in IEEE/ACM International Symposium on
Cluster, Cloud and Grid Computing, 2013.
[122] T. Chau, V. Muthusamy, H.-A. Jacobsen, E. Litani, A. Chan, and P. Coulthard,
Automating SLA Modeling, in Conference of the Center for Advanced Studies on
Collaborative Research: Meeting of Minds, 2008.
[123] K. Stamou, V. Kantere, J.-H. Morin, and M. Georgiou, A SLA Graph Model for
Data Services, in International Workshop on Cloud Data Management, 2013.
[124] J. Skene, D. D. Lamanna, and W. Emmerich, Precise Service Level Agreements,
in International Conference on Software Engineering, 2004.
[125] H. He, Z. Ma, H. Chen, and W. Shao, Towards an SLA-Driven Cache Adjustment
Approach for Applications on PaaS, in Asia-Pacific Symposium on Internetware,
2013.
[126] C. Ardagna, E. Damiani, and K. Sagbo, Early Assessment of Service
Performance Based on Simulation, in IEEE International Conference on Services
Computing (SCC), 2013.
168
[127] D. Dib, N. Parlavantzas, and C. Morin, Meryn: Open, SLA-driven, Cloud
Bursting PaaS, in Proceedings of the First ACM Workshop on Optimization
Techniques for Resources Management in Clouds, 2013.
[128] A. Iosup, N. Yigitbasi, and D. Epema, On the Performance Variability of
Production Cloud Services, in Cluster, Cloud and Grid Computing (CCGrid),
2011 11th IEEE/ACM International Symposium on, 2011.
[129] P. Leitner, B. Wetzstein, F. Rosenberg, A. Michlmayr, S. Dustdar, and
F. Leymann, Runtime Prediction of Service Level Agreement Violations for
Composite Services, in Service-Oriented Computing. ICSOC/ServiceWave 2009
Workshops (A. Dan, F. Gittler, and F. Toumani, eds.), vol. 6275 of Lecture Notes
in Computer Science, pp. 176–186. Springer Berlin Heidelberg, 2010.
[130] B. Tang and M. Tang, Bayesian Model-Based Prediction of Service Level
Agreement Violations for Cloud Services, in Theoretical Aspects of Software
Engineering Conference (TASE), 2014.
[131] S. Duan and S. Babu, Proactive Identification of Performance Problems, in ACM
SIGMOD International Conference on Management of Data, 2006.
[132] G. Da Cunha Rodrigues, R. N. Calheiros, V. T. Guimaraes, G. L. d. Santos,
M. B. de Carvalho, L. Z. Granville, L. M. R. Tarouco, and R. Buyya, Monitoring
of cloud computing environments: Concepts, solutions, trends, and future
directions, in Proceedings of the 31st Annual ACM Symposium on Applied
Computing, SAC ’16, (New York, NY, USA), pp. 378–383, ACM, 2016.
[133] O. Ibidunmoye, F. Hernández-Rodriguez, and E. Elmroth, Performance anomaly
detection and bottleneck identification, ACM Comput. Surv. 48 (2015), no. 1.
[134] H. Jayathilaka, C. Krintz, and R. Wolski, Response Time Service Level
Agreements for Cloud-hosted Web Applications, in Proceedings of the Sixth ACM
Symposium on Cloud Computing, 2015.
[135] Elasticsearch – search and analyze data in real time, 2016.
“https://www.elastic.co/products/elasticsearch” [Accessed Sep 2016].
[136] O. Kononenko, O. Baysal, R. Holmes, and M. W. Godfrey, Mining modern
repositories with elasticsearch, in Proceedings of the 11th Working Conference on
Mining Software Repositories, MSR 2014, (New York, NY, USA), pp. 328–331,
ACM, 2014.
[137] Logstash – collect, enrich and transport data, 2016.
“https://www.elastic.co/products/logstash” [Accessed Sep 2016].
169
”
”
[138] S. Urbanek, Rserve – a fast way to provide r functionality to applications, in
Proc. of the 3rd international workshop on Distributed Statistical Computing
(DSC 2003), 2003.
[139] R. Killick, P. Fearnhead, and I. A. Eckley, Optimal detection of changepoints with
a linear computational cost, Journal of the American Statistical Association 107
(2012), no. 500 1590–1598.
[140] C. Chen and L.-M. Liu, Joint estimation of model parameters and outlier effects
in time series, Journal of the American Statistical Association 88 (1993), no. 421
284–297.
[141] U. Groemping, Relative importance for linear regression in r: The package
relaimpo, Journal of Statistical Software 17 (2006), no. 1.
[142] G. R. Lindeman R.H., Merenda P.F., Introduction to Bivariate and Multivariate
Analysis. Scott, Foresman, Glenview, IL, 1980.
[143] Q. Guan, Z. Zhang, and S. Fu, Proactive failure management by integrated
unsupervised and semi-supervised learning for dependable cloud systems, in
Availability, Reliability and Security (ARES), 2011 Sixth International
Conference on, pp. 83–90, Aug, 2011.
[144] R. C. Harlan, Network management with nagios, Linux J. 2003 (July, 2003) 3–.
[145] “The OpenNMS Project.” http://www.opennms.org [Accessed April 2016].
[146] P. Tader, Server monitoring with zabbix, Linux J. 2010 (July, 2010).
[147] Amazon cloud watch, 2016. https://aws.amazon.com/cloudwatch [Accessed
Sep 2016].
[148] L. Cherkasova, K. Ozonat, N. Mi, J. Symons, and E. Smirni, Anomaly?
application change? or workload change? towards automated detection of
application performance anomaly and change, in 2008 IEEE International
Conference on Dependable Systems and Networks With FTCS and DCC (DSN),
pp. 452–461, June, 2008.
[149] D. J. Dean, H. Nguyen, P. Wang, and X. Gu, Perfcompass: Toward runtime
performance anomaly fault localization for infrastructure-as-a-service clouds, in
Proceedings of the 6th USENIX Conference on Hot Topics in Cloud Computing,
HotCloud’14, (Berkeley, CA, USA), pp. 16–16, USENIX Association, 2014.
[150] H. Nguyen, Y. Tan, and X. Gu, Pal: Propagation-aware anomaly localization for
cloud hosted distributed applications, in Managing Large-scale Systems via the
Analysis of System Logs and the Application of Machine Learning Techniques,
SLAML ’11, (New York, NY, USA), pp. 1:1–1:8, ACM, 2011.
170
http://www.opennms.org
https://aws.amazon.com/cloudwatch
[151] J. P. Magalhaes and L. M. Silva, Detection of performance anomalies in
web-based applications, in Proceedings of the 2010 Ninth IEEE International
Symposium on Network Computing and Applications, NCA ’10, (Washington, DC,
USA), pp. 60–67, IEEE Computer Society, 2010.
[152] J. a. P. Magalhães and L. M. Silva, Root-cause analysis of performance anomalies
in web-based applications, in Proceedings of the 2011 ACM Symposium on Applied
Computing, 2011.
[153] V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, ACM
Comput. Surv. 41 (July, 2009) 15:1–15:58.
[154] G. Casale, N. Mi, L. Cherkasova, and E. Smirni, Dealing with burstiness in
multi-tier applications: Models and their parameterization, IEEE Transactions on
Software Engineering 38 (Sept, 2012) 1040–1053.
[155] S. Malkowski, M. Hedwig, J. Parekh, C. Pu, and A. Sahai, Bottleneck detection
using statistical intervention analysis, in Proceedings of the Distributed Systems:
Operations and Management 18th IFIP/IEEE International Conference on
Managing Virtualization of Networks and Services, DSOM’07, (Berlin,
Heidelberg), pp. 122–134, Springer-Verlag, 2007.
[156] I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase, Correlating
instrumentation data to system states: A building block for automated diagnosis
and control, in Proceedings of the 6th Conference on Symposium on Opearting
Systems Design & Implementation – Volume 6, OSDI’04, (Berkeley, CA, USA),
pp. 16–16, USENIX Association, 2004.
[157] L. Yu and Z. Lan, A scalable, non-parametric anomaly detection framework for
hadoop, in Proceedings of the 2013 ACM Cloud and Autonomic Computing
Conference, CAC ’13, (New York, NY, USA), pp. 22:1–22:2, ACM, 2013.
[158] K. Bhaduri, K. Das, and B. L. Matthews, Detecting abnormal machine
characteristics in cloud infrastructures, in 2011 IEEE 11th International
Conference on Data Mining Workshops, pp. 137–144, IEEE, 2011.
[159] H. Jayathilaka, C. Krintz, and R. Wolski, Service-level agreement durability for
web service response time, in 2015 IEEE 7th International Conference on Cloud
Computing Technology and Science (CloudCom), 2015.
171
- Curriculum Vitae
Abstract
Introduction
Background
Cloud Computing
Platform-as-a-Service Clouds
PaaS Architecture
PaaS Usage Model
Governance
IT and SOA Governance
Governance for Cloud-hosted Applications
API Governance
Governance of Cloud-hosted Applications Through Policy Enforcement
Enforcing API Governance in Cloud Settings
EAGER
Metadata Manager
API Deployment Coordinator
EAGER Policy Language and Examples
API Discovery Portal
API Gateway
Prototype Implementation
Auto-generation of API Specifications
Implementing the Prototype
Experimental Results
Baseline EAGER Overhead by Application
Impact of Number of APIs and Dependencies
Impact of Number of Policies
Scalability
Experimental Results with a Real-World Dataset
Related Work
Conclusions and Future Work
Response Time Service Level Objectives for Cloud-hosted Web Applications
Domain Characteristics and Assumptions
Cerebro
Static Analysis
PaaS Monitoring Agent
Making SLO Predictions
Example Cerebro Workflow
SLO Durability
SLO Reassessment
Experimental Results
Correctness of Predictions
Tightness of Predictions
SLO Validity Duration
Long-term SLO Durability and Change Frequency
Effectiveness of QBETS
Learning Duration
Related Work
Conclusions and Future Work
Performance Anomaly Detection and Root Cause Analysis for Cloud-hosted Web Applications
Performance Debugging Cloud Applications
Roots
Data Collection and Correlation
Data Storage
Data Analysis
Roots Process Management
Prototype Implementation
SLO-violating Anomalies
Path Distribution Analysis
Workload Change Analyzer
Bottleneck Identification
Results
Anomaly Detection: Accuracy and Speed
Path Distribution Analyzer: Accuracy and Speed
Workload Change Analyzer Accuracy
Bottleneck Identification Accuracy
Multiple Applications in a Clustered Setting
Results Summary
Roots Performance and Scalability
Related Work
Conclusions and Future Work
Conclusion
Bibliography