cover story: new tools
home
EDITOR’S DESK
NEW TOOLS
FRONTLINE REPORT
MARKET RETOOLS
New,
Smart Tools
Advance
Monitoring,
Protect
Network
in the high-stakes, cat-and-mouse game of cybersecu-
Learn the latest about ways
to employ advanced network
security monitoring technologies.
By Lisa Phifer
3 Can New Security Tools Keep Your Network Clean?
n
july 2015
rity, the only real constant is change. The number of new
threats is escalating, and the attack surface is growing,
too. Businesses today rely more than ever on Internetconnected devices, services and data—from machineto-machine communication and the Internet of Things
(IoT), to bring your own devices (BYODs) and bring your
own cloud (BYOC) applications. One thing this tidal
wave of new targets has in common: a 24/7 exposure to
network-borne threats. From Heartbleed to FREAK, criminals continually exploit low-hanging fruit by finding new
bugs in widely deployed software and old gaps in new
technologies.
Effectively spotting and stopping these evolving network threats requires not just vigilance, but new approaches. It’s unrealistic to expect enterprise defenses to
block all attacks or eliminate all vulnerabilities. Furthermore, manual threat assessment and intervention simply
cannot scale to meet these challenges. Network security
monitoring that is more pervasive, automated and intelligent is critical to approve awareness and response time.
cover story: new tools
home
EDITOR’S DESK
NEW TOOLS
FRONTLINE REPORT
MARKET RETOOLS
The Importance of Network Threat Visibility
According to the Ponemon Institute’s “2014 Cost of
Cyber Crime: United States,” the most costly cybercrimes
are those caused by denial of service attacks, malicious
insiders and malicious code, leading to 55% of all costs
associated with cyberattacks. Not surprisingly, costs escalate when attacks are not resolved quickly. Participants
in Ponemon’s study reported the average time to resolve
a cyberattack in 2014 was 45 days, at an average cost of
$1,593,627—a 33% increase over 2013 cost and 32-day
resolution. Worse, study participants reported that malicious insider attacks took on average more than 65 days
to contain.
The increasing frequency, diversity and complexity
of network-borne attacks is impeding threat resolution.
Cisco’s 2015 Annual Security Report found that criminals
are getting better at using security gaps to conceal malicious activity: for example, moving beyond recently fixed
Java bugs to use new Flash malware and Snowshoe IP distribution techniques (increasing spam by 250%) and exploiting the 56% of Open SSL installations still vulnerable
to Heartbleed, and others, or enlisting end users as cybercrime accomplices.
In this era of BYOD, BYOC, IoT and more, achieving real-world security for business-essential connectivity requires more visibility into network traffic, assets and
patterns. “By understanding how security technologies
operate,” Cisco’s report concluded, “and what is normal
4 Can New Security Tools Keep Your Network Clean?
n
july 2015
(and not normal) in the IT environment, security teams
can reduce their administrative workload while becoming
more dynamic and accurate in identifying and responding
to threats and adapting defenses.”
depth, breadth and intelligence
According to Gartner analyst Earl Perkins, speaking at
the Gartner Security & Risk Management Summit in June
2015, advanced threat defense combines near-real-time
monitoring, detection and analysis of network traffic,
payload and endpoint behavior with network and endpoint forensics. More effective threat response begins
with advanced security monitoring. This includes awareness of user activities and the business resources they
access, both on-site and off. However, security professionals are also experiencing information overload. Advanced
visibility therefore comes from more intelligent use of
information through prioritization, baselining, analytics
and more.
Perkins recommends deploying network security
monitoring technologies based on risk. At a minimum,
every enterprise should take fundamental steps, including properly segmenting networks and defending business assets with traditional network firewalls, intrusion
prevention systems (IPS), secure Web gateways and endpoint protection tools. These defenses serve as sentries—
armed guards stationed at key entrances to ward off basic
threats and sound alarm at the first sign of attack. For
cover story: new tools
home
EDITOR’S DESK
NEW TOOLS
FRONTLINE REPORT
MARKET RETOOLS
threat-tolerant businesses with low-risk, these fundamentals may be sufficient.
However, most organizations at risk will want to consider more advanced monitoring tools and capabilities
such as next-generation and application firewalls, network access control (NAC), enterprise mobility management (EMM), and security information and event
management (SIEM). These technologies go deeper by
examining more traffic content or endpoint characteristics. They broaden visibility by monitoring more network
elements, including mobile devices and activities. Ultimately, they can produce more actionable intelligence
by knitting together disparate events into more cohesive
threat alerts—especially for advanced persistent threats
that might otherwise be missed entirely.
Finally, risk-intolerant organizations may wish to
go even further, using network and endpoint forensics
to routinely record all activity, enabling look-back traffic, and payload and behavior analysis. Unlike real-time
monitoring technologies, forensics tools focus on identifying past compromises—but this can be important to
spot, for example, those long-running insider attacks.
Forensics can also help enterprises identify gaps in their
defenses, enabling them to adapt and to better prevent future attacks.
Putting Advanced Monitoring to Work
To take advantage of new advanced network monitoring
5 Can New Security Tools Keep Your Network Clean?
n
july 2015
technologies, it can help to get a handle on industry advances and why new tools and capabilities have emerged.
Let’s start with that staple of network monitoring,
the traditional network firewall. Single-function firewalls long ago morphed into unified threat management
Forensics can also help enterprises
identify gaps in their defenses,
enabling them to adapt and to better
prevent future attacks.
(UTM) platforms, which combine firewall, IPS, VPN,
Web gateway, and antimalware capabilities. However,
even UTMs tend to focus on network traffic inspection.
When application payload is examined, it’s for a specific
reason such as blocking a blacklisted URL, content type
or recognized malware.
In contrast, next-generation firewalls are applicationaware. That is, they attempt to identify the application
riding over a given traffic stream—even an SSL-encrypted
session—and apply policies specific to that application
and perhaps to the users, groups or roles. For example, a
next-generation firewall isn’t limited to blocking all traffic to Facebook. It can allow only marketing employees to
post to Facebook, but not to play Facebook games. Or it
cover story: new tools
home
EDITOR’S DESK
NEW TOOLS
FRONTLINE REPORT
MARKET RETOOLS
can simply monitor how workers interact with Facebook
and generate alerts when activity deviates from that baseline. This granularity is only possible because the firewall
can identify applications and their features—including
new applications it will learn about in the future. Increasingly, next-generation firewalls are learning through
machine-readable feeds that not only deliver new threat
signatures but intelligence about new attacks and IPs, devices or users with bad reputations. This ability to adapt
and learn is key to keeping up with new cyberthreats.
While intrusion prevention remains a cornerstone of
network monitoring, it has expanded in several dimensions. First, as enterprise networks move from wired to
wireless access, wireless IPS has become essential. At a
minimum, enterprises can use rogue detection built into
wireless LAN controllers. Risk-averse enterprises may invest in wireless IPS to scan the network 24/7 for threats,
including some otherwise hidden IoT and unauthorized
BYOD communication.
Second, intrusion prevention now extends beyond
the enterprise network to mobile devices. For example,
EMMs can be used to routinely assess mobile device integrity, alerting administrators to jailbroken, rooted or
malware-infected devices and automatically protect the
enterprise by removing network connections or business applications from those devices. The ability to look
beyond the traditional enterprise network edge is key to
avoiding blind spots.
6 Can New Security Tools Keep Your Network Clean?
n
july 2015
SIEM technologies have also evolved from simply
aggregating and normalizing events produced by enterprise network-connected systems and applications; now it
combs that data with contextual information about users,
assets, threats and vulnerabilities to enable correlation
SIEM not only helps enterprises
pull monitored data together, but
now it can intelligently sift through
that haystack to pinpoint internal
and external threats.
and analysis. According to Gartner, SIEM deployment is
growing, with breach detection now overcoming compliance as the primary driver. As a result, SIEM vendors
have expanded capabilities that target breach detection,
such as threat intelligence, anomaly detection and network-based activity monitoring—for example, integrating NetFlow and packet capture analysis. SIEM not only
helps enterprises pull monitored data together, but now it
can intelligently sift through that haystack to pinpoint internal and external threats.
A new market segment has started to emerge: breach
detection systems (BDS). These technologies are being driven by startups that are working to apply big data
cover story: new tools
home
EDITOR’S DESK
NEW TOOLS
analytics to monitored information, profiling user- and
device-behavior patterns to detect breaches and facilitate
interactive investigation. According to NSS Labs, a BDS
can identify pre-existing breaches as well as malware introduced through side-channel attacks—but should be
FRONTLINE REPORT
MARKET RETOOLS
Enterprises that have tried other
advanced monitoring technologies
but are plagued by advanced, persistent threats may wish to investigate
breach detection systems.
considered a “last line of defense against breaches that go
undetected by current security technologies, or are unknown by these technologies.” Risk-intolerant enterprises
that have tried other advanced monitoring technologies
but are plagued by advanced, persistent threats may wish
to investigate this new tool.
When attacks inevitably break through enterprise network defenses and evade real-time detection, another advanced monitoring tool can be helpful: network forensics
appliances. Network forensics also analyzes monitored
data, but in a different way, for a different purpose. Like
7 Can New Security Tools Keep Your Network Clean?
n
july 2015
a network DVR, these passive appliances record and catalog all ingress and egress traffic. By delivering exhaustive
full-packet replay, analysis and visualization quickly, network forensics appliances support cybercrime investigation, evidence gathering, impact assessment and cleanup.
Here, the idea is to avoid limitations associated with realtime monitoring—that is, having to spot everything important right when it happens. Network forensics makes
it possible to go back and take a second look, to find what
other monitoring systems might have missed.
The Bottom Line
As we have seen, advanced network security monitoring cannot be accomplished through isolated static tools.
Rather, monitoring must occur at many locations and levels through the enterprise network and beyond, and create a comprehensive data set that an increasingly smart
and dynamic collection of analysis tools then scours. Only
in this way can we respond quickly and effectively to
emerging cyberthreats that have learned how to fly under
the traditional network radar. n
Lisa Phifer owns Core Competence Inc., a consultancy specializing
in safe business use of emerging Internet technologies. Phifer is a
recognized industry expert on wireless, mobile and cyber security.
Copyright of Information Security is the property of TechTarget, Inc. and its content may not
be copied or emailed to multiple sites or posted to a listserv without the copyright holder's
express written permission. However, users may print, download, or email articles for
individual use.
FRONTLINE REPORT
home
EDITOR’S DESK
NEW TOOLS
FRONTLINE REPORT
MARKET RETOOLS
How IT Pros
Employ
the Latest
Security
Monitoring
Tech
it takes only a cursory glance at the news to realize that
Get the latest from the
enterprise frontline on attack
prevention and network
protection.
By Steve Zurier
8 Can New Security Tools Keep Your Network Clean?
n
july 2015
malware, data breaches and other information security
threats have expanded exponentially in the last three to
five years. Hardly a week goes by before another highprofile, multimillion-dollar hacking incident goes public. Take your pick: Target, eBay, Home Depot, JPMorgan
Chase, Sony, even the U.S. Postal Service.
And while cyberattacks rarely target network devices,
vulnerable networks are regularly exploited by cybercriminals during these incidents to transport malicious traffic or stolen data. Behind the scenes are the IT pros who
must grapple with government-sponsored hacks—from
countries such as China, Iran, North Korea and Russia—
in addition to for-profit hackers with ties to organized
crime and, yes, people who decide to break into a network just because they can. It’s an environment that has
propelled networking and security teams to work closer
together so they can answer a seemingly basic question
with more certainty: Is the network safe?
Ron Grohman, a senior network engineer at Bush
Brothers & Co., makers of the popular Bush’s Best brand
FRONTLINE REPORT
home
of baked beans, says faced with all of the recent threats
and attacks, he’s not taking any chances.
EDITOR’S DESK
NEW TOOLS
FRONTLINE REPORT
MARKET RETOOLS
NO PRODUCT IS PERFECT
Grohman doesn’t rely on just one security product to protect the company’s network. He uses a mix of Cisco ASA
5525-X firewalls with the Sourcefire URL filter, FireEye’s
Web Malware Protection System (MPS) 1310 to look for
suspicious malware and Symantec antivirus software as a
final backup.
Grohman says he uses the products from Cisco and
Sourcefire—which Cisco acquired in 2013—mainly as a
firewall and URL filter to manage Web traffic. The FireEye product was placed on the network to do anomaly-based detections. If the FireEye platform detects
suspicious malware, the software blocks the malware and
sends an alert to Grohman, who refers the incident to a
member of the help desk, where the malware is removed.
The Symantec software catches HTTPS traffic and serves
as a last line of defense before anything reaches the
endpoint.
“I double- and triple-up,” he says. “No one product
is perfect, and I’m OK with multiple systems checking
things, especially if it’s going to protect the network.”
At a Fortune 500 company, there might be dozens of
networking and security personnel who are cross-trained
in each other’s disciplines and work jointly on network
9 Can New Security Tools Keep Your Network Clean?
n
july 2015
security. Not so at Bush Brothers, a private medium-sized
company in Knoxville, Tenn. Grohman works as one of
eight members of the infrastructure team—and he’s the
only one with security credentials. To augment his background in networking, he completed the CISSP course
from ISC2 last fall, and he earned the Cisco Certified Network Professional certification last spring. Grohman also
earned a degree in information security several years ago
from ITT Tech.
“All the security work falls on me,” he says. “People
would come to me and ask if the network was secure, and
all I could say was, ‘I think so,’ ” he says. “It bothered me
that I didn’t know for sure if everything was safe. So that’s
why we added all those extra layers.”
That kind of uneasiness is common among network
and security managers today. But it’s also an approach
that Frank Dickson, research director for information and network security at Frost & Sullivan, says is
understandable.
“There’s really no single silver bullet,” he says. “No
matter what the vendors say, malware will inevitably get
through. Remember, a zero-day attack, by definition, exploits an unknown vulnerability. Defending oneself from
the unknown is challenging.”
Dickson says there’s definitely been a shift in the security landscape away from solely relying on traditional antivirus software, which would look to create signatures of
previously undiscovered malware after a breach had been
FRONTLINE REPORT
home
EDITOR’S DESK
NEW TOOLS
FRONTLINE REPORT
MARKET RETOOLS
detected. Today, products such as SourceFire from Cisco,
FireEye, and Palo Alto Networks’ WildFire and Traps
tools take a much more proactive approach.
“The industry is moving to the use of more behaviorbased approaches, such as testing the behavior of suspicious files in a quarantined, virtualized environment or
utilizing big data analytics to monitor network traffic to
establish a baseline and look for significant anomalies,”
Dickson explains.
PROTECTION VS. PREVENTION
Conventional wisdom and the sheer reality of today’s
threats may dictate an approach like the one used at Bush
Brothers. But Golan Ben-Oni, CIO officer at telecom,
banking and energy company IDT Corporation in Newark, N.J., doesn’t buy it. He says there has to be more of
a focus on stopping the bad guys, not merely responding
to attacks after the fact. The industry has been caught in
a mode of believing that it’s only a matter of when they
will be hacked, he contends, as opposed to if they will be
hacked. Ben-Oni says that’s defeatist.
“If you say we’re giving up on prevention, you’re then
essentially saying you’ve given up,” he says.
IDT uses a combination of three products from Palo
Alto to protect its network: WildFire network detection
software; Traps, which Palo acquired from Israel-based
Cyvera last year for endpoint protection; and Global Protect. The third tool allows IDT to extend the benefits of
10 Can New Security Tools Keep Your Network Clean?
n
july 2015
Crime Doesn’t Pay,
But It Sure Costs
How bad is a data breach or leak for the bottom
line? It depends on the underlying cause, which
one study says is linked to the average cost per
compromised record.
$246
$171
Malicious
attack
System
glitches
$160
Employee
error
Source: “2014 Cost of Data Breach Study: United States,”
Ponemon Institute/IBM, May 2014
WildFire and Traps to mobile devices and computers that
leave the office.
Here’s how they work in concert at IDT: Traps is always on the lookout for malware on the endpoint. If
Traps detects that a zero-day attack or some other anomaly has entered the network, it will communicate that
to WildFire, which will then run an analysis. Once it
FRONTLINE REPORT
home
EDITOR’S DESK
NEW TOOLS
FRONTLINE REPORT
MARKET RETOOLS
confirms that the activity is in fact malware, WildFire
will block and remediate the malware. WildFire adds another level of protection in that, once it detects malware,
both the endpoint and the network (via the Palo Alto firewall) are protected. The network will not allow the malicious traffic to flow through, and if the file should be
introduced by some other means—via a USB flash drive
or local file copies, for example—Traps will block its
execution.
In the past, Ben-Oni says, by the time the IT staffers detected malware, disconnected the computer from
the network and uploaded the file to the antivirus lab,
it could take 24 hours for them to write a signature. IT
teams don’t have that kind of time today.
“Traditionally, all of this was done manually and now
it happens in near real-time,” Ben-Oni says. By avoiding
the need to bring people into the process, the risk of lateral infection is greatly reduced, he explains. Hackers use
automation, so the only way for companies to level the
playing field is to also use automation, Ben-Oni adds.
AUTOMATE NOW
That’s a really important point, says Dan Polly, enterprise information security officer at First Financial Bank,
which operates more than 100 banking locations in Indiana, Kentucky and Ohio.
First Financial uses Cisco Advanced Malware Protection (AMP) for endpoints and the network, which lets the
11 Can New Security Tools Keep Your Network Clean?
n
july 2015
Network and Security
Teams Evolve
for years, information security and network-
ing teams worked in different department. FireEye CTO Dave Merkel says the current threat
landscape has changed that dynamic: “Today,
security must be woven into the fabric of the
organization.” Security and networking teams
must work in partnership.
Scott Harrell, vice president of product marketing in the Security Business Group at Cisco,
agrees that collaboration between the two disciplines will be critical to combat more advanced
threats. “While the two groups can still have
division of roles, I think it will move to a point
where the security group is more involved in
developing the network architecture, and the
networking staff will handle Tier 1 security calls
while the Tier 2 and Tier 3 alerts go to more experienced security pros,” he says.
Golan Ben-Oni, CISO at IDT Corporation, says
all the teams within IT work together at IDT. “At
our company, everyone gets cross-trained in all
the different computing disciplines,” he adds.
FRONTLINE REPORT
home
EDITOR’S DESK
NEW TOOLS
FRONTLINE REPORT
MARKET RETOOLS
company do rapid-fire malware analysis. If AMP detects
malware, it pushes the suspect file into a portal, which
acts as a sandbox in which the software runs an analysis
to determine the extent of the threat.
“The thing to remember is that before these tools
were available, you would need someone to analyze that
malware in-depth, which required a person with some extensive programming and security skills,” Polly explains.
“Now, we’re able to automate some of that, which saves
time and gives us the ability to block the threat much
faster.”
Much like IDT’s Ben-Oni, Polly sees a great benefit to
working with a single vendor that offers multiple capabilities. Along with the AMP product, the company’s security engineering team uses a combination of Cisco ASA
and Sourcefire next-generation firewalls, which he says
not only perform traditional firewall functions, but also
support intrusion detection and prevention and URL content filtering. Polly also likes that, through development
and acquisition, Cisco has invested in Talos, the company’s security intelligence and research group. Talos fields
a team of researchers who analyze threats and spend their
days looking to improve Cisco’s security products.
“With an extensible platform, it cuts the time we can
address any emerging threats,” Polly says. “There’s no
question that the security industry goes in phases; there
are times when best-of-breed was the only choice, [but]
now it seems to be going back the other way to single
12 Can New Security Tools Keep Your Network Clean?
n
july 2015
source with multiple capabilities.”
For years, information security and networking teams
worked in different departments and, in some cases, competed with each other.
FireEye CTO Dave Merkel says the current threat
landscape has changed that dynamic.
“Today, security must be woven into the fabric of the
organization,” he says, adding that the security and networking teams must work in partnership.
Scott Harrell, vice president of product marketing in
the security business group at Cisco, agrees that collaboration between the two disciplines will be critical to combat more advanced threats.
“While the two groups can still have division of roles,
I think it will move to a point where the security group
is more involved in developing the network architecture,
and the networking staff will handle Tier 1 security calls
while the Tier 2 and Tier 3 alerts go to more experienced
security pros,” he says.
Golan Ben-Oni, CISO at IDT Corporation, says all the
teams within IT work together at IDT.
“At our company, everyone gets cross-trained in all
the different computing disciplines,” he adds. n
Steve Zurier is a freelance technology journalist based in
Columbia, Md., with more than 30 years of journalism and
publishing experience. Zurier previously worked as features editor
at Government Computer News and InternetWeek.
Copyright of Information Security is the property of TechTarget, Inc. and its content may not
be copied or emailed to multiple sites or posted to a listserv without the copyright holder's
express written permission. However, users may print, download, or email articles for
individual use.
Computing (2015) 97:357–377
DOI 10.1007/s00607-014-0398-5
An overview of the commercial cloud monitoring tools:
research dimensions, design issues, and state-of-the-art
Khalid Alhamazani · Rajiv Ranjan · Karan Mitra ·
Fethi Rabhi · Prem Prakash Jayaraman ·
Samee Ullah Khan · Adnene Guabtni · Vasudha Bhatnagar
Received: 29 June 2013 / Accepted: 20 March 2014 / Published online: 16 April 2014
© Springer-Verlag Wien 2014
Abstract Cloud monitoring activity involves dynamically tracking the Quality of Service (QoS) parameters related to virtualized resources (e.g., VM, storage, network,
appliances, etc.), the physical resources they share, the applications running on them
and data hosted on them. Applications and resources configuration in cloud computing environment is quite challenging considering a large number of heterogeneous
cloud resources. Further, considering the fact that at given point of time, there may
be need to change cloud resource configuration (number of VMs, types of VMs,
number of appliance instances, etc.) for meet application QoS requirements under
uncertainties (resource failure, resource overload, workload spike, etc.). Hence, cloud
monitoring tools can assist a cloud providers or application developers in: (i) keeping
their resources and applications operating at peak efficiency, (ii) detecting variations
in resource and application performance, (iii) accounting the service level agreement
violations of certain QoS parameters, and (iv) tracking the leave and join operations
K. Alhamazani · F. Rabhi
University of New South Wales, Sydney, Australia
R. Ranjan (B) · P. P. Jayaraman
CSIRO, Canberra, Australia
e-mail: rajiv.ranjan@csiro.au
K. Mitra
Luleå University of Technology, Luleå, Sweden
S. U. Khan
North Dakota State University, Fargo, USA
A. Guabtni
NICTA, Sydney, Australia
V. Bhatnagar
University of Delhi, New Delhi, India
123
358
K. Alhamazani et al.
of cloud resources due to failures and other dynamic configuration changes. In this
paper, we identify and discuss the major research dimensions and design issues related
to engineering cloud monitoring tools. We further discuss how the aforementioned
research dimensions and design issues are handled by current academic research as
well as by commercial monitoring tools.
Keywords Cloud monitoring · Cloud application monitoring · Cloud resource
monitoring · Cloud application provisioning · Cloud monitoring metrics · Quality
of service parameters · Service level agreement
Mathematics Subject Classification
68U01
1 Introduction
According to National Institute of Standards and Technology NIST,1 cloud computing
is a “ Model for enabling convenient, on-demand network access to a shared pool of
configurable computing resources (network, servers, storage, applications, services)
that can be rapidly provisioned and released with minimal management effort or service
provider interaction” [1]. Service models, hosting, deployment models, and roles are
some of the important concepts and essential characteristics related to cloud computing
technologies defined by NIST and elaborated in [1–6], Commercial cloud providers
including Amazon Web Services (AWS), Microsoft Azure, Salesforce.com, Google
App Engine and others offer the cloud consumers options to deploy their applications
over a network of infinite resource pool with practically no capital investment and
with modest operating cost proportional to the actual use. For example, Amazon EC2
cloud runs around half million physical hosts, each of them hosting multiple virtual
machines that can be dynamically invoked or removed [7].
Several papers in the literature discuss, explore and propose surveys of cloud monitoring in different aspects [1–6,8–13]. To the best of our knowledge, no specific survey
considers monitoring applications across the different cloud layers namely Infrastructure as a service (IaaS), platform as a service (PaaS) and software as a service (SaaS).
Further, none of the papers have focused on predictive cloud monitoring. In addition to that, none of the paper discusses the possibility of utilizing machine learning
techniques with monitored data. In addition to the above factors, one arising aspect
with the cloud computing is managing huge volume of data (Big Data). In the present
environment, the term “Big Data” is described as a phenomenon that refers to the
practice of collection and processing of very large datasets and the associated systems
and algorithms used to analyze these enormous data sets [14]. Three well recognized
characteristics of Big Data are Variety, Volume and Velocity (3 V’s) of data generation
[15,16]. The steady growth of social media and smart mobile devices has led to an
increase in the sources of outbound traffic, initiating “data tsunami phenomenon”.
For example, in [17–19], Wang et al. present a high-performance data-intensive computing solution for is massive remote sensing data processing. This poses significant
challenges in cloud computing.
1 http://www.nist.gov/itl/cloud/.
123
An overview of the commercial cloud monitoring tools
359
In [20], studies show that as more people join social media sites hosted on clouds,
analysis of the data become more difficult and almost impossible to be analyzed.
Another aspect of big data events can occur when e.g. services or processes of the
infrastructure itself cause high load [21]. Other study [22,23] shows that VMs migrating, copy and saving current state can affect the performance of data transfer within the
cloud. Moreover, different types of data originating from mobile devices makes understanding of composite data a challenging problem due to multi-modality, huge volume,
dynamic nature, multiple sources, and unpredictable quality. Continuously monitoring
of multi-modal data streams collected from heterogeneous data sources require monitoring tools that can cope with managing big data floods (data tsunami phenomenon).
In this paper, we identify and discuss three challenges of cloud monitoring: (1)
How to determine layer specific application monitoring requirements i.e., how cloud
consumer can stipulate at what cloud layer his/her running application should be
monitored. (2) How cloud consumer can express what information he/she is interesting
in to gain knowledge while his/her application is being monitored. (3) How cloud
consumer can predict the application behavior in terms of performance.
1.1 Our contributions
The concrete contributions made by this paper are: (a) advancing the fundamental
understanding of cloud resource and application provisioning and monitoring concepts, (b) identification of the main research dimensions and software engineering
issues based on cloud resource and application types, and (c) presents future research
directions for novel cloud monitoring techniques.
The remainder of the paper is organized as follows: Discussion on key cloud
resource provisioning is presented in Sect. 2. Section 3 discusses the cloud application life cycle and in detail discusses the components of cloud monitoring. Section 4
present details on major research dimensions and software engineering issues related
to developing cloud monitoring tools are. Section 5, discusses mapping of research
dimensions to existing cloud monitoring tools. The paper ends with brief conclusion
and overview of future work.
2 Cloud resource provisioning
Cloud resource provisioning is a complex task [24] and is referred to as the process
of application deployment and management on cloud infrastructure. Current cloud
providers such as Amazon EC2, ElasticHosts, GoGrid, TerraMark, and Flexian, do
not completely offer support for automatic deployment and configuration of software resources [25]. Therefore, several companies, e.g. RightScale and Scalr provide
scalable managed services on top of cloud infrastructures, to cloud consumers for supporting automatic application deployment and configuration control [25]. The three
main steps for cloud provisioning are [24,26]:
Virtual machine provisioning where suitable VMs are instantiated to match the required
hardware and configuration of an application. To illustrate, Bitnami2 supports con2 http://bitnami.org/faq/cloud_amazon_ec2.
123
360
K. Alhamazani et al.
Fig. 1 Provisioning and deployment sequence diagram
sumers to provision a Bitnami stack that consists of VM and appliances. On the other
hand, Amazon EC2 consumers may firstly provision a VM on the cloud then choose
the appliances to provision on that VM.
Resource provisioning it is the process of mapping and scheduling the instantiated
VMs onto the cloud’s physical servers. This is handled by cloud-based hypervisors.
For example, public clouds expose APIs to start/stop a resource but not to control which
physical server within that region/datacenter will host the VM. Figure 1, illustrates
the steps where a cloud consumer attempts to provision cloud resources on Amazon
EC2 platform. In step 1, from the VM repository, a consumer views the available VMs
provided by the cloud platform and selects the preferable VM instance type. In step 2,
the consumer sets up his/her preferences/configurations on this VM. In steps 3 and 4,
the user deploys this VM on the cloud platform successfully. Subsequently, in steps 5
and 6, the consumer retrieves back a list of available applications from the applications
repository. In step 7, the consumer simply opts for his/her desired applications that
123
An overview of the commercial cloud monitoring tools
361
he/she would like to provision. Finally, in step 8, the cloud consumer deploys the
applications and the VM on the cloud platform.
Application provisioning is the process of application deployment on VMs on the
cloud infrastructure. For example, deploying a Tomcat server as an application on a
VM hosted on the Amazon EC2 cloud. Applications provisioning can be done in two
ways. The first method consists of deploying the applications together while hosting
a VM. In the second method, the consumer may want to first deploy the VM, and then
as a separate step, he/she may deploy the applications.
After the provisioning stage, a cloud workflow instance might be composed of
multiple cloud services, and in some cases services from a number of different service providers. Therefore, monitoring the quality of cloud instances across cloud
providers become much more complex [27]. Further, at run time, QoS of the running instance needs to be consistently monitored to guarantee SLA and avoid/handle
abnormal system behaviour. Monitoring is the process of observing and tracking applications/resources at run time. It is the basis of control operations and corrective actions
for running systems on clouds. Despite the existence of many commercial monitoring tools in the market, managing service level agreements (SLAs) between multiple
cloud providers still pose a major issue in clouds.
In some way, cloud monitoring, SLA and dynamic configuration are correlated in
the sense that one has an impact on another. In other words, enhancing monitoring
functionalities will in turn assist meeting SLAs as well as improving dynamic configuration operations at run time. Moreover, SLA has to be met by the cloud providers in
order to reach the required reliability level required by consumers. Also, auto-scaling
and dynamic configurations are required for optimal use of cloud technology. This
all-together leads us to conclude that cloud monitoring is a key element that has to be
further studied and enhanced.
3 Cloud monitoring
Under this section, we present the basic components, phases and layers of application
architecture on clouds. Also, this section will present the state of the art in cloud
monitoring as well as how it is conceptually correlated to QoS and SLA.
3.1 Application life cycle
The application architecture determines how, when, and which provisioning operations
should be processed and applied on cloud resources. The high level application (e.g.,
multimedia applications) architecture is multi-layered [28]. These layers may consist of clients/application consumers, load balancers, web servers, streaming servers,
application servers, and a database system. Notably, each layer may instantiate multiple software resources as needed and when required. Such multiple instantiations can
be allocated to one or more hardware resources. Technically, across those aforementioned system layers, a number of provisioning operations take place at design time
as well as run-time. These provisioning operations should ensure SLA compliance by
achieving the QoS targets.
123
362
K. Alhamazani et al.
Resource selection It is the process where the system developer selects software (web
server, multimedia server, database server, etc.) and hardware resources (CPU, storage,
and network). This process encapsulates the allocation of hardware resources to those
selected software resources.
Resource deployment During this process, system administrator instantiates the
selected software resources on the hardware resources, as well as configuring these
resources for successful communication and inter-operation with the other software
resources already running in the system.
Resource monitoring In order to ensure that the deployed software and hardware
resources run at the required level to satisfy the SLA, a continuous resource monitoring
process is desirable. This process involves detecting and gathering information about
the running resources. In case of the detection of any abnormal system behavior, the
system orchestrator is notified for policy-based corrective actions to be undertaken as
a system remedy.
Resource control Is the process to ensure meeting the QoS terms stated in the SLA.
This process is responsible for handling system uncertainties at run time e.g. upgrade
or downgrade a resource type, capacity or functionality.
3.2 Cloud monitoring
In clouds, monitoring is essential to maintain high system availability and performance of the system and is important for both providers and consumers [8–10]. Primarily, monitoring is a key tool for (i) managing software and hardware resources,
and (ii) providing continuous information for those resources as well as for consumer
hosted applications on the cloud. Cloud activities like resource planning, resource
management, data center management, SLA management, billing, troubleshooting,
performance management, and security management essentially need monitoring for
effective and smooth operations of the system [29]. Consequently, there is a strong
need for monitoring looking at the elastic nature of cloud computing [30].
In cloud computing, monitoring can be of two types: high-level and low-level. Highlevel monitoring is related to the virtual platform status [31]. The low-level monitoring
is related to information collected about the status of the physical infrastructure [31,
32]. Cloud monitoring system is a self-adjusting and typically multi-threaded system
that is able to support monitoring functionalities [33]. It comprehensively monitors preidentified instances/resources on the cloud for abnormalities. On detecting an abnormal behavior, the monitoring system attempts to auto-repair this instance/resource if
the corresponding monitor has a tagged auto-heal action [33]. In case of auto-repair
failure or an absence of an auto-heal action, a support team is notified. Technically,
notifications can be sent by different means such as email, or SMS [33].
3.2.1 Monitoring, QoS, and SLA
As mentioned earlier, cloud monitoring is needed for continuous assessment of
resources or applications on cloud platform in terms of performance, reliability, power
123
An overview of the commercial cloud monitoring tools
363
usage, ability to meet SLA, security, etc [34]. Fundamentally, monitoring tests can be
computation based and/or network based. Computation based tests are concerned
about the status of the real or virtualized platforms running cloud applications. Data
metrics considered in such tests include CPU speed, CPU utilization, disk throughput, VM acquisition/release time and system up-time. Network based tests focus on
network layer data related metrics like jitter, round-trip time RTT, packets loss, traffic
volume etc [31,32,35].
At run-time, a set of operations take place in order to meet the QoS parameters
specified in SLA document that guarantees the required performance objectives of the
cloud consumers. The availability, load, and throughput of hardware resources can
vary in unpredictable ways, so ensuring that applications achieve QoS targets is not
trivial. Being aware of the system’s current software and hardware service status is
imperative for handling such uncertainties to ensure the fulfillment of QoS targets [8].
In addition, detecting exceptions and malfunctions while deploying software services
on hardware resources is essential e.g., showing QoS delivered by each application
component (software service such as web server or database server) hosted on each
hardware resource. Uncertainties can be tackled through the development of efficient,
scalable, interoperable monitoring tools with easy-to-use interfaces.
3.2.2 Monitoring across different applications and layers
As mentioned previously, application components (e.g., streaming server, web server,
indexing server, compute service, storage service, and network) are distributed across
cloud layers including PaaS and IaaS. Thus, in order to guarantee the achievement
of QoS targets for the application as a whole, monitoring QoS parameters should
be performed across all the layers of cloud stack including Platform-as-a-Service
(PaaS) (e.g., web server, streaming server, indexing server, etc.) and Infrastructureas-a-Service (IaaS) (e.g., compute services, storage services, and network). Figure 2
illustrates how different components in a cloud platform are distributed across the
cloud platform layers. Table 1 shows the QoS parameters that a monitoring system
should consider at each cloud layer.
Typically, QoS targets vary across application types. For example, QoS targets
for eResearch applications are different from static, single-tier web applications (e.g.,
web site serving static contents) or multi-tier applications (e.g., on demand audio/video
streaming). Based on application types, there is always a need to negotiate different
SLAs. Hence, SLA document includes conditions and constraints that match the nature
of QoS requirements with each application type. For example, a genome analysis
experiment on cloud services will only care about data transfer (upload and download)
network latency and processing latency. On the other hand, for multimedia applications, the quality of the transferred data over network is more important. Hence, other
parameters gain priority in this case. Failing to track QoS parameters will eventually
lead to SLA violations. Consequently, monitoring is fundamental and responsible for
SLAs compliance certification [36]. Moreover, a multi-layer application monitoring
approach can provide significant insights into the application performance and system performance to both the consumer and cloud administrator. This is essential for
consumers as they can identify and isolate application performance bottlenecks to
123
364
K. Alhamazani et al.
Fig. 2 Components across cloud platform layers
Table 1 QoS parameters at each cloud platform layer
Cloud layer
Layer components
Targeted QoS parameters
SaaS
Appliances x,y,z, etc
PaaS
Web Server, Streaming Server, Index
Server, Apps Server, etc
IaaS
Compute Service, Storage Service,
Network, etc
BytesRead, BytesWrite, Delay, Loss,
Availability, Utilization
BytesRead, BytesWrite, SysUpTime,
SysDesc, HrSystemMaxProcesses,
HrSystemProcesses, SysServices
CPU parameters: Utilization,
ClockSpeed, CurrentState; network
parameters: Capacity, Bandwidth,
Throughput, ResponseTime,
OneWayDelay, RoundTripDelay,
TcpConnState, TcpMaxConn
specific layers. From a cloud administrator point-of-view, the QoS statistics on application performance across layer can help them maintain their SLAs delivering better
performance and higher consumer satisfaction.
4 Evaluation dimensions
Under this section, we present the basic components that can be considered as evaluation dimensions in order to evaluate a monitoring tool in cloud computing.
4.1 Monitoring architectures
In cloud monitoring, the network and system related information is collected by the
systems. For example, CPU utilization, network delay and packet losses. This informa-
123
An overview of the commercial cloud monitoring tools
365
Fig. 3 Centralized monitoring architecture
tion is then used by the applications to determine actions such as data migration to the
server closest to the user to ensure that SLA requirements are met. Typically, network
monitoring can be performed on centralized and de-centralized network architectures.
4.1.1 Centralized
In centralized architecture shown in Fig. 3, the PaaS and IaaS resources send QoS
status update queries to the centralized monitoring server. In this scheme, the monitoring techniques continuously pull the information from the components via periodic
probing messages. In [33], the authors show that a centralized cloud monitoring architecture allows better management for cloud applications. Nevertheless, centralized
approach has several design issues, including:
• Prone to a single point of failure;
• Lack of scalability;
• High network communication cost at links leading to the information server (i.e.,
network bottleneck, congestion); and
• Possible lack of the required computational power to serve a large number of monitoring requests.
4.1.2 Decentralized
Recently, proposals for decentralized cloud monitoring tools have gained momentum.
Figure 4 shows the broad schematic design of decentralized cloud monitoring system.
The decentralization of monitoring tools can overcome the issues related to current
centralized systems. A monitoring tool configuration is considered as decentralized if
none of the components in the system is more important than others. In case one of
123
366
K. Alhamazani et al.
Fig. 4 Decentralized monitoring architecture
the components fails, it does not influence the operations of any other component in
the system.
Structured peer-to-peer Looking forward to have a network layout where a central
authority is defused has lead to the development of the structured peer-to-peer networks. In such a network overlay, central point of failure is eliminated. Napster is a
popular structured peer-to-peer system [37].
Unstructured peer-to-peer Unstructured peer-to-peer networks overlay is meant to be
a distributed overlay but the difference is that the search directory is not centralized
unlike structured peer-to-peer networks overlay which, leads to absolute single point
failure in such network overlay. Gnutella is one of the well-known unstructured peerto-peer systems [37].
Hybrid peer-to-peer Is a combination of structured and unstructured peer-to-peer networks systems. Super peers can act as local search hubs in small portions of the network
whereas the general scope of the network behaves as unstructured peer-to-peer system.
Kazaa is a hybrid of centralized Napster and decentralized Gnutella network systems.
4.2 Interoperability
The interoperability perspective in technology focuses on the system’s technical capabilities to interface between organizations and systems. It also focuses on the resulting
mission of compatibility or incompatibility between systems and data collation partners. Modern business applications developed on cloud are often complicated and
require interoperability. For example, an application owner can deploy a web server
on Amazon Cloud while the database server may be hosted in Azure Cloud. Unless
data and applications are not integrated across clouds properly, the results and benefits
123
An overview of the commercial cloud monitoring tools
367
Fig. 5 Interoperability
classification
Table 2 Monitoring tools and
interoperability
Platform
Interoperability, Cloud-Agnostic
(Multi-Clouds)
Monitis [38]
Yes
RevealCloud [39,40]
Yes
LogicMonitor [41]
Yes
Nimsoft [42]
Yes
Nagios [31,43]
Yes
SPAE [44,45]
Yes
CloudWatch [46]
No
OpenNebula [47]
No
CloudHarmony [48]
Yes
Azure FC [49,50]
No
of cloud adoption cannot be achieved. Interoperability is also necessary to avoid cloud
provider lock-in.
This dimension refers to the ability of a cloud monitoring framework to monitor
applications and its components that may be deployed across multiple cloud providers.
While it is not difficult to implement a cloud-specific monitoring framework, to design
generic cloud monitoring framework that can work with multiple cloud providers
remain a challenging problem. Next, we classify the interoperability (Fig. 5) of monitoring frameworks into the following categories:
Cloud dependent Currently many public cloud providers provide their consumers monitoring tools to monitor their application’s CPU, storage and network usage. Often
these tools are tightly integrated with the cloud providers existing tools. For example, CloudWatch, offered by Amazon is a monitoring tool that enables consumers to
manage and monitors their applications residing on AWS EC2 (CPU) services. But,
this monitoring tool does not have the ability to monitor an application component
that may reside on other cloud provider’s infrastructure such as GoGrid and Azure.
Table 2 illustrates some examples of cloud monitoring tools that are specific to a cloud
provider as well as Cloud Agnostic.
Cloud Agnostic In contrast to single cloud monitoring, engineering cloud agnostic
monitoring tools is challenging. This is primarily due to fact that there is not a common unified application programming interface (API) for calling on cloud computing
services’ runtime QoS statistics. Though recent developments in cloud programming
API including Simple Cloud, Delta Cloud, JCloud, and Dasein Cloud simplify inter-
123
368
K. Alhamazani et al.
action of services (CPU, storage, and network) that may belong to multiple clouds,
they have limited or no ability to monitor their run-time QoS statistics and application
behaviors. In this scenario, monitoring tools are expected to be able to retrieve QoS
data of services and applications that may be part of multiple clouds. Cloud agnostic
monitoring tools are also required if one wants to realize a hybrid cloud architecture
involving services from private and public clouds. Monitis monitoring tool provides
the ability of accessing different clouds e.g. Amazon EC2, Rackspace and GoGrid. It
utilizes the concept of widgets where consumers can view more than one widget in a
page. In Monitis [38], consumers need to provide only cloud account credentials to
access monitoring data of their cloud applications running on different cloud provider
infrastructure. They can also specify which instance to monitor. Hence, a consumer
can view two different cloud instances using two different widgets in one single page.
4.3 Quality of service (QoS) matrix
It is non-trivial for application developers to understand what QoS parameters and
targets he/she needs to specify and monitor across each layer of a cloud stack including
PaaS (e.g., web server, streaming server, indexing server, etc.) and IaaS (e.g., compute
services, storage services, and network). As shown in Fig. 6, this can be by one
parameter or a group of parameters.
4.3.1 Single parameter
In this scenario, a single parameter refers to a specific system QoS target. In each
system, there are major atomic/single values that have to be tracked closely and continuously. For example, CPU utilization is basically expressed by only one single
parameter in the SLA. Such parameters can affect the whole system and a violation in
SLA can lead to a serious system failure. Unlike composite parameters where a single
parameter might not be of priority to the system administrator, single parameters in
most cases gain high priority when monitoring SLA violations and QoS targets.
4.3.2 Composite parameters
In a composite parameter scenario, a group of different parameters are taken into
consideration. In the cloud, software application is composed of many cloud software
services. Thus, the performance quality can be determined by collective behaviors
of those software services [27]. After observing multiple parameters for estimating
a functionality of one or more concerned processes, one result could be obtained to
Fig. 6 QoS matrix classification
123
An overview of the commercial cloud monitoring tools
Table 3 Monitoring tools and
layers’ visibility
369
Platform
Visibility multi-layers
(composite QoS matrix)
Monitis [38]
Yes
RevealCloud [39,40]
Yes
LogicMonitor [41]
Yes
Nimsoft [42]
Yes
Nagios [31,43]
Yes
SPAE [44,45]
No
CloudWatch [46]
Yes
OpenNebula [47]
No
CloudHarmony [48]
No
Azure FC [49,50]
Yes
evaluate the QoS. To illustrate, “loss” can be considered as a composite parameter of
two single parameters “one way loss” and “round trip loss”. Similarly, “delay” can
be considered as composite parameters of three single parameters “one way delay”,
“RTT delay”, and “delay variance”. Table 3 shows a list of some commercial tools for
cloud monitoring and it illustrates which of them support or do not support monitoring
multiple QoS parameters.
4.4 Cross-layer monitoring
As shown in Fig. 7, application components (streaming server, web server, indexing
server, compute service, storage service, and network) related to a multimedia streaming application are distributed across cloud layers including PaaS and IaaS. In order to
guarantee the achievement of QoS targets for the multimedia application as a whole, it
is critical to monitor QoS parameters across multiple layers [51]. Hence, the challenge
here is to develop monitoring tools that can capture and reason about the QoS parameters of application components across IaaS and PaaS layers. As demonstrated in Fig. 8,
we categorize the visibility of commercial monitoring tools into following categories:
Layer specific Cloud services are distributed among three layers namely, SaaS, PaaS,
and IaaS. Monitoring tools originally are oriented to perform monitoring tasks over
services only in one of the aforementioned layers. Most of present day commercial
tools are designed to keep track of the performance of resources provisioned at the
IaaS layer. For example, CloudWatch is not capable of monitoring information related
to load, availability, and throughout of each core of CPU services and its effect on the
QoS (e.g., latency, availability, etc.) delivered by the hosted PaaS services (e.g., J2EE
application server). Hence, there exists a considerable gap and research challenges in
developing a monitoring tool that can monitor QoS statistics across multiple layers of
the cloud stack.
Layer Agnostic In contrast to the previous scenario, monitoring at multiple layers
enables the consumers to gain insights to applications’ performance across multiple
layers. E.g., consumers can retrieve data at the same time from PaaS and IaaS for the
123
370
K. Alhamazani et al.
Fig. 7 Components across cloud layers and QoS propagating
Fig. 8 Visibility categorization
same application (Table 3). This type of cloud monitoring is essential in all cases but
obviously it is more effective for consumers requiring complete awareness about their
cloud applications.
4.5 Programming interfaces
Programing interfaces allows the development of software systems to enable monitoring across different layers of the cloud stack. It involves several components such
as APIs, widgets and the command line to enable a consumer to monitor several
components of the complex cloud systems in a unified manner.
4.5.1 Application programming interface
An application programming interface (API) is a particular set of rules (‘code’) and
specifications that software programs follow to communicate with each other (Fig. 9).
123
An overview of the commercial cloud monitoring tools
371
Fig. 9 Different types of
programming interfaces
It serves as an interface between different software programs and facilitates their
interaction; similar to the way the user interface facilitates interaction between humans
and computers. In fact, most commercial monitoring tools such as Rackspace, Nimsoft,
RevealCloud, and LogicMonitor provide their consumers with extensible open APIs
enabling them tp specify their own required system functionalities.
4.5.2 Command-line
A command line provides a means of communication between a consumer and a
computer that is based solely on textual input and output.
4.5.3 Widgets
In computer software, a widget is a software service available to consumers for running
and displaying applets via a graphical interface on the desktop. Monitis and RevealCloud are two popular commercial tools that provide performance data to consumers
on multiple customizable widgets.
4.5.4 Communication protocols
All commercial tools adopt communication protocols for data transfer. Communication protocols vary and are different from one monitoring tool to another. For example,
Monitis and Rackspace follow HTTPs and FTP protocols. Another example is LogicMonitor, which adopts the encrypted Simple Network Management Protocol (SNMP).
5 Commercial monitoring tools
5.1 Monitis
Monitis [38] founded in 2005, has one unified dashboard where consumers can open
multiple widgets for monitoring. A Monitis consumer needs to enter his/her credentials
to access the hosting cloud account. In addition, a Monitis consumer can remotely
monitor any website for uptime, in-house servers for CPU load, memory, or disk I/O,
by installing Monitis agents to retrieve data about the devices. A Monitis agent can also
be used to collect data of networked devices in an entire network (behind a firewall).
This technique is used instead of installing a Monitis agent on each single device.
Widgets can also be emailed as read only version to share the monitored information.
Moreover, Monitis provides rich features for reporting the status of instances where
123
372
K. Alhamazani et al.
consumers can specify the way a report should be viewed e.g. chart, or graph. It also
enables its consumers to share the report publicly with others.
5.2 RevealCloud
CopperEgg [39,40] provides RevealCloud monitoring tool. It was founded in 2010
and Rackspace is a main partner. RevealCloud enables its consumers to monitor across
cloud layers e.g. SaaS, PaaS, and IaaS. It is not dedicated to only one cloud resources
provider, rather it is generic to allow a consumer to get its benefits within most popular
cloud providers e.g. AWS EC2, Rackspace, etc. RevealCloud is one of the very few
monitoring tools that supports maintaining monitored historical data.It can track up to
30 days of historical data, which is considered as a prime feature that most commercial
monitoring tools lack.
5.3 LogicMonitor
LogicMonitor [41] was founded in 2008 and it is a partner with several third parties such as NetApp, VMWare, Dell, and HP. Similar to RevealCloud, LogicMonitor
enables its consumers to monitor across cloud layers e.g. SaaS, PaaS, and IaaS. It also
enables them to monitoring application operations on multi-cloud resources. Protocol
used in communications is SSL. Moreover, LogicMonitor uses SNMP as a method of
retrieving data about distributed virtual and physical resources.
5.4 Nimsoft
Nimsoft [42] was founded in 2011. Nimsoft supports multi-layers monitoring of both
virtual and physical cloud resources. Moreover, Nimsoft enables its consumers to view
and monitor their resources in case they are hosted on different cloud infrastructures
e.g. a Nimsoft consumer can view resources on Google Apps, Rackspace, Amazon,
Salesforce.com and others through a unified monitoring dashboard. Also, Nimsoft
gives its consumers the ability to monitor both private and public clouds.
5.5 Nagios
Nagios [43] was founded in 2007, it supports multi-layer monitoring. It enables its
consumers to monitor their resources on different cloud infrastructures as well as
in-house infrastructure. Nagios utilizes SNMP for monitoring networked resources.
Moreover, Nagios has been extended with monitoring functionalities for both virtual
instances and storage services using a plugin-based architecture [31]. Typically, a
Nagios server is required to collect the monitoring data, which would place it as
a centralized solution. Moreover, Nagios is a cloud solution as a user would need to
setup a Nagios server. However, many possible configurations can help create multiple
hierarchical Nagios servers to reduce the disadvantages of a centralized server.
123
An overview of the commercial cloud monitoring tools
373
5.6 SPAE by SHALB
SHALB [44] was founded in 2002 and provides a monitoring solution called Security
Performance Availability Engine (SPAE). SPAE is a typical network monitoring tool
supporting a variety of network protocols such as HTTP, HTTPS, FTP, SSH, etc. It
uses SNMP [45] to perform all of its monitoring processes and emphasizes security
monitoring and vulnerability. However, SPAE does not support monitoring at different
layers (IaaS, PaaS and SaaS). It enables its consumers to monitor networked resources
including cloud infrastructure.
5.7 CloudWatch
CloudWatch [46] is one of the most popular commercial tools for monitoring the
cloud. It is provided by Amazon to enable its consumers monitoring their resources
residing on EC2. Hence, it does not support multi-cloud infrastructure monitoring. The
technical approaches used in CloudWatch to collect data are implicit and not exposed
to users. CloudWatch is limited in monitoring resources across cloud layers. However,
an API is provided for users to collect metrics at any cloud layer but requires the users
to write additional code.
5.8 OpenNebula
OpenNebula [47] is an open source monitoring system that provides management
for data centers. It uses SSH as the protocol permitting consumers to gain access
and gather information about their resources. Mainly, OpenNebula is concerned
with monitoring physical infrastructures involved in data centers such as private
clouds.
5.9 CloudHarmony
CloudHarmony [48] started monitoring services in the beginning of 2010. It provides a
set of performance benchmarks of public clouds. It is mostly concerned in monitoring
the common operating system metrics that are related to (CPU, disk and memory).
Moreover, cloud to cloud network performance in CloudHarmony is evaluated in terms
of RTT and throughput.
5.10 Windows Azure FC
Azure Fabric Controller (Azure FC) [49,50] is adopting centralized network architecture. It is a multi-layer monitoring system but, it does not support monitoring across
different cloud infrastructures. Moreover, Azure FC utilizes SNMP for performing
monitoring.
123
374
K. Alhamazani et al.
6 Classification and analysis of cloud monitoring tools based on taxonomy
With increasing cloud complexity, efforts needed for management and monitoring of
cloud infrastructures need to be multiplied. The size and scalability of clouds when
compared to traditional infrastructure involves more complex monitoring systems
that have to be more scalable, effective and fast. Technically, this would mean that
there is a demand for real-time reporting of performance measurements while monitoring cloud resources and applications. Therefore, cloud monitoring systems need
to be advanced and customized to the diversity, scalability, and high dynamic cloud
environments.
In Sect. 4, we analyzed in detail the main evaluation dimensions of monitoring. As
discussed, not all of those dimensions are adopted by monitoring systems in either
open source or commercial domains. Though, most of these dimensions, which are
basically related to performance, have been addressed by the research community
and have received some, attention, more considerable effort to achieve higher level of
maturity is essential for monitoring cloud systems.
Decentralized approaches are gaining more trust over centralized approaches. In
contrast to unstructured P2P, structured P2P networks present a practical and more
efficient approach in terms of network architecture. However, considerable study is
needed on decentralized networks that are with various degrees of centralization. Considering interoperability, either cloud-dependent or cloud-agnostic, both of these monitoring approaches gain high importance. Currently, both approaches are supported by
several monitoring systems. Through our study, we found that cloud-dependent monitoring systems are mostly commercial, whereas, cloud-agnostic monitoring systems
are typically open source.
We observe that matrix of the quality of service is the most important dimension
of a monitoring system and list the quality parameters that can be monitored along
with the related criteria. We also elaborate on how those quality parameters should be
monitored, detected and reported. At which cloud layer a monitoring system should
operate the monitoring processes. Further, the aggregation of multiple parameters for
a consumer application is a critical aspect of monitoring. This means that a monitoring
system should not be cloud layer specific or layer agnostic. This will determine the
visibility characteristic of a cloud monitoring system. All of these issues in monitoring need more study by the cloud community and is still in demand formore technical
improvements. Table 4 summarizes our study of monitoring platforms against evaluation dimensions explored in Sect. 4.
7 Conclusion and future research directions
This paper presented and discussed the state-of-the-art research in the area of cloud
monitoring. In doing so, it presented several design issues and research dimensions that
could be considered to evaluate a cloud computing system. It also presented several
cloud monitoring tools, their features and shortcomings. Finally, this paper presented
a taxonomy of current cloud monitoring tools with focus on future research directions
that should be considered in the development of efficient cloud monitoring systems.
123
An overview of the commercial cloud monitoring tools
375
Table 4 Monitoring platforms against evaluation dimensions
Platform
Network arch.
(centralized)
Network arch.
(decentralized)
Interoperability Visibility
multi-cloud
multi-layers
SNMP
Extendable
APIs
Monitis [38]
Not-stated
(SaaS solution)
Not-stated
(SaaS solution)
Not-stated
(SaaS solution)
Yes
Not-stated
(SaaS solution)
Not-stated
(SaaS solution)
Not-stated
(SaaS solution)
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Not-stated Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yes
No
No
Yes
Not-stated Yes
No
No
Not-stated No
Yes
No
Not-stated No
No
Yes
Yes
RevealCloud
[39,40]
LogicMonitor
[41]
Nimsoft
[42]
Nagios
[31,43]
SPAE
[44,45]
CloudWatch
[46]
OpenNebula
[47]
CloudHarmony
[48]
Azure FC
[49,50]
Not-stated
Not-stated
(SaaS solution) (SaaS solution)
Not-stated
Not-stated
(SaaS solution) (SaaS solution)
Yes
No
Not-stated
Not-stated
(SaaS solution) (SaaS solution)
Yes
Not-stated
Yes
Since monitoring becomes an essential component of the whole cloud infrastructure, its elasticity has to be given a high considerable priority. Based on this fact and on
the aforementioned monitoring aspects and approaches, we believe that considerable
effort is required to have more reliable cloud monitoring systems. Furthermore, we
found there is a lack of reachable standards on procedure, format, and metrics to assess
the development of cloud monitoring. Hence, we recommend having more collaborative use of research facilities in which tools, lessons learned and best practices can be
shared among all interested researches and professions.
References
1. Mell P, Grance T (2011) The NIST definition of cloud computing (draft). NIST Spec Publ 800:145
2. Letaifa A, Haji A, Jebalia M, Tabbane S (2010) State of the art and research challenges of new services
architecture technologies: virtualization, SOA and cloud computing. Int J Grid Distrib Comput 3
3. Cong C, Liu J, Zhang Q, Chen H, Cong Z (2010) The characteristics of cloud computing. In: 39th
international conference on parallel processing workshops (ICPPW), pp 275–279
4. Zhang S, Zhang S, Chen X, Huo X (2010) Cloud computing research and development trend. In: 2nd
international conference on future networks, ICFN’10, pp 93–97
5. Ahmed M, Chowdhury ASMR, Ahmed M, Rafee MMH (2012) An advanced survey on cloud computing and state-of-the-art research issues. Int J Comput Sci Issues (IJCSI) 9
6. Atzori L, Granelli F, Pescapè A (2011) A network-oriented survey and open issues in cloud computing
7. Shin S, Gu G (2012) CloudWatcher: network security monitoring using openflow in dynamic cloud
networks (or: How to provide security monitoring as a service in clouds?). In: 2012 20th IEEE international conference on network protocols (ICNP), pp 1–6
8. De Chaves SA, Uriarte RB, Westphall CB (2011) Toward an architecture for monitoring private clouds.
IEEE Commun Mag 49:130–137
123
376
K. Alhamazani et al.
9. Grobauer B, Walloschek T, Stocker E (2011) Understanding cloud computing vulnerabilities. In: IEEE
security and privacy, vol 9, pp 50–57
10. Moses J, Iyer R, Illikkal R, Srinivasan S, Aisopos K (2011) Shared resource monitoring and throughput
optimization in cloud-computing datacenters. In: 2011 IEEE international parallel and distributed
processing symposium (IPDPS), pp 1024–1033
11. Wang L, Kunze M, Tao J, von Laszewski G (2011) Towards building a cloud for scientific applications.
Adv Eng Softw 42(9):714–722
12. Wang L, Chen D, Ma Y, Wang J (2013) Towards enabling cyberinfrastructure as a service in clouds.
Comput Electr Eng 39(1):3–14
13. Wang L, von Laszewski G, Younge AJ, He X, Kunze M, Tao J (2010) Cloud computing: a perspective
study. New Gener Comput 28(2):137–146
14. Begoli E, Horey J (2012) Design principles for effective knowledge discovery from big data. In:
Joint working IEEE/IFIP conference on software architecture (WICSA) and European conference on
software architecture (ECSA), pp 215–218
15. Bryant R, Katz RH, Lazowska ED (2008) Big-data computing: creating revolutionary breakthroughs
in commerce, science and society
16. Labrinidis A, Jagadish H (2012) Challenges and opportunities with big data. In: Proceedings of the
VLDB endowment, vol 5, pp 2032–2033
17. Ma Y, Wang L, Liu D, Yuan T, Liu P, Zhang W (2013) Distributed data structure templates for dataintensive remote sensing applications. Concurr Comput Pract Exp 25(12):1784–1797
18. Zhang W, Wang L, Liu D, Song W, Ma Y, Liu P, Chen Dan (2013) Towards building a multi-datacenter
infrastructure for massive remote sensing image processing. Concurr Comput Pract Exp 25(12):1798–
1812
19. Zhang W, Wang L, Ma Y, Liu D (2013) Design and implementation of task scheduling strategies for
massive remote sensing data processing across multiple data centers. Pract Exp Softw. doi:10.1002/
spe.2229
20. Twitter and Natural Disasters (2011) Crisis communication lessons from the Japan tsunami. http://
www.sciencedaily.com/releases/2011/04/110415154734.htm. Accessed 22 Feb 2014
21. Nita M-C, Chilipirea C, Dobre C, Pop F (2013) A SLA-based method for big-data transfers with multicriteria optimization constraints for IaaS. In: 2013 11th roedunet international conference (RoEduNet),
pp 1, 6
22. Zhao M, Figueiredo RJ (2007) Experimental study of virtual machine migration in support of reservation of cluster resources. In: Proceedings of the 2nd international workshop on virtualization technology
in distributed computing, p 5
23. Wang L, Chen D, Zhao J, Tao J (2012) Resource management of distributed virtual machines. IJAHUC
10(2):96–111
24. Calheiros RN, Ranjan R, Buyya R (2011) Virtual machine provisioning based on analytical performance and qos in cloud computing environments. In: International conference on parallel processing
(ICPP), pp 295–304
25. Kirschnick J, Calero A, Edwards N (2010) Toward an architecture for the automated provisioning of
cloud services. IEEE Commun Mag 48:124–131
26. Ranjan R, Zhao L, Wu X, Liu A, Quiroz A, Parashar M (2010) Peer-to-peer cloud provisioning: service
discovery and load-balancing. In: Cloud computing, Springer, pp 195–217
27. Liu X, Yang Y, Yuan D, Zhang G, Li W, Cao D (2011) A generic QoS framework for cloud workflow
systems. In: 2011 IEEE ninth international conference on dependable, autonomic and secure computing
(DASC), pp 713–720
28. Ranjan R, Benatallah B Programming cloud resource orchestration framework: operations and research
challenges. In: Technical report. http://arxiv.org/abs/1204.2204. Accessed 22 Feb 2014
29. Aceto G, Botta A, de Donato W, Pescapè A (2013) Cloud monitoring: a survey. Comput Netw 57:2093–
2115
30. Shao J, Wei H, Wang Q, Mei H (2010) A runtime model based monitoring approach for cloud. In:
2010 IEEE 3rd international conference on cloud computing (CLOUD), pp 313–320
31. Caron E, Rodero-Merino L, Desprez F, Muresan A (2012) Auto-scaling, load balancing and monitoring
in commercial and open-source clouds
32. Spring J (2011) Monitoring cloud computing by layer, part 1. IEEE Secur Priv 9:66–68
123
An overview of the commercial cloud monitoring tools
377
33. Anand M (2012) Cloud monitor: monitoring applications in cloud. In: Cloud computing in emerging
markets (CCEM), 2012 IEEE international conference on communication, networking and broadcasting, pp 1–4
34. Kutare M, Eisenhauer G, Wang C, Schwan K, Talwar V, Wolf M (2010) Monalytics: online monitoring
and analytics for managing large scale data centers. In: Proceedings of the 7th international conference
on autonomic computing, pp 141–150
35. Sundaresan S, de Donato W, Feamster N, Teixeira R, Crawford S, Pescape A (2011) Broadband internet
performance: a view from the gateway. In: ACM SIGCOMM computer communication review, pp
134–145
36. Massonet P, Naqvi S, Ponsard C, Latanicki J, Rochwerger B, Villari M (2011) A monitoring and
audit logging architecture for data location compliance in federated cloud infrastructures. In: IEEE
international symposium on parallel and distributed processing workshops and PhD forum (IPDPSW),
pp 1510–1517
37. Davis C, Neville S, Fernandez J, Robert J-M, Mchugh J (2008) Structured peer-to-peer overlay networks: ideal botnets command and control infrastructures? In: Computer security—ESORICS 2008,
pp 461–480
38. Monitis (2014) http://portal.monitis.com/. Accessed 22 Feb 2014
39. RevealCloud (2014) http://copperegg.com/. Accessed 22 Feb 2014
40. RevealCloud (2014) http://sandhill.com/article. Accessed 22 Feb 2014
41. LogicMonitor (2014) http://www.logicmonitor.com/why-logicmonitor. Accessed 22 Feb 2014
42. Nimsoft (2014) http://www.nimsoft.com/solutions/nimsoft-monitor/cloud. Accessed 22 Feb 2014
43. Nagios (2014) http://www.nagios.com. Accessed 22 Feb 2014
44. SPAE (2014) http://shalb.com/en/spae/spae_features/. Accessed 22 Feb 2014
45. SPAE (2014) http://www.rackaid.com/resources/server-monitoring-cloud. Accessed 22 Feb 2014
46. CloudWatch (2014) http://awsdocs.s3.amazonaws.com/AmazonCloudWatch/latest/acw-dg.pdf.
Accessed 22 Feb 2014
47. OpenNebula (2014) http://opennebula.org/documentation:rel4.0. Accessed 22 Feb 2014
48. Cloudharmony (2014) http://cloudharmony.com/. Accessed 22 Feb 2014
49. Azure FC (2014) http://www.techopedia.com/definition/26433/azure-fabric-controller. Accessed 22
Feb 2014
50. Azure FC (2014) http://snarfed.org/windows_azure_details#Configuration_and_APIs. Accessed 22
Feb 2014
51. Nathuji R, Kansal A, Ghaffarkhah A (2010) Q-clouds: managing performance interference effects for
QoS-aware clouds. In: Proceedings of the 5th European conference on computer systems, pp 237–250
123
Copyright of Computing is the property of Springer Science & Business Media B.V. and its
content may not be copied or emailed to multiple sites or posted to a listserv without the
copyright holder's express written permission. However, users may print, download, or email
articles for individual use.
International Journal of Data and Network Science 4 (2020) 255–262
Contents lists available at GrowingScience
International Journal of Data and Network Science
homepage: www.GrowingScience.com/ijds
A practical approach to monitoring network redundancy
Richard Phillipsa, Kouroush Jenaba* and Saeid Moslehpourb
a
b
Department of Engineering and Technology Management, Morehead State University, Morehead, KY, USA
College of Engineering, Hartford University, West Hartford, CT, USA
CHRONICLE
Article history:
Received: July 18, 2018
Received in revised format: July
28, 2019
Accepted: September 19, 2019
Available online: September 19,
2019
Keywords:
Practical Network Monitoring
SNMP
Network Monitoring
Network Performance
Network Alerts
Interface Redundancy Monitoring
ABSTRACT
Computer TCP/IP networks are becoming critical in all aspects of life. As computer networks
continue to improve, the levels of redundancy continue to increase. Modern network redundancy
features can be complex and expensive. This leads to misconfiguration of the redundancy features.
Monitoring everything is not always practical. Some redundancy features are easy to detect while
others are more difficult. It is common for redundancy features to fail or contribute to a failure
scenario. Incorrectly configured redundancy will lead to network downtime when the network is
supposed to be redundant. This presents a false sense of security to the network operators and
administrators. This research will present two scenarios that are commonly left unmonitored and
look at a practical way to deploy solutions to these two scenarios in such a way that the network
uptime can be improved. Implementing a practical approach to monitor and mitigate these types
of failures allows costs spent on redundancy to increase uptime, and thus increase overall quality
that is critical to a modern digital company.
© 2020 by the authors; licensee Growing Science, Canada.
1. Introduction
A report in 2015 stated that the average adult spends 8 hours and 21 minutes sleeping per day while that
same adult spends 8 hours and 41 minutes on media devices (Davies, 2015). All these connected media
devices use many forms of networking. We spend more time on networked devices then we spend sleeping. Clearly, computer networks have become critical to many modern-day activities. As the level of
criticality has increased, so has the desire to build networks with higher levels of resiliency and redundancy (Bayrak & Brabowski, 2006). Computer network operators that consider the two words similar,
typically apply the same types of monitoring for components regardless of their role. Both methods introduce scenarios that can be hard to monitor. This approach will explain methods that will establish
monitoring for common redundancy and resiliency features or components. The first study will look at
redundancy in the device and the second study will look at redundancy in the connectivity between devices. These are two common scenarios for both simple and complex networks. After establishing a
practical network monitoring strategy, the system can detect and possibly repair issues, leading to better
* Corresponding author. Tel: 1(606)783-9339, Fax: 1(606)783-5030
E-mail address: k.jenab@moreheadstate.edu (K. Jenab)
© 2020 by the authors; licensee Growing Science, Canada.
doi: 10.5267/j.ijdns.2019.9.004
256
uptime metrics. It is common for an organization or company to invest in more redundancy and resiliency
features. These configurations can add exponential costs to the design, only to see failures continue and
uptime turn into downtime. Using a more practical approach can keep the alerts from becoming background noise and focus monitoring efforts to work in conjunction with the redundant and resilient designs.
2. Monitoring overview
The most common form of network monitoring today entails the SNMP protocol. SNMP, or simple network management protocol, comes in both “push” and “pull” varieties. When a network monitoring system polls a device for specific information, it is “pulling” data from the device to the monitoring system.
The reverse or “pushing” is when the device sends SNMP data to the monitoring system when a specific
event happens. This is typically associated with problems where pulling data on a regular schedule induces delays in detecting problems. The SNMP protocol uses a complex structure that must be understood. A MIB, or management information database, is the structure used in SNMP (Netak & Kiwelekar,
2006). A MIB is setup like a tree with a root and branches. Each branch has specific object identifiers
(OID). Most monitoring systems have commonly used MIB trees and are setup to detect the polled device, so it uses the correct OID for polling a basic set of things about that device. These MIB/OIDs have
some common branches but most are device manufacturer specific. This leads to large variability in the
branches. To setup a network with 10 devices, it is easy to determine what device manufacture MIB you
are using and select the correct OIDs for your device. This is optimal but not practical or even typical.
Large networks can be far more difficult to establish the correct MIB/OIDs that need monitoring. Most
network monitoring system have defaults established to ease initial deployment and then the network
operation must customize to the environment. This customization becomes complex and time consuming
for large networks. Often overlooked are the redundancy/resiliency features or components. It is not
difficult to monitor a single interface or chassis for things like throughput (in bps or input bits per second),
link utilization percentage or errors. As stated previously, SNMP is mostly different for each manufacturer. This creates opportunity for an OID specifically for a second redundant power supply to be hard to
derive, which in turn leads to a lack of native polling for that component. If the monitoring system does
not handle this out of the box and the network operator does not specifically deal with a failure notification of the power supplies, then the network is at great risk of having a failure that goes undetected.
Building on that scenario of a power supply that is purposed for failover, fails itself and goes unnoticed
so the next failure takes down the whole device and possibly the network. In summary, the problem will
present itself as a complete loss of power when in fact, one of the power modules failed prior to the actual
significant event but went undetected.
Another method to monitor devices is using syslog messages, or system logged messages (Gerhards,
2009). Machine data is also a commonly used name for these log messages. Syslog logfiles have been
around for years. This data comes in all formats and detail levels. It is in ASCII text and has no real
structure. Newer logging systems have become increasingly good at creating intelligence that can add
structure to this unstructured machine data and produce valuable information that can be correlated across
multiple machines or services. Using this machine data can be a powerful way to solve problems associated with traditional SNMP polled monitoring solutions. Machine logs, when used to diagnose a system,
are records of system events that provide details of what the device is doing internally. With the modern
machine learning and big-data analytics, mining and analyzing this unstructured data moves traditional
reactive support models to more proactive models of support.
Solarwinds Orion (SolarWinds Worldwide, LLC.) is the tool that performed the SNMP polling for this
approach. This network monitoring system is multifunctional and has many features. The system can
make network device configuration changes as part of an alert remediation strategy. This is a feature that
is part of a “self-healing” process (Quattrociocchi et al., 2014). Solarwinds Orion will also collect machine syslog data for some event and alert notifications but this approach utilized Splunk (Splunk Inc) to
R. Phillips et al. / International Journal of Data and Network Science 4 (2020)
257
analyze the machine data used to build the algorithms for failure alerting and remediation. Network
alarms need to evaluate and raise events to a status that notifies the network operators. The sheer volume
of alarms that can exist on a large network can overwhelm a network operations center. Tuning the
thresholds that trigger alarms are part of any practical strategy for network management. The tuning
process involves looking at the lifecycle of the alarms and looking at the time the alarm stays in the
various states like active, cleared or ended. It a report released in 2009, 20% of network alarms automatically cleared in less than 5 min while 47% automatically ended within 5 minutes (Wallin, 2009). These
portions of the lifecycle suggest filtering these alarms out. It is important to consider alarms management
when implementing any strategy for monitoring network redundancy. Alarms fatigue should always be
avoided if possible.
3. Monitoring redundancy and resiliency
Merriam-Webster defines redundancy as “serving as a duplicate for preventing failure of an entire system
upon failure of a single component”. Merriam-Webster also defines resiliency as “tending to recover
from or adjust easily to misfortune or change” (Merriam-Webster.com). Clearly the definitions are different, yet they are often thought to mean the same thing. This can lead to a lack of adequate alerting or
monitoring. A detailed study of each term, as it relates to computer networks, will help to better understand the approach and how each complement one another when maintaining a redundant and resilient
network design.
3.1 Redundant
A network designed for redundancy has multiple components that prevent the device from becoming
unavailable or down. This covers things like power supplies, additional CPUs, hard disks or memory
modules. A typical example of redundancy in a network device is a device that has multiple power supplies or power modules. Having multiple power supplies connected to multiple power sources eliminates
outages if you lose primary power. While this does make the system resilient, the true definition is closer
to redundancy. When the redundant power supply fails while its running on the primary power, the network operator must be notified of event. If the failure goes unnoticed, a primary power supply failure
will create a network down event for the entire device. When SNMP monitoring a network device for
basic things, like is the whole device up or down, unless configured specifically, there will not typically
be a down message until both power supplies fail and the unit becomes unreachable. When monitoring a
redundant system, we must make sure to raise alerts when the redundant device component fails or is in
an unusable or degraded state. Typically, the machine data is a better way to predict or identify these
types of problems compared to simple SNMP polling. The machine data has a greater level of detail that
can be leveraged in a modern log collection system. Some operators utilize SNMP traps, or “pushes” for
this time-sensitive event notification.
3.2 Resilient
Resiliency in a network device means the device or circuit can stay online and working during a failure.
In some computer networks, this is the ability to route around problems using multiple paths to the same
destination. This could be defined as a form of self-healing (Quattrociocchi et al., 2014). Another form
a resiliency is to build multiple paths to a logical device, made up of two physical devices, and bundle
them all together, so they look like a single connection. Fig 1 is a visual representation of port channel
technology with 2 or 4-member interfaces. Looking at Fig 1, specifically when looking at Po2 on the left
side, there are 2 circuits in this port channel. The extra circuit serves as a redundancy measure as well as
a resiliency measure. On larger port channels there could be 4, 8 or even 16 circuits (as shown on Fig 1
- right). If a member circuit develops problems and goes offline, the remaining circuits continue to pass
data. In the case of a larger port channel it may not be noticeable to the users or network operator from a
consumed bandwidth perspective. Therefore, alerting is critical. For example, hypothetically if Te4/1 on
either side of Fig 1 experiences problems but continues to pass data, the data may have errors or discards
258
on the transmit or receive side of the circuit. These types of issues create delays for customers or they
slow down the communications far beyond expected rates. While the port channel is still operational and
will appear to be up and resilient, the intended design is not what the customer experiences. It would
seem logical that the other 7 circuits should handle the traffic and automatically exclude the defective
member. Traditional network devices do not have circuit error detection or self-healing features. Also,
there are times when errors are normal such as a denial of service attacks or a defective machine might
flood the network with unintended packets of data. To properly fail over these circuits, removal of the
defective member from the port channel must occur. If the monitored interface is only detecting up or
down status, the network operator is not aware of the problem when it occurs. This scenario, where the
network operator does not know there are problems on the network, yet the customers are experiencing
no problem, is the opposite of what resiliency is supposed to accomplish for a network.
Fig. 1. A visual representation of port channel technology
4. Research
The first case study looked at a year’s worth of log data from a large university network. The network is
comprised of over 2500 physical devices. The logs are captured in a log collection system. A list and
count of critical events was derived from a report obtained from the log entries in the log collection
system. A pareto chart with the event data was used for evaluation of the most common critical log
messages See Fig 2. Overall there were 38 critical log messages for the time period studied. The three
most common were fan failures and power supply issues. After investigation of the power supply issues;
nearly 46% (7 of 17) of the failures occurred on redundant power supplies (FRU_PS_ACCESS & PWR).
Fig. 2. A pareto chart with the event data
R. Phillips et al. / International Journal of Data and Network Science 4 (2020)
259
In this scenario, the SNMP-based network monitoring system did not detect an event that was critical or
problematic for the failed power-related components. The failures were reported by customers and added
to the incident management system. Technicians were dispatched to repair the failure and restore the
customer experience. In the case of the redundant power supply failures, an algorithm was added to the
Splunk analytics engine to process the log data to detect redundant power module failures. The system
was also configured to automatically open incident tickets for further technical investigation for detected
events. Post process implementation, the network devices can experience power supply failures that
would not normally be detected during SNMP polling but are still remediated using log data. This allows
for proper network operator alerting and problem remediation. In summary, this new workflow has allowed the devices that experience failures with redundant components to be repaired without causing
downtime for the users. The 7 failures that occurred with a redundant component more than likely created
downtime on a redundantly designed device. These failures could have been prevented but the failures
largely went unnoticed until the system experienced a total power failure.
The second case study looked at the same university network for port channel usage on the campus
network. This university has over 4000 fiber optic connections and many of them are members in port
channels. Root causes of downtime events suggest that these interfaces go bad without notification or the
notification is lost in an overwhelming volume of alerts. This issue was reported by the network operator
prior to the study and suggested it warranted a long-term solution. The issue is that the most important
tiers of the network, core and distribution, have numerous port channels to allow for capacity and resiliency. These port channels lose members due external issues like fiber trouble and begin to throw errors
when passing traffic. The individual interface is generating errors, but the circuit never goes completely
down. It is just very slow for traffic flows utilizing that port channel member since the traffic is resent
several times due to errors. This creates a bad user experience. The study identified core and distribution
level port channels. After identification, the network monitoring system implemented custom tags, attached to the specific interface, to categorize these interfaces for event detection. After tagging, an event
detection algorithm was created to capture this segment of interfaces if they generated any of the specific
errors in Table I. These types of errors are indicative of a member that should be removed from a port
channel and investigated or repaired (Wallin, 2009).
Table 1
Specific errors
Error type
Input Errors (RX)
Duration
Last hour
Input Errors (RX)
Today
Output Errors (TX)
Last Hour
Output Errors (TX)
Today
Typical Causes
Fiber or cabling Inconsistencies, Malformed Packets, Speed or duplex mismatch, Queue or Buffer Flooding
Fiber or cabling Inconsistencies, Malformed Packets, Speed or duplex mismatch, Queue or Buffer Flooding
Flow Control Problem, Fiber or cabling Inconsistencies, Malformed Packets, Speed or duplex mismatch,
Queue or Buffer Flooding
Flow Control Problem, Fiber or cabling Inconsistencies, Malformed Packets, Speed or duplex mismatch,
Queue or Buffer Flooding
The detected events generated both alarms and actions. The alarms logged the event in the event list
while the actions used the device information to remove the member from the port channel. This was
properly logged in the event list and an email was sent to the network operator queues for notification of
the issue along with remediation steps taken. The network operators noted over 300 interfaces that met
this criterion. After further review, the event thresholds were adjusted to look for the high probability
offenders. It was noted that some interfaces generate errors when the switch buffers get full or the forwarding queues get full. This is not a defective interface but rather the traffic is either too great or there
are too many packets per second to process. This requires traffic engineering to engage and pinpoint the
solution. The study excluded these types of errors. Likewise, if a fiber is damaged but not severed, the
error counts will grow every hour in large numbers and after two polling periods detection of the problem
can occur in a process that is in control. This is where the study focused on an acceptable approach.
260
5. Methodology
These studies implemented specific solution sets using specific products. The approach will work with
most enterprise class monitoring software systems. There are 5 basic steps required in this approach. The
steps are slightly different if hardware replacement is normal for the resolution of the problem (aka power
supply failures). These differing workflows are in Fig 3.
Add Device to Monitoring
system
Add Device to Monitoring
system
Identify the port channel and its
member interfaces
Configure Device for Logging to
Log Collector (Splunk)
Build alert algorithm that
detects errors or discards
Build alert algorithm that
detects power supply error
When alert is fired script the
removal of the offending
interface from the port channel.
When alert is fired script the
opening of an incident ticket to
have device repaired by support.
Alert the network operator of
the alert and response
Alert the network operator of
the ticket creation and repair
status.
Fig. 3. Monitor workflow
Step 1: adding the device to the network management system. This step is straightforward. In adding
the device, it is critical that the interface and the port channel be polled for status and error counts. If the
network management system does any kind of configuration management including the allowance of
script execution, remediation of the event is more than likely possible. If not possible, any form of autohealing will be difficult. A fallback option would be to alert an operator and continue until the network
operator acknowledges the issue. This is typically a standard feature with network management systems.
Step 2: identify the member interfaces of the port channel and segment them out. The network management system can accomplish this step. Solarwinds Orion allows for custom properties for all interfaces and devices (nodes). If that option is not available, renaming the interfaces in the device may allow
for segmentation that detection algorithm can use. Either way the interfaces need to be detectable and
look different, so they are not like the normal event messages. In the case of redundant hardware, the
machine data needs to be identified from manufacture literature or device testing. Once the log messages
are identified they can be placed into the correct location of the algorithm.
Step 3: build alert algorithm that detects errors or discards. When configuring the event/alert, it is
critical that the search for member interfaces use the segmentation strategy used in step 2. This will allow
for custom notifications or actions that will get through any network alert noise. It is important to look at
both errors inbound and outbound. It is not uncommon for a circuit that goes between two monitored
systems to experience errors in one direction. For example, if looking for an inbound error on device A
(receiving interface), it would seem like there should be an output error on device B (sending interface).
This is not the case and if the algorithm is only detecting outbound errors, the process will miss any
inbound errors.
Step 4: when alert is fired, script the appropriate action or the alert. In the case of a defective interface
the next step may be the removal of...
Purchase answer to see full
attachment