Network Monitoring Plan Paper

User Generated

xriqunxre

Engineering

Description

Network Management/Monitoring (Network Operation Center)

Monitoring is a critical aspect of the daily management and support of the network infrastructure. It serves as the assessment and validation of the network requirements and answers questions such as ‘Are we meeting our Service Level Agreement (SLA) for high-availability?’, ‘Are we approaching capacity?’, ‘Are we vulnerable to security threats?’, ‘How are we doing with performance?’. Providing transparency and visibility to the metrics is a must as it not only surface early signs of problems but also instills trust in the process. Monitoring tools can provide the visibility needed for senior leadership to better understand the existing state of the enterprise and helps make progress towards achieving its goals (McDowall, 2019).

Establishing a baseline that includes critical events to monitor, accompanied by key performance indicators to measure success is essential and allows you to answer these questions. Some of the key components that must be monitored include monitoring hardware such as routers, switches, and servers to monitor the health and identify problems (Masikisiki et al., 2017). Monitoring software average percentage of max throughput utilized (Masikisiki et al., 2017). “The whole network doesn’t have to be down to have a negative impact” (Masikisiki et al., 2017). It is important to remember that software performance issues may cause the degradation of services and monitoring.

Proactive monitoring has the potential to transform network infrastructure teams from being reactive and putting out fires every week to being proactive, where less time is spent on critical incidents that could have been prevented. It allows network administrators to shift from incident management to problem-solving that ensures the root cause is identified and resolved.

This week, you will learn about tools and techniques that can be deployed for network monitoring. You will prepare a network monitoring plan for the organization that would include tools and approaches that should be implemented.

References

John D. McDowall. (2019). Complex Enterprise Architecture : A New Adaptive Systems Approach. Apress.

Masikisiki, B., Dyakalashe, S., & Scott, M. S. (2017). Network monitoring system for network equipment availability and performance reporting. 2017 IST-Africa Week Conference (IST-Africa), IST-Africa Week Conference (IST-Africa), 2017, 1–12. https://doi-org.proxy1.ncu.edu/10.23919/ISTAFRICA....

Assignments Develop Network Monitoring Plan

For this assignment, you will build on the work of Week 4 by adding network management and monitoring into your network design. You want to plan for real-time monitoring so that performance, availability, and security can be proactively monitored to address potential issues before they become critical and impactful to the organization. Assume a range of tools and controls to allow the Network Operations Center (NOC), develop a comprehensive plan for how you will shift your NOC from a reactive to a proactive mode by monitoring performance, availability, and security threats using data to increase visibility, gain insights, and prevent problems before they occur.

Your plan should include the following:

  • Network Performance Monitoring (NPM) tools and probes
  • Events to monitor and detect security, performance, and availability issues
  • Alerts and notifications
  • Key Performance Indicators (KPIs)
  • Visualization and reporting

Length: 5-7 pages, not including the title and reference pages

References: Include a minimum of 5 scholarly resources.

The paper should include at least one custom table and one diagram created for this specific paper. Tools may include Lucidchart for students, or any other tool that may be known by students. Include citations to peer-reviewed articles that you used in research and developing this paper.

The completed assignment should address all of the assignment requirements, exhibit evidence of concept knowledge, and demonstrate thoughtful consideration of the content presented in the course.

References to use

  • An Overview of the Commercial Cloud Monitoring ToolsAlhamazani, K., Ranjan, R., Mitra, K., Rabhi, F., Jayaraman, P., Khan, S., Guabtni, A., & Bhatnagar, V. (2015). An overview of the commercial cloud monitoring tools: Research dimensions, design issues, and state-of-the-art. Computing, 97(4), 357–377.
    This resource provides students with an overview of tools used for monitoring the cloud.
  • New, Smart Tools Advance Monitoring, Protect NetworkPhifer, L. (2015). New, smart tools advance monitoring, protect network. (cover story). Information Security, 6, 3–7.
    This resource examines tools for security monitoring.
  • Data Types Enterprises Should Use for High Network Visibility Preimesberger, C. (2020). Data types enterprises should use for high network visibility. EWeek, N.PAG.
    This resource provides some date points that can be used for monitoring networks.
  • Network Monitoring SoftwareMitchell, D. (2018, April). Network monitoring software. PC Pro, 92-94,96-98.
    This resource explores monitoring software used today.
  • How It Pros Employ the Latest Security Monitoring TechZurier, S. (2015). How it pros employ the latest security monitoring tech. Information Security, 6, 8–12.
    This resource provides industry insights on the challenges and techniques used for monitoring security.
  • A Practical Approach to Monitoring Network Redundancy Phillips, R., Jenab, K., & Saeid Moslehpour, S. (2020). A practical approach to monitoring network redundancy. International Journal of Data and Network Science, 4(2), 255–262.
    This resource examines approaches for monitoring redundancy.

Unformatted Attachment Preview

cover story: new tools home EDITOR’S DESK NEW TOOLS FRONTLINE REPORT MARKET RETOOLS New, Smart Tools Advance Monitoring, Protect Network in the high-stakes, cat-and-mouse game of cybersecu- Learn the latest about ways to employ advanced network security monitoring technologies. By Lisa Phifer 3 Can New Security Tools Keep Your Network Clean? n july 2015 rity, the only real constant is change. The number of new threats is escalating, and the attack surface is growing, too. Businesses today rely more than ever on Internetconnected devices, services and data—from machineto-machine communication and the Internet of Things (IoT), to bring your own devices (BYODs) and bring your own cloud (BYOC) applications. One thing this tidal wave of new targets has in common: a 24/7 exposure to network-borne threats. From Heartbleed to FREAK, criminals continually exploit low-hanging fruit by finding new bugs in widely deployed software and old gaps in new technologies. Effectively spotting and stopping these evolving network threats requires not just vigilance, but new approaches. It’s unrealistic to expect enterprise defenses to block all attacks or eliminate all vulnerabilities. Furthermore, manual threat assessment and intervention simply cannot scale to meet these challenges. Network security monitoring that is more pervasive, automated and intelligent is critical to approve awareness and response time. cover story: new tools home EDITOR’S DESK NEW TOOLS FRONTLINE REPORT MARKET RETOOLS The Importance of Network Threat Visibility According to the Ponemon Institute’s “2014 Cost of Cyber Crime: United States,” the most costly cybercrimes are those caused by denial of service attacks, malicious insiders and malicious code, leading to 55% of all costs associated with cyberattacks. Not surprisingly, costs escalate when attacks are not resolved quickly. Participants in Ponemon’s study reported the average time to resolve a cyberattack in 2014 was 45 days, at an average cost of $1,593,627—a 33% increase over 2013 cost and 32-day resolution. Worse, study participants reported that malicious insider attacks took on average more than 65 days to contain. The increasing frequency, diversity and complexity of network-borne attacks is impeding threat resolution. Cisco’s 2015 Annual Security Report found that criminals are getting better at using security gaps to conceal malicious activity: for example, moving beyond recently fixed Java bugs to use new Flash malware and Snowshoe IP distribution techniques (increasing spam by 250%) and exploiting the 56% of Open SSL installations still vulnerable to Heartbleed, and others, or enlisting end users as cybercrime accomplices. In this era of BYOD, BYOC, IoT and more, achieving real-world security for business-essential connectivity requires more visibility into network traffic, assets and patterns. “By understanding how security technologies operate,” Cisco’s report concluded, “and what is normal 4 Can New Security Tools Keep Your Network Clean? n july 2015 (and not normal) in the IT environment, security teams can reduce their administrative workload while becoming more dynamic and accurate in identifying and responding to threats and adapting defenses.” depth, breadth and intelligence According to Gartner analyst Earl Perkins, speaking at the Gartner Security & Risk Management Summit in June 2015, advanced threat defense combines near-real-time monitoring, detection and analysis of network traffic, payload and endpoint behavior with network and endpoint forensics. More effective threat response begins with advanced security monitoring. This includes awareness of user activities and the business resources they access, both on-site and off. However, security professionals are also experiencing information overload. Advanced visibility therefore comes from more intelligent use of information through prioritization, baselining, analytics and more. Perkins recommends deploying network security monitoring technologies based on risk. At a minimum, every enterprise should take fundamental steps, including properly segmenting networks and defending business assets with traditional network firewalls, intrusion prevention systems (IPS), secure Web gateways and endpoint protection tools. These defenses serve as sentries— armed guards stationed at key entrances to ward off basic threats and sound alarm at the first sign of attack. For cover story: new tools home EDITOR’S DESK NEW TOOLS FRONTLINE REPORT MARKET RETOOLS threat-tolerant businesses with low-risk, these fundamentals may be sufficient. However, most organizations at risk will want to consider more advanced monitoring tools and capabilities such as next-generation and application firewalls, network access control (NAC), enterprise mobility management (EMM), and security information and event management (SIEM). These technologies go deeper by examining more traffic content or endpoint characteristics. They broaden visibility by monitoring more network elements, including mobile devices and activities. Ultimately, they can produce more actionable intelligence by knitting together disparate events into more cohesive threat alerts—especially for advanced persistent threats that might otherwise be missed entirely. Finally, risk-intolerant organizations may wish to go even further, using network and endpoint forensics to routinely record all activity, enabling look-back traffic, and payload and behavior analysis. Unlike real-time monitoring technologies, forensics tools focus on identifying past compromises—but this can be important to spot, for example, those long-running insider attacks. Forensics can also help enterprises identify gaps in their defenses, enabling them to adapt and to better prevent future attacks. Putting Advanced Monitoring to Work To take advantage of new advanced network monitoring 5 Can New Security Tools Keep Your Network Clean? n july 2015 technologies, it can help to get a handle on industry advances and why new tools and capabilities have emerged. Let’s start with that staple of network monitoring, the traditional network firewall. Single-function firewalls long ago morphed into unified threat management Forensics can also help enterprises identify gaps in their defenses, enabling them to adapt and to better prevent future attacks. (UTM) platforms, which combine firewall, IPS, VPN, Web gateway, and antimalware capabilities. However, even UTMs tend to focus on network traffic inspection. When application payload is examined, it’s for a specific reason such as blocking a blacklisted URL, content type or recognized malware. In contrast, next-generation firewalls are applicationaware. That is, they attempt to identify the application riding over a given traffic stream—even an SSL-encrypted session—and apply policies specific to that application and perhaps to the users, groups or roles. For example, a next-generation firewall isn’t limited to blocking all traffic to Facebook. It can allow only marketing employees to post to Facebook, but not to play Facebook games. Or it cover story: new tools home EDITOR’S DESK NEW TOOLS FRONTLINE REPORT MARKET RETOOLS can simply monitor how workers interact with Facebook and generate alerts when activity deviates from that baseline. This granularity is only possible because the firewall can identify applications and their features—including new applications it will learn about in the future. Increasingly, next-generation firewalls are learning through machine-readable feeds that not only deliver new threat signatures but intelligence about new attacks and IPs, devices or users with bad reputations. This ability to adapt and learn is key to keeping up with new cyberthreats. While intrusion prevention remains a cornerstone of network monitoring, it has expanded in several dimensions. First, as enterprise networks move from wired to wireless access, wireless IPS has become essential. At a minimum, enterprises can use rogue detection built into wireless LAN controllers. Risk-averse enterprises may invest in wireless IPS to scan the network 24/7 for threats, including some otherwise hidden IoT and unauthorized BYOD communication. Second, intrusion prevention now extends beyond the enterprise network to mobile devices. For example, EMMs can be used to routinely assess mobile device integrity, alerting administrators to jailbroken, rooted or malware-infected devices and automatically protect the enterprise by removing network connections or business applications from those devices. The ability to look beyond the traditional enterprise network edge is key to avoiding blind spots. 6 Can New Security Tools Keep Your Network Clean? n july 2015 SIEM technologies have also evolved from simply aggregating and normalizing events produced by enterprise network-connected systems and applications; now it combs that data with contextual information about users, assets, threats and vulnerabilities to enable correlation SIEM not only helps enterprises pull monitored data together, but now it can intelligently sift through that haystack to pinpoint internal and external threats. and analysis. According to Gartner, SIEM deployment is growing, with breach detection now overcoming compliance as the primary driver. As a result, SIEM vendors have expanded capabilities that target breach detection, such as threat intelligence, anomaly detection and network-based activity monitoring—for example, integrating NetFlow and packet capture analysis. SIEM not only helps enterprises pull monitored data together, but now it can intelligently sift through that haystack to pinpoint internal and external threats. A new market segment has started to emerge: breach detection systems (BDS). These technologies are being driven by startups that are working to apply big data cover story: new tools home EDITOR’S DESK NEW TOOLS analytics to monitored information, profiling user- and device-behavior patterns to detect breaches and facilitate interactive investigation. According to NSS Labs, a BDS can identify pre-existing breaches as well as malware introduced through side-channel attacks—but should be FRONTLINE REPORT MARKET RETOOLS Enterprises that have tried other advanced monitoring technologies but are plagued by advanced, persistent threats may wish to investigate breach detection systems. considered a “last line of defense against breaches that go undetected by current security technologies, or are unknown by these technologies.” Risk-intolerant enterprises that have tried other advanced monitoring technologies but are plagued by advanced, persistent threats may wish to investigate this new tool. When attacks inevitably break through enterprise network defenses and evade real-time detection, another advanced monitoring tool can be helpful: network forensics appliances. Network forensics also analyzes monitored data, but in a different way, for a different purpose. Like 7 Can New Security Tools Keep Your Network Clean? n july 2015 a network DVR, these passive appliances record and catalog all ingress and egress traffic. By delivering exhaustive full-packet replay, analysis and visualization quickly, network forensics appliances support cybercrime investigation, evidence gathering, impact assessment and cleanup. Here, the idea is to avoid limitations associated with realtime monitoring—that is, having to spot everything important right when it happens. Network forensics makes it possible to go back and take a second look, to find what other monitoring systems might have missed. The Bottom Line As we have seen, advanced network security monitoring cannot be accomplished through isolated static tools. Rather, monitoring must occur at many locations and levels through the enterprise network and beyond, and create a comprehensive data set that an increasingly smart and dynamic collection of analysis tools then scours. Only in this way can we respond quickly and effectively to emerging cyberthreats that have learned how to fly under the traditional network radar. n Lisa Phifer owns Core Competence Inc., a consultancy specializing in safe business use of emerging Internet technologies. Phifer is a recognized industry expert on wireless, mobile and cyber security. Copyright of Information Security is the property of TechTarget, Inc. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. FRONTLINE REPORT home EDITOR’S DESK NEW TOOLS FRONTLINE REPORT MARKET RETOOLS How IT Pros Employ the Latest Security Monitoring Tech it takes only a cursory glance at the news to realize that Get the latest from the enterprise frontline on attack prevention and network protection. By Steve Zurier 8 Can New Security Tools Keep Your Network Clean? n july 2015 malware, data breaches and other information security threats have expanded exponentially in the last three to five years. Hardly a week goes by before another highprofile, multimillion-dollar hacking incident goes public. Take your pick: Target, eBay, Home Depot, JPMorgan Chase, Sony, even the U.S. Postal Service. And while cyberattacks rarely target network devices, vulnerable networks are regularly exploited by cybercriminals during these incidents to transport malicious traffic or stolen data. Behind the scenes are the IT pros who must grapple with government-sponsored hacks—from countries such as China, Iran, North Korea and Russia— in addition to for-profit hackers with ties to organized crime and, yes, people who decide to break into a network just because they can. It’s an environment that has propelled networking and security teams to work closer together so they can answer a seemingly basic question with more certainty: Is the network safe? Ron Grohman, a senior network engineer at Bush Brothers & Co., makers of the popular Bush’s Best brand FRONTLINE REPORT home of baked beans, says faced with all of the recent threats and attacks, he’s not taking any chances. EDITOR’S DESK NEW TOOLS FRONTLINE REPORT MARKET RETOOLS NO PRODUCT IS PERFECT Grohman doesn’t rely on just one security product to protect the company’s network. He uses a mix of Cisco ASA 5525-X firewalls with the Sourcefire URL filter, FireEye’s Web Malware Protection System (MPS) 1310 to look for suspicious malware and Symantec antivirus software as a final backup. Grohman says he uses the products from Cisco and Sourcefire—which Cisco acquired in 2013—mainly as a firewall and URL filter to manage Web traffic. The FireEye product was placed on the network to do anomaly-based detections. If the FireEye platform detects suspicious malware, the software blocks the malware and sends an alert to Grohman, who refers the incident to a member of the help desk, where the malware is removed. The Symantec software catches HTTPS traffic and serves as a last line of defense before anything reaches the endpoint. “I double- and triple-up,” he says. “No one product is perfect, and I’m OK with multiple systems checking things, especially if it’s going to protect the network.” At a Fortune 500 company, there might be dozens of networking and security personnel who are cross-trained in each other’s disciplines and work jointly on network 9 Can New Security Tools Keep Your Network Clean? n july 2015 security. Not so at Bush Brothers, a private medium-sized company in Knoxville, Tenn. Grohman works as one of eight members of the infrastructure team—and he’s the only one with security credentials. To augment his background in networking, he completed the CISSP course from ISC2 last fall, and he earned the Cisco Certified Network Professional certification last spring. Grohman also earned a degree in information security several years ago from ITT Tech. “All the security work falls on me,” he says. “People would come to me and ask if the network was secure, and all I could say was, ‘I think so,’ ” he says. “It bothered me that I didn’t know for sure if everything was safe. So that’s why we added all those extra layers.” That kind of uneasiness is common among network and security managers today. But it’s also an approach that Frank Dickson, research director for information and network security at Frost & Sullivan, says is understandable. “There’s really no single silver bullet,” he says. “No matter what the vendors say, malware will inevitably get through. Remember, a zero-day attack, by definition, exploits an unknown vulnerability. Defending oneself from the unknown is challenging.” Dickson says there’s definitely been a shift in the security landscape away from solely relying on traditional antivirus software, which would look to create signatures of previously undiscovered malware after a breach had been FRONTLINE REPORT home EDITOR’S DESK NEW TOOLS FRONTLINE REPORT MARKET RETOOLS detected. Today, products such as SourceFire from Cisco, FireEye, and Palo Alto Networks’ WildFire and Traps tools take a much more proactive approach. “The industry is moving to the use of more behaviorbased approaches, such as testing the behavior of suspicious files in a quarantined, virtualized environment or utilizing big data analytics to monitor network traffic to establish a baseline and look for significant anomalies,” Dickson explains. PROTECTION VS. PREVENTION Conventional wisdom and the sheer reality of today’s threats may dictate an approach like the one used at Bush Brothers. But Golan Ben-Oni, CIO officer at telecom, banking and energy company IDT Corporation in Newark, N.J., doesn’t buy it. He says there has to be more of a focus on stopping the bad guys, not merely responding to attacks after the fact. The industry has been caught in a mode of believing that it’s only a matter of when they will be hacked, he contends, as opposed to if they will be hacked. Ben-Oni says that’s defeatist. “If you say we’re giving up on prevention, you’re then essentially saying you’ve given up,” he says. IDT uses a combination of three products from Palo Alto to protect its network: WildFire network detection software; Traps, which Palo acquired from Israel-based Cyvera last year for endpoint protection; and Global Protect. The third tool allows IDT to extend the benefits of 10 Can New Security Tools Keep Your Network Clean? n july 2015 Crime Doesn’t Pay, But It Sure Costs How bad is a data breach or leak for the bottom line? It depends on the underlying cause, which one study says is linked to the average cost per compromised record. $246 $171 Malicious attack System glitches $160 Employee error Source: “2014 Cost of Data Breach Study: United States,” Ponemon Institute/IBM, May 2014 WildFire and Traps to mobile devices and computers that leave the office. Here’s how they work in concert at IDT: Traps is always on the lookout for malware on the endpoint. If Traps detects that a zero-day attack or some other anomaly has entered the network, it will communicate that to WildFire, which will then run an analysis. Once it FRONTLINE REPORT home EDITOR’S DESK NEW TOOLS FRONTLINE REPORT MARKET RETOOLS confirms that the activity is in fact malware, WildFire will block and remediate the malware. WildFire adds another level of protection in that, once it detects malware, both the endpoint and the network (via the Palo Alto firewall) are protected. The network will not allow the malicious traffic to flow through, and if the file should be introduced by some other means—via a USB flash drive or local file copies, for example—Traps will block its execution. In the past, Ben-Oni says, by the time the IT staffers detected malware, disconnected the computer from the network and uploaded the file to the antivirus lab, it could take 24 hours for them to write a signature. IT teams don’t have that kind of time today. “Traditionally, all of this was done manually and now it happens in near real-time,” Ben-Oni says. By avoiding the need to bring people into the process, the risk of lateral infection is greatly reduced, he explains. Hackers use automation, so the only way for companies to level the playing field is to also use automation, Ben-Oni adds. AUTOMATE NOW That’s a really important point, says Dan Polly, enterprise information security officer at First Financial Bank, which operates more than 100 banking locations in Indiana, Kentucky and Ohio. First Financial uses Cisco Advanced Malware Protection (AMP) for endpoints and the network, which lets the 11 Can New Security Tools Keep Your Network Clean? n july 2015 Network and Security Teams Evolve for years, information security and network- ing teams worked in different department. FireEye CTO Dave Merkel says the current threat landscape has changed that dynamic: “Today, security must be woven into the fabric of the organization.” Security and networking teams must work in partnership. Scott Harrell, vice president of product marketing in the Security Business Group at Cisco, agrees that collaboration between the two disciplines will be critical to combat more advanced threats. “While the two groups can still have division of roles, I think it will move to a point where the security group is more involved in developing the network architecture, and the networking staff will handle Tier 1 security calls while the Tier 2 and Tier 3 alerts go to more experienced security pros,” he says. Golan Ben-Oni, CISO at IDT Corporation, says all the teams within IT work together at IDT. “At our company, everyone gets cross-trained in all the different computing disciplines,” he adds. FRONTLINE REPORT home EDITOR’S DESK NEW TOOLS FRONTLINE REPORT MARKET RETOOLS company do rapid-fire malware analysis. If AMP detects malware, it pushes the suspect file into a portal, which acts as a sandbox in which the software runs an analysis to determine the extent of the threat. “The thing to remember is that before these tools were available, you would need someone to analyze that malware in-depth, which required a person with some extensive programming and security skills,” Polly explains. “Now, we’re able to automate some of that, which saves time and gives us the ability to block the threat much faster.” Much like IDT’s Ben-Oni, Polly sees a great benefit to working with a single vendor that offers multiple capabilities. Along with the AMP product, the company’s security engineering team uses a combination of Cisco ASA and Sourcefire next-generation firewalls, which he says not only perform traditional firewall functions, but also support intrusion detection and prevention and URL content filtering. Polly also likes that, through development and acquisition, Cisco has invested in Talos, the company’s security intelligence and research group. Talos fields a team of researchers who analyze threats and spend their days looking to improve Cisco’s security products. “With an extensible platform, it cuts the time we can address any emerging threats,” Polly says. “There’s no question that the security industry goes in phases; there are times when best-of-breed was the only choice, [but] now it seems to be going back the other way to single 12 Can New Security Tools Keep Your Network Clean? n july 2015 source with multiple capabilities.” For years, information security and networking teams worked in different departments and, in some cases, competed with each other. FireEye CTO Dave Merkel says the current threat landscape has changed that dynamic. “Today, security must be woven into the fabric of the organization,” he says, adding that the security and networking teams must work in partnership. Scott Harrell, vice president of product marketing in the security business group at Cisco, agrees that collaboration between the two disciplines will be critical to combat more advanced threats. “While the two groups can still have division of roles, I think it will move to a point where the security group is more involved in developing the network architecture, and the networking staff will handle Tier 1 security calls while the Tier 2 and Tier 3 alerts go to more experienced security pros,” he says. Golan Ben-Oni, CISO at IDT Corporation, says all the teams within IT work together at IDT. “At our company, everyone gets cross-trained in all the different computing disciplines,” he adds. n Steve Zurier is a freelance technology journalist based in Columbia, Md., with more than 30 years of journalism and publishing experience. Zurier previously worked as features editor at Government Computer News and InternetWeek. Copyright of Information Security is the property of TechTarget, Inc. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. Computing (2015) 97:357–377 DOI 10.1007/s00607-014-0398-5 An overview of the commercial cloud monitoring tools: research dimensions, design issues, and state-of-the-art Khalid Alhamazani · Rajiv Ranjan · Karan Mitra · Fethi Rabhi · Prem Prakash Jayaraman · Samee Ullah Khan · Adnene Guabtni · Vasudha Bhatnagar Received: 29 June 2013 / Accepted: 20 March 2014 / Published online: 16 April 2014 © Springer-Verlag Wien 2014 Abstract Cloud monitoring activity involves dynamically tracking the Quality of Service (QoS) parameters related to virtualized resources (e.g., VM, storage, network, appliances, etc.), the physical resources they share, the applications running on them and data hosted on them. Applications and resources configuration in cloud computing environment is quite challenging considering a large number of heterogeneous cloud resources. Further, considering the fact that at given point of time, there may be need to change cloud resource configuration (number of VMs, types of VMs, number of appliance instances, etc.) for meet application QoS requirements under uncertainties (resource failure, resource overload, workload spike, etc.). Hence, cloud monitoring tools can assist a cloud providers or application developers in: (i) keeping their resources and applications operating at peak efficiency, (ii) detecting variations in resource and application performance, (iii) accounting the service level agreement violations of certain QoS parameters, and (iv) tracking the leave and join operations K. Alhamazani · F. Rabhi University of New South Wales, Sydney, Australia R. Ranjan (B) · P. P. Jayaraman CSIRO, Canberra, Australia e-mail: rajiv.ranjan@csiro.au K. Mitra Luleå University of Technology, Luleå, Sweden S. U. Khan North Dakota State University, Fargo, USA A. Guabtni NICTA, Sydney, Australia V. Bhatnagar University of Delhi, New Delhi, India 123 358 K. Alhamazani et al. of cloud resources due to failures and other dynamic configuration changes. In this paper, we identify and discuss the major research dimensions and design issues related to engineering cloud monitoring tools. We further discuss how the aforementioned research dimensions and design issues are handled by current academic research as well as by commercial monitoring tools. Keywords Cloud monitoring · Cloud application monitoring · Cloud resource monitoring · Cloud application provisioning · Cloud monitoring metrics · Quality of service parameters · Service level agreement Mathematics Subject Classification 68U01 1 Introduction According to National Institute of Standards and Technology NIST,1 cloud computing is a “ Model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (network, servers, storage, applications, services) that can be rapidly provisioned and released with minimal management effort or service provider interaction” [1]. Service models, hosting, deployment models, and roles are some of the important concepts and essential characteristics related to cloud computing technologies defined by NIST and elaborated in [1–6], Commercial cloud providers including Amazon Web Services (AWS), Microsoft Azure, Salesforce.com, Google App Engine and others offer the cloud consumers options to deploy their applications over a network of infinite resource pool with practically no capital investment and with modest operating cost proportional to the actual use. For example, Amazon EC2 cloud runs around half million physical hosts, each of them hosting multiple virtual machines that can be dynamically invoked or removed [7]. Several papers in the literature discuss, explore and propose surveys of cloud monitoring in different aspects [1–6,8–13]. To the best of our knowledge, no specific survey considers monitoring applications across the different cloud layers namely Infrastructure as a service (IaaS), platform as a service (PaaS) and software as a service (SaaS). Further, none of the papers have focused on predictive cloud monitoring. In addition to that, none of the paper discusses the possibility of utilizing machine learning techniques with monitored data. In addition to the above factors, one arising aspect with the cloud computing is managing huge volume of data (Big Data). In the present environment, the term “Big Data” is described as a phenomenon that refers to the practice of collection and processing of very large datasets and the associated systems and algorithms used to analyze these enormous data sets [14]. Three well recognized characteristics of Big Data are Variety, Volume and Velocity (3 V’s) of data generation [15,16]. The steady growth of social media and smart mobile devices has led to an increase in the sources of outbound traffic, initiating “data tsunami phenomenon”. For example, in [17–19], Wang et al. present a high-performance data-intensive computing solution for is massive remote sensing data processing. This poses significant challenges in cloud computing. 1 http://www.nist.gov/itl/cloud/. 123 An overview of the commercial cloud monitoring tools 359 In [20], studies show that as more people join social media sites hosted on clouds, analysis of the data become more difficult and almost impossible to be analyzed. Another aspect of big data events can occur when e.g. services or processes of the infrastructure itself cause high load [21]. Other study [22,23] shows that VMs migrating, copy and saving current state can affect the performance of data transfer within the cloud. Moreover, different types of data originating from mobile devices makes understanding of composite data a challenging problem due to multi-modality, huge volume, dynamic nature, multiple sources, and unpredictable quality. Continuously monitoring of multi-modal data streams collected from heterogeneous data sources require monitoring tools that can cope with managing big data floods (data tsunami phenomenon). In this paper, we identify and discuss three challenges of cloud monitoring: (1) How to determine layer specific application monitoring requirements i.e., how cloud consumer can stipulate at what cloud layer his/her running application should be monitored. (2) How cloud consumer can express what information he/she is interesting in to gain knowledge while his/her application is being monitored. (3) How cloud consumer can predict the application behavior in terms of performance. 1.1 Our contributions The concrete contributions made by this paper are: (a) advancing the fundamental understanding of cloud resource and application provisioning and monitoring concepts, (b) identification of the main research dimensions and software engineering issues based on cloud resource and application types, and (c) presents future research directions for novel cloud monitoring techniques. The remainder of the paper is organized as follows: Discussion on key cloud resource provisioning is presented in Sect. 2. Section 3 discusses the cloud application life cycle and in detail discusses the components of cloud monitoring. Section 4 present details on major research dimensions and software engineering issues related to developing cloud monitoring tools are. Section 5, discusses mapping of research dimensions to existing cloud monitoring tools. The paper ends with brief conclusion and overview of future work. 2 Cloud resource provisioning Cloud resource provisioning is a complex task [24] and is referred to as the process of application deployment and management on cloud infrastructure. Current cloud providers such as Amazon EC2, ElasticHosts, GoGrid, TerraMark, and Flexian, do not completely offer support for automatic deployment and configuration of software resources [25]. Therefore, several companies, e.g. RightScale and Scalr provide scalable managed services on top of cloud infrastructures, to cloud consumers for supporting automatic application deployment and configuration control [25]. The three main steps for cloud provisioning are [24,26]: Virtual machine provisioning where suitable VMs are instantiated to match the required hardware and configuration of an application. To illustrate, Bitnami2 supports con2 http://bitnami.org/faq/cloud_amazon_ec2. 123 360 K. Alhamazani et al. Fig. 1 Provisioning and deployment sequence diagram sumers to provision a Bitnami stack that consists of VM and appliances. On the other hand, Amazon EC2 consumers may firstly provision a VM on the cloud then choose the appliances to provision on that VM. Resource provisioning it is the process of mapping and scheduling the instantiated VMs onto the cloud’s physical servers. This is handled by cloud-based hypervisors. For example, public clouds expose APIs to start/stop a resource but not to control which physical server within that region/datacenter will host the VM. Figure 1, illustrates the steps where a cloud consumer attempts to provision cloud resources on Amazon EC2 platform. In step 1, from the VM repository, a consumer views the available VMs provided by the cloud platform and selects the preferable VM instance type. In step 2, the consumer sets up his/her preferences/configurations on this VM. In steps 3 and 4, the user deploys this VM on the cloud platform successfully. Subsequently, in steps 5 and 6, the consumer retrieves back a list of available applications from the applications repository. In step 7, the consumer simply opts for his/her desired applications that 123 An overview of the commercial cloud monitoring tools 361 he/she would like to provision. Finally, in step 8, the cloud consumer deploys the applications and the VM on the cloud platform. Application provisioning is the process of application deployment on VMs on the cloud infrastructure. For example, deploying a Tomcat server as an application on a VM hosted on the Amazon EC2 cloud. Applications provisioning can be done in two ways. The first method consists of deploying the applications together while hosting a VM. In the second method, the consumer may want to first deploy the VM, and then as a separate step, he/she may deploy the applications. After the provisioning stage, a cloud workflow instance might be composed of multiple cloud services, and in some cases services from a number of different service providers. Therefore, monitoring the quality of cloud instances across cloud providers become much more complex [27]. Further, at run time, QoS of the running instance needs to be consistently monitored to guarantee SLA and avoid/handle abnormal system behaviour. Monitoring is the process of observing and tracking applications/resources at run time. It is the basis of control operations and corrective actions for running systems on clouds. Despite the existence of many commercial monitoring tools in the market, managing service level agreements (SLAs) between multiple cloud providers still pose a major issue in clouds. In some way, cloud monitoring, SLA and dynamic configuration are correlated in the sense that one has an impact on another. In other words, enhancing monitoring functionalities will in turn assist meeting SLAs as well as improving dynamic configuration operations at run time. Moreover, SLA has to be met by the cloud providers in order to reach the required reliability level required by consumers. Also, auto-scaling and dynamic configurations are required for optimal use of cloud technology. This all-together leads us to conclude that cloud monitoring is a key element that has to be further studied and enhanced. 3 Cloud monitoring Under this section, we present the basic components, phases and layers of application architecture on clouds. Also, this section will present the state of the art in cloud monitoring as well as how it is conceptually correlated to QoS and SLA. 3.1 Application life cycle The application architecture determines how, when, and which provisioning operations should be processed and applied on cloud resources. The high level application (e.g., multimedia applications) architecture is multi-layered [28]. These layers may consist of clients/application consumers, load balancers, web servers, streaming servers, application servers, and a database system. Notably, each layer may instantiate multiple software resources as needed and when required. Such multiple instantiations can be allocated to one or more hardware resources. Technically, across those aforementioned system layers, a number of provisioning operations take place at design time as well as run-time. These provisioning operations should ensure SLA compliance by achieving the QoS targets. 123 362 K. Alhamazani et al. Resource selection It is the process where the system developer selects software (web server, multimedia server, database server, etc.) and hardware resources (CPU, storage, and network). This process encapsulates the allocation of hardware resources to those selected software resources. Resource deployment During this process, system administrator instantiates the selected software resources on the hardware resources, as well as configuring these resources for successful communication and inter-operation with the other software resources already running in the system. Resource monitoring In order to ensure that the deployed software and hardware resources run at the required level to satisfy the SLA, a continuous resource monitoring process is desirable. This process involves detecting and gathering information about the running resources. In case of the detection of any abnormal system behavior, the system orchestrator is notified for policy-based corrective actions to be undertaken as a system remedy. Resource control Is the process to ensure meeting the QoS terms stated in the SLA. This process is responsible for handling system uncertainties at run time e.g. upgrade or downgrade a resource type, capacity or functionality. 3.2 Cloud monitoring In clouds, monitoring is essential to maintain high system availability and performance of the system and is important for both providers and consumers [8–10]. Primarily, monitoring is a key tool for (i) managing software and hardware resources, and (ii) providing continuous information for those resources as well as for consumer hosted applications on the cloud. Cloud activities like resource planning, resource management, data center management, SLA management, billing, troubleshooting, performance management, and security management essentially need monitoring for effective and smooth operations of the system [29]. Consequently, there is a strong need for monitoring looking at the elastic nature of cloud computing [30]. In cloud computing, monitoring can be of two types: high-level and low-level. Highlevel monitoring is related to the virtual platform status [31]. The low-level monitoring is related to information collected about the status of the physical infrastructure [31, 32]. Cloud monitoring system is a self-adjusting and typically multi-threaded system that is able to support monitoring functionalities [33]. It comprehensively monitors preidentified instances/resources on the cloud for abnormalities. On detecting an abnormal behavior, the monitoring system attempts to auto-repair this instance/resource if the corresponding monitor has a tagged auto-heal action [33]. In case of auto-repair failure or an absence of an auto-heal action, a support team is notified. Technically, notifications can be sent by different means such as email, or SMS [33]. 3.2.1 Monitoring, QoS, and SLA As mentioned earlier, cloud monitoring is needed for continuous assessment of resources or applications on cloud platform in terms of performance, reliability, power 123 An overview of the commercial cloud monitoring tools 363 usage, ability to meet SLA, security, etc [34]. Fundamentally, monitoring tests can be computation based and/or network based. Computation based tests are concerned about the status of the real or virtualized platforms running cloud applications. Data metrics considered in such tests include CPU speed, CPU utilization, disk throughput, VM acquisition/release time and system up-time. Network based tests focus on network layer data related metrics like jitter, round-trip time RTT, packets loss, traffic volume etc [31,32,35]. At run-time, a set of operations take place in order to meet the QoS parameters specified in SLA document that guarantees the required performance objectives of the cloud consumers. The availability, load, and throughput of hardware resources can vary in unpredictable ways, so ensuring that applications achieve QoS targets is not trivial. Being aware of the system’s current software and hardware service status is imperative for handling such uncertainties to ensure the fulfillment of QoS targets [8]. In addition, detecting exceptions and malfunctions while deploying software services on hardware resources is essential e.g., showing QoS delivered by each application component (software service such as web server or database server) hosted on each hardware resource. Uncertainties can be tackled through the development of efficient, scalable, interoperable monitoring tools with easy-to-use interfaces. 3.2.2 Monitoring across different applications and layers As mentioned previously, application components (e.g., streaming server, web server, indexing server, compute service, storage service, and network) are distributed across cloud layers including PaaS and IaaS. Thus, in order to guarantee the achievement of QoS targets for the application as a whole, monitoring QoS parameters should be performed across all the layers of cloud stack including Platform-as-a-Service (PaaS) (e.g., web server, streaming server, indexing server, etc.) and Infrastructureas-a-Service (IaaS) (e.g., compute services, storage services, and network). Figure 2 illustrates how different components in a cloud platform are distributed across the cloud platform layers. Table 1 shows the QoS parameters that a monitoring system should consider at each cloud layer. Typically, QoS targets vary across application types. For example, QoS targets for eResearch applications are different from static, single-tier web applications (e.g., web site serving static contents) or multi-tier applications (e.g., on demand audio/video streaming). Based on application types, there is always a need to negotiate different SLAs. Hence, SLA document includes conditions and constraints that match the nature of QoS requirements with each application type. For example, a genome analysis experiment on cloud services will only care about data transfer (upload and download) network latency and processing latency. On the other hand, for multimedia applications, the quality of the transferred data over network is more important. Hence, other parameters gain priority in this case. Failing to track QoS parameters will eventually lead to SLA violations. Consequently, monitoring is fundamental and responsible for SLAs compliance certification [36]. Moreover, a multi-layer application monitoring approach can provide significant insights into the application performance and system performance to both the consumer and cloud administrator. This is essential for consumers as they can identify and isolate application performance bottlenecks to 123 364 K. Alhamazani et al. Fig. 2 Components across cloud platform layers Table 1 QoS parameters at each cloud platform layer Cloud layer Layer components Targeted QoS parameters SaaS Appliances x,y,z, etc PaaS Web Server, Streaming Server, Index Server, Apps Server, etc IaaS Compute Service, Storage Service, Network, etc BytesRead, BytesWrite, Delay, Loss, Availability, Utilization BytesRead, BytesWrite, SysUpTime, SysDesc, HrSystemMaxProcesses, HrSystemProcesses, SysServices CPU parameters: Utilization, ClockSpeed, CurrentState; network parameters: Capacity, Bandwidth, Throughput, ResponseTime, OneWayDelay, RoundTripDelay, TcpConnState, TcpMaxConn specific layers. From a cloud administrator point-of-view, the QoS statistics on application performance across layer can help them maintain their SLAs delivering better performance and higher consumer satisfaction. 4 Evaluation dimensions Under this section, we present the basic components that can be considered as evaluation dimensions in order to evaluate a monitoring tool in cloud computing. 4.1 Monitoring architectures In cloud monitoring, the network and system related information is collected by the systems. For example, CPU utilization, network delay and packet losses. This informa- 123 An overview of the commercial cloud monitoring tools 365 Fig. 3 Centralized monitoring architecture tion is then used by the applications to determine actions such as data migration to the server closest to the user to ensure that SLA requirements are met. Typically, network monitoring can be performed on centralized and de-centralized network architectures. 4.1.1 Centralized In centralized architecture shown in Fig. 3, the PaaS and IaaS resources send QoS status update queries to the centralized monitoring server. In this scheme, the monitoring techniques continuously pull the information from the components via periodic probing messages. In [33], the authors show that a centralized cloud monitoring architecture allows better management for cloud applications. Nevertheless, centralized approach has several design issues, including: • Prone to a single point of failure; • Lack of scalability; • High network communication cost at links leading to the information server (i.e., network bottleneck, congestion); and • Possible lack of the required computational power to serve a large number of monitoring requests. 4.1.2 Decentralized Recently, proposals for decentralized cloud monitoring tools have gained momentum. Figure 4 shows the broad schematic design of decentralized cloud monitoring system. The decentralization of monitoring tools can overcome the issues related to current centralized systems. A monitoring tool configuration is considered as decentralized if none of the components in the system is more important than others. In case one of 123 366 K. Alhamazani et al. Fig. 4 Decentralized monitoring architecture the components fails, it does not influence the operations of any other component in the system. Structured peer-to-peer Looking forward to have a network layout where a central authority is defused has lead to the development of the structured peer-to-peer networks. In such a network overlay, central point of failure is eliminated. Napster is a popular structured peer-to-peer system [37]. Unstructured peer-to-peer Unstructured peer-to-peer networks overlay is meant to be a distributed overlay but the difference is that the search directory is not centralized unlike structured peer-to-peer networks overlay which, leads to absolute single point failure in such network overlay. Gnutella is one of the well-known unstructured peerto-peer systems [37]. Hybrid peer-to-peer Is a combination of structured and unstructured peer-to-peer networks systems. Super peers can act as local search hubs in small portions of the network whereas the general scope of the network behaves as unstructured peer-to-peer system. Kazaa is a hybrid of centralized Napster and decentralized Gnutella network systems. 4.2 Interoperability The interoperability perspective in technology focuses on the system’s technical capabilities to interface between organizations and systems. It also focuses on the resulting mission of compatibility or incompatibility between systems and data collation partners. Modern business applications developed on cloud are often complicated and require interoperability. For example, an application owner can deploy a web server on Amazon Cloud while the database server may be hosted in Azure Cloud. Unless data and applications are not integrated across clouds properly, the results and benefits 123 An overview of the commercial cloud monitoring tools 367 Fig. 5 Interoperability classification Table 2 Monitoring tools and interoperability Platform Interoperability, Cloud-Agnostic (Multi-Clouds) Monitis [38] Yes RevealCloud [39,40] Yes LogicMonitor [41] Yes Nimsoft [42] Yes Nagios [31,43] Yes SPAE [44,45] Yes CloudWatch [46] No OpenNebula [47] No CloudHarmony [48] Yes Azure FC [49,50] No of cloud adoption cannot be achieved. Interoperability is also necessary to avoid cloud provider lock-in. This dimension refers to the ability of a cloud monitoring framework to monitor applications and its components that may be deployed across multiple cloud providers. While it is not difficult to implement a cloud-specific monitoring framework, to design generic cloud monitoring framework that can work with multiple cloud providers remain a challenging problem. Next, we classify the interoperability (Fig. 5) of monitoring frameworks into the following categories: Cloud dependent Currently many public cloud providers provide their consumers monitoring tools to monitor their application’s CPU, storage and network usage. Often these tools are tightly integrated with the cloud providers existing tools. For example, CloudWatch, offered by Amazon is a monitoring tool that enables consumers to manage and monitors their applications residing on AWS EC2 (CPU) services. But, this monitoring tool does not have the ability to monitor an application component that may reside on other cloud provider’s infrastructure such as GoGrid and Azure. Table 2 illustrates some examples of cloud monitoring tools that are specific to a cloud provider as well as Cloud Agnostic. Cloud Agnostic In contrast to single cloud monitoring, engineering cloud agnostic monitoring tools is challenging. This is primarily due to fact that there is not a common unified application programming interface (API) for calling on cloud computing services’ runtime QoS statistics. Though recent developments in cloud programming API including Simple Cloud, Delta Cloud, JCloud, and Dasein Cloud simplify inter- 123 368 K. Alhamazani et al. action of services (CPU, storage, and network) that may belong to multiple clouds, they have limited or no ability to monitor their run-time QoS statistics and application behaviors. In this scenario, monitoring tools are expected to be able to retrieve QoS data of services and applications that may be part of multiple clouds. Cloud agnostic monitoring tools are also required if one wants to realize a hybrid cloud architecture involving services from private and public clouds. Monitis monitoring tool provides the ability of accessing different clouds e.g. Amazon EC2, Rackspace and GoGrid. It utilizes the concept of widgets where consumers can view more than one widget in a page. In Monitis [38], consumers need to provide only cloud account credentials to access monitoring data of their cloud applications running on different cloud provider infrastructure. They can also specify which instance to monitor. Hence, a consumer can view two different cloud instances using two different widgets in one single page. 4.3 Quality of service (QoS) matrix It is non-trivial for application developers to understand what QoS parameters and targets he/she needs to specify and monitor across each layer of a cloud stack including PaaS (e.g., web server, streaming server, indexing server, etc.) and IaaS (e.g., compute services, storage services, and network). As shown in Fig. 6, this can be by one parameter or a group of parameters. 4.3.1 Single parameter In this scenario, a single parameter refers to a specific system QoS target. In each system, there are major atomic/single values that have to be tracked closely and continuously. For example, CPU utilization is basically expressed by only one single parameter in the SLA. Such parameters can affect the whole system and a violation in SLA can lead to a serious system failure. Unlike composite parameters where a single parameter might not be of priority to the system administrator, single parameters in most cases gain high priority when monitoring SLA violations and QoS targets. 4.3.2 Composite parameters In a composite parameter scenario, a group of different parameters are taken into consideration. In the cloud, software application is composed of many cloud software services. Thus, the performance quality can be determined by collective behaviors of those software services [27]. After observing multiple parameters for estimating a functionality of one or more concerned processes, one result could be obtained to Fig. 6 QoS matrix classification 123 An overview of the commercial cloud monitoring tools Table 3 Monitoring tools and layers’ visibility 369 Platform Visibility multi-layers (composite QoS matrix) Monitis [38] Yes RevealCloud [39,40] Yes LogicMonitor [41] Yes Nimsoft [42] Yes Nagios [31,43] Yes SPAE [44,45] No CloudWatch [46] Yes OpenNebula [47] No CloudHarmony [48] No Azure FC [49,50] Yes evaluate the QoS. To illustrate, “loss” can be considered as a composite parameter of two single parameters “one way loss” and “round trip loss”. Similarly, “delay” can be considered as composite parameters of three single parameters “one way delay”, “RTT delay”, and “delay variance”. Table 3 shows a list of some commercial tools for cloud monitoring and it illustrates which of them support or do not support monitoring multiple QoS parameters. 4.4 Cross-layer monitoring As shown in Fig. 7, application components (streaming server, web server, indexing server, compute service, storage service, and network) related to a multimedia streaming application are distributed across cloud layers including PaaS and IaaS. In order to guarantee the achievement of QoS targets for the multimedia application as a whole, it is critical to monitor QoS parameters across multiple layers [51]. Hence, the challenge here is to develop monitoring tools that can capture and reason about the QoS parameters of application components across IaaS and PaaS layers. As demonstrated in Fig. 8, we categorize the visibility of commercial monitoring tools into following categories: Layer specific Cloud services are distributed among three layers namely, SaaS, PaaS, and IaaS. Monitoring tools originally are oriented to perform monitoring tasks over services only in one of the aforementioned layers. Most of present day commercial tools are designed to keep track of the performance of resources provisioned at the IaaS layer. For example, CloudWatch is not capable of monitoring information related to load, availability, and throughout of each core of CPU services and its effect on the QoS (e.g., latency, availability, etc.) delivered by the hosted PaaS services (e.g., J2EE application server). Hence, there exists a considerable gap and research challenges in developing a monitoring tool that can monitor QoS statistics across multiple layers of the cloud stack. Layer Agnostic In contrast to the previous scenario, monitoring at multiple layers enables the consumers to gain insights to applications’ performance across multiple layers. E.g., consumers can retrieve data at the same time from PaaS and IaaS for the 123 370 K. Alhamazani et al. Fig. 7 Components across cloud layers and QoS propagating Fig. 8 Visibility categorization same application (Table 3). This type of cloud monitoring is essential in all cases but obviously it is more effective for consumers requiring complete awareness about their cloud applications. 4.5 Programming interfaces Programing interfaces allows the development of software systems to enable monitoring across different layers of the cloud stack. It involves several components such as APIs, widgets and the command line to enable a consumer to monitor several components of the complex cloud systems in a unified manner. 4.5.1 Application programming interface An application programming interface (API) is a particular set of rules (‘code’) and specifications that software programs follow to communicate with each other (Fig. 9). 123 An overview of the commercial cloud monitoring tools 371 Fig. 9 Different types of programming interfaces It serves as an interface between different software programs and facilitates their interaction; similar to the way the user interface facilitates interaction between humans and computers. In fact, most commercial monitoring tools such as Rackspace, Nimsoft, RevealCloud, and LogicMonitor provide their consumers with extensible open APIs enabling them tp specify their own required system functionalities. 4.5.2 Command-line A command line provides a means of communication between a consumer and a computer that is based solely on textual input and output. 4.5.3 Widgets In computer software, a widget is a software service available to consumers for running and displaying applets via a graphical interface on the desktop. Monitis and RevealCloud are two popular commercial tools that provide performance data to consumers on multiple customizable widgets. 4.5.4 Communication protocols All commercial tools adopt communication protocols for data transfer. Communication protocols vary and are different from one monitoring tool to another. For example, Monitis and Rackspace follow HTTPs and FTP protocols. Another example is LogicMonitor, which adopts the encrypted Simple Network Management Protocol (SNMP). 5 Commercial monitoring tools 5.1 Monitis Monitis [38] founded in 2005, has one unified dashboard where consumers can open multiple widgets for monitoring. A Monitis consumer needs to enter his/her credentials to access the hosting cloud account. In addition, a Monitis consumer can remotely monitor any website for uptime, in-house servers for CPU load, memory, or disk I/O, by installing Monitis agents to retrieve data about the devices. A Monitis agent can also be used to collect data of networked devices in an entire network (behind a firewall). This technique is used instead of installing a Monitis agent on each single device. Widgets can also be emailed as read only version to share the monitored information. Moreover, Monitis provides rich features for reporting the status of instances where 123 372 K. Alhamazani et al. consumers can specify the way a report should be viewed e.g. chart, or graph. It also enables its consumers to share the report publicly with others. 5.2 RevealCloud CopperEgg [39,40] provides RevealCloud monitoring tool. It was founded in 2010 and Rackspace is a main partner. RevealCloud enables its consumers to monitor across cloud layers e.g. SaaS, PaaS, and IaaS. It is not dedicated to only one cloud resources provider, rather it is generic to allow a consumer to get its benefits within most popular cloud providers e.g. AWS EC2, Rackspace, etc. RevealCloud is one of the very few monitoring tools that supports maintaining monitored historical data.It can track up to 30 days of historical data, which is considered as a prime feature that most commercial monitoring tools lack. 5.3 LogicMonitor LogicMonitor [41] was founded in 2008 and it is a partner with several third parties such as NetApp, VMWare, Dell, and HP. Similar to RevealCloud, LogicMonitor enables its consumers to monitor across cloud layers e.g. SaaS, PaaS, and IaaS. It also enables them to monitoring application operations on multi-cloud resources. Protocol used in communications is SSL. Moreover, LogicMonitor uses SNMP as a method of retrieving data about distributed virtual and physical resources. 5.4 Nimsoft Nimsoft [42] was founded in 2011. Nimsoft supports multi-layers monitoring of both virtual and physical cloud resources. Moreover, Nimsoft enables its consumers to view and monitor their resources in case they are hosted on different cloud infrastructures e.g. a Nimsoft consumer can view resources on Google Apps, Rackspace, Amazon, Salesforce.com and others through a unified monitoring dashboard. Also, Nimsoft gives its consumers the ability to monitor both private and public clouds. 5.5 Nagios Nagios [43] was founded in 2007, it supports multi-layer monitoring. It enables its consumers to monitor their resources on different cloud infrastructures as well as in-house infrastructure. Nagios utilizes SNMP for monitoring networked resources. Moreover, Nagios has been extended with monitoring functionalities for both virtual instances and storage services using a plugin-based architecture [31]. Typically, a Nagios server is required to collect the monitoring data, which would place it as a centralized solution. Moreover, Nagios is a cloud solution as a user would need to setup a Nagios server. However, many possible configurations can help create multiple hierarchical Nagios servers to reduce the disadvantages of a centralized server. 123 An overview of the commercial cloud monitoring tools 373 5.6 SPAE by SHALB SHALB [44] was founded in 2002 and provides a monitoring solution called Security Performance Availability Engine (SPAE). SPAE is a typical network monitoring tool supporting a variety of network protocols such as HTTP, HTTPS, FTP, SSH, etc. It uses SNMP [45] to perform all of its monitoring processes and emphasizes security monitoring and vulnerability. However, SPAE does not support monitoring at different layers (IaaS, PaaS and SaaS). It enables its consumers to monitor networked resources including cloud infrastructure. 5.7 CloudWatch CloudWatch [46] is one of the most popular commercial tools for monitoring the cloud. It is provided by Amazon to enable its consumers monitoring their resources residing on EC2. Hence, it does not support multi-cloud infrastructure monitoring. The technical approaches used in CloudWatch to collect data are implicit and not exposed to users. CloudWatch is limited in monitoring resources across cloud layers. However, an API is provided for users to collect metrics at any cloud layer but requires the users to write additional code. 5.8 OpenNebula OpenNebula [47] is an open source monitoring system that provides management for data centers. It uses SSH as the protocol permitting consumers to gain access and gather information about their resources. Mainly, OpenNebula is concerned with monitoring physical infrastructures involved in data centers such as private clouds. 5.9 CloudHarmony CloudHarmony [48] started monitoring services in the beginning of 2010. It provides a set of performance benchmarks of public clouds. It is mostly concerned in monitoring the common operating system metrics that are related to (CPU, disk and memory). Moreover, cloud to cloud network performance in CloudHarmony is evaluated in terms of RTT and throughput. 5.10 Windows Azure FC Azure Fabric Controller (Azure FC) [49,50] is adopting centralized network architecture. It is a multi-layer monitoring system but, it does not support monitoring across different cloud infrastructures. Moreover, Azure FC utilizes SNMP for performing monitoring. 123 374 K. Alhamazani et al. 6 Classification and analysis of cloud monitoring tools based on taxonomy With increasing cloud complexity, efforts needed for management and monitoring of cloud infrastructures need to be multiplied. The size and scalability of clouds when compared to traditional infrastructure involves more complex monitoring systems that have to be more scalable, effective and fast. Technically, this would mean that there is a demand for real-time reporting of performance measurements while monitoring cloud resources and applications. Therefore, cloud monitoring systems need to be advanced and customized to the diversity, scalability, and high dynamic cloud environments. In Sect. 4, we analyzed in detail the main evaluation dimensions of monitoring. As discussed, not all of those dimensions are adopted by monitoring systems in either open source or commercial domains. Though, most of these dimensions, which are basically related to performance, have been addressed by the research community and have received some, attention, more considerable effort to achieve higher level of maturity is essential for monitoring cloud systems. Decentralized approaches are gaining more trust over centralized approaches. In contrast to unstructured P2P, structured P2P networks present a practical and more efficient approach in terms of network architecture. However, considerable study is needed on decentralized networks that are with various degrees of centralization. Considering interoperability, either cloud-dependent or cloud-agnostic, both of these monitoring approaches gain high importance. Currently, both approaches are supported by several monitoring systems. Through our study, we found that cloud-dependent monitoring systems are mostly commercial, whereas, cloud-agnostic monitoring systems are typically open source. We observe that matrix of the quality of service is the most important dimension of a monitoring system and list the quality parameters that can be monitored along with the related criteria. We also elaborate on how those quality parameters should be monitored, detected and reported. At which cloud layer a monitoring system should operate the monitoring processes. Further, the aggregation of multiple parameters for a consumer application is a critical aspect of monitoring. This means that a monitoring system should not be cloud layer specific or layer agnostic. This will determine the visibility characteristic of a cloud monitoring system. All of these issues in monitoring need more study by the cloud community and is still in demand formore technical improvements. Table 4 summarizes our study of monitoring platforms against evaluation dimensions explored in Sect. 4. 7 Conclusion and future research directions This paper presented and discussed the state-of-the-art research in the area of cloud monitoring. In doing so, it presented several design issues and research dimensions that could be considered to evaluate a cloud computing system. It also presented several cloud monitoring tools, their features and shortcomings. Finally, this paper presented a taxonomy of current cloud monitoring tools with focus on future research directions that should be considered in the development of efficient cloud monitoring systems. 123 An overview of the commercial cloud monitoring tools 375 Table 4 Monitoring platforms against evaluation dimensions Platform Network arch. (centralized) Network arch. (decentralized) Interoperability Visibility multi-cloud multi-layers SNMP Extendable APIs Monitis [38] Not-stated (SaaS solution) Not-stated (SaaS solution) Not-stated (SaaS solution) Yes Not-stated (SaaS solution) Not-stated (SaaS solution) Not-stated (SaaS solution) Yes Yes Yes Yes Yes Yes Yes Not-stated Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes No No Yes Not-stated Yes No No Not-stated No Yes No Not-stated No No Yes Yes RevealCloud [39,40] LogicMonitor [41] Nimsoft [42] Nagios [31,43] SPAE [44,45] CloudWatch [46] OpenNebula [47] CloudHarmony [48] Azure FC [49,50] Not-stated Not-stated (SaaS solution) (SaaS solution) Not-stated Not-stated (SaaS solution) (SaaS solution) Yes No Not-stated Not-stated (SaaS solution) (SaaS solution) Yes Not-stated Yes Since monitoring becomes an essential component of the whole cloud infrastructure, its elasticity has to be given a high considerable priority. Based on this fact and on the aforementioned monitoring aspects and approaches, we believe that considerable effort is required to have more reliable cloud monitoring systems. Furthermore, we found there is a lack of reachable standards on procedure, format, and metrics to assess the development of cloud monitoring. Hence, we recommend having more collaborative use of research facilities in which tools, lessons learned and best practices can be shared among all interested researches and professions. References 1. Mell P, Grance T (2011) The NIST definition of cloud computing (draft). NIST Spec Publ 800:145 2. Letaifa A, Haji A, Jebalia M, Tabbane S (2010) State of the art and research challenges of new services architecture technologies: virtualization, SOA and cloud computing. Int J Grid Distrib Comput 3 3. Cong C, Liu J, Zhang Q, Chen H, Cong Z (2010) The characteristics of cloud computing. In: 39th international conference on parallel processing workshops (ICPPW), pp 275–279 4. Zhang S, Zhang S, Chen X, Huo X (2010) Cloud computing research and development trend. In: 2nd international conference on future networks, ICFN’10, pp 93–97 5. Ahmed M, Chowdhury ASMR, Ahmed M, Rafee MMH (2012) An advanced survey on cloud computing and state-of-the-art research issues. Int J Comput Sci Issues (IJCSI) 9 6. Atzori L, Granelli F, Pescapè A (2011) A network-oriented survey and open issues in cloud computing 7. Shin S, Gu G (2012) CloudWatcher: network security monitoring using openflow in dynamic cloud networks (or: How to provide security monitoring as a service in clouds?). In: 2012 20th IEEE international conference on network protocols (ICNP), pp 1–6 8. De Chaves SA, Uriarte RB, Westphall CB (2011) Toward an architecture for monitoring private clouds. IEEE Commun Mag 49:130–137 123 376 K. Alhamazani et al. 9. Grobauer B, Walloschek T, Stocker E (2011) Understanding cloud computing vulnerabilities. In: IEEE security and privacy, vol 9, pp 50–57 10. Moses J, Iyer R, Illikkal R, Srinivasan S, Aisopos K (2011) Shared resource monitoring and throughput optimization in cloud-computing datacenters. In: 2011 IEEE international parallel and distributed processing symposium (IPDPS), pp 1024–1033 11. Wang L, Kunze M, Tao J, von Laszewski G (2011) Towards building a cloud for scientific applications. Adv Eng Softw 42(9):714–722 12. Wang L, Chen D, Ma Y, Wang J (2013) Towards enabling cyberinfrastructure as a service in clouds. Comput Electr Eng 39(1):3–14 13. Wang L, von Laszewski G, Younge AJ, He X, Kunze M, Tao J (2010) Cloud computing: a perspective study. New Gener Comput 28(2):137–146 14. Begoli E, Horey J (2012) Design principles for effective knowledge discovery from big data. In: Joint working IEEE/IFIP conference on software architecture (WICSA) and European conference on software architecture (ECSA), pp 215–218 15. Bryant R, Katz RH, Lazowska ED (2008) Big-data computing: creating revolutionary breakthroughs in commerce, science and society 16. Labrinidis A, Jagadish H (2012) Challenges and opportunities with big data. In: Proceedings of the VLDB endowment, vol 5, pp 2032–2033 17. Ma Y, Wang L, Liu D, Yuan T, Liu P, Zhang W (2013) Distributed data structure templates for dataintensive remote sensing applications. Concurr Comput Pract Exp 25(12):1784–1797 18. Zhang W, Wang L, Liu D, Song W, Ma Y, Liu P, Chen Dan (2013) Towards building a multi-datacenter infrastructure for massive remote sensing image processing. Concurr Comput Pract Exp 25(12):1798– 1812 19. Zhang W, Wang L, Ma Y, Liu D (2013) Design and implementation of task scheduling strategies for massive remote sensing data processing across multiple data centers. Pract Exp Softw. doi:10.1002/ spe.2229 20. Twitter and Natural Disasters (2011) Crisis communication lessons from the Japan tsunami. http:// www.sciencedaily.com/releases/2011/04/110415154734.htm. Accessed 22 Feb 2014 21. Nita M-C, Chilipirea C, Dobre C, Pop F (2013) A SLA-based method for big-data transfers with multicriteria optimization constraints for IaaS. In: 2013 11th roedunet international conference (RoEduNet), pp 1, 6 22. Zhao M, Figueiredo RJ (2007) Experimental study of virtual machine migration in support of reservation of cluster resources. In: Proceedings of the 2nd international workshop on virtualization technology in distributed computing, p 5 23. Wang L, Chen D, Zhao J, Tao J (2012) Resource management of distributed virtual machines. IJAHUC 10(2):96–111 24. Calheiros RN, Ranjan R, Buyya R (2011) Virtual machine provisioning based on analytical performance and qos in cloud computing environments. In: International conference on parallel processing (ICPP), pp 295–304 25. Kirschnick J, Calero A, Edwards N (2010) Toward an architecture for the automated provisioning of cloud services. IEEE Commun Mag 48:124–131 26. Ranjan R, Zhao L, Wu X, Liu A, Quiroz A, Parashar M (2010) Peer-to-peer cloud provisioning: service discovery and load-balancing. In: Cloud computing, Springer, pp 195–217 27. Liu X, Yang Y, Yuan D, Zhang G, Li W, Cao D (2011) A generic QoS framework for cloud workflow systems. In: 2011 IEEE ninth international conference on dependable, autonomic and secure computing (DASC), pp 713–720 28. Ranjan R, Benatallah B Programming cloud resource orchestration framework: operations and research challenges. In: Technical report. http://arxiv.org/abs/1204.2204. Accessed 22 Feb 2014 29. Aceto G, Botta A, de Donato W, Pescapè A (2013) Cloud monitoring: a survey. Comput Netw 57:2093– 2115 30. Shao J, Wei H, Wang Q, Mei H (2010) A runtime model based monitoring approach for cloud. In: 2010 IEEE 3rd international conference on cloud computing (CLOUD), pp 313–320 31. Caron E, Rodero-Merino L, Desprez F, Muresan A (2012) Auto-scaling, load balancing and monitoring in commercial and open-source clouds 32. Spring J (2011) Monitoring cloud computing by layer, part 1. IEEE Secur Priv 9:66–68 123 An overview of the commercial cloud monitoring tools 377 33. Anand M (2012) Cloud monitor: monitoring applications in cloud. In: Cloud computing in emerging markets (CCEM), 2012 IEEE international conference on communication, networking and broadcasting, pp 1–4 34. Kutare M, Eisenhauer G, Wang C, Schwan K, Talwar V, Wolf M (2010) Monalytics: online monitoring and analytics for managing large scale data centers. In: Proceedings of the 7th international conference on autonomic computing, pp 141–150 35. Sundaresan S, de Donato W, Feamster N, Teixeira R, Crawford S, Pescape A (2011) Broadband internet performance: a view from the gateway. In: ACM SIGCOMM computer communication review, pp 134–145 36. Massonet P, Naqvi S, Ponsard C, Latanicki J, Rochwerger B, Villari M (2011) A monitoring and audit logging architecture for data location compliance in federated cloud infrastructures. In: IEEE international symposium on parallel and distributed processing workshops and PhD forum (IPDPSW), pp 1510–1517 37. Davis C, Neville S, Fernandez J, Robert J-M, Mchugh J (2008) Structured peer-to-peer overlay networks: ideal botnets command and control infrastructures? In: Computer security—ESORICS 2008, pp 461–480 38. Monitis (2014) http://portal.monitis.com/. Accessed 22 Feb 2014 39. RevealCloud (2014) http://copperegg.com/. Accessed 22 Feb 2014 40. RevealCloud (2014) http://sandhill.com/article. Accessed 22 Feb 2014 41. LogicMonitor (2014) http://www.logicmonitor.com/why-logicmonitor. Accessed 22 Feb 2014 42. Nimsoft (2014) http://www.nimsoft.com/solutions/nimsoft-monitor/cloud. Accessed 22 Feb 2014 43. Nagios (2014) http://www.nagios.com. Accessed 22 Feb 2014 44. SPAE (2014) http://shalb.com/en/spae/spae_features/. Accessed 22 Feb 2014 45. SPAE (2014) http://www.rackaid.com/resources/server-monitoring-cloud. Accessed 22 Feb 2014 46. CloudWatch (2014) http://awsdocs.s3.amazonaws.com/AmazonCloudWatch/latest/acw-dg.pdf. Accessed 22 Feb 2014 47. OpenNebula (2014) http://opennebula.org/documentation:rel4.0. Accessed 22 Feb 2014 48. Cloudharmony (2014) http://cloudharmony.com/. Accessed 22 Feb 2014 49. Azure FC (2014) http://www.techopedia.com/definition/26433/azure-fabric-controller. Accessed 22 Feb 2014 50. Azure FC (2014) http://snarfed.org/windows_azure_details#Configuration_and_APIs. Accessed 22 Feb 2014 51. Nathuji R, Kansal A, Ghaffarkhah A (2010) Q-clouds: managing performance interference effects for QoS-aware clouds. In: Proceedings of the 5th European conference on computer systems, pp 237–250 123 Copyright of Computing is the property of Springer Science & Business Media B.V. and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. International Journal of Data and Network Science 4 (2020) 255–262 Contents lists available at GrowingScience International Journal of Data and Network Science homepage: www.GrowingScience.com/ijds A practical approach to monitoring network redundancy Richard Phillipsa, Kouroush Jenaba* and Saeid Moslehpourb a b Department of Engineering and Technology Management, Morehead State University, Morehead, KY, USA College of Engineering, Hartford University, West Hartford, CT, USA CHRONICLE Article history: Received: July 18, 2018 Received in revised format: July 28, 2019 Accepted: September 19, 2019 Available online: September 19, 2019 Keywords: Practical Network Monitoring SNMP Network Monitoring Network Performance Network Alerts Interface Redundancy Monitoring ABSTRACT Computer TCP/IP networks are becoming critical in all aspects of life. As computer networks continue to improve, the levels of redundancy continue to increase. Modern network redundancy features can be complex and expensive. This leads to misconfiguration of the redundancy features. Monitoring everything is not always practical. Some redundancy features are easy to detect while others are more difficult. It is common for redundancy features to fail or contribute to a failure scenario. Incorrectly configured redundancy will lead to network downtime when the network is supposed to be redundant. This presents a false sense of security to the network operators and administrators. This research will present two scenarios that are commonly left unmonitored and look at a practical way to deploy solutions to these two scenarios in such a way that the network uptime can be improved. Implementing a practical approach to monitor and mitigate these types of failures allows costs spent on redundancy to increase uptime, and thus increase overall quality that is critical to a modern digital company. © 2020 by the authors; licensee Growing Science, Canada. 1. Introduction A report in 2015 stated that the average adult spends 8 hours and 21 minutes sleeping per day while that same adult spends 8 hours and 41 minutes on media devices (Davies, 2015). All these connected media devices use many forms of networking. We spend more time on networked devices then we spend sleeping. Clearly, computer networks have become critical to many modern-day activities. As the level of criticality has increased, so has the desire to build networks with higher levels of resiliency and redundancy (Bayrak & Brabowski, 2006). Computer network operators that consider the two words similar, typically apply the same types of monitoring for components regardless of their role. Both methods introduce scenarios that can be hard to monitor. This approach will explain methods that will establish monitoring for common redundancy and resiliency features or components. The first study will look at redundancy in the device and the second study will look at redundancy in the connectivity between devices. These are two common scenarios for both simple and complex networks. After establishing a practical network monitoring strategy, the system can detect and possibly repair issues, leading to better * Corresponding author. Tel: 1(606)783-9339, Fax: 1(606)783-5030 E-mail address: k.jenab@moreheadstate.edu (K. Jenab) © 2020 by the authors; licensee Growing Science, Canada. doi: 10.5267/j.ijdns.2019.9.004 256 uptime metrics. It is common for an organization or company to invest in more redundancy and resiliency features. These configurations can add exponential costs to the design, only to see failures continue and uptime turn into downtime. Using a more practical approach can keep the alerts from becoming background noise and focus monitoring efforts to work in conjunction with the redundant and resilient designs. 2. Monitoring overview The most common form of network monitoring today entails the SNMP protocol. SNMP, or simple network management protocol, comes in both “push” and “pull” varieties. When a network monitoring system polls a device for specific information, it is “pulling” data from the device to the monitoring system. The reverse or “pushing” is when the device sends SNMP data to the monitoring system when a specific event happens. This is typically associated with problems where pulling data on a regular schedule induces delays in detecting problems. The SNMP protocol uses a complex structure that must be understood. A MIB, or management information database, is the structure used in SNMP (Netak & Kiwelekar, 2006). A MIB is setup like a tree with a root and branches. Each branch has specific object identifiers (OID). Most monitoring systems have commonly used MIB trees and are setup to detect the polled device, so it uses the correct OID for polling a basic set of things about that device. These MIB/OIDs have some common branches but most are device manufacturer specific. This leads to large variability in the branches. To setup a network with 10 devices, it is easy to determine what device manufacture MIB you are using and select the correct OIDs for your device. This is optimal but not practical or even typical. Large networks can be far more difficult to establish the correct MIB/OIDs that need monitoring. Most network monitoring system have defaults established to ease initial deployment and then the network operation must customize to the environment. This customization becomes complex and time consuming for large networks. Often overlooked are the redundancy/resiliency features or components. It is not difficult to monitor a single interface or chassis for things like throughput (in bps or input bits per second), link utilization percentage or errors. As stated previously, SNMP is mostly different for each manufacturer. This creates opportunity for an OID specifically for a second redundant power supply to be hard to derive, which in turn leads to a lack of native polling for that component. If the monitoring system does not handle this out of the box and the network operator does not specifically deal with a failure notification of the power supplies, then the network is at great risk of having a failure that goes undetected. Building on that scenario of a power supply that is purposed for failover, fails itself and goes unnoticed so the next failure takes down the whole device and possibly the network. In summary, the problem will present itself as a complete loss of power when in fact, one of the power modules failed prior to the actual significant event but went undetected. Another method to monitor devices is using syslog messages, or system logged messages (Gerhards, 2009). Machine data is also a commonly used name for these log messages. Syslog logfiles have been around for years. This data comes in all formats and detail levels. It is in ASCII text and has no real structure. Newer logging systems have become increasingly good at creating intelligence that can add structure to this unstructured machine data and produce valuable information that can be correlated across multiple machines or services. Using this machine data can be a powerful way to solve problems associated with traditional SNMP polled monitoring solutions. Machine logs, when used to diagnose a system, are records of system events that provide details of what the device is doing internally. With the modern machine learning and big-data analytics, mining and analyzing this unstructured data moves traditional reactive support models to more proactive models of support. Solarwinds Orion (SolarWinds Worldwide, LLC.) is the tool that performed the SNMP polling for this approach. This network monitoring system is multifunctional and has many features. The system can make network device configuration changes as part of an alert remediation strategy. This is a feature that is part of a “self-healing” process (Quattrociocchi et al., 2014). Solarwinds Orion will also collect machine syslog data for some event and alert notifications but this approach utilized Splunk (Splunk Inc) to R. Phillips et al. / International Journal of Data and Network Science 4 (2020) 257 analyze the machine data used to build the algorithms for failure alerting and remediation. Network alarms need to evaluate and raise events to a status that notifies the network operators. The sheer volume of alarms that can exist on a large network can overwhelm a network operations center. Tuning the thresholds that trigger alarms are part of any practical strategy for network management. The tuning process involves looking at the lifecycle of the alarms and looking at the time the alarm stays in the various states like active, cleared or ended. It a report released in 2009, 20% of network alarms automatically cleared in less than 5 min while 47% automatically ended within 5 minutes (Wallin, 2009). These portions of the lifecycle suggest filtering these alarms out. It is important to consider alarms management when implementing any strategy for monitoring network redundancy. Alarms fatigue should always be avoided if possible. 3. Monitoring redundancy and resiliency Merriam-Webster defines redundancy as “serving as a duplicate for preventing failure of an entire system upon failure of a single component”. Merriam-Webster also defines resiliency as “tending to recover from or adjust easily to misfortune or change” (Merriam-Webster.com). Clearly the definitions are different, yet they are often thought to mean the same thing. This can lead to a lack of adequate alerting or monitoring. A detailed study of each term, as it relates to computer networks, will help to better understand the approach and how each complement one another when maintaining a redundant and resilient network design. 3.1 Redundant A network designed for redundancy has multiple components that prevent the device from becoming unavailable or down. This covers things like power supplies, additional CPUs, hard disks or memory modules. A typical example of redundancy in a network device is a device that has multiple power supplies or power modules. Having multiple power supplies connected to multiple power sources eliminates outages if you lose primary power. While this does make the system resilient, the true definition is closer to redundancy. When the redundant power supply fails while its running on the primary power, the network operator must be notified of event. If the failure goes unnoticed, a primary power supply failure will create a network down event for the entire device. When SNMP monitoring a network device for basic things, like is the whole device up or down, unless configured specifically, there will not typically be a down message until both power supplies fail and the unit becomes unreachable. When monitoring a redundant system, we must make sure to raise alerts when the redundant device component fails or is in an unusable or degraded state. Typically, the machine data is a better way to predict or identify these types of problems compared to simple SNMP polling. The machine data has a greater level of detail that can be leveraged in a modern log collection system. Some operators utilize SNMP traps, or “pushes” for this time-sensitive event notification. 3.2 Resilient Resiliency in a network device means the device or circuit can stay online and working during a failure. In some computer networks, this is the ability to route around problems using multiple paths to the same destination. This could be defined as a form of self-healing (Quattrociocchi et al., 2014). Another form a resiliency is to build multiple paths to a logical device, made up of two physical devices, and bundle them all together, so they look like a single connection. Fig 1 is a visual representation of port channel technology with 2 or 4-member interfaces. Looking at Fig 1, specifically when looking at Po2 on the left side, there are 2 circuits in this port channel. The extra circuit serves as a redundancy measure as well as a resiliency measure. On larger port channels there could be 4, 8 or even 16 circuits (as shown on Fig 1 - right). If a member circuit develops problems and goes offline, the remaining circuits continue to pass data. In the case of a larger port channel it may not be noticeable to the users or network operator from a consumed bandwidth perspective. Therefore, alerting is critical. For example, hypothetically if Te4/1 on either side of Fig 1 experiences problems but continues to pass data, the data may have errors or discards 258 on the transmit or receive side of the circuit. These types of issues create delays for customers or they slow down the communications far beyond expected rates. While the port channel is still operational and will appear to be up and resilient, the intended design is not what the customer experiences. It would seem logical that the other 7 circuits should handle the traffic and automatically exclude the defective member. Traditional network devices do not have circuit error detection or self-healing features. Also, there are times when errors are normal such as a denial of service attacks or a defective machine might flood the network with unintended packets of data. To properly fail over these circuits, removal of the defective member from the port channel must occur. If the monitored interface is only detecting up or down status, the network operator is not aware of the problem when it occurs. This scenario, where the network operator does not know there are problems on the network, yet the customers are experiencing no problem, is the opposite of what resiliency is supposed to accomplish for a network. Fig. 1. A visual representation of port channel technology 4. Research The first case study looked at a year’s worth of log data from a large university network. The network is comprised of over 2500 physical devices. The logs are captured in a log collection system. A list and count of critical events was derived from a report obtained from the log entries in the log collection system. A pareto chart with the event data was used for evaluation of the most common critical log messages See Fig 2. Overall there were 38 critical log messages for the time period studied. The three most common were fan failures and power supply issues. After investigation of the power supply issues; nearly 46% (7 of 17) of the failures occurred on redundant power supplies (FRU_PS_ACCESS & PWR). Fig. 2. A pareto chart with the event data R. Phillips et al. / International Journal of Data and Network Science 4 (2020) 259 In this scenario, the SNMP-based network monitoring system did not detect an event that was critical or problematic for the failed power-related components. The failures were reported by customers and added to the incident management system. Technicians were dispatched to repair the failure and restore the customer experience. In the case of the redundant power supply failures, an algorithm was added to the Splunk analytics engine to process the log data to detect redundant power module failures. The system was also configured to automatically open incident tickets for further technical investigation for detected events. Post process implementation, the network devices can experience power supply failures that would not normally be detected during SNMP polling but are still remediated using log data. This allows for proper network operator alerting and problem remediation. In summary, this new workflow has allowed the devices that experience failures with redundant components to be repaired without causing downtime for the users. The 7 failures that occurred with a redundant component more than likely created downtime on a redundantly designed device. These failures could have been prevented but the failures largely went unnoticed until the system experienced a total power failure. The second case study looked at the same university network for port channel usage on the campus network. This university has over 4000 fiber optic connections and many of them are members in port channels. Root causes of downtime events suggest that these interfaces go bad without notification or the notification is lost in an overwhelming volume of alerts. This issue was reported by the network operator prior to the study and suggested it warranted a long-term solution. The issue is that the most important tiers of the network, core and distribution, have numerous port channels to allow for capacity and resiliency. These port channels lose members due external issues like fiber trouble and begin to throw errors when passing traffic. The individual interface is generating errors, but the circuit never goes completely down. It is just very slow for traffic flows utilizing that port channel member since the traffic is resent several times due to errors. This creates a bad user experience. The study identified core and distribution level port channels. After identification, the network monitoring system implemented custom tags, attached to the specific interface, to categorize these interfaces for event detection. After tagging, an event detection algorithm was created to capture this segment of interfaces if they generated any of the specific errors in Table I. These types of errors are indicative of a member that should be removed from a port channel and investigated or repaired (Wallin, 2009). Table 1 Specific errors Error type Input Errors (RX) Duration Last hour Input Errors (RX) Today Output Errors (TX) Last Hour Output Errors (TX) Today Typical Causes Fiber or cabling Inconsistencies, Malformed Packets, Speed or duplex mismatch, Queue or Buffer Flooding Fiber or cabling Inconsistencies, Malformed Packets, Speed or duplex mismatch, Queue or Buffer Flooding Flow Control Problem, Fiber or cabling Inconsistencies, Malformed Packets, Speed or duplex mismatch, Queue or Buffer Flooding Flow Control Problem, Fiber or cabling Inconsistencies, Malformed Packets, Speed or duplex mismatch, Queue or Buffer Flooding The detected events generated both alarms and actions. The alarms logged the event in the event list while the actions used the device information to remove the member from the port channel. This was properly logged in the event list and an email was sent to the network operator queues for notification of the issue along with remediation steps taken. The network operators noted over 300 interfaces that met this criterion. After further review, the event thresholds were adjusted to look for the high probability offenders. It was noted that some interfaces generate errors when the switch buffers get full or the forwarding queues get full. This is not a defective interface but rather the traffic is either too great or there are too many packets per second to process. This requires traffic engineering to engage and pinpoint the solution. The study excluded these types of errors. Likewise, if a fiber is damaged but not severed, the error counts will grow every hour in large numbers and after two polling periods detection of the problem can occur in a process that is in control. This is where the study focused on an acceptable approach. 260 5. Methodology These studies implemented specific solution sets using specific products. The approach will work with most enterprise class monitoring software systems. There are 5 basic steps required in this approach. The steps are slightly different if hardware replacement is normal for the resolution of the problem (aka power supply failures). These differing workflows are in Fig 3. Add Device to Monitoring system Add Device to Monitoring system Identify the port channel and its member interfaces Configure Device for Logging to Log Collector (Splunk) Build alert algorithm that detects errors or discards Build alert algorithm that detects power supply error When alert is fired script the removal of the offending interface from the port channel. When alert is fired script the opening of an incident ticket to have device repaired by support. Alert the network operator of the alert and response Alert the network operator of the ticket creation and repair status. Fig. 3. Monitor workflow Step 1: adding the device to the network management system. This step is straightforward. In adding the device, it is critical that the interface and the port channel be polled for status and error counts. If the network management system does any kind of configuration management including the allowance of script execution, remediation of the event is more than likely possible. If not possible, any form of autohealing will be difficult. A fallback option would be to alert an operator and continue until the network operator acknowledges the issue. This is typically a standard feature with network management systems. Step 2: identify the member interfaces of the port channel and segment them out. The network management system can accomplish this step. Solarwinds Orion allows for custom properties for all interfaces and devices (nodes). If that option is not available, renaming the interfaces in the device may allow for segmentation that detection algorithm can use. Either way the interfaces need to be detectable and look different, so they are not like the normal event messages. In the case of redundant hardware, the machine data needs to be identified from manufacture literature or device testing. Once the log messages are identified they can be placed into the correct location of the algorithm. Step 3: build alert algorithm that detects errors or discards. When configuring the event/alert, it is critical that the search for member interfaces use the segmentation strategy used in step 2. This will allow for custom notifications or actions that will get through any network alert noise. It is important to look at both errors inbound and outbound. It is not uncommon for a circuit that goes between two monitored systems to experience errors in one direction. For example, if looking for an inbound error on device A (receiving interface), it would seem like there should be an output error on device B (sending interface). This is not the case and if the algorithm is only detecting outbound errors, the process will miss any inbound errors. Step 4: when alert is fired, script the appropriate action or the alert. In the case of a defective interface the next step may be the removal of...
Purchase answer to see full attachment
Explanation & Answer:
5 pages
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Please view explanation and answer below.

1

Network Monitoring Plan

Name
Institutional Affiliation
Lecturer
Course
Date

2
Network Monitoring Plan
The network is changing daily due to the evolution of technology. Networks have
evolved from flat networks to networks where everything is connected, forming complex designs
in which technology such as remote user, wireless, and IoT and VPN are used. Although much is
changing in the network, network monitoring software has remained constant. Network
monitoring plays an essential role in validating networks and ensuring the provision of quality
network services (Abbasi et al., 2021). Rather than reacting to network issues after they happen,
a network monitoring plan ensures that network issues are addressed before or controlled before
a network issue occurs. Network monitoring planning ensures that network performance is
monitored, identifying possible security threats, and gaining insight to prevent a problem. The
network monitoring plan entails;
Network Performance Monitoring Tools and Probes
Network Performance Monitoring Tools and probes is the first step in the network
monitoring plan where network performance is monitored using specific tools and probes.
Network performance monitoring (NPM) is the process of optimizing, visualizing,
troubleshooting, reporting and tracking the network's health. Network performance tools and
probes collect data and monitor network activities. In some areas, network performance tools
make use of telemetry (Rahman et al., 2019). Telemetry is a networking type that automatically
records and transmits data from inaccessible or remote sources to information technology
systems for analysis and monitoring. There are different types of telemetry, including simple
network management protocol (SNMP), network flow data, and packet data. SNMP is
responsible for monitoring and managing network devices and their function. This is the most

3
supported and used protocol for it delivers network diagnostic and critical information about the
devices where network failure can be easily detected.
The other type is the network data flow generated by the network devices such as
switches and routers, among other network devices. Flow data provides essential information
about the network users, traffic patterns and peak usage times. Flow data can also be used in
troubleshooting networks to offer a holistic information vi...


Anonymous
Really helped me to better understand my coursework. Super recommended.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Related Tags