SDN

Kubernetes: The Modern-day Kernel / November 15, 2019 by James Kelly

In the lead up to KubeCon + CloudNativeCon North America 2019, we are posting a series of blogs covering what we expect to be some of the most prevalent topics at the event. In our first post, we walked through the journey from the monolithic to the microservices-based application, outlining the benefits of speed and scale that microservices bring, with Kubernetes as the orchestrator of it all.

Kubernetes may appear to be the new software-defined, a panacea like software-defined networking (SDN), famously personified at Juniper by JKitty—the rainbow-butterfly-unicorn kitten. But you know what they say about the butterfly effect. When the Kubernetes kitty flaps its wings…there’s a storm coming. Kubernetes is indeed amazing—but not yet amazingly easy. Along with a shift to a microservices architecture, Kubernetes may create as many challenges as it solves.

Breaking applications into microservices means security and networking are critical because the network is how microservices communicate with each other to integrate into a larger application. Get it wrong, and it’s a storm indeed.

This week at KubeCon, Juniper is showcasing solutions in the areas where we’re best known for engineering prowess: at-scale networking and security.

As luck would have it, these coincide perfectly to simplify the challenges that Kubernetes creates. Let’s look at some a little closer.

The Opportunities and Challenges of Microservices and Kubernetes

A well-architected microservices-based application could shrug off the loss of a single container or server node, as long as there was an orchestration platform in place to ensure enough instances of the right services were active to allow the application to meet demand. Microservices-based applications are, after all, designed to add and remove individual service instances on an as-needed basis.

This sort of scale-out, fault-tolerant orchestration is what Kubernetes does with many additional constraint-based scheduling features. For this reason, Kubernetes is often called the operating system kernel for the cloud. In many ways, it’s a powerful process scheduler for a cluster of distributed microservices-based applications. But a kernel isn't everything that an application needs to function.

In true Juniper form of solving the hardest problems in our industry, we are engineering simplicity — tackling challenges in storage, security, networking and monitoring. We know from our customers and own experience that if you’re managing your own Kubernetes deployments, these challenges need to be squarely addressed in order to successfully manage this new IT platform.

One way to tackle problems is to outsource them. Kubernetes as a service is one approach to simplicity, but it also comes with trade-offs in cost and multicloud portability and uniformity. Therefore, operating Kubernetes clusters will be in the cards for most enterprises—and Juniper is here to help.

Don’t Limit Your Challenges, Challenge Your Limits

Kubernetes operators often deal with its challenges by limiting the size of clusters.

Smaller cluster sizes will restrict the security and reliability blast radius and such pigmy-scaled demands on monitoring, storage and networking look easier to solve than one larger shared cluster. In some cases, each application team will deploy its own cluster. In other cases, each development lifecycle phase—dev, test, staging and production—has its own Kubernetes cluster. All variants aside, many Kubernetes operators are deploying small clusters; for example, 10-20 nodes and a few applications in each one. Yet, Kubernetes can scale a couple orders of magnitude beyond this.

The result of many small clusters already has a name: Kube sprawl.

While there may be benefits of small cluster design, this approach quickly introduces new challenges, not containing complexity but merely shifting it around. The operational challenges of managing many clusters also means the added juggling of more Kubernetes versions in flight, upgrades and patching. Not to mention, more engineers to do all of this or the added task of building your own automation to do so.

Moreover, there is the obvious drawback that since each server or VM node can only belong to one Kubernetes cluster, efficiencies are lost that a larger shared cluster would afford. Resource efficiencies and economies of scale come when there is a great variety of applications, teams, batches and peak usage times.

If a small cluster is just running a few apps, it’s unlikely that the apps will be diverse enough to steadily use its whole pool of resources, so there will be times of waste. Operators can try to make the number of cluster nodes itself elastic like the containerized applications, but this orchestration is difficult to automate and build per the unique demands of applications inside each small cluster. Given each Kubernetes cluster needs at least three server nodes for high availability, that also is replicated waste across the number of clusters maintained.

Many clusters create new challenges for developers. While cloud-native tools, such as microservices tracing, exist to benefit developers, these and other middleware services are generally designed to work within, not across, clusters. Other new tools like service meshes can be more complex when federated across clusters.

Applications Span Edge Clouds and Multicloud, So Too Must Cloud-native Infrastructure

Kubernetes is good at managing a single cluster of servers and the containers that run on them. But some applications will run in multiple clusters to span multiple fault domains or availability zones (AZ). This isn’t multiple small clusters inside the same data center AZ, but rather about spanning data centers for additional availability, better global coverage and user latency.

Solving security, networking, monitoring and storage per cluster is a first step, but strong solutions should deal with the challenges of federating multiple clusters across AZs, regions, edge cloud and multicloud. Here again, Juniper has some market leadership to show off at KubeCon.

Security and networking become more complex in this scenario, as they play a more important role in global application architecture. The network must link together microservices, as well as multicloud. Defense in depth must protect one cluster, but defense in breadth is another requirement needing to equally enforce policy end-to-end in addition to top-to-bottom.

You Can’t Always Leave the Legacy Behind

In the real world of enterprise, the services used by cloud-native applications will range from "that bit still runs on a mainframe" to "this bit can be properly called a microservice." So how can organizations provide a secure, performant and scalable software-defined infrastructure for an application made out of radically different types of services, each running on multiple different kinds of infrastructures and possibly from multiple providers located all around the world?

Here again, solving security, network, monitoring and storage for Kubernetes is all well and good, but what about managing the legacy in VMs and on bare-metal? Many new-fangled tools for Kubernetes end where Kubernetes does, but applications don’t. Moreover, operations don’t and yet another tool for ops teams to learn and deploy creates additional burden instead of reducing it. Technology and tooling today must solve new cloud-native problems, but the best tools will solve the hardest problem today: operations simplicity. This can only happen by spanning the boundaries of new and old, and maintaining evolvability for the future.

Solutions for Operational Simplicity

By this point, it’s easy to imagine what’s coming next. Juniper has been successfully building high-performance, scalable systems for a long time.

Stay tuned for our next blog that is set to explore how Juniper brings operational simplicity to Kubernetes users and beyond. In the meantime, find out more about Juniper’s cloud-native solutions at juniper.net/cloud-native.

Bringing Operational Simplicity to Data Centers with Contrail Insights / November 11, 2019 by James Kelly

Screen Shot 2020-02-16 at 11.21.52 AM.png

Data centers are the epitome of infrastructure automation, and their modern manifestation—cloud—provides an almost magical platform for its users. To construct clouds, separation of concerns into layers of abstraction, like network overlays and service API encapsulations, help enable service agility and innovation. But do these layers curb complexity, or merely mask it?

The truth is, it’s a struggle to understand how all the magic happens behind the curtain of cloud infrastructure. Willfully blind reliability can be a house of cards, with applications stacked upon services, stacked upon a cloud platform, stacked upon data center infrastructure. If the foundation of the cloud architecture—the network—wobbles, or doesn’t live up to its SLA metrics, then issues reverberate all the way up the stack.

Demystifying this magic to identify root causes is a deeply complex problem faced by all data center operators. To thwart and unmask such complexity in the data center, we have engineered a solution that will shine a light on some of the most elusive troubleshooting and analytics issues faced today.

Introducing Contrail Insights

Juniper’s Contrail Insights simplifies multicloud operations with monitoring, troubleshooting and optimizing functions based on telemetry collection, policy rules, artificial intelligence and an intuitive user interface for analysis and observability. It works with VMware, OpenStack, Kubernetes and public cloud environments, as well as private cloud data center infrastructure, and it provides visibility across the network, servers and workloads.

Contrail Insights is available standalone or as part of Contrail Enterprise Multicloud, where’s it’s combined with Contrail Networking and Contrail Security. It’s available with Contrail Cloud for service providers. This next evolution of AppFormix has now been fully merged into the Contrail Command user interface and Contrail APIs for this trifecta of Contrail Networking, Security and Insights products.

In our new release of Contrail Insights, we have greatly expanded the analytics and observability features well beyond what AppFormix previously offered.

Seeing below the tip of the infrastructure iceberg

As revealed in Juniper’s 2019 State of Network Automation Report, monitoring is the most time consuming task day-to-day for network and security operations teams. As teams automate more, monitoring is increasingly the cornerstone of operations since there are fewer changes to manually perform.

Contrail Insights is now doing for monitoring and troubleshooting what Contrail Networking and Contrail Security did for data center, cloud and cloud-native orchestration. Seeing through the layers of entangled automation makes monitoring and troubleshooting possible via the instrumentation of Contrail Insights.

Let’s take a closer look at some of the other new features:

An intuitive tour of topology

A good place to start is the topology view. Contrail Insights shows the fabric of spine and leaf switches and links, all the way down to the servers and their hosted workloads.

The topology viewer works well for any size data center. There are smart arrangement presets as well as featured ability to customize the display with an intuitive user interface. This allows the user to drag and drop, select and move groups of nodes and links at a time and zoom in and out, when dealing with large topologies, improving broad visibility and quickly focusing that visibility as needed.

Visualize and analyze with a heatmap

Switch, link and server resources show up in the topology view with a configurable heat map. Heat maps can be based on switch resource usage, server resources usage or link usage. For example, the indicators show the heat map color scale based on link bytes, packets, or relative utilization.

The right panel provides further controls. Analysis can be done in a contextual way through the topology, with mouse-over tooltips and clicking on resources to present detailed and customizable analytics shown in the right panel with charts, graphs and tables. Contrail Insights provides statistics, both in real-time and on a historical basis. Using the calendar to navigate back to a certain period is immensely valuable for troubleshooting past issues like microbursts or intermittent hot spots.

Powerful querying and root-cause analysis with drill downs

The top-N view features a larger-scale bar chart or table in the main panel of the user interface, allowing operators to explore multi-parameter queries in more detail and to sort the top N results.

To build a query in this mode, the right panel is where the query filters are set. All query fields are populated with drop-down results, so that the user doesn’t have to guess or remember the resource names. This makes it easy to find traffic in a virtual network or between two points. This is also made easy by selecting points or links in the topology view and then clicking the top-N button from that view to enter the top-N interface with filters preset to what was selected in the topology.

Using the drill-down button in the table of results will recurse through the search results. This can aid in sifting through traffic volumes to find an exact flow or traffic group that may be an issue. For example, if a link is running hot, the user can query between the source and destination nodes and then drill down through the traffic volumes of culprit overlays, protocols, and flows.

Troubleshooting with path finder

For each search result row in the top-N view, there’s also a simple find-path button to jump into the Contrail Insights path finder tool. This is a simple way to get into the path finder interface with the right-panel filters all preset to match the context of the row in the previous view, but you can also build path finder queries from scratch.

The path finder tool is ideal for troubleshooting in a visual way, with the heatmap-contextual topology. It displays the path through the network topology for a specific flow or particular set of traffic parameters, and it presents an elegant solution to the problems of overlay-underlay correlation.

Traffic groups—for example a given 3-tuple of source IP, destination IP, and protocol—can be balanced across multiple paths in a data center fabric. Path finder highlights the breakdown of the amount of traffic per path. In the right panel, paths are broken down across a bar chart, showing their relative share and allowing selection of those bars for individual subgroups of the traffic taking one path through the network. Also in the right panel, there is a line graph showing the traffic bandwidth over the selected time window.

Overlay to underlay correlation

Imagine troubleshooting incidents when an application team is experiencing issues. In such cases, the network engineer only knows the overlay information, such as workload endpoints (i.e. source and destination IP addresses,) and needs to find out the path in the underlay network. This requires correlating the overlay networks with the physical underlay data center networks.

Path finder shows the topology with the link path highlighted for the end-to-end path, workload to workload, traversing the server hosts and switching fabric. Because Contrail is overlay and underlay aware, it has all the context to filter on the appropriate domain, tenant, or virtual network, as well as the source and destination of the workload IPs. This is easily filtered in the path finder right panel to reveal the path through the fabric topology shown in the main panel. When the pressure is on for network engineers to show network innocence or find a problem, path finder is a leap forward in troubleshooting.

Underlay to overlay correlation

In the reverse scenario to the above, imagine the NetOps team must determine which applications are using the most bandwidth between two points in the physical network. The network engineer knows only the underlay physical switch IP addresses or interfaces and would like to know the top workloads whose overlay traffic are using the path between those two points.

From the top-N view, the user can select the overlay source and destination, along with other fields of interest, to present in the results. Then, in the right panel as query parameters, the user sets the filter to match the underlay source and destination switches or interfaces. The table view, or particularly its bar-chart view, shows the distribution of top overlay flows between the two switches. Now, to illustrate the result in a topology view, the user simply clicks the flow result’s find-path button in the given row. Presto! Contrail will render the path finder view for the end-to-end flow, clearly illustrating which part of the switch fabric the traffic is taking.

Resource consumption for a given tenant’s virtual network

In this use case, a data center operator wants to know the server and network resource consumption across a tenant or to drill down into more specific consumption at some points.

Starting in the topology view, the user can set up the heatmap configuration for a given time range and filter on just one tenant at the level of the source virtual network field. The topology heatmap highlights will activate for all links, servers and network nodes participating in that virtual network. Hovering the mouse over the highlighted resources shows a quick tooltip view of the resource consumption contextual to our single virtual network. To drill down further, simply click on any resource in the map and the right panel will present the default charts and tables that can be reconfigured to suit the search. For more detailed analysis the user can contextually launch into the top-N and path finder tools.

Incredible. Insightful.

By now this illustrative blog has given you a good taste for the power of Contrail Insights.

If you’re joining us at NXTWORK this week, be sure to check out the breakout session on “Insights and Operational Simplicity” and the demos in Enterprise Multicloud kiosks. You can also binge Contrail demos to your heart’s content in our YouTube playlist on Contrail Enterprise Multicloud. When you’re ready to judge for yourself, ask your Juniper account team or partner for a demo.

Software-Defined, a Decade Later / April 10, 2019 by James Kelly

Podcast of this blog on YouTube.

The first African-American president had been sworn in, the king of pop had just passed, @realDonaldTrump had taken to tweeting, and Avatar was about to smash box-office records, but what I remember about the summer of 2009 is the dawn of software-defined.

Looking forward to tapas and sunglasses, I landed in Barcelona to present at SIGCOMM on what was then a niche topic: network programmability.

I was feeling good about my presentation on Service Creation with the Junos SDK (now known as JET). I’d trained many people on it, so I knew everything from its use cases to APIs like the back of my hand. Gung-ho, as I shook hands with people in the room, I was greeted with enthusiastic faces.

A decade later, I don’t remember the talk. I do remember its Q&A. Raised hands eclipsed my talk with many questions comparing our SDK to a similar demo and presentation from the day before that I had missed. Well I’ll be damned. I had followed—you guessed it—OpenFlow.

For me that was SDN’s big bang moment. The ensuing unbridled enthusiasm for OpenFlow percolated up from this academic setting into a frenzy of new foundations and projects, controllers, APIs, and a smorgasbord of overlay and control protocols. As the networking industry was soon flush with SDN startups, many established players SDN-washed anything that resembled software, and soon that spread to a software-defined movement that impacted all things infrastructure.

Back then, “en primeur” SDN looked like a dicey proposition, but its preteen years have seen it mature in the data center and WAN. Today, software-defined is almost a norm, expected to be poured into all places in network, to keep up with the operational reliability and speed demanded of these times of cloud-charged disruption.

As multicloud infrastructure, the new IT platform, permits and needs greater automation, digital operations teams are impelled to a new status quo: software-defined and AI-driven. While we still hear of SDDC, SD-WAN and SD-Branch, expectations today are that all places in network are ruled with software. And soon, if not already, that software must be AI-fueled for smarter AIOps and reliability engineered like DevOps.

Putting aside expectations and buzzwords, let’s objectively look at how SDN is moving along. 2019’s litmus test is a far cry from the old standby of control and data plane separation.

How is SDN Evolving in the Open?

It’s impossible to write an account of SDN without giving meaningful consideration to openness.

In some enterprise engineering teams, closed and proprietary solutions are contraband, but when it comes to SDN, there few open-source projects simple enough to operationalize right off the shelf of GitHub or DockerHub. Commercially available products still carry the day, but a predilection for open source hangs heavy in the air as it brings the benefits of a community for discussion and sharing of automated workflow tools, frameworks, tests and playbooks.

While open source can introduce economies of multi-vendor engineering, testing and co-creation with code-savvy customers, it’s not replacing open standards. If such a replacement is looming, it’s narrowly in the cloud-native space of securely connecting Kubernetes application clusters and service meshes. But where SDN meets hardware top to bottom, and where multiple SDNs meet to federate and span domain boundaries end to end, interoperability isn’t yet forged in open source. Multi-vendor interoperability and multi-SDN system federation hinges on open standards-based protocols, especially those truly proven and widespread standards.

Interoperability prevents technical debt and increases customer freedom. However, customer freedom is an unlikely strategy for the largest vendor incumbents, which is why we still see some rival SDNs bound to vendor hardware and vendor workload orchestration systems.

Interoperability should also not be confused with overlays which are merely agnostic to what IP network on which they ride. Overlays provide separation of concerns, and like interoperability, they afford a way to insert into brownfield environments. But without interoperable network boundaries, overlay SDN solutions are islands eventually destined as technical debt, the ball and chain to IT progress.

Security is Shifting Left

SDN openness may be in various shades of optional, but everyone understands security is a must because from the network engineer to the board of directors, people today are acutely aware of the specter of security breaches and attacks.

Whether it is multi-tenancy, micro-segmentation, mutual identity authentication, or secure SD-WAN based on next-gen firewalls, in software development and network development security has shifted left, moving it earlier on the timeline of project considerations. Infosec used to be a rubber stamp at the end of projects, now security is foundational and a first priority.

With the progression of SDN, instead of only traditional secure perimeters, network security is now getting measured out in course- and fine-grained means for multi-factor defense in depth. And in the race against advanced and automated threats, software-defined systems have also made it simpler to manage security policies and automatically enforce them by applying protection within the network faster than ever before.

How is SDN Evolving Automation?

In the aftermath of the early SDN hype, the industry experienced progression on the front of orchestrating operations and regression from the focus on automating them.

The evolution of controllers made automation turnkey, focusing on what instead of how. In data centers, the dynamic machinery inside of the network was automated in step with workflow orchestrators like vSphere, OpenStack, and today, more so Kubernetes and OpenShift. In the case of SD-WAN, orchestration meant dealing in zero-touch branch onboarding and choosing WAN policy levels for application traffic. Automation to regulate traffic steering across the hybrid WAN uplinks was baked in.

These are oversimplifications of SDN applications, but across SDN domains, the common thread is that 1. control is centralized and abstracted above the device level to span a distributed fleet of infrastructure; and 2. controllers automate workflows, orchestrating, reducing and simplifying steps. The industry attention spent adapting to software-defined has, until recently, been mostly about these benefits of moving from commanding device CLIs to orchestrating controller GUIs. But orchestration implies automated networks, and changes nothing about automated network-ing. Automated network operations (NetOps) is the next frontier.

While workflows are leveled up and centralized with SDN, so are workflow APIs, another key aspect of SDN systems. With this, there is also the possibility to automate NetOps: evolving to automated testing, troubleshooting, change controls, service-level indicators, and service-level reliability. This is an area of network automation focused less on the automation technology inside software-defined networks, and more on the processes and people that will software-define operations.

In this transformational movement that network engineers have dubbed network reliability engineering (NRE), a subject we at Juniper have promoted a good deal, technology also plays an important role. SDN systems are important so that engineers are building on SDN abstractions and workflows and doing so with centralized APIs instead of down at each device.

Analytics and AI Are Driving the Future of Automation

Big data analytics, machine learning and AI are certainly sensationalized, but that’s because they show promise. If they can be applied to customer trends, voter preferences, and many other fields with complex data points, they can also be applied to IT. A prerequisite to this is already found in SDN systems: a central point of management. And networks are teeming with data to be centrally collected and processed.

Many SDN systems already incorporate telemetry, analytics and some of them leverage AI. If they don’t, you can bet it’s on the roadmap. And analytics and AI are important ingredients in automation, both within SDN systems and for improving the operations-contextual automated NetOps on top of SDN.

Since Juniper just acquired Mist, a timely example is the AI-driven Mist Cloud. Combining data points from Wi-Fi and BLE radio, wired, and wide-area networks, the Mist Cloud software uses AI to crunch location-addled degradations and dynamics. It’s able to stabilize network reliability and raise user experience problems, even outside of the WLAN. On the NetOps side of things, Mist’s AI-powered assistant, Marvis, smartens up the human touch and mitigates mistakes.

Another example is found in network reliability engineering. In the data-driven SRE and NRE culture, you can’t manage what you can’t measure. Many SDN systems have higher-order reporting, metrics and centralized telemetry APIs, aiding in the creation of service-level indicators (SLIs). SLIs are important because they map to objectives about the stakeholder-oriented service levels instead of SNMP or metrics that are meaningless to those outside of networking.

NREs also manage errors and outages by automating around them for continuous improvement. As SDN systems begin to incorporate AI-powered predictive analytics, NREs can use those signals to better uphold service levels with automated proactive steps to complement reactive remediation.

Clouds are in the Forecast

The cloud is replete with as-a-service IT offerings, but many SDN systems are still “shrink wrapped software” as downloads to run on premises or in private clouds.

Many networking teams would like to use or at least evaluate SDN, but jumping headlong into unboxed software like SDN is not accessible to smaller networking teams when they have the added work of learning to run and maintain SDN systems. For this reason, cloud-delivered SDN offerings are kindling more SDN adopters. This week, Juniper announced that it is taking its SD-WAN solution and launching the option for Contrail SD-WAN as a service.

Cloud-based SDN for all domains of networking seems interesting but won’t be right for all customers nor all use cases. Cloud-based data collection comes with mixed benefits and drawbacks.

For some, sending telemetry data to the cloud may be prohibited or prohibitively inefficient, especially in cases where certain telemetry requires a quick reaction. On the other hand, to the cloud’s benefit, if many networking teams don’t have the means to operate SDN, then they surely don’t have the means to run sophisticated analytics or AI systems on premises. A cloud service solves for simplicity. It also solves for quicker vendor innovation cycles and smarter processing because the economies of scale for data storage, analytics and now hardware-assisted AI processing based in the cloud are enormous and incomparable.

In summary, the first software-defined decade has been spectacular, and seasoned solutions are seeing a lot of production. Now, with so much contributing to SDN’s innovation and reinvention, I look forward to the next odyssey and seeing what will define software-defined. Drop a comment below to share your own anticipations.

Podcast of this blog on YouTube.

image credit moritz320/pixabay