It’s the End of Network Automation as We Know It (and I Feel Fine) by James Kelly

This article was originally posted on July 5th on The New Stack at

Network automation does not an automated network make. Today’s network engineers are frequently guilty of two indulgences. First, random acts of automation hacking. Second, pursuing aspirational visions of networking grandeur — complete with their literary adornments like “self-driving” and “intent-driven” — without a plan or a healthy automation practice to take them there.

Can a Middle Way be found, enabling engineers to set achievable goals, while attaining the broader vision of automated networks as code? Taking some inspiration from our software engineering brethren doing DevOps, I believe so.

How Not to Automate

There’s a phrase going around in our business: “To err is human; to propagate errors massively at scale is automation.”

Automation is a great way for any organization to speed up bad practices and #fail bigger. Unfortunately, when your business is network ops, the desire to be a cool “Ops” kid with some “Dev” chops — as opposed to just a CLI jockey — will quickly lead you down the automation road. That road might not lead you to those aspirational goals, although it certainly could expand your blast radius for failures.

Before we further contemplate self-driving, intent-driven networking, and every other phrase that’s all the rage today (although I’m just as guilty of such contemplation as anyone else at Juniper), we should take the time to define what we mean by “proper” in the phrase, “building an automated network properly.”

If you haven’t guessed already, it’s not about writing Python scripts. Programming is all well and good, but twenty minutes of design often really does save about two weeks of coding. To start hacking at a problem right away is probably the wrong approach. We need to step back from our goals, think about what gives them meaning, apply those goals to the broader picture, and plan accordingly.

To see what is possible with automation, we should look at successful patterns of automation outside of networking and the reasons behind them, so we may avoid the known bad habits and anti-patterns, and sidestep avoidable pitfalls. For well-tested patterns of automation, we needn’t look any further than the wealth of knowledge and experience in the arena of DevOps.

It matters what we call things. For better or worse, a name focuses the mind. The overall IT strategy to improve the speed, resilience, quality and intelligence of our applications is not called automation or orchestration. While ITIL volumes make their steady march into museums, the new strategy to enable business speed and smarts is incontrovertible, and it’s called DevOps.

Initially that term may invoke a blank page or even a transformation conundrum. But you can learn what it means to practice successful DevOps culture, processes, design, tooling and coding. DevOps can define your approach to the network, and is why we ought not promote network automation (which could focus the mind on the wrong objectives) and instead talk about DevNetOps as the application of patterns of DevOps applied to networking.

Networks as Code

The idea of infrastructure-as-code (IaC) has been around for a while, but surprisingly has seldom been applied to networking. Juniper Networks (where I hang my hat) and other networking vendors like Apstra have made some efforts over the years to move folks in this direction, but there is still a lot of work to do. For example, Juniper has had virtual form factors of most series of hardware systems, projects like Junosphere for network modeling in the cloud (many of us now use Ravello), and impressive presentations on IaC and professional services consulting. Juniper’s senior marketing director Mike Bushong (formerly with Plexxi) wrote about the network as code back in 2014.

IaC is generally well applied to cloud infrastructure, but it’s way harder to apply to bare metal. For evidence of this, just look at Triple-O, Kubernetes on Kubernetes, or Kubernetes on OpenStack on Kubernetes! That bottom, metal layer is quite the predicament.

To me, this means in networking, we should be easily applying IaC to software defined networking (SDN). But can we apply it to our network devices and manage the physical network? I asked Siri, and she said it was a trick question. As an armchair architect myself, I don’t have all the answers.  But as I see it, here are some under-considered aspects for designing networks as code with DevNetOps:

1. Tooling

In tech, everyone loves shiny objects, so let’s start there. Few network operators — even those who have learned some programming — are knowledgeable about the ecosystem of DevOps tooling, and few consider applying those same tools to networking. Learning Python and Ansible is just scratching the surface. There is a vast swath of DevOps tools for CI/CDsite reliability engineering (SRE), and at-scale cloud-native ops.

2. Chain the tool chain: a pipeline as code

When we approach the network as code, we need to consider network elements and their configurations as building blocks created in a code-development pipeline of dev/test/staging/production phases. Stringing together this pipeline shouldn’t be a manual process; it should be mostly coded to be automatic.

As with software engineering, there are hardware and foundational software elements with network engineering, such as operating systems that the operator will not create themselves, but rather just configure and extend. These configurations and extensions, with their underlying dependencies, can be built together, versioned, tested, and delivered. Thinking about the network as an exercise in development, automation should start in the development infrastructure itself.

3. Immutable infrastructure

Virtualization and especially containers have made the concept of baking images very accessible, and immutable infrastructure popular. While there is still much work to do with network software disaggregation, containerization, and decoupling of services, there are many benefits of adopting immutable infrastructure that are equally applicable to networking. Today’s network devices are poster children for config drift, but to call them “snowflakes” would be an insult to actual snowflakes.

Applying principles of immutable infrastructure, I imagine a network where each device PXE-boots into a minimal OS and runs signed micro-service building blocks. Each device has declarative configs, decoupled secret management and rotation, and logging and other monitoring data with good overall audit ability and traceability — all of which is geared to take the network off the box ASAP.

Interestingly, practices such as SSH’ing into boxes would be rendered impossible, and practices that “savvy” network automators do today like running Ansible playbooks against an inventory of devices would be banished.

4. Upgrades

Upgrades to network software and even firmware/microcode on devices could be managed automatically, by means of canary tests and rolling upgrade patterns. To do this on a per-box or per-port basis, or at finer levels of flows or traffic-processing components, we need to be able to orchestrate traffic balancing and draining.

If that sounds complex, we can make things simpler. We could treat devices and their traffic like cattle instead of pets, and rely on their resilience. Killing and resurrecting a component would restart it with a new version. While this is suitable for some software applications, treating traffic as disposable is not yet desirable for all network applications and SLAs. Still, it would go a long way toward properly designing for resilience.

5. Resilience

One implication of all this is the presence of redundancy in the network paths.  As with any networking component, that’s very important for resilience. Drawing inspiration from scale-out architectures in DevOps and the microservices application model, redundancy and scale would go hand-in-hand by means of instance replication. So redundancy would neither be 1:1 nor active-passive. We should always be skeptical of architectures that include those anti-patterns.

Good design would tolerate a veritable networking chaos monkey. Burning down network software would circuit-break to limit failures. Killing links and even boxes, we would quickly re-converge as we often do today, but dead boxes, dead SDN functions or dead virtual network functions would act like phoenix servers, rising back up or staying down in case of repeated failures or detected hardware failures.

The pattern for preventing black-swan event failures is to practice forcing these failures, and thus practice automating around them, so that the connectivity or other network service and its performance is tolerant and acceptably resilient on its own SLA measuring stick, whatever the meta-service in question may be.

Doing DevNetOps

In each one of these above topics lies much more complexity than I will dive head-first into here. By introducing them here, my aim has been to demonstrate there are interesting patterns we may draw from, and some operators are doing so already. If you’ve ever heard the old Zen Buddhist koan of the sound of one hand clapping, that’s the sound you’re likely to hear from your own forehead, once the obviousness of applying DevOps to DevNetOps hits you squarely in the face.

Just as the hardest part of adopting DevOps is often cited to be breaking off one manageable goal at a time and focusing on that, I think we’ll find the same is true of DevNetOps. Before we even get there in networking, I think we need to scope the transformation properly of applying DevNetOps to the challenges and design of networking, especially with issues of basic physical connectivity and transport.

While “network automation” leads the mind to jump to things like applying configuration management tooling and programming today’s manual tasks, DevNetOps should remind us that there is a larger scope than mere automation coding.  That scope includes culture, processes, design and tools. Collectively, they may all lead us to a happier place.

This article was originally posted on July 5th on The New Stack at

Title image of a Bell System telephone switchboard, circa 1943, from the U.S. National Archives.

Are Service Meshes the Next-Gen SDN? by James Kelly

June 28, 2017 update: more awesome background on service meshes, proxies and Istio in particular on yet another new SE Daily podcast with Istio engineers from Google.

June 26, 2017 update: For great background on service meshes (a relatively new concept) check out today's podcast on SE Daily with the founder of Linkerd.

Whether you’re adopting a containers- or functions-as-a-Service stack, or both, in the new world of micro-services architectures, one thing that has grown in importance is the network because:

  1. Micro-service application building blocks are decoupled from one another over the network. They re-integrate over the network with APIs as remote procedure calls (RPC) that have evolved from the likes of CORBA and RMI, past web services with SOAP and REST, to new methods like Apache Thrift and the even-fresher gRPC: a new CNCF project donated by Google for secure and fast http2-based RPC. RPC has been around for a long time, but now the network is actually fast enough to handle it as a general means of communication between application components, allowing us to break down monoliths where service modules would have previously been bundled or coupled with tighter API communications based on package includes, libraries, and some of us may even remember more esoteric IPC.
  2. Each micro-service building block scales out by instance replication. The front-end of each micro-service is thus a load balancer which is itself a network component, but beyond that, the services need to discover their dependent services, which is generally done with DNS and service discovery.
  3. To boost processing scale out and engineer better reliability, each micro-service instance is often itself decoupled from application state and its storage. The state is saved over the network as well. For example, using an API into an object store, a database, a k/v-store, a streaming queue or a message queue. There is also good-ol’ disk, but such disk and accompanying file systems too, may be virtual network-mounted volumes. The API- and RPC-accessible variants of storing state are, themselves, systems that are micro-services too, and probably the best example of using disks in fact. They would also incorporate a lot of distributed storage magic to deliver on whatever promises they make, and that magic is often performed over the network.

Hopefully we’re all on the same page now as to why the network is important micro-services glue. If this was familiar to you, then maybe you already know about cloud-native SDN solutions and service meshes too.

The idea and implementation of a service mesh is fairly new. The topic is also garnering a lot of attention because they handle the main networking challenges listed above (esp #1 & 2), and much more in the case of new projects like the CNCF Linkerd and newly launched project Istio.

Since I’ve written about SDN for container stacks before, namely OpenContrail for Kubernetes and OpenShift, I’m not going to cover it super deeply. Nor am I going to cover service meshes in general detail except to make comparisons. I will also put some references below and throughout. And I’ve tried to organize my blog by compartmentalizing the comparisons, so you can skip technical bits that you might not care for, and dive into the ones that matter most to you.

So on to the fun! Let’s look at some of the use cases and features of services meshes and compare them to SDN solutions, mostly OpenContrail, so we can answer the question posed in the title. Are service meshes the “Next-Generation” of SDN?

Automating SDN and Service Meshes

First, let’s have a look at 3 general aspects of automation in various contexts where SDN and service meshes are used: 1 - programmability, 2 - configuration and 3 - installation.

1.     Programmability
When it comes to automating everything, programmability is a must. Good SDNs are untethered from the hardware side of networking, and many, like OpenContrail, offer a logically centralized control plane with an API. The main two service meshes introduced above do this too, and they follow an architectural pattern similar to SDNs of centralized control plane with a distributed forwarding plane agent. While Istio has a centralized control plane API, Linkerd is more distributed but offers an API through its Namerd counterpart. Most people would probably say that the two service meshes’ gRPC API is more modern and advantageous than the OpenContrail RESTful API, but then again OpenContrail’s API is very well built-out and tested compared to Istio’s still-primordial API functions.

2.     Configuration
A bigger difference than the API, is in how functionality can be accessed. The service meshes take in YAML to declare configuration intent that can be delivered through a CLI. I suppose most people would agree that’s an advantage over SDNs that don’t offer that (at least OpenContrail doesn’t today). In terms of a web-based interface, the service meshes do offer those, as so many SDNs. OpenContrail’s web interface is fairly sophisticated after 5 years of development, yet still modern-feeling and enterprise friendly.

Looking toward “network as code” trends however, CLI and YAML is codable and version controllable more easily than say OpenContrail’s API calls. In an OpenStack environment OpenContrail can be configured with YAML-based Heat templates, but that’s less relevant for container-based K8s and OpenShift world. In a K8s world, OpenContrail SDN configuration is annotated into K8s objects. It’s intentionally simple, so it’s just exposing a fraction of the OpenContrail functionality. It remains to be seen what will be done with K8s TPRs, ConfigMaps or through some OpenContrail interpreter of its own.

3.     Installation
When it comes to getting going with Linkerd, having a company behind it, Buoyant, means anyone can get support, but getting through day-one looks pretty straightforward on one’s own anyway. Deployed with Kubernetes in the model of a DaemonSet, it is straightforward to use out of the box.

Istio is brand new, but already has Helm charts to deploy it quickly with Kubernetes thanks to our friends at Deis (@LachlanEvenson has done some amazing demo videos already –links below). Using Istio, on the other hand, means bundling its Envoy proxy into every Kubernetes pod as a sidecar. It’s an extra step, but it looks fairly painless with the kube-inject command. Sidecar vs. DaemonSet considerations aside, this bundling is doing some magic, and it’s important to understand for debugging later.

When it comes to SDNs, they’re all different wrt deployments. OpenContrail is working on a Juniper-supported Helm chart for simple deployment, but in the meantime there are Ansible playbooks and other comparable configuration management solutions offered by the community.

One thing OpenContrail has in common with the two service meshes, is that it is deployed as containers. One difference is that OpenContrail’s forwarding agent on each node is both a user-space component and Kernel module (or DPDK-based or SmartNIC-based). They’re containerized, but the kernel module is only there for installation purposes to bootstrap the insmod installation. You may feel ambivalent towards kernel modules… The kernel module will obviously streamline performance and integration with the networking stack, but the resources it uses are not container-based, and thus not resource restricted, so resource management is different than say a user-space sidecar process. Anyway, this is same deal as using the kube-proxy or any IP tables-based networking which OpenContrail vRouter replaces.

SDN and Service Meshes: Considerations in DevOps

When reflecting on micro-services architectures, we must remember that the complexity doesn’t stop there. There is also the devops apparatus to manage the application through dev, test, staging and prod, and through continuous integration, delivery, and response. Let’s look at some of the considerations:

1.     Multi-tenancy / multi-environment
In a shared cluster, code shouldn’t focus on operational contexts like operator or application dev/test environments. To achieve this, we need isolation mechanisms. Kubernetes namespaces and RBAC help this, but there is still more to do. I’ll quickly recap my understanding of the routing in OpenContrail and service meshes to better dissect the considerations for context isolation.

OpenContrail for K8s recap: One common SDN approach to isolation is overlay networks. They allow us to create virtual networks that are separate from each other on the wire (different encapsulation markings) and often in the forwarding agent as well. This is indeed the case with OpenContrail, but OpenContrail also allows higher-level namespace-like wrappers called domains/tenants and projects. Domains are isolated from each other, projects within domains are isolated from each other, and virtual networks within projects are isolated from each other. This hierarchy maps nicely to isolate tenants and dev/test/staging/prod environments, and then we can use a virtual network to isolate every micro-service. To connect networks (optionally across domains and projects), a policy is created and applied to the networks that need connecting, and this policy can optionally specify direction, network names, ports, and service chains to insert (for example, a stateful firewall service).

The way these domains, projects, and networks are created for Kubernetes is based on annotations. OpenContrail maps namespaces to their own OpenContrail project or their own virtual network, so optionally micro-services can all be reachable to each other on one big network (similar to the default cluster behavior). There are security concerns there, and OpenContrail can also enforce ACL rules and automate their creation as a method of isolating micro-services for security based on K8s object annotations or implementing Kubernetes NetworkPolicy objects as OpenContrail security groups and rules. Another kind of new annotations on objects like K8s deployments, jobs or services would specify the whole OpenContrail domain, project, and virtual network of choice. Personally, I think the best approach is a hierarchy designed to match devops teams and environments structure that makes use of the OpenContrail model of segmentation by domain, project and network. This is in (unfortunately) contrast to the simpler yet more frequently used global default-deny rule and ever-growing whitelist that ensues that turns your cluster into Swiss cheese. Have fun managing that :/

The overlay for SDN is at layer 2, 3 and 4, meaning that when the packet is received on the node, the vRouter (in OpenContrail’s case) will receive the packet destined to it and look at the inner header (the VXLAN ID or MPLS LSP number) to determine the domain/tenant, project and network. Basically, the number identifies which routing table will be used as a lookup context for the inner destination address, and then (pending ACLs) the packet is handed off to the right container/pod interface (per CNI standards).

Service mesh background: The model of Istio’s Envoy and Linkerd insofar as they are used (which can be on a per-microservice basis), is that there is a layer-7 router and proxy in front of your microservices. All traffic is intercepted at this proxy, and tunneled between nodes. Basically, it is also an overlay at a higher layer.

The overlay at layer-7 is conceptually the same as SDN overlays except that the overlay protocol over the wire is generally HTTP or HTTP2, or TLS with one of those. In the DaemonSet deployment mode of Linkerd, there is one IP address for the host and Linkerd will proxy all traffic. It’s conceptually similar to the vRouter except in reality it is just handling HTTP traffic on certain ports, not all traffic. Traffic is routed and destinations are resolved using a delegation tables (dtabs) format inherited from Finagle. In the sidecar deployment model for Linkerd or for Istio’s Envoy (which is always a sidecar), the proxy is actually in the same container network context as each micro-service because it is in the same pod. There are some IP tables tricks they do to sit between your application and the network. In Istio Pilot (the control plane) and Envoy (the data plane), traffic routing and destination resolution is based primarily on the Kubernetes service name.

With that background, here are a few implications for multi-tenancy.

Let’s observe that in the SDN setup, the tenant, environment and application (network) classification happens in the kernel vRouter. In service mesh proxies, we still need a CNI solution to get the packets into the pod in the first place. In Linkerd, we need dtab routing rules that include tenant, environment and service. Dtabs seems to give a good way to break this down that is manageable. In the sidecar mode, more frequently used for Envoy, it’s likely that the pod in which traffic ends up already has a K8s namespace associated with it, and so we would map a tenant or environment outside of the Istio rules, and just focus on resolving the service name to a container and port when it comes to Envoy.

It seems that OpenContrail here has a good way to match the hierarchy of separate teams, and separate out those RBAC and routing contexts. Linkerd dtabs are probably a more flexible way to create as many layers of routing interpretation as you want, but it may need a stronger RBAC to allow the splitting of dtabs among team tenants for security and coordination. Istio doesn’t do much in the way of isolating tenants and environments at all. Maybe that is out of scope for it which seems reasonable since Envoy is always a sidecar container and you should have underlying multi-tenant networking anyway to get traffic into the sidecar’s pod.

One more point is that service discovery baked into the service mesh solutions, but it is still important in the SDN world, and systems that include DNS (OpenContrail does) can help manage name resolution in a multi-tenant way as well as provide IP address management (like bring your own IPs) across the environments you carve up. This is out of scope for service meshes, but with respect to multiple team and dev/test/staging/prod environments, it may be desirable to have the same IP address management pools and subnets.


2.     Deployment and load balancing
When it comes to deployment and continuous delivery (CD), the fact that SDN is programmable helps, but service meshes have a clear advantage here because they’re designed with CD in mind.

To do blue-green deployments with SDN, it helps to have floating IP functionality. Basically, we can cut over to green (float a virtual IP to the new version of the micro-service) and safely float it back to blue if we needed to in case of an issue. As you continuously deliver or promote staging into the non-live deployment, you can still reach it with a different floating IP address. OpenContrail handles overlapping floating IPs to let you juggle this however you want to.

Service mesh routing rules can achieve the same thing, but based on routing switch overs at the HTTP level that point to for example a newer backend version. What service meshes further allow is traffic roll over like this example showing a small percentage of traffic at first and then all of it, effectively giving you a canary deployment that is traffic load-oriented as opposed to a Kubernetes rolling upgrade or the Kubernetes deployment canary strategy that gives you a canary that is instance-count based, and relies on the load balancing across instances to partition traffic.

This brings us to load balancing. Balancing traffic between the instances of a micro-service, by default happens with the K8s kube-proxy controller by its programming of IP tables. There is a bit of a performance and scale advantage here of using OpenContrail’s vRouter which uses its own ECMP load balancing and NAT instead of the kernel’s IP tables.

Service meshes also handle such load balancing. They support wider ranging features, both in terms of load balancing schemes like EWMA and also in terms of cases to eject an instance from the load balancing pool, like if they’re too slow.

Of course service meshes do also handle load balancing for ingress HTTP frontending. Linkerd and Istio integrate with the K8s Ingress as ingress controllers. While most SDNs don’t seem to offer this, OpenContrail does have a solution here that is based on haproxy, an open source TCP proxy project. One difference, is that OpenContrail does not yet support SSL/TLS, but there are also K8s pluggable alternatives like nginx for pure software-defined load balancing.

3.     Reliability Engineering
Yes, I categorize SRE and continuous response under the DevOps umbrella. In this area, since service meshes are more application-aware, it’s no surprise, they do the most further the causes of reliability.

When it comes to reliably optimizing and engineering performance, one point here from above is that EWMA and such advanced load balancing policies will assist in avoiding or ejecting slow instances, thus improving tail latency. A Buoyant article about performance addresses performance in terms of latency directly. Envoy and Linkerd are after all TCP proxies, and unpacking and repacking a TCP stream is seen as notoriously slow if you’re in the networking world (I can attest to this personally recalling one project I assisted with that did HTTP header injection for ad placement purposes). Anyway, processors have come far, and Envoy and Linkerd are probably some of the fastest TCP proxies you can get. That said, there are always the sensitive folks that balk at inserting such latency. I thought it was enlightening that in the test conducted in the article cited above, they’ve added more latency and steps, but because they’re also adding intelligence, they’re netting an overall latency speed up!

The consensus seems to be that service meshes solve more problems than they create, such as latency. Are they right for your particular case? As somebody highly quoted once said, “it depends.” As is the case with DPI-based firewalls, these kind of traffic processing applications can have great latency and throughput with a given feature set or load, but wildly different performance by turning on certain features or under load. Not that it’s a fair comparison, but the lightweight stateless processing that an SDN forwarding agent does is always going to be way faster than such proxies, especially when, like for OpenContrail, there are smart NIC vendors implementing the vRouter in hardware.

Another area that needs more attention in terms of reliability is security. As soon as I think of a TCP proxy, my mind wonders about protecting against a DoS attack because so much state is created to track each session. A nice way that service meshes nicely solve this is through the use of TLS. While Linkerd can support this, Istio makes this even easier because of the Istio Auth controller for key management. This is a great step to not only securing traffic over the wire (which SDNs could do too with IPsec etc.), but also making strong identity-based AAA for each micro-service. It’s worth noting that these proxies can change the wire protocol to anything they can configure, regardless of if it was initiated as such from the application. So an HTTP request could be sent as HTTP2 within TLS on the wire.

I’ll cap off this section by mentioning circuit breaking. I don’t know of any means that an SDN solution could do this very well without interpreting a lot of analytics and application tracing information and feeding that back into the API of the SDN. Even if that is possible in theory, service meshes already do this today as a built-in feature to gracefully handle failures instead of having them escalate.

4.     Testing and debugging
This is an important topic, but there’s not really an apples-to-apples comparison of features, so I’ll just hit prominent points on this topic separately.

Services meshes provide an application RPC-oriented view into the intercommunication in the mesh of micro-services. This information can be very useful for monitoring and ops visibility and during debugging by tracing the communication path across an application. Linkerd integrates with Zipkin for tracing and other tools for metrics, and works for applications written in any language unlike some language-specific tracing libraries.

Service meshes also provide per-request routing based on things like HTTP headers, which can be manipulated for testing. Additionally, Istio also provides fault injection to simulate blunders in the application.

On the SDN side of things, solutions differ. OpenContrail is fairly mature in this space compared to the other choices one has with CNI providers. OpenContrail has the ability to run packet capture and sniffers like Wireshark on demand, and its comprehensive analytics engines and visibility tools expose flow records and other traffic stats. Aside from debugging (at a more of network level), there are interesting security applications for auditing ACL deny logs. Finally, OpenContrail can tell you the end-to-end path of your traffic if it’s run atop of a physical network (not a cloud). All of this can potentially help debugging, but the kind of information is far more indirect vis-à-vis the applications, and is probably better suited for NetOps.

Legacy and Other Interconnection

Service meshes seem great in many ways, but one hitch to watch out for is how they can allow or block your micro-services connecting to your legacy services or any services that don’t have a proxy in front of them.

If you are storing state in S3 or making a call to a cloud service, that’s an external call. If you’re reaching back to a legacy application like an Oracle database, same deal. If you’re calling an RPC of another micro-service that isn’t on the service mesh (for example it’s sitting in virtual machine instead of a container), same again. If your micro-service is supposed to deal with traffic that isn’t TCP traffic, that too isn’t going to be handled through your service mesh (for example, DNS is UDP traffic, ping is ICMP).

In the case of Istio, you can setup egress connectivity with a service alias, but that may require changes to the application, so a direct pass-thru is perhaps a simpler option. Also there are a lot of variants of TCP traffic that are not HTTP nor directly supported as higher-level protocols riding on HTTP. Common examples might be ssh and mail protocols.

There is also the question of how service meshes will handle multiple IPs per pod and multiple network interfaces per pod once CNI soon allows it.

You most certainly have some of this communication in your applications that doesn’t quite fit the mesh. In these cases you not only need to plan how to allow this communication, but also how to do it securely, probably with an underlying SDN solution like OpenContrail that can span Kubernetes as well as OpenStack, VMware and metal.

What do you think?

Going back to the original question in the title: Are Service Meshes the Next-Gen SDN?

On one hand: yes! because they‘re eating a lot of the value that some SDNs provided by enabling micro-segmentation and security for RPC between micro-services. Service meshes are able to do this with improved TLS-based security and identity assignment to each micro-service. Also service meshes are adding advanced application-aware load balancing and fault handling that is otherwise hard to achieve without application analytics and code refactoring.

On the other hand: no! because service meshes sit atop of CNI and container connectivity. They ride on top of SDN, so they’ll still need a solid foundation. Moreover, most teams will want multiple layers of security isolation when they can get micro-segmentation and multi-tenancy that comes with SDN solutions without any performance penalty. SDN solutions can also span connectivity across clusters, stacks and runtimes other than containers, and satisfy the latency obsessed.

Either way, service meshes are a new, cool and shiny networking toy. They offer a lot of value beyond the networking and security values that they subsume, and I think we’ll soon see them in just about every micro-services architecture and stack.

More questions…

Hopefully something in this never-ending blog makes you question SDN or the service meshes. Share your thoughts or questions. The anti-pattern of technology forecasting is thinking we’re done, so some open questions:

  1. Should we mash-up service meshes and fit Linkerd into the Istio framework as an alternative to Envoy? If so, why?
  2. Should we mash-up OpenContrail and service meshes and how?

Resources on Learning About Service Meshes

Serverless Can Smarten Up Your DevOps & Infrastructure Stack by James Kelly

As IT organizations and cultures reshape themselves for speed, we’ve seen a trend away from IT as a service provider, to IT as:

  • a collection of smaller micro-service provider teams,
  • each operating and measuring like autonomous (micro-)product teams,
  • each building and running technology systems as micro-services.

Now you’ve heard me cover the intersection of microservices and containers quite a bit, along with CaaS/Kubernetes or PaaS/OpenShift orchestration. In this blog, I’m going to change it up a bit and talk about the other kind of microservices building block: code functions.

Here I’m talking about microservices run atop of a serverless computing a.k.a. function-as-a-service (FaaS) stack. Serverless is a 2-years-new yet already well-covered topic. If you need a primer this one is the best I’ve found.

In my last blog, I examined avoiding cloud lock-in by bringing your own devops and software-defined infrastructure stack: bringing your own open-source/standard toolchain across public/private IaaS, you maintain portability and harmonized, if not unified, hybrid cloud policy. This software-defined infrastructure stack, underpinning the user- or business-facing applications, is also still just a bunch of software applications. FaaS is just one example of an increasingly popular app-developer technology that can be employed by infrastructure developers too, though this one hasn’t gotten much attention yet in that regard.

If you read the primer above, you won’t have learned that the popularity of FaaS systems like Lambda to reduce developer friction has led to a number of open source projects like OpenWhisk, and others that to sit atop of a CaaS/PaaS (such as Fission or Kubeless). If a FaaS is likely to exist in your application stack, it may well make sense to standardize on incorporating one into your devops and infrastructure stack. Let’s see.


When it comes to I&O events, does your mind go to a pager waking somebody up in the middle of the night? Better to wake up a function and keep sleeping.

The main area of natural applicability for FaaS is in event processing. Such code functions after all are event handlers.

In the devops world, for development and application lifecycle events, FaaS presents a simple way to further automate. The scale is probably low for most CI/CD events, so the main benefit here of employing FaaS is the simplicity and agility in your pipelines as code. It also helps foster the change and process innovation to do pipelines as code in the first place. In the devops area of continuous response (CR), that Randy Bias coins here, there is almost certainly more scale and importance to respond to events, where the property of FaaS to handle on-demand bursty scale could also be a significant benefit.

In the software-defined infrastructure world, there are traditional events such as change logs, alarms, security alerts, and modern telemetry and analytics events. Of all the types, analytics, telemetry and security events strike me as the area where FaaS could present the biggest benefit and most interesting examination here. Similar to how for CR, analytics for an application (processing telemetry such as API calls, clicks, and other user events) needs scale, software-defined infrastructure is increasingly generating massive amounts of telemetry data that needs processing.

Systems like Juniper Networks AppFormix and Contrail Networking already do some telemetry collection and processing (see for yourself). AppFormix, a platform for smart operations management, in fact, processes much of the telemetry with light machine learning on the edge node on which it’s generated, so that sending the big data for centralized heavy-duty crunching and machine learning in the centralized control plane is more efficient. We at Juniper have written before about this distributed processing model and its benefits such as realer real-time visibility monitoring to show you what’s happening because… the devops call, “You build it, you run it,” sounds great until you’re running in a multi-tenant/team/app environment and chaos ensues. AppFormix also has also a smart policy engine, that is using static thresholds as well as anomaly detection to generate... you guessed it... events!

Just like devops event hooks can trigger a FaaS function, so can these software-defined infrastructure events.

Obviously a system like AppFormix implements much of the CR machine-learning smart glue for the analytics of your stack and its workloads; smart-driving ops as I call it. Other systems like Contrail Networking with its analytics Kafka or API interface are rawer. The same is true of other SD-infrastructure components like raw Kubernetes watcher events with which some FaaS systems like Fission already integrate.

The more raw the events, I bet, the more frequent they are too, so you have to decide at what point the frequency becomes so high that it’s more efficient to run a full-time process instead of functions on demand. That depends on the scale with which you want to run your own FaaS tool, or use a public cloud FaaS service (of course that only makes sense if you’re running the stack on that public cloud, and want to tradeoff some lock-in until they mature as usable in a standard way…worth seeing Serverless Framework here).


One of the ends that you could have in mind to run operational-based functions is greater operational intelligence. As we’ve discussed, AppFormix does this to a good extent, but it does it within a cluster. We can consider further smart automation use cases in-cluster, but using a FaaS external to your clusters, your infrastructure developers could scope your intelligence deeper or broader too.

For example, you could go deeper, processing telemetry in customized ways that AppFormix does not today. Imagine you’re running on GCP. Well you could do deep learning on telemetry data, even with TPUs I might add. To borrow a metaphor from a common ML example of image recognition, where AppFormix might be able to tell you, “I see a bird,” a powerful deep learning engine with TPU speed up, may, in real-enough time, be able to tell you, “I see a peregrine falcon.” In more practical terms, you may be able to go from detecting anomalies or security alerts to operations and security situational awareness where your systems can run themselves better than humans. After all, they already can at lower levels thanks to systems like Kubernetes, Marathon and AppFormix.

Rather than deeper, broader examples could be use cases that extend beyond your cluster. For example, auto-scaling or auto-healing of the cluster by integrating with physical DCIM systems or underlying IaaS. Yes, it’s serverless managing your servers folks!

In closing, there is plenty more to be written in the future of serverless. I expect we’ll see something soon in the CNCF, and then we’ll definitely start hearing of more application use cases. But hopefully, this blog sparked an idea or two about its usefulness beyond the frequent http-request and API-callback use cases for high-level apps. (Please share your ideas). For intent-driven, self-driving, smart-driving infrastructure, I think a FaaS track is emerging.

How to Avoid the Cloud Trap by James Kelly


Do you ever want something so badly that you might just fall into a trap trying to get it? That cheese looks awfully yum…whack! Oops :(

When speed = haste 

When speed = haste we blind ourselves to potential pitfalls. One area today of much haste in many enterprise IT organizations is, you guessed it, the move to the cloud. I was visiting a customer today, and I was told a familiar story: Their application team has built a new application proof-of-concept on AWS, and the line of business is going to fund this app to move ahead. The familiar part of the story line is that this was all achieved in X weeks instead of X months. With impressive results, developers are soon pulling the rest of IT to do cloud with an executive push.

Embracing cloud native is a great thing, but doing so with this haste is running right into the palm of AWS’s hand. Not to pick on AWS, this can happen anywhere, but it’s most likely to happen at AWS because they’re the incumbent choice, and there are so many services to help developers deliver something cloud-native quickly.

To developers, these AWS services look like candy to Garfield on halloween. Unsuspectingly, many will never see the trap. Why? Because the lock-in police within the IT organization are patrolling I&O, while the lock-in felons have cleverly gone to work over on the application developers.

The new ‘lock-in’ is from developers,
not infrastructure and operations

I often talk about the move to cloud being led by developers and the devops trend; however, I hadn’t put 2 & 2 together quite like I did today with respect to lock-in. Although, I do often preach about how 80-odd percent of enterprises are targeting the hybrid cloud model, and how to make hybrid cloud a true IT platform, there must be portability across clouds, I guess I didn’t see the inverse of that concern.

That brings us to discussing the solution. How to achieve portability and a hybrid cloud IT platform with, less but not absolutely no, lock-in.

Let’s just examine the main public cloud business models for their services.

  1. First, nearly everyone is very familiar with IaaS. It’s offered at almost all big clouds, certainly the main 3.
  2. Then, there is the managed service model, where the cloud provider will usually take the many popular open-source software projects and offer them as individual managed services.
  3. Finally, there are the services that are homegrown or customized by the provider and offered as a service.

This is a decent summary of the main service models, but certainly not the only things that cloud providers can offer. Some other interesting examples include: AWS offers Direct Connect; GCP differentiates with its own high-speed global network to interconnect regions; Azure has the broadest compliance coverage. Anyway, if you understand the 3 models above, you can probably easily grasp what’s coming next… That developers that use customized, homegrown services are clearly locking themselves in the most.

Getting conscious about lock-in

So should you never use these customized cloud services? Of course you should if it is really worth it, but do it consciously. Obviously, services that are unique prevent apps that touch them from being very portable across your hybrid cloud platform.

The good news is that most cloud services today, have an open source project that matches most of what any service can do. If not, there is still a good chance that you can achieve similar benefit with a combination of open source projects, or vendor products based on such projects.

Bring your own stack

Perhaps the main thing that enables IT to go from multi cloud to consciously building a universal hybrid cloud platform is a unified toolchain and unified policy. Some systems’ policy unification can be achieved with meta-orchestrators that work across multiple clouds like Red Hat CloudForms and RightScale, but they have their limitations. Using the many benefits of public and private cloud IaaS, you can still control the application and devops stack to achieve portability and mitigate lock-in by bringing your own full stack that sits atop of any IaaS.

Anyone can do multi cloud. Throw in a little bit of this, and a little bit of that. But the recipe for a happier hybrid cloud seems to be known to some organizations, but not to all, so let’s have a look at how to do cloud with less lock-in and more portability.

Embrace cloud IaaS as a base for your devops stack, but use other cloud services sparingly:

  • Bring your own IaaS automation and abstraction (such as Terraform and config management tools like Puppet, Chef, Ansible, etc.)
  • Lock in to cloud services consciously when they are unique and necessary for business advantage
  • For services that have open source tool equivalents, bring your own tool or at least use a managed service that has the generic API

Start with 2 clouds instead of one. This will…

  • Prevent you from tethering yourself to just one partner for cloud innovation & economics
  • Force the application cluster / stack to be portable
  • Force the DevOps workflows to be portable
  • Force designing for resiliency and scale early on

Take the naïve out of cloud native

If your business case for cloud is that your developers’ new-found speed is knocking the socks off of your executives, then maybe have a closer look.

There is a better way to “do cloud” with portable apps, automation, and software-defined infrastructure atop of any IaaS, and even better, there is actually plenty of help. Check out the cloud-native computing foundation that Juniper and many other vendors recently joined and for your unified toolchain check out some of Juniper’s cloud software portfolio that can fit equally well across your public/private/ hybrid cloud venues such as AppFormix, vSRX, vMX and Contrail Networking built from OpenContrail.


Juniper & Red Hat Serve Up an Open Double-Stack Cloud with an SDN Twist by James Kelly

Photo credit Tom Haverford

Photo credit Tom Haverford

Juniper Networks Contrail Networking, developed in the OpenContrail open source project, has long been a part of Red Hat’s millinery. The partnership between Juniper and Red Hat goes back some years now. Collaborating on OpenStack cloud and NFV infrastructure has won these partners success in supporting large enterprises and communications service providers like Orange Business Services.

At the long list of open source festivities in Boston over the next 2 weeks, you will hear these partners in cloud building on their past successful OpenStack + Contrail integration and now putting the spotlight on new integrations to support cloud native. You’ve heard me blog about the OpenContrail integration with OpenShift back a year already (in its first alpha form that I demoed), and more recently for CloudNativeCon and DockerCon talking about how we evolved that work to make this integration enterprise-ready and up-to-date with all the innovation that’s happened in the fast-paced OpenShift releases.

But how do you get the best of OpenStack and OpenShift?

Red Hat has been helping customers to move faster with devops, continuous delivery, and containers using OpenShift for a long time. Naturally Red Hat often does this atop of Red Hat OpenStack Platform, where OpenStack creates clusters of virtual machine hosts for the Red Hat OpenShift Container Platform cluster.

One latent hitch in this double stack is the software-defined networking (SDN). Like OpenStack, OpenShift and Kubernetes (on which OpenShift is based) have their opportunities for improvement. One area that is frequently fortified and improved is the software-defined networking (SDN), and the importance of doing this doubles when you’re running the double stack of OpenShift on top of OpenStack.

Why can SDN be such a snag? Well, the network is a critical part of any cloud, and especially cloud-native, infrastructure because of the enormous volume of microservice-generated east-west traffic, along with load balancing, multi-tenancy, and security requirements; that’s just for starters. The good-enough SDN that is included but swappable in such open source cloud stacks is very often indeed good enough for small cloud setups, but it is common to see the SDN replaced with something more robust for clouds with more than 100 nodes or other advanced use cases like NFV.

Fast and Furious Clouds

I think of this 100-node edge like the 100mph edge of a car… As I make my way to Boston, I just took an Uber to SFO, and of course it was a Prius. Like most cars nowadays they’re very efficient, and great for A-to-B commuting. But also like most cars, when you approach or (hopefully don’t) pass 100mph mark, the thing feels like it is going to disintegrate! I was a little uneasy today flying down the 101 at just 80-something mph. Eeek!

Now I love to drive, and drive fast, but I don’t do it in a bloody Prius! My speed-seeking readers will probably know, as I do, that if you go fast in a sportscar, it is a way better experience. It’s smooth at high speeds, and the power feels awesome. Now let’s say you also aren’t just driving in a straight line, but you’re on the Laguna Seca Raceway. To handle cornering agility, gas and maintenance pit stops, and defense/offense versus the other drivers, you need more than just smooth handling. The requirements add up. I think you get my drift…

My point is that, similarly, if you’re going to build a cloud, you need to consider that you’re going to need a lot more than good enough. Good enough might do you fine for some dev/test or basic scenarios, but if you need performance, elastic scale, resilience, F1-pit-crew speed and ease of maintenance, and a security defense/offense, then you need to invest in building the best. Juniper has been helping its customers build the best networks for over 20 years. It’s what Juniper is known for: high-performance and innovation. When you (or say NSS) compare security solutions, again, Juniper is on top for performance, effectivity (stopping the bad stuff) and value I might add.

When Compounding Good-enough Isn’t Great

Imagine the challenges you can run into with good-enough networking… now imagine you stack two such solutions on top of each other. That’s what happens with OpenShift or Kubernetes on top of OpenStack.

In this kind of scenario, compounding the two stacks’ SDN, as you demand more from your cloud, you will double complexity and twice as quickly hit network disintegration!

Ludacris Mode for Your Cloud

If you want to drive your cloud fast and furious, and not crash, you need some racing readiness. OpenContrail is designed for this, and proven in some of the largest most-demanding clouds. No need to recount the awesomeness here. It’s already well documented. Before you green-light it though, there is one thing we’ve needed to iron out: How does an OpenContrail double stack drive?

SDN Inception

When OpenContrail developers were in their first throws of SDN integration for Kubernetes and OpenShift, we often ran it inside of an OpenStack cloud. And what was the OpenStack SDN? OpenContrail of course. Yep, that’s right we have OpenContrail providing an IP overlay on top of the physical network for the OpenStack VM connectivity, and then inside of that overlay and those VMs we installed OpenShift (or Kubernetes) with another OpenContrail overlay. It turns out this SDN inception works just fine. There’s nothing special to it. OpenContrail just requires an IP network, and the OpenStack-level OpenContrail fits the bill perfectly.

In fact, SDN inception is pretty common, but not usually with the same SDN at both levels. The main place this happens in practice is because we run cloud-native CaaS/PaaS stacks like Kubernetes, Mesos, OpenShift paired with OpenContrail on top of public clouds, and that public IaaS line AWS has its own underlying SDN. It provides the IP underlay that we need in those cases.

What about when we control the SDN at the IaaS AND the CaaS/PaaS layers? Even if 2 SDNs (the same or different solution) work well stacked atop each other, it’s not ideal because there is still double the complexity of managing them. If only there was a better way…

A New Hat Stack Trick

This is where the OpenContrail community was inspired to raise the bar, and the Red Hat stack of OpenShift on OpenStack is the perfect motivation. What’s now possible today is to unwind the SDN inception and use one single control and data plane for OpenShift or Kubernetes on top of OpenStack when you run OpenContrail. The way this is realized is by having the OpenStack layer work as usual, and using OpenContrail in a different way with OpenShift or Kubernetes. In that instance, the OpenContrail plugin for OpenShift/Kubernetes master will speak directly to the OpenContrail controller used at the OpenStack layer. To collapse the data plane, we have a CNI plugin passthru that will not require the OpenContrail vRouter to sit inside the host VM for each OpenShift/Kubernetes minion (compute) node. Instead the traffic will be channeled from the container to the underlying vRouter that is sitting on the OpenStack nova compute node. We’ll save further technicalities and performance boost analysis for an OpenContrail engineering blog another day.

Juniper and Red Hat work on this latest innovation of flattening the SDN stack is coming to fruition. It is available today in the OpenContrail community or Juniper Contrail Networking beta, and slated for Juniper’s next Contrail release. As to that, stay tuned. As to catching this in action, visit Juniper and Red Hat at the Red Hat Summit this week and the OpenStack Summit next week. We’ll see you there, and I hope you hear about this and more OpenContrail community innovations ahead and in deployment at the OCUG next week.

A Contrarian Viewpoint on Container Networking by James Kelly

With DockerCon in Austin happening this week, I’m reminded of last year’s DockerCon Seattle, and watching some announcements with utter fascination or utter disappointment. Let’s see if we can’t turn the disappointments into positives.

The first disappointment has a recent happy ending. It was a broadly shared observation in Seattle, the media, and discussion forums: Docker was overstepping when they bundled in Swarm with the core of Docker Engine in 1.12. This led to the trial balloon that forking Docker was a potential solution towards a lighter-weight Docker that could serve in Mesos and Kubernetes too. Last September I covered this in my blog sparing little disdain over the idea of forking Docker Engine simply because it had become too monolithic. There are other good options, and I’m happy to say Docker heeded the community’s outcry and cleanly broke out a component of Docker Engine called containerd which is the crux of the container runtime. This gets back to the elegant Unix-tool inspired modularization and composition, and I’m glad to see containerd and rkt have recently been accepted into the CNCF. Crisis #1 averted.

My next disappointment was not so widely shared, and in fact it is still a problem at large today: the viewpoint on container networking. Let’s have a look.

Is container networking special?

When it comes to containers, there’s a massive outpour of innovation from both mature vendors and startups alike. When it comes to SDN there’s no exception.

Many of these solutions you can discount because, as I said in my last blog, as shiny and interesting as they may be on the surface or in the community, their simplicity is quickly discovered as a double-edged sword. In other words, they may be easy to get going and wrap your head around, but they have serious performance issues, security issues, scale issues, and “experiential cliffs” to borrow a turn of phrase from the Kubernetes founders when they commented on the sometimes over-simplicity of many PaaS systems (iow. they hit a use case where the system just can’t do that experience/feature that is needed).

Back to DockerCon Seattle…

Let’s put aside the SDN startups that to various extents suffer from the over-simplicity or lack of soak and development time, leading to the issues above. The thing that really grinds my gears about last year’s DockerCon can be boiled down to Docker, a powerful voice in the community, really advocating that container networking was making serious strides, when at the same time they were using the most primitive of statements (and solution) possible, introducing “Multi-host networking”

You may recall my social post/poke at the photo of this slide with my sarcastic caption.

Of course, Docker was talking about their overlay-based approach to networking that was launched as the (then) new default mode to enable networking in Swarm clusters. The problem is that most of the community are not SDN experts, and so they really don’t know any better than to believe this is an aww!-worthy contribution. A few of us that have long-worked in networking were less impressed.

Because of the attention that container projects get, Docker being the biggest, these kind of SDN solutions are still seen today by the wider community of users as good networking solutions to go with because they easily work in the very basic CaaS use cases that most users start playing with. Just because they work for your cluster today, however, doesn’t make them a solid choice. In the future your netops team will ask about X, Y and Z (and yet more stuff down the road they won’t have the foresight to see today). Also in the future you’ll expand and mature your use cases and start to care about non-functional traits of the network which often happens too late in production or when problems arise. I totally get it. Networking isn’t the first thing you want to think about in the cool new world of container stacks. It’s down in the weeds. It’s more exciting to contemplate the orchestration layer, and things we understand like our applications.

On top of the fact that many of these new SDN players offer primitive solutions with hidden pitfalls down the road that you won’t see until it’s too late, another less pardonable nuisance is the fact that most of them are perpetrating the myth that container networking is somehow special. I’ve heard this a lot in various verbiage over the ~7 years that SDN has arisen for cloud use cases. Just this week, I read a meetup description that started, “Containers require a new approach to networking.” Because of all the commotion in the container community with plenty of new SDN projects and companies having popped up, you may be duped into believing that, but it’s completely false. These players have a vested interest, though, in making you see it that way.

The truth about networking containers

The truth is that while workload connectivity to the network may change with containers (see CNM or CNI) or with the next new thing, the network itself doesn’t need to change to address the new endpoint type. Where networks did need some work, however, is on the side of plugging into the orchestration systems. This meant that networks needed better programmability and then integration to connect-up workloads in lock-step with how the orchestration system created, deleted and moved workloads. This meant plugging into systems like vSphere, OpenStack, Kubernetes, etc. In dealing with that challenge, there were again two mindsets to making the network more programmable, automated, and agile: one camp created totally net-new solutions with entirely new protocols (OpenFlow, STT, VxLAN, VPP, etc.), and the other camp used existing protocols to build new more dynamic solutions that met the new needs.

Today the first camp solutions are falling by the wayside, and the camp that built based on existing open standards and with interoperability in mind is clearly winning. OpenContrail is the most successful of these solutions.

The truth about networks is that they are pervasive and they connect everything. Interoperability is key. 1) Interoperability across networks: If you build a network that is an island of connectivity, it can’t be successful. If you build a network that requires special/new gateways, then it doesn’t connect quickly and easily to other networks using existing standards, and it won’t be successful. 2) Interoperability across endpoints connections: If you build a network that is brilliant at connecting only containers, even if it’s interoperable with other networks, then you’ve still created an island. It’s an island of operational context because the ops team needs a different solution for connecting bare-metal nodes and virtual machines. 3) Interoperability across infrastructure: If you have an SDN solution that requires a lot from the underlay/underlying infrastructure, it’s a failure. We’ve seen this with SDNs like NSX that required multicast in the underlay. We’ve seen this with ACI that requires Cisco switches to work. We’ve even seen great SDN solutions in the public cloud, but they’re specific to AWS or GCP. If your SDN solution isn’t portable anywhere, certainly to most places, then it’s still doomed.

If you want one unified network, you need one SDN solution

This aspect of interoperability and portability actually applies to many IT tools if you’re really going to realize a hybrid cloud and streamline ops, but perhaps nowhere is it more important than in the network because of its inherently pervasive nature.

If you’re at DockerCon this week, you’ll be happy to know that the best solution for container networking, OpenContrail, is also the best SDN for Kubernetes, Mesos, OpenStack, NFV, bare-metal node network automation, and VMware. While this is one SDN to rule and connect them all, and very feature rich in its 4th year of development, it’s also never been more approachable, both commercially turn-key and in open source. You can deploy it on top of any public cloud or atop of private clouds with OpenStack or VMware, or equally easily on bare-metal CaaS, especially with Kubernetes, thanks to Helm.

Please drop by and ask for an OpenContrail demo and sticker! for your laptop or phone at the Juniper Networks booth, and booths of partners of Juniper’s that have Juniper Contrail Networking integrations: Red Hat, Mirantis, Canonical, and we’re now happy to welcome Platform9 to the party too. We at Juniper will be showcasing a joint demo with Platform9 that you can read more about on the Platform9 blog.

PS. If you’re running your CaaS atop of OpenStack, then even more reason that you’ll want to stop by and get a sneak peak of what you’ll also hear more about at the upcoming Red Hat and OpenStack Summits in Boston.

The Best SDN for OpenStack, now for Kubernetes by James Kelly

In between two weeks of events as I write this, I got to stop at home and in the office for only a day, but I was excited by what I saw on Friday afternoon… The best SDN and network automation solution out there, OpenContrail, is now set to rule the Kubernetes stage too. With title of “#1 SDN for OpenStack” secured for the last two years, the OpenContrail community is now ready for Kubernetes (K8s), the hottest stack out there, and a passion of mine for which I’ve been advocating for more and more of attention around the office at Juniper.

It couldn’t be better timing, heading into KubeCon / CloudNativeCon / OpenShift Commons in Berlin this week. For those that know me, you know I love the (b)leading edge of cloud and devops, and last year when I built an AWS cluster for OpenContrail + OpenShift / Kubernetes then shared it in my Getting to #GIFEE blog, github repo, and demo video, those were early days for OpenContrail in the K8s space. It was work pioneered by OpenContrail co-founder and a few star engineers to help me cut through the off-piste mank and uncharted territory. It was also inspired by my friends at tcp cloud (now Mirantis) who presented on Smart cities / IoT with K8s, OpenContrail, and Raspberry Pi at the last KubeCon EU. Today, OpenContrail, the all-capability open SDN is ready to kill it in the Kubernetes and OpenShift spaces, elevating it to precisely what I recently demo shopped for in skis: a true all-mountain top performer (BTW my vote goes to the Rosi Exp. HD 88). Now that you know what I was doing when not blogging recently… Skiing aside, our demos are ready for the KubeCon audience this week at our Juniper booth, and I’ll be there talking about bringing a ton of extra value to the Kubernetes v1.6 networking stack.

What sets OpenContrail apart from the other networking options for Kubernetes and OpenShift is that it brings an arsenal of networking and security features to bear on your stack, developed by years of the performance-obsessed folks at Juniper and many other engineers in the community. In this space OpenContrail often competes with smaller SDN solutions offered by startups. Their said advantage is positioned as nimbler and simpler solutions. This was particularly true in their easy, integrated installation with K8s (coming for OpenContrail very soon), but on the whole, their advantage, in general across the likes of Contiv, Calico, Flannel, Weave, etc. boils down to 2 things. Let’s have a look…

First, it is easy to be simpler when primitive. This isn’t a knock on the little guys. They are very good for some pointed use cases and indeed they are simple to use. They do have real limits however. Simple operation can always come from less knobs and fewer capabilities, but I believe an important goal of SDN, self-driving infrastructure, cognitive / AI in tech, is abstraction; in other words, simplicity needn’t come from striping away functionality. We only need start with a good model for excellent performance at scale and design for elegance and ease of use. It’s not easy to do, but have you noticed that Kubernetes itself is the perfect example of this? – super easy 2 years ago, still is, but at the same time there’s a lot more to it today. It’s built on solid concepts and architecture, from a lot of wisdom at Google. Abstraction is the new black. It’s open, layered and transparent when you need to peel it back, and it is the solution to manage complexity that arises from features and concepts that aren’t for everyone. General arguments aside, if you look at something like K8s ingress or services networking / balancing, none of the SDN solutions cover that except for very pointed solutions like GKE or nginx where you’d still need one or more smaller SDN tools. Furthermore, when you start to demand network performance on the control (protocol/API) and data (traffic feeds & speeds) planes with scale in many other K8s metrics, the benefits of OpenContrail that you get for free really stand apart from the defacto K8s networking components and these niche offerings. Better still, you get it all as one OpenContrail solution that is portable across public and private clouds.

Second, other solutions are often solely focused on K8s or CaaS stacks. OpenContrail is an SDN solution that isn’t just for Kubernetes nor containers. It’s one ring to rule them all. It works on adjacent CaaS stack integrations with OpenShift, Mesos, etc., but it’s equally if not more time-tested in VMware and OpenStack stacks and even bare-metal or DIY orchestration of mixed runtimes. You might have heard of publicized customer cases for CaaS, VMware and OpenStack across hybrid clouds and continuing with the same OpenContrail networking irrespective of stacking PaaS and CaaS upon pub/prvt IaaS or metal (here are a few 1, 2, 3). It’s one solution unifying your stack integration, network automation and policy needs across everything. In other words, the only-one-you-need all-mountain ski. The only competitor that comes close to this breadth is Nuage Networks, but they get knocked out quickly for not being open source (without tackling other important performance and scale detail here). If the right ski for the conditions sounds more fun to you, like the right tool for the right job, then wait for my next blog on hybrid cloud…and what I spoke about at IBM Interconnect (sorry no recording). In short, while having lots of skis may have an appeal to different conditions, having lots of networking solutions is a recipe for IT disaster because the network foundation is so pervasive and needs to connect everything.

With OpenContrail developers now making it is dead easy to deploy with K8s (think kubeadm, Helm, Ansible…) and seamless to operate with K8s, it’s going to do what it did for OpenStack networking, for Kubernetes networking very quickly. I’m really excited to be a part of it.

You can visit the Juniper booth to find me and hear more. As you may know, Juniper is the leading contributor to OpenContrail, but we’re also putting K8s to work in the Juniper customer service workbench and CSO product (NFV/CPE offering). Moreover, and extremely complementary to Juniper’s Contrail Networking (our OpenContrail offering), Juniper just acquired AppFormix 4 months ago. AppFormix is smart operations management software that will make your OpenStack or K8s cluster ops akin to a self-driving Tesla (a must if your ops experience is about as much fun as a traffic jam). Juniper’s AppFormix will give you visibility into the state of workloads both apps and software-defined infrastructure stack components alike, and it applies big data and cognitive machine learning to automate and adapt policies, alarms, chargebacks and more. AppFormix is my new jam (with Contrail PB)… it gathers and turns your cloud ops data into insight… The better and faster you can do that, the better your competitive advantage. As you can tell, KubeCon is going to be fun this week!


+update 3/28
Thanks to the Juniper marketing folks for a list of the demos that we'll be sharing at KubeCon:

  1. Contrail Networking integrated with a Kubernetes cluster (fully v1.6 compatible) -
    • Namespaces/RBAC/Network Policy improved security with OpenContrail virtual networks, VPC/projects and security groups (YouTube demo video)
    • Ingress and Services load balancing with OpenContrail HAProxy, vRouter ECMP, and floating IP addresses, improving performance (remove KubeProxy) and consolidating SDN implementation (YouTube demo video)
    • Connecting multiple hybrid clouds with simple network federation and unified policy
    • Consolidating stacked CaaS on IaaS SDN: optimized OpenContrail vRouter networking "passthru" for Kubernetes nodes on top of OpenStack VMs
  2. Contrail Networking integrated with an OpenShift cluster (YouTube demo video)
  3. Contrail Networking containerized and easy deployment in Kubernetes
  4. AppFormix monitoring and analytics for Kubernetes and workload re-placement automation

I'll light these up with links to YouTube videos as I get the URLs.

Fond Memories of 2016 by James Kelly

2016 was a year full of fun and adventure. To change up this blog installment, here is the video Linh and I shared on Facebook with our friends and families.

Some more quick fun facts about my 2016…

  • Linh and I got engaged on October 12 in Half Moon Bay
  • I have a new niece, Felicity, who I adore and got to visit in Jersey and can’t wait to see her again soon
  • I posted 165 times on Facebook (I’m sure that’s a record for me, but that’s what you get when you hang around with a Facebook star like Linh)
  • Linh and I went to Canada together on an extended business/pleasure adventure. It was her first time in Canada, first time meeting some of my family this year there and elsewhere, and the fall colours ;) were beautiful
  • Along that theme, we went to a hockey game. Linh’s first, my umpteenth, but it was a good match really far along in the playoffs
  • We built a garden that we love at our apartment home in San Jose
  • Linh and I both re-started painting. Although Linh was more productive in making many pieces, it has been wonderful to get back into it for us both. It’s fun to paint pieces for our friends and home
  • We ate countless oysters and in many fine establishments including *** Per Se
  • We went horseback riding on the beach
  • I traveled to a new country for me, Korea, and countless other places mostly with Linh… well I can count quite a few. Let me see: New York, Las Vegas, Los Angeles, Portland and Dundee/wine country, Oregon, Marin, Napa (twice), Seattle (twice), Chicago (3 times), Austin, Atlanta, Miami and Orlando, Washington DC, Yosemite National Park and Tioga Pass, Lake Tahoe, Paso Robles, SLO, Morro bay, Arizona, New Jersey, (old / UK) Jersey, England (London, Manchester, Cambridge, Oxford, Nantwich…), and on the trip to Canada just Niagara-on-the-Lake, Ontario wine country, Toronto, Montreal, Ottawa and Gatineau Park, Greater Vancouver and Roberts Creek on the BC Sunshine Coast.