Forging DevOps Culture with Hedge-fund Flair by James Kelly

teamwork-culture shutterstock_506137132.jpg

People: your most important resource and your greatest predicament to DevOps potency.

When the DevOps consultants recess and you need to scale a pilot-project team’s savvy, how do you affect the wider organization with DevOps principles?

Balancing the ingredients of this so-called mentality is trickier than revamping tools and processes. We all know to let tooling lead thy process, and process lead thy tooling. We know the approach is a rolling upgrade, not a mass reboot.

But in the plethora chapter and verse on DevOps, cultural principles are still parsimonious—not another definition, nor “automate everything,” nor the trite dev and ops working closely—real principles of cultural behaviors, their reasoning and an implementation track record.

When I was pouring through the pages of Principles by the Steve Jobs of investing, Ray Dalio, I was expecting to learn about life, finance and business from this famed hedge-fund investment and business guru. I did. I also realized, Ray’s high-performing investment and management principles codify common aspects of the DevOps mentality with some new ideas and revisions. And he’s got the CEO and CIO track record to support it, only his c-level ‘I’ stands for investment.

In the spirit of the ‘S’ for sharing in DevOps’s CALMS, Ray has provided a principles manifesto in clear, practical terms. I won’t reveal them all—I encourage you to read the book for that—but here are five of his greatest principles, distilled and steeped with my own perspective for the DevOps anthology.

1. Expedite Evolution, Not Perfection

From the opening biography, we come to know Ray as a continual learner by trial and error. He’s always looking for lessons in failures to carry forward, to do it better next time. He doesn’t regret failures; he values them more than successes because they provide learning.

Ray tells how he wouldn’t be where he is today—one of TIME’s top-100 most influential people in the world—if he had not hit rock bottom, having to let go of all his employees and forced to borrow $4000 from his dad to pay household bills until his family could sell their second car.

Because Ray upcycles painful mistakes into lessons and principles, learning and efficiency compound. He embraces evolutionary cycles, and knows a thing or two about compounding. Our human intelligence allows us to falter and adapt in rapid cycles that compound wisdom, without waiting for effects of generations. This iterative, rather than intellectual, approach performs better with the added benefit that, being experiential, you know it works.

If you’re a DevOps advocate, your Kaizen lightbulb may have lit. Kaizen is continuous learning: as I say, it’s the most important of all continuous practices in DevOps—and in life. Drawing from Ray’s rapid iteration of trial, error, reflect and learn, we see how he pairs Kaizen with Agile, values learning from failure, and takes many small quick steps for faster evolution.

To solidify the value behind this concept pairing, imagine a fixed savings interest rate, but change the cycle. What’s better: 12% annually or 1% monthly? “Periods do Matter” in this Investopedia article will show you that shorter cycles are better than longer ones. There is the technical reasoning behind why faster failing, leads to better evolution.

In another great read, 4 Seconds, Peter Bregman exemplifies how to manage learning and failure in business by telling the story of teaching his daughter to ride a bike without training wheels. Managing is knowing just the right time to step in and catch her. Too soon and she won’t learn to rebalance herself. Too late and...wipeout! He explains, “Learning to ride a bike, learning anything actually, isn't about doing it right: it's about doing it wrong and then adjusting. Learning isn't about being in balance, it's about recovering balance. And you can't recover balance if someone keeps you from losing balance in the first place.”

In summary, allow failure, cycle quickly and record the lessons. Depriving your people from the opportunity to fail, you deprive them from the opportunity to succeed—and the opportunity to improve. Breed a culture of rapid feedback and experimentation with guardrails, allowing failure without fatality.

2. Triangulate and Be Actively Open Minded

DevOps aficionados are familiar with “fail fast,” Agile and Kaizen. What’s further interesting about Ray, is how he allows for failure and equally reaches for high standards. And beyond technology, excellence is rarely discussed in DevOps circles.

Ray pursues life’s best. “You can have virtually anything you want, you just can’t have everything you want,” he says. Aside from his uncompromising principles in hiring and maintaining excellent people, Ray insists on excellent decision making to instill quality into evolution.

If failure doesn’t form progress, “fail fast” falls flat. Just like machine learning uses new and quality data to improve, our cycle progress is proportional to the quality and newness of abilities and information we use to pursue our goals.

The approach Ray hammers again and again is triangulation: exploring opinions different than his own or the first one offered up. Varying judgments can’t all be right, but understanding different viewpoints, is like making a quantum leap in an evolutionary cycle compared to learning from one source.

Ray’s dramatic story of receiving a cancer diagnosis indelibly impresses the importance of triangulation.

Obviously shaken up, he began to estate plan and spend more time with family, but he also consulted three experts. The first two doctors had wildly different prognoses and proposals for treatment or surgery. So he got them speaking with one another; they were respectful in understanding each other’s take, and Ray learned a lot. Finally, the third doctor suggested a regiment of no treatment nor surgery, but instead to monitor a biopsy of the cells every 6 months because his data showed treatments and surgery didn’t necessarily extend life in cases of cancer of the esophagus.

The three specialists, Ray and his family doctor agreed that this final approach wouldn’t hurt. The learning value or this triangulation aside, the outcome of the story will floor you: Ray’s first biopsy showed that he didn’t have any cancerous cells.

Back to DevOps, the CALMS ‘S’ for sharing is brilliant, but we can push beyond sharing. Actively seeking, not only sharing, information is key to boosting the quality of our decision making and evolution. Companies like Google do this with a manic focus on data, and data is just one avenue of information that may or may not go against our own beliefs.

In general, DevOps leaders must advocate for a culture and habits of active open mindedness, seeking opinions of other believable people and data. Like Ray, assertively explain your own opinions, while maintaining poise and humility to change your mind.

3. Radical Truth and Transparency

At the heart of Ray’s high-performing company, Bridgewater, is a culture of radical truth and transparency. Their patriarch trusts in truth, and loves his people like family, but also equitably protects the whole more than any part. For the greater good he doesn’t hold back in accurate evaluation, root-cause analysis, and openly pointing out problems, even in people. “Love the people you shoot,” he writes, “Tough love is both the hardest and most important kind of love to give.”

The firm keeps an internally available “baseball card” on each employee’s strengths and weaknesses synthesized from evidence-based patterns and a collection of business tools with psychographic-data crunching backends. Weaknesses aren’t misconstrued for weak people, and employees aren’t pigeonholed; the transparency enables orchestrating employees’ best work and identifying their believability in decision making.

For decades, Bridgewater has been using data on people and their track records to do believability-weighted decision making with the help of computers. The company’s “Idea Meritocracy” tools like the Dot Collector Matrix collect data and help teams make believability-weighted decisions, even instantly in meetings. While this was pioneered for investment decisions, Bridgewater later adopted the system for management decisions. Ray also hints he’s working toward offering the system as a service.

This principle is about being ruthless in demanding integrity, honesty, accuracy and openness. Common workplace biases like loyalty, niceness, confidentiality and secrecy might seem safe or well-intentioned in small contexts, but are ultimately self-defeating of the big-picture success of the whole.

Every person and organization has a unique twist on values and workplace politics, but while Bridgewater’s success speaks volumes, its radically straightforward approach is also reported to be the preference of techies and millennials that make up many DevOps-forward teams.

Embracing DevOps, results in more than dev and ops working together—it’s working more closely with the business too. While DevOps leaders can’t control the culture in the wider organization, they can shape the sub-culture of their own teams. Not only is it more manageable on that scale, but this cultural principle and corresponding tools seem a natural fit for IT workers. Just maybe as the role of IT is growing in most businesses today, the culture might catch on.

4. Be Candid and Fearless, Rather Than Blameless

Blameless post-mortem or retrospective meetings are not uncommon in a DevOps culture.

You can probably guess how Ray might see this differently.

If your culture is blameless, there’s less accountability, so you’re more likely to miss lessons and chances for improvement. It’s not just about fixing the machine neither, it’s about helping individuals. And if someone is truly not capable, you could fail to see it if you don’t dispassionately trace the blame.

Accuracy requires great diagnosis, and Ray’s method for root-cause analysis makes Toyota’s 5 Whys look skin deep.

Ray advocates to keep people responsible for investigation reporting up independently of where diagnosis happens, so there’s no fear of recrimination. “Remember people tend to be more defensive than self-critical. It is your job as a manager to get at truth and excellence, not to make people happy. Everyone's objective must be to get the best answers, not the answers that will make the most people happy.”

Having said that, Bridgewater’s culture also pushes everyone to tell the truth without fear of adverse consequences from admission of mistake.

When an employee missed placing a trade for a client, it ended up costing millions for younger, smaller, less-capitalized Bridgewater. With the whole company watching so to speak, Ray decided not to fire this employee—he knew that would lead to a culture of people hiding their mistakes instead of bringing them to light as soon as possible.

With respect to handling missteps, this hedge funder would admonish blamelessness in favor of candor and staff fearlessness. It has the efficiency of earlier learning and earlier redesign for prevention. It also doesn’t eschew accountability, encouraging individual improvement that eventually lifts the whole team.

5. Management by Machine and Metrics

Techies will appreciate how Ray talks about his business as a machine.

If you have great principles that guide you from your values to your day-to-day decisions, but don't have a way of making sure they're systematically applied, you leave their usefulness to chance. We need to cement culture into habits and help others do so as well. Systematizing any cultural principle into a process, tool, or both, “it typically takes twice as long,” Ray says, “but pays off many times over because the learning compounds.”

Bridgewater always put investment principles into algorithms and expert systems, and has long since run the rest of their business by software machinery as well.

Is this just the well-worn “automate everything” DevOps call?

Automation advances scale, performance, correctness, consistency and instrumentation. But high-performing businesses like Bridgewater also manage by metrics: they compare outcomes and measurements to goals.

While data-driven decision making is eminent these days, data-driven measurement and accountability is less common. We have KPIs, QBRs and performance reviews, but how many teams are consistently managed by metrics? We more easily look forward, than take an objective look in the mirror even though it’s critical for evolution.

Goal-setting zealots argue that goals must be measureable, and Ray’s advice takes it one step further: Don’t look at the numbers you have and adapt them to your needs. Instead, “start with the most important questions and come up with the metrics that will answer them,” he says. “Remember any single metric can mislead.” Furthermore, like big-data analytics, data garbage in equals information garbage out.

Be the Change

Ray also says, “An organization is the opposite of a building: it's foundation is at the top.”

But we all know stories of change percolating from all levels of organizations, communities and countries. If you’re not a CEO like Ray was, you can still make a meaningful difference bottom-up or managing your own team, leading by example.

You could simply publish your team’s principles, create a tool, or ignite behaviors you want to spread. Of the DevOps people-process-technology, people are your most important resource; so forge the principles of their operating systems: sharpen, tweak, prioritize and balance. With the transformation door open in your digital business and DevOps journey, there’s no better time to make an invaluable mark on culture—in IT and beyond.

image credit Jacob Lund/Shutterstock

Micro-services: Knock, Knock, Knockin’ on DevNetOps’ Door by James Kelly


A version of this article was published on October 13, 2017 at TheNewStack

“You’ve got to ask yourself one question: ‘Do I feel lucky?’ Well, do ya, punk?” – Dirty Harry

Imagine that putting this famous question to the sentiment of IT deployments, it points to IT and even business performance with scary accuracy. High performers – “The most powerful guns in the world,” to borrow Harry’s words – pull the trigger on deployments with high confidence, while deployment dread is a surefire sign of lower performers, advises the State of DevOps report.

In his talks, Gene Kim, author of The DevOps Handbook, corroborates that deployment anxiety is associated with hapless businesses half as likely to exceed profitability, productivity and market share goals, and evidently with lower market-cap growth.

We may conclude high-performing teams wear the badge of confidence because they’re among the ranks in the academy of agile, deploying more often – orders of magnitude more often. Their speed is in taking lots of little steps, so they have regular experience with change.

And feeling luckier is also an effect of actually being luckier. Data show high performers break things less often; and when they do, there are more clear-cut forensics. They bring in better MTTF and MTTR than lower performers’ old-fashioned police work because the investigation and patching proceeds quickly from the last small step, instead of digging for clues in bigger deliveries peppered with many modifications.

Less, more often, is better than more, less often

Imagine conducting IT on two technology axes: time and space. If it’s clearly healthier to automate faster, smaller steps with respect to the timeline or pipeline, then consider space or architecture. Optimizing architecture design and orchestration to recruit nimble pipeline outputs, what stands out in today’s line up of characters? Affirmative, ace: micro-services.

The principles of DevOps have been around a while and in emerging practice for more than a decade, but the pivotal technologies that cracked DevOps wide open were containers and micro-services orchestration systems like Kubernetes. Looking back, it’s not so surprising that smaller boundaries and enforced packaging from developers, preserved through the continuous integration and delivery pipeline, make more reliable cases for deployment.

A micro-services architecture isn’t foolproof, but it’s the best partner today for the speed and agility of frequent or continuous deployment.

Micro-services networks and networking as micro-services

In a technical trial of service meshes versus SDN, there are three key positions networking takes in today’s micro-services scene:

  1. In a micro-services design, the pieces become smaller and the intercellular space – the network – gets bigger, busier, and hence, vital. Also, beyond zero-trust-style protection of the micro-services themselves, it’s important to have this network locked down.
  2. Service discovery, service/API gateways, service advertising with DNS, and service scale-out or -in with load balancing are all players in networking’s jurisdiction.
  3. Beyond micro-services, any state replication, backup, or analytics over an API, a volume, or a disk, also rides on the network.

Given the importance of networking to the success of micro-services, it’s ironic that networking components are mostly monolithic. Worse, deployment anxiety is epidemic: network operators have lengthy change controls, infrequent maintenance windows, and new code versions are held for questioning for 6-18 months and several revisions after availability.

A primal piece on DevNetOps cites five things we can borrow from the department of DevOps to remedy network ops in time and space, starting with code, pipelines and architecture.

Small steps for DevNetOps

Starting into DevNetOps is possible today with Spinnaker-esque orchestration of operational stages: a network-as-code model and repository would feed into a CICD pipeline for all configuration, template, code and software-image artifacts. With new processes and skills training of networking teams – like coding and reviewing logistics as well as testing and staging simulation – we could foil small-time CLI-push-to-production joyrides and rehabilitate seriously automation-addled “masterminds” who might playbook-push-to-production bigger mistakes.

The path to corrections and confidence on the technology time axis, begins with automating the ops timeline as a pipeline with steps of micro modifications.

Old indicted ops practices can be reinvented by the user community with vendor and open-source help for tooling, but when it comes to architecture, the vendors need to lead. Vendors are the chief “Dev” partner in the DevNetOps force, and motives are clearer than ever to build a case to pursue micro-services.

Small pieces for DevNetOps: Micro-services

After years of bigger badder network devices – producing some monolithic proportions so colossal they don’t fit through doors – vendors can’t ignore the flashing lights and sirens of cloud, containers and micro-services.

It’s clear for DevNetOps, like for DevOps, micro-sized artifacts are perfectly sized bullets for the chamber of an agile pipeline. But while there’s evidence of progress in the networking industry, there’s a ways to go.

Some good leads toward a solution include Arista supporting patch packages separate from their main EOS delivery; the OCP popularizing software and hardware disaggregation in its networking project; and Juniper Networks building on disaggregation by supporting node splicing and universal chassis for finer-grained modularity and management boundaries. Furthermore, data center network designs of resilient scale-out Clos network fabrics with pizza-box-sized devices are gradually favored over large aggregation devices. And in software-defined networking, projects like OpenContrail are now dispatched as containers.

In the world of DevOps, we know that nothing does wonders for deployment quality like developers threatened with the prospect of a page at 2am. But for DevNetOps that poetic justice is missing, and the 24/7 support between the vendor-customer wall hardly subdues operator angst when committing a change to roll a deployment. Moreover, the longer the time between a flawed vendor code change and the time it’s caught, the more muddled it gets and the tougher it is to pin.

The best line of defense against these challenges is smaller, more-frequent vendor deliveries, user tests and deployments. Drawing inspiration from the success of how DevOps was bolstered by micro-services, imagine if while we salute DevNetOps continuous and agile operations today, we compel vendors and architectural commissioners to uphold designs for finer-grained felicitous micro-services, devices and networks for a luckier tomorrow.

A version of this article was published on October 13, 2017 at TheNewStack


For more information on defining DevNetOps and DecSecOps, see this article and my short slideshare:

New Heroes in the DevOps Saga: DevSecOps and DevNetOps by James Kelly


This article was originally published on September 26 at

The evolution of DevOps is by no means done, but it’s safe to say that there is enough agreement and acceptance to declare it a hero. DevOps has helped glorify IT to the point where it’s no longer preventing business, nor a provider nor a partner of the business.

Often IT is the business, or its vanguard for competitive disruption and differentiation.

Splintering the success of this portmanteau hero, we now hear more and more of two trusty sidekicks: DevSecOps and DevNetOps. Lesser understood in their adolescence, these tots are still frequently misunderstood, are still forming their identities, and still need a lot of development if they’re to enter the IT hall of fame like their forerunner.

Just as the terms look, DevSecOps and DevNetOps are often assumed to be about wrapping DevOps principles around security and networking: operators hope to assuage technical debt and drudgery by automating in proficiency and resiliency. For networking, I’ve covered how there is a lot more to that than coding, but to be sure, these sidekicks certainly espouse operators learning how to do develop while DevOps was equally, if not more, about developers learning to operate.

The Shift Left: SecDevOps and NetDevOps

As if it wasn’t hard enough to tell what DevSecOps and DevNetOps want to be when they grow up, we’ve gone and given them alter egos: SecDevOps (aka “rugged” DevOps) and NetDevOps. Think about them exactly as the words look – it’s about the shift to the left. Left of what?

Traditional DevOps practices focus on business-specific applications development. The development timeline is known as concept to cash, and with all the superpowers of DevOps we try to reduce our enemy: the lead time and repeatable processes between code and cash.

Security and building infrastructure – like networks – were supporting tasks, not revenue-generating nor competitive advantages. Thus, security and networking were far to the right on the timeline with concerns that deal with operational scale, performance and protection.

Today’s shift left propels security and infrastructure considerations earlier on the timeline, into coding, architecture and pre-production systems. It’s a palpable penny-drop amid daily news of security breaches and infrastructure outages causing technology-defined establishments to bleed money and brand equity.

Fill the bucket with cash, but don’t forget to forestall the leaks!

DevOps and Infrastructure: Challenge and Opportunity

Automation sparks have flown over the proverbial wall into the camp of I&O pros. Operators trading physical for virtual, macro for micro, converged for composed, and configuration for code is proof that the fire has caught security and networking. Controlling the burn now, is key, so that healthier skills and structures arise in place of the I&O dogma and duff. Fortunately, this is precisely the destiny for our newfound heroes, DevSecOps and DevNetOps.

However, doing DevSecOps and DevNetOps, embracing security and networks as code, we mustn’t be so credulous as to forget the formidable DevOps practices and patterns that need transforming along the ultimate automation journey. Testability, immutability, upgradability, traceability, auditability, reliability, and other __abilities are not straightforward to achieve.

Discounting “aaS” technology consumed as a service, a fundamental challenge to innovating SecOps and NetOps, compared to application ops, is that applications are crafted and built; security and networking solutions are mostly still bought and assembled.

Security and network infrastructure as code is something that needs to be co-created with the vendors. Other than in the cloud, it will take a while before security and networking systems are driven API-first, and are redesigned and broken down to offer simulation, composition and orchestration with scale and resilience.

While this will land first in software-defined infrastructure, there is still a ways to go to manage most software-defined security and networking systems with continuous practices of artifact integration, testing, and deployment. Hardware and embedded software will be even more challenging.

Finding Strength in Challenge

So on one hand, DevOps is evolving with security and networking shifting left. On the other hand, traditional security and networking ops are transforming with DevOps principles.

Is the ultimate innovation to squeeze out those traditional operations altogether? Does NetDevOps + DevNetOps = DevOps?

There is a parallel train of thought and debate, with success on both sides. Purist teams cut out operations with the “you build it, you run it” attitude. Other companies like Google have dedicated operations specialist teams of SREs. While the SRE reporting structure is isolated, SRE jobs are very integrated with that of development teams. It’s easy to imagine the purist approach, subsuming security and networking into DevOps practices, but only if we assume the presence of cloud infrastructure and services as a platform. Even then, there is still substantiation for the SRE.

Layers below, however, somebody still needs to build the foundations of the cloud IaaS and data center hardware. As they say, “Even serverless computing, is not actually serverless.”

Underpinning the clouds are data centers. And then there’s transport, IoT, mobile or other secure networks to and between clouds. In these areas, it’s obvious there is a niche for our two trusty sidekicks, DevSecOps and DevNetOps, to shake up ops culture and principles. These two heroes can rescue software-defined and physical infrastructure from the clutches of so many anti-pattern evils, like maintenance windows and change controls (ahem, it’s called a “commit”).

We may not require rapid experimentation in our infrastructure, but we would warmly welcome automated deployments, automated updates, failure and attack testing drills, and intent-driven continuous response. They will boost resiliency and optimization for the business and peace of mind for the builders.

Teams operating security, networks, and especially clouds, need to honor and elevate DevSecOps and DevNetOps, so that on the journey now afoot, our teams and our new heroes may realize their potential.

This article was originally published on September 26 at

Good Habits to Make the Multi-Cloud Work For You – Part 2 of 2 by James Kelly


This article was originally published on September 12 at Data Center Knowledge

In my previous article, I talked about the state of infatuation with hybrid and multi-cloud environments. Would you be surprised that in the stresses and mania surrounding IT cloud strategy, some folks fixate more on the playing field than the game itself? You probably already know that you’ve got to get your head in the game in this unforgiving age, and a winning strategy for digitally speeding and feeding the business across the multi-cloud is not: taste the rainbow; it’s choosing and consuming cloud wisely.

Too bad that how you do so isn’t obvious, and as if it wasn’t difficult enough to anticipate technology turns ahead, there are so many captivating cloud services that might lead you down treacherous roads to traps and debt. But there are also well-known tactics emerging that you can model to ready and steady your organization for change and success. Like most, if your journey has already begun, you’re picking these up along the way and adjusting your habits as you go.

You know how bad habits are easy to form and hard to live with? Similarly, it’s very easy to jump into multi-cloud or unwittingly let it happen to you. At this precipice, the warning signs and early stories of cloud lock-in, overwhelming multiple-cloud context switches, runaway expenses, and situational blindness, are hopefully enough to grab your attention. Multi-cloud is inevitable; these fatalities are not.

A multi-cloud platform is a powerful environment, and it requires proper preparation so you can control it, instead of it controlling you. With that, here are four of the best preparations I’ve seen, like good habits that are hard to form, but easy to live with.

1. Unify Your Toolchain

In the eternal deluge and disruption of new tech tooling and systems, remember those good old-fashioned IT values of standardization and consolidation? Don’t throw those babies out with the ITIL bathwater.

As you embrace cloud and bimodal IT with new and improved tools, you might lessen the reins on your traditional values, using public cloud and building private cloud infrastructure alongside your physical and virtualized data centers. In loosening the reins or spinning out agile side projects, just watch out for the trap of hasty developers rolling their own stack or going stackless/serverless, only to get caught in a web of proprietary cloud services.

Don’t rush an obstinate knee-jerk to block this neither. Think of a unified toolchain effort as one with the developers to rationalize a base devops pipeline, cluster, and middleware stack, that could serve 80 percent of projects.

  • Your tools need to work on any cloud infrastructure, and if they can work with your legacy infrastructure, even better.
  • Freeing yourself from lock-in of cluster and pipeline orchestration tools and infrastructure-as-code lifecycle management: keep them untethered from any specific underlying IaaS, with portable shims like Terraform.
  • While you don’t want to throttle developers back from using services outside of your stack – they’ll go around you anyway – encourage managed open-source-based services. Then incorporate such services into your middleware toolchain as it matures. Tools like Helm, make it easier to manage services yourself, more than ever before.

If you’re a lean IT shop, let’s face it, following this to the letter may take you away from getting to market ASAP. Maybe you’re a startup or in that mode? You don’t just want, but need, to focus on developing your core competitive technology, not a portable multi-cloud toolchain.

How do you balance moving fast, employing low-hanging SaaP, with the concern of vendor and architectural lock-in?

If a tool is a competitive differentiator, then you should probably build it. Otherwise, remember there are a lot of open-source tools that are glued together with reference implementations of other open-source tools: large projects like Kubernetes and Spinnaker are easy to adopt with a bunch of pre-canned sensible defaults. Another option is to choose managed open-source services, that are more easily insourced later or offered by multiple cloud vendors.

Finally, software design is probably the most important and challenging factor of all. Architecting for scale is obvious, but flexibility enables business agility; so consider not only today’s lock-in, but also getting locked out of a competitive advantage tomorrow. Assembling API-driven (capital ‘S’) Services from micro-services is a well-established pattern to do this, and I’d recommend software alchemists investigate evolutionary architecture from ThoughtWorks for more wisdom.

2. Connect Your Clouds

 Connection was a given in the world of hybrid cloud. That still holds true. However, cloud bursting, the most bombastic of all use cases for hybrid cloud, is the least common. Multiple clouds need to be connected together for many more realistic and common use cases:   

  • Imagine pipeline automation that includes environments or steps split across clouds. Dev/test can happen anywhere, but you may have higher requirements for staging/production.
  • Secure data replication for warehousing or distributed applications, and backups for disaster recovery and avoidance.
  • Split application tiers, where there are different non-functional requirements for the various application tiers like sovereignty, security, scale, performance, etc. that must be met in various geographies or optimized with split economics. Some applications may be split because of functional requirements too because certain clouds have unique advantages that others can’t reproduce.

Such cloud interconnections demand higher security than using the internet, and often clouds simply require a secure connection back to your enterprise staff or users. Beyond security, unique routing and legacy layer-2 unicast or even multicast connectivity could be an application requirement. While there are cloud-vendor tools for basic security and networking, you can also search for your own software-defined and virtualized security and networking solutions that are agnostic to any cloud infrastructure, unifying this toolchain too and incorporating it into your infrastructure-as-code policies.

3. Harmonize, Unify and Simplify Policy

If you have software deployed or scaled across multiple cloud locations, the configuration, monitoring and automatic-response systems may get unwieldly unless you seek to elevate your orchestration across clouds. Of course, there are cloud management platforms for this. With or without them, you can also do some multi-cloud management with your own centrally harmonized configurations and management as code. A further step might unify configuration and management with global controllers, but with the track record of humans causing most errors, be careful with your blast radius for a fat-finger typo.

Another trend in provisioning models and APIs is abstraction, which can be at many levels like multi-cloud orchestration, individual stack, pipeline or application. By making things more intuitive and concise for humans and leaving the execution to your software machinery and machine learning, you’re likely to improve the lives of your operators, your applications and your application users.

4. Hold Up Before You Speed Up

Cloud will move you faster, and if that’s not enough cause for care, even with no IT strategy, you’ll still end up with multi-cloud in no time: multiple owners, vendors, regions, and availability zones. The increased danger is that multi-cloud can multiply messes and mistakes. Preparation in building a platform is key, and like many things that take a bit of time upfront, it’s worth the effort in the long run.

Consciousness to hold up isolated quick gains as short-term one-offs, that generally beget debt down the road, is the critical gambit that will return long-term payouts in adaptability and speed atop a united multi-cloud platform.

IT leaders know that digital transformation is a journey, not a destination. With continuous learning, the first of all healthy continuous IT practices, mastering the tactics and good habits for structuring your multi-cloud platform and using the ins and outs of devops atop it, can be fun and rewarding. It allows safe acceleration and agility for IT, and it’s essential to sustainably advance the speeds and smarts of your business.

Multi-cloud is A Reality, Not A Strategy - Part 1 of 2 by James Kelly


This article was originally posted on August 25th, 2017 on Data Center Knowledge at

So you’re doing cloud, and there is no sign of slowing down. Maybe your IT strategies are measured, maybe you’re following the wisdom of the crowd, maybe you’re under the gun, you’re impetuous or you’re oblivious. Maybe all of the above apply. In any case, like all businesses, you’ve realized the future belongs to the fast and that cloud is the vehicle for your newly dubbed software-defined enterprise: a definition carrying onerous, what I call, “daft pressures” for harder, better, faster, stronger IT.

You may as well be solving the climate-change crisis because to have a fighting chance today, it feels like you have to do everything all at once.

Enduring such exploding forces and pressures, most crapulent enterprises simultaneously devour cloud, left, right and center; the name for said zeitgeist: multi-cloud. If you’ve not kept abreast, overwhelming evidence – like years of the State of the Cloud Report and other analyst research – show that multi-cloud is the present direction for most enterprises. But the name “multi-cloud” would suggest that using multiple clouds is all there is to it. Thus, in the time it takes to read this article, any technocrat can have their own multi-cloud. It sounds too easy, and as they say, “when life looks like Easy Street, danger is at the door.”

Apparition du Jour

While some cloud pundits forecasting the future of IT infrastructure will shift from hybrid cloud to multi-cloud or go round and round in sanctimonious circles, those who are perceptive have realized that multi-cloud pretense is merely a catch-all moniker, rather than an IT pattern to model. As with many terms washed over too much, “multi-cloud” has quickly washed out. It’s delusional as a strategy, and isn’t particularly useful, other than as le mot du jour to attract attention. How’s that for hypocrisy?

Just like you wouldn’t last on only one food, it’s obvious that you choose multiple options at the cloud buffet. How you choose and how you consume is what will make the difference to if you perish, survive or thrive. Choosing and consuming wisely and with discipline is a strategy.

Is the Strategy Hybrid Cloud?

Whether you want to renew your vows to hybrid cloud or divorce that term, somewhere along the way, it got confused with bimodal, hybrid IT and ops. Nonetheless, removing the public-plus-private qualification, there was real value in the model to unite multiple clouds for one purpose. It’s precisely that aspect that would also add civility to the barbarism of blatantly pursuing multi-cloud. However, precipitating hybrid clouds for a single application with, say, the firecracker whiz-bang of the lofty cloud bursting use case, is really a distraction. The greater goal of using hybrid or multiple clouds should be as a unified, elastic IT infrastructure platform, exploiting the best of many environments.

Opportunities for Ops

Whether public or private, true clouds (which are far from every data center—even those well virtualized) ubiquitously offer elastic, API-controllable and observable infrastructure atop which devops can flourish to enable the speeds and feeds of your business. An obvious opportunity and challenge for IT operations professionals is in building true cloud infrastructure, but if that’s not in the cards for your enterprise, there is still even greater opportunity in managing and executing the strategy for the wisdom and discipline in choosing and consuming cloud.

In the new generation of IT, ops discipline doesn’t mean being the “no” person, it means shaping the path for developers, with developers. This is where devops teaches us that new fun exists to engineer, integrate and operate things. It’s just that they are stacks, pipelines and services, not servers. Of course, to ride this elevator up, traditional infrastructure and operations pros need to elevate their skills, practicing devops kung-fu across the multi-cloud and possibly applying it in building private clouds too.

Imagine what would happen if we exalted our trite multi-cloud environment with strategies and tactics to master it? We can indeed extract the most value from each cloud of choice, avoid cloud lock-in, and ensure evolvability, but you probably wouldn’t be surprised to learn that there are also shortcuts with new technical debt spirals lurking ahead.

To secure the multi-cloud as a platform serving us, and not the other way around, some care, commitment, time and work is needed toward forming the right skills, habits, and using or building tooling, services, and middleware. There is indeed a maturing shared map leading to portability, efficiency, situational awareness and orchestration. In my next article, we’ll examine some of the journey and important strategies to use the multi-cloud for speed with sustainability. Stay tuned.

This article was originally posted on August 25th, 2017 on Data Center Knowledge at

It’s the End of Network Automation as We Know It (and I Feel Fine) by James Kelly

This article was originally posted on July 5th on The New Stack at

Network automation does not an automated network make. Today’s network engineers are frequently guilty of two indulgences. First, random acts of automation hacking. Second, pursuing aspirational visions of networking grandeur — complete with their literary adornments like “self-driving” and “intent-driven” — without a plan or a healthy automation practice to take them there.

Can a Middle Way be found, enabling engineers to set achievable goals, while attaining the broader vision of automated networks as code? Taking some inspiration from our software engineering brethren doing DevOps, I believe so.

How Not to Automate

There’s a phrase going around in our business: “To err is human; to propagate errors massively at scale is automation.”

Automation is a great way for any organization to speed up bad practices and #fail bigger. Unfortunately, when your business is network ops, the desire to be a cool “Ops” kid with some “Dev” chops — as opposed to just a CLI jockey — will quickly lead you down the automation road. That road might not lead you to those aspirational goals, although it certainly could expand your blast radius for failures.

Before we further contemplate self-driving, intent-driven networking, and every other phrase that’s all the rage today (although I’m just as guilty of such contemplation as anyone else at Juniper), we should take the time to define what we mean by “proper” in the phrase, “building an automated network properly.”

If you haven’t guessed already, it’s not about writing Python scripts. Programming is all well and good, but twenty minutes of design often really does save about two weeks of coding. To start hacking at a problem right away is probably the wrong approach. We need to step back from our goals, think about what gives them meaning, apply those goals to the broader picture, and plan accordingly.

To see what is possible with automation, we should look at successful patterns of automation outside of networking and the reasons behind them, so we may avoid the known bad habits and anti-patterns, and sidestep avoidable pitfalls. For well-tested patterns of automation, we needn’t look any further than the wealth of knowledge and experience in the arena of DevOps.

It matters what we call things. For better or worse, a name focuses the mind. The overall IT strategy to improve the speed, resilience, quality and intelligence of our applications is not called automation or orchestration. While ITIL volumes make their steady march into museums, the new strategy to enable business speed and smarts is incontrovertible, and it’s called DevOps.

Initially that term may invoke a blank page or even a transformation conundrum. But you can learn what it means to practice successful DevOps culture, processes, design, tooling and coding. DevOps can define your approach to the network, and is why we ought not promote network automation (which could focus the mind on the wrong objectives) and instead talk about DevNetOps as the application of patterns of DevOps applied to networking.

Networks as Code

The idea of infrastructure-as-code (IaC) has been around for a while, but surprisingly has seldom been applied to networking. Juniper Networks (where I hang my hat) and other networking vendors like Apstra have made some efforts over the years to move folks in this direction, but there is still a lot of work to do. For example, Juniper has had virtual form factors of most series of hardware systems, projects like Junosphere for network modeling in the cloud (many of us now use Ravello), and impressive presentations on IaC and professional services consulting. Juniper’s senior marketing director Mike Bushong (formerly with Plexxi) wrote about the network as code back in 2014.

IaC is generally well applied to cloud infrastructure, but it’s way harder to apply to bare metal. For evidence of this, just look at Triple-O, Kubernetes on Kubernetes, or Kubernetes on OpenStack on Kubernetes! That bottom, metal layer is quite the predicament.

To me, this means in networking, we should be easily applying IaC to software defined networking (SDN). But can we apply it to our network devices and manage the physical network? I asked Siri, and she said it was a trick question. As an armchair architect myself, I don’t have all the answers.  But as I see it, here are some under-considered aspects for designing networks as code with DevNetOps:

1. Tooling

In tech, everyone loves shiny objects, so let’s start there. Few network operators — even those who have learned some programming — are knowledgeable about the ecosystem of DevOps tooling, and few consider applying those same tools to networking. Learning Python and Ansible is just scratching the surface. There is a vast swath of DevOps tools for CI/CDsite reliability engineering (SRE), and at-scale cloud-native ops.

2. Chain the tool chain: a pipeline as code

When we approach the network as code, we need to consider network elements and their configurations as building blocks created in a code-development pipeline of dev/test/staging/production phases. Stringing together this pipeline shouldn’t be a manual process; it should be mostly coded to be automatic.

As with software engineering, there are hardware and foundational software elements with network engineering, such as operating systems that the operator will not create themselves, but rather just configure and extend. These configurations and extensions, with their underlying dependencies, can be built together, versioned, tested, and delivered. Thinking about the network as an exercise in development, automation should start in the development infrastructure itself.

3. Immutable infrastructure

Virtualization and especially containers have made the concept of baking images very accessible, and immutable infrastructure popular. While there is still much work to do with network software disaggregation, containerization, and decoupling of services, there are many benefits of adopting immutable infrastructure that are equally applicable to networking. Today’s network devices are poster children for config drift, but to call them “snowflakes” would be an insult to actual snowflakes.

Applying principles of immutable infrastructure, I imagine a network where each device PXE-boots into a minimal OS and runs signed micro-service building blocks. Each device has declarative configs, decoupled secret management and rotation, and logging and other monitoring data with good overall audit ability and traceability — all of which is geared to take the network off the box ASAP.

Interestingly, practices such as SSH’ing into boxes would be rendered impossible, and practices that “savvy” network automators do today like running Ansible playbooks against an inventory of devices would be banished.

4. Upgrades

Upgrades to network software and even firmware/microcode on devices could be managed automatically, by means of canary tests and rolling upgrade patterns. To do this on a per-box or per-port basis, or at finer levels of flows or traffic-processing components, we need to be able to orchestrate traffic balancing and draining.

If that sounds complex, we can make things simpler. We could treat devices and their traffic like cattle instead of pets, and rely on their resilience. Killing and resurrecting a component would restart it with a new version. While this is suitable for some software applications, treating traffic as disposable is not yet desirable for all network applications and SLAs. Still, it would go a long way toward properly designing for resilience.

5. Resilience

One implication of all this is the presence of redundancy in the network paths.  As with any networking component, that’s very important for resilience. Drawing inspiration from scale-out architectures in DevOps and the microservices application model, redundancy and scale would go hand-in-hand by means of instance replication. So redundancy would neither be 1:1 nor active-passive. We should always be skeptical of architectures that include those anti-patterns.

Good design would tolerate a veritable networking chaos monkey. Burning down network software would circuit-break to limit failures. Killing links and even boxes, we would quickly re-converge as we often do today, but dead boxes, dead SDN functions or dead virtual network functions would act like phoenix servers, rising back up or staying down in case of repeated failures or detected hardware failures.

The pattern for preventing black-swan event failures is to practice forcing these failures, and thus practice automating around them, so that the connectivity or other network service and its performance is tolerant and acceptably resilient on its own SLA measuring stick, whatever the meta-service in question may be.

Doing DevNetOps

In each one of these above topics lies much more complexity than I will dive head-first into here. By introducing them here, my aim has been to demonstrate there are interesting patterns we may draw from, and some operators are doing so already. If you’ve ever heard the old Zen Buddhist koan of the sound of one hand clapping, that’s the sound you’re likely to hear from your own forehead, once the obviousness of applying DevOps to DevNetOps hits you squarely in the face.

Just as the hardest part of adopting DevOps is often cited to be breaking off one manageable goal at a time and focusing on that, I think we’ll find the same is true of DevNetOps. Before we even get there in networking, I think we need to scope the transformation properly of applying DevNetOps to the challenges and design of networking, especially with issues of basic physical connectivity and transport.

While “network automation” leads the mind to jump to things like applying configuration management tooling and programming today’s manual tasks, DevNetOps should remind us that there is a larger scope than mere automation coding.  That scope includes culture, processes, design and tools. Collectively, they may all lead us to a happier place.

This article was originally posted on July 5th on The New Stack at

Title image of a Bell System telephone switchboard, circa 1943, from the U.S. National Archives.

Are Service Meshes the Next-Gen SDN? by James Kelly

June 28, 2017 update: more awesome background on service meshes, proxies and Istio in particular on yet another new SE Daily podcast with Istio engineers from Google.

June 26, 2017 update: For great background on service meshes (a relatively new concept) check out today's podcast on SE Daily with the founder of Linkerd.

Whether you’re adopting a containers- or functions-as-a-Service stack, or both, in the new world of micro-services architectures, one thing that has grown in importance is the network because:

  1. Micro-service application building blocks are decoupled from one another over the network. They re-integrate over the network with APIs as remote procedure calls (RPC) that have evolved from the likes of CORBA and RMI, past web services with SOAP and REST, to new methods like Apache Thrift and the even-fresher gRPC: a new CNCF project donated by Google for secure and fast http2-based RPC. RPC has been around for a long time, but now the network is actually fast enough to handle it as a general means of communication between application components, allowing us to break down monoliths where service modules would have previously been bundled or coupled with tighter API communications based on package includes, libraries, and some of us may even remember more esoteric IPC.
  2. Each micro-service building block scales out by instance replication. The front-end of each micro-service is thus a load balancer which is itself a network component, but beyond that, the services need to discover their dependent services, which is generally done with DNS and service discovery.
  3. To boost processing scale out and engineer better reliability, each micro-service instance is often itself decoupled from application state and its storage. The state is saved over the network as well. For example, using an API into an object store, a database, a k/v-store, a streaming queue or a message queue. There is also good-ol’ disk, but such disk and accompanying file systems too, may be virtual network-mounted volumes. The API- and RPC-accessible variants of storing state are, themselves, systems that are micro-services too, and probably the best example of using disks in fact. They would also incorporate a lot of distributed storage magic to deliver on whatever promises they make, and that magic is often performed over the network.

Hopefully we’re all on the same page now as to why the network is important micro-services glue. If this was familiar to you, then maybe you already know about cloud-native SDN solutions and service meshes too.

The idea and implementation of a service mesh is fairly new. The topic is also garnering a lot of attention because they handle the main networking challenges listed above (esp #1 & 2), and much more in the case of new projects like the CNCF Linkerd and newly launched project Istio.

Since I’ve written about SDN for container stacks before, namely OpenContrail for Kubernetes and OpenShift, I’m not going to cover it super deeply. Nor am I going to cover service meshes in general detail except to make comparisons. I will also put some references below and throughout. And I’ve tried to organize my blog by compartmentalizing the comparisons, so you can skip technical bits that you might not care for, and dive into the ones that matter most to you.

So on to the fun! Let’s look at some of the use cases and features of services meshes and compare them to SDN solutions, mostly OpenContrail, so we can answer the question posed in the title. Are service meshes the “Next-Generation” of SDN?

Automating SDN and Service Meshes

First, let’s have a look at 3 general aspects of automation in various contexts where SDN and service meshes are used: 1 - programmability, 2 - configuration and 3 - installation.

1.     Programmability
When it comes to automating everything, programmability is a must. Good SDNs are untethered from the hardware side of networking, and many, like OpenContrail, offer a logically centralized control plane with an API. The main two service meshes introduced above do this too, and they follow an architectural pattern similar to SDNs of centralized control plane with a distributed forwarding plane agent. While Istio has a centralized control plane API, Linkerd is more distributed but offers an API through its Namerd counterpart. Most people would probably say that the two service meshes’ gRPC API is more modern and advantageous than the OpenContrail RESTful API, but then again OpenContrail’s API is very well built-out and tested compared to Istio’s still-primordial API functions.

2.     Configuration
A bigger difference than the API, is in how functionality can be accessed. The service meshes take in YAML to declare configuration intent that can be delivered through a CLI. I suppose most people would agree that’s an advantage over SDNs that don’t offer that (at least OpenContrail doesn’t today). In terms of a web-based interface, the service meshes do offer those, as so many SDNs. OpenContrail’s web interface is fairly sophisticated after 5 years of development, yet still modern-feeling and enterprise friendly.

Looking toward “network as code” trends however, CLI and YAML is codable and version controllable more easily than say OpenContrail’s API calls. In an OpenStack environment OpenContrail can be configured with YAML-based Heat templates, but that’s less relevant for container-based K8s and OpenShift world. In a K8s world, OpenContrail SDN configuration is annotated into K8s objects. It’s intentionally simple, so it’s just exposing a fraction of the OpenContrail functionality. It remains to be seen what will be done with K8s TPRs, ConfigMaps or through some OpenContrail interpreter of its own.

3.     Installation
When it comes to getting going with Linkerd, having a company behind it, Buoyant, means anyone can get support, but getting through day-one looks pretty straightforward on one’s own anyway. Deployed with Kubernetes in the model of a DaemonSet, it is straightforward to use out of the box.

Istio is brand new, but already has Helm charts to deploy it quickly with Kubernetes thanks to our friends at Deis (@LachlanEvenson has done some amazing demo videos already –links below). Using Istio, on the other hand, means bundling its Envoy proxy into every Kubernetes pod as a sidecar. It’s an extra step, but it looks fairly painless with the kube-inject command. Sidecar vs. DaemonSet considerations aside, this bundling is doing some magic, and it’s important to understand for debugging later.

When it comes to SDNs, they’re all different wrt deployments. OpenContrail is working on a Juniper-supported Helm chart for simple deployment, but in the meantime there are Ansible playbooks and other comparable configuration management solutions offered by the community.

One thing OpenContrail has in common with the two service meshes, is that it is deployed as containers. One difference is that OpenContrail’s forwarding agent on each node is both a user-space component and Kernel module (or DPDK-based or SmartNIC-based). They’re containerized, but the kernel module is only there for installation purposes to bootstrap the insmod installation. You may feel ambivalent towards kernel modules… The kernel module will obviously streamline performance and integration with the networking stack, but the resources it uses are not container-based, and thus not resource restricted, so resource management is different than say a user-space sidecar process. Anyway, this is same deal as using the kube-proxy or any IP tables-based networking which OpenContrail vRouter replaces.

SDN and Service Meshes: Considerations in DevOps

When reflecting on micro-services architectures, we must remember that the complexity doesn’t stop there. There is also the devops apparatus to manage the application through dev, test, staging and prod, and through continuous integration, delivery, and response. Let’s look at some of the considerations:

1.     Multi-tenancy / multi-environment
In a shared cluster, code shouldn’t focus on operational contexts like operator or application dev/test environments. To achieve this, we need isolation mechanisms. Kubernetes namespaces and RBAC help this, but there is still more to do. I’ll quickly recap my understanding of the routing in OpenContrail and service meshes to better dissect the considerations for context isolation.

OpenContrail for K8s recap: One common SDN approach to isolation is overlay networks. They allow us to create virtual networks that are separate from each other on the wire (different encapsulation markings) and often in the forwarding agent as well. This is indeed the case with OpenContrail, but OpenContrail also allows higher-level namespace-like wrappers called domains/tenants and projects. Domains are isolated from each other, projects within domains are isolated from each other, and virtual networks within projects are isolated from each other. This hierarchy maps nicely to isolate tenants and dev/test/staging/prod environments, and then we can use a virtual network to isolate every micro-service. To connect networks (optionally across domains and projects), a policy is created and applied to the networks that need connecting, and this policy can optionally specify direction, network names, ports, and service chains to insert (for example, a stateful firewall service).

The way these domains, projects, and networks are created for Kubernetes is based on annotations. OpenContrail maps namespaces to their own OpenContrail project or their own virtual network, so optionally micro-services can all be reachable to each other on one big network (similar to the default cluster behavior). There are security concerns there, and OpenContrail can also enforce ACL rules and automate their creation as a method of isolating micro-services for security based on K8s object annotations or implementing Kubernetes NetworkPolicy objects as OpenContrail security groups and rules. Another kind of new annotations on objects like K8s deployments, jobs or services would specify the whole OpenContrail domain, project, and virtual network of choice. Personally, I think the best approach is a hierarchy designed to match devops teams and environments structure that makes use of the OpenContrail model of segmentation by domain, project and network. This is in (unfortunately) contrast to the simpler yet more frequently used global default-deny rule and ever-growing whitelist that ensues that turns your cluster into Swiss cheese. Have fun managing that :/

The overlay for SDN is at layer 2, 3 and 4, meaning that when the packet is received on the node, the vRouter (in OpenContrail’s case) will receive the packet destined to it and look at the inner header (the VXLAN ID or MPLS LSP number) to determine the domain/tenant, project and network. Basically, the number identifies which routing table will be used as a lookup context for the inner destination address, and then (pending ACLs) the packet is handed off to the right container/pod interface (per CNI standards).

Service mesh background: The model of Istio’s Envoy and Linkerd insofar as they are used (which can be on a per-microservice basis), is that there is a layer-7 router and proxy in front of your microservices. All traffic is intercepted at this proxy, and tunneled between nodes. Basically, it is also an overlay at a higher layer.

The overlay at layer-7 is conceptually the same as SDN overlays except that the overlay protocol over the wire is generally HTTP or HTTP2, or TLS with one of those. In the DaemonSet deployment mode of Linkerd, there is one IP address for the host and Linkerd will proxy all traffic. It’s conceptually similar to the vRouter except in reality it is just handling HTTP traffic on certain ports, not all traffic. Traffic is routed and destinations are resolved using a delegation tables (dtabs) format inherited from Finagle. In the sidecar deployment model for Linkerd or for Istio’s Envoy (which is always a sidecar), the proxy is actually in the same container network context as each micro-service because it is in the same pod. There are some IP tables tricks they do to sit between your application and the network. In Istio Pilot (the control plane) and Envoy (the data plane), traffic routing and destination resolution is based primarily on the Kubernetes service name.

With that background, here are a few implications for multi-tenancy.

Let’s observe that in the SDN setup, the tenant, environment and application (network) classification happens in the kernel vRouter. In service mesh proxies, we still need a CNI solution to get the packets into the pod in the first place. In Linkerd, we need dtab routing rules that include tenant, environment and service. Dtabs seems to give a good way to break this down that is manageable. In the sidecar mode, more frequently used for Envoy, it’s likely that the pod in which traffic ends up already has a K8s namespace associated with it, and so we would map a tenant or environment outside of the Istio rules, and just focus on resolving the service name to a container and port when it comes to Envoy.

It seems that OpenContrail here has a good way to match the hierarchy of separate teams, and separate out those RBAC and routing contexts. Linkerd dtabs are probably a more flexible way to create as many layers of routing interpretation as you want, but it may need a stronger RBAC to allow the splitting of dtabs among team tenants for security and coordination. Istio doesn’t do much in the way of isolating tenants and environments at all. Maybe that is out of scope for it which seems reasonable since Envoy is always a sidecar container and you should have underlying multi-tenant networking anyway to get traffic into the sidecar’s pod.

One more point is that service discovery baked into the service mesh solutions, but it is still important in the SDN world, and systems that include DNS (OpenContrail does) can help manage name resolution in a multi-tenant way as well as provide IP address management (like bring your own IPs) across the environments you carve up. This is out of scope for service meshes, but with respect to multiple team and dev/test/staging/prod environments, it may be desirable to have the same IP address management pools and subnets.


2.     Deployment and load balancing
When it comes to deployment and continuous delivery (CD), the fact that SDN is programmable helps, but service meshes have a clear advantage here because they’re designed with CD in mind.

To do blue-green deployments with SDN, it helps to have floating IP functionality. Basically, we can cut over to green (float a virtual IP to the new version of the micro-service) and safely float it back to blue if we needed to in case of an issue. As you continuously deliver or promote staging into the non-live deployment, you can still reach it with a different floating IP address. OpenContrail handles overlapping floating IPs to let you juggle this however you want to.

Service mesh routing rules can achieve the same thing, but based on routing switch overs at the HTTP level that point to for example a newer backend version. What service meshes further allow is traffic roll over like this example showing a small percentage of traffic at first and then all of it, effectively giving you a canary deployment that is traffic load-oriented as opposed to a Kubernetes rolling upgrade or the Kubernetes deployment canary strategy that gives you a canary that is instance-count based, and relies on the load balancing across instances to partition traffic.

This brings us to load balancing. Balancing traffic between the instances of a micro-service, by default happens with the K8s kube-proxy controller by its programming of IP tables. There is a bit of a performance and scale advantage here of using OpenContrail’s vRouter which uses its own ECMP load balancing and NAT instead of the kernel’s IP tables.

Service meshes also handle such load balancing. They support wider ranging features, both in terms of load balancing schemes like EWMA and also in terms of cases to eject an instance from the load balancing pool, like if they’re too slow.

Of course service meshes do also handle load balancing for ingress HTTP frontending. Linkerd and Istio integrate with the K8s Ingress as ingress controllers. While most SDNs don’t seem to offer this, OpenContrail does have a solution here that is based on haproxy, an open source TCP proxy project. One difference, is that OpenContrail does not yet support SSL/TLS, but there are also K8s pluggable alternatives like nginx for pure software-defined load balancing.

3.     Reliability Engineering
Yes, I categorize SRE and continuous response under the DevOps umbrella. In this area, since service meshes are more application-aware, it’s no surprise, they do the most further the causes of reliability.

When it comes to reliably optimizing and engineering performance, one point here from above is that EWMA and such advanced load balancing policies will assist in avoiding or ejecting slow instances, thus improving tail latency. A Buoyant article about performance addresses performance in terms of latency directly. Envoy and Linkerd are after all TCP proxies, and unpacking and repacking a TCP stream is seen as notoriously slow if you’re in the networking world (I can attest to this personally recalling one project I assisted with that did HTTP header injection for ad placement purposes). Anyway, processors have come far, and Envoy and Linkerd are probably some of the fastest TCP proxies you can get. That said, there are always the sensitive folks that balk at inserting such latency. I thought it was enlightening that in the test conducted in the article cited above, they’ve added more latency and steps, but because they’re also adding intelligence, they’re netting an overall latency speed up!

The consensus seems to be that service meshes solve more problems than they create, such as latency. Are they right for your particular case? As somebody highly quoted once said, “it depends.” As is the case with DPI-based firewalls, these kind of traffic processing applications can have great latency and throughput with a given feature set or load, but wildly different performance by turning on certain features or under load. Not that it’s a fair comparison, but the lightweight stateless processing that an SDN forwarding agent does is always going to be way faster than such proxies, especially when, like for OpenContrail, there are smart NIC vendors implementing the vRouter in hardware.

Another area that needs more attention in terms of reliability is security. As soon as I think of a TCP proxy, my mind wonders about protecting against a DoS attack because so much state is created to track each session. A nice way that service meshes nicely solve this is through the use of TLS. While Linkerd can support this, Istio makes this even easier because of the Istio Auth controller for key management. This is a great step to not only securing traffic over the wire (which SDNs could do too with IPsec etc.), but also making strong identity-based AAA for each micro-service. It’s worth noting that these proxies can change the wire protocol to anything they can configure, regardless of if it was initiated as such from the application. So an HTTP request could be sent as HTTP2 within TLS on the wire.

I’ll cap off this section by mentioning circuit breaking. I don’t know of any means that an SDN solution could do this very well without interpreting a lot of analytics and application tracing information and feeding that back into the API of the SDN. Even if that is possible in theory, service meshes already do this today as a built-in feature to gracefully handle failures instead of having them escalate.

4.     Testing and debugging
This is an important topic, but there’s not really an apples-to-apples comparison of features, so I’ll just hit prominent points on this topic separately.

Services meshes provide an application RPC-oriented view into the intercommunication in the mesh of micro-services. This information can be very useful for monitoring and ops visibility and during debugging by tracing the communication path across an application. Linkerd integrates with Zipkin for tracing and other tools for metrics, and works for applications written in any language unlike some language-specific tracing libraries.

Service meshes also provide per-request routing based on things like HTTP headers, which can be manipulated for testing. Additionally, Istio also provides fault injection to simulate blunders in the application.

On the SDN side of things, solutions differ. OpenContrail is fairly mature in this space compared to the other choices one has with CNI providers. OpenContrail has the ability to run packet capture and sniffers like Wireshark on demand, and its comprehensive analytics engines and visibility tools expose flow records and other traffic stats. Aside from debugging (at a more of network level), there are interesting security applications for auditing ACL deny logs. Finally, OpenContrail can tell you the end-to-end path of your traffic if it’s run atop of a physical network (not a cloud). All of this can potentially help debugging, but the kind of information is far more indirect vis-à-vis the applications, and is probably better suited for NetOps.

Legacy and Other Interconnection

Service meshes seem great in many ways, but one hitch to watch out for is how they can allow or block your micro-services connecting to your legacy services or any services that don’t have a proxy in front of them.

If you are storing state in S3 or making a call to a cloud service, that’s an external call. If you’re reaching back to a legacy application like an Oracle database, same deal. If you’re calling an RPC of another micro-service that isn’t on the service mesh (for example it’s sitting in virtual machine instead of a container), same again. If your micro-service is supposed to deal with traffic that isn’t TCP traffic, that too isn’t going to be handled through your service mesh (for example, DNS is UDP traffic, ping is ICMP).

In the case of Istio, you can setup egress connectivity with a service alias, but that may require changes to the application, so a direct pass-thru is perhaps a simpler option. Also there are a lot of variants of TCP traffic that are not HTTP nor directly supported as higher-level protocols riding on HTTP. Common examples might be ssh and mail protocols.

There is also the question of how service meshes will handle multiple IPs per pod and multiple network interfaces per pod once CNI soon allows it.

You most certainly have some of this communication in your applications that doesn’t quite fit the mesh. In these cases you not only need to plan how to allow this communication, but also how to do it securely, probably with an underlying SDN solution like OpenContrail that can span Kubernetes as well as OpenStack, VMware and metal.

What do you think?

Going back to the original question in the title: Are Service Meshes the Next-Gen SDN?

On one hand: yes! because they‘re eating a lot of the value that some SDNs provided by enabling micro-segmentation and security for RPC between micro-services. Service meshes are able to do this with improved TLS-based security and identity assignment to each micro-service. Also service meshes are adding advanced application-aware load balancing and fault handling that is otherwise hard to achieve without application analytics and code refactoring.

On the other hand: no! because service meshes sit atop of CNI and container connectivity. They ride on top of SDN, so they’ll still need a solid foundation. Moreover, most teams will want multiple layers of security isolation when they can get micro-segmentation and multi-tenancy that comes with SDN solutions without any performance penalty. SDN solutions can also span connectivity across clusters, stacks and runtimes other than containers, and satisfy the latency obsessed.

Either way, service meshes are a new, cool and shiny networking toy. They offer a lot of value beyond the networking and security values that they subsume, and I think we’ll soon see them in just about every micro-services architecture and stack.

More questions…

Hopefully something in this never-ending blog makes you question SDN or the service meshes. Share your thoughts or questions. The anti-pattern of technology forecasting is thinking we’re done, so some open questions:

  1. Should we mash-up service meshes and fit Linkerd into the Istio framework as an alternative to Envoy? If so, why?
  2. Should we mash-up OpenContrail and service meshes and how?

Resources on Learning About Service Meshes

Serverless Can Smarten Up Your DevOps & Infrastructure Stack by James Kelly

As IT organizations and cultures reshape themselves for speed, we’ve seen a trend away from IT as a service provider, to IT as:

  • a collection of smaller micro-service provider teams,
  • each operating and measuring like autonomous (micro-)product teams,
  • each building and running technology systems as micro-services.

Now you’ve heard me cover the intersection of microservices and containers quite a bit, along with CaaS/Kubernetes or PaaS/OpenShift orchestration. In this blog, I’m going to change it up a bit and talk about the other kind of microservices building block: code functions.

Here I’m talking about microservices run atop of a serverless computing a.k.a. function-as-a-service (FaaS) stack. Serverless is a 2-years-new yet already well-covered topic. If you need a primer this one is the best I’ve found.

In my last blog, I examined avoiding cloud lock-in by bringing your own devops and software-defined infrastructure stack: bringing your own open-source/standard toolchain across public/private IaaS, you maintain portability and harmonized, if not unified, hybrid cloud policy. This software-defined infrastructure stack, underpinning the user- or business-facing applications, is also still just a bunch of software applications. FaaS is just one example of an increasingly popular app-developer technology that can be employed by infrastructure developers too, though this one hasn’t gotten much attention yet in that regard.

If you read the primer above, you won’t have learned that the popularity of FaaS systems like Lambda to reduce developer friction has led to a number of open source projects like OpenWhisk, and others that to sit atop of a CaaS/PaaS (such as Fission or Kubeless). If a FaaS is likely to exist in your application stack, it may well make sense to standardize on incorporating one into your devops and infrastructure stack. Let’s see.


When it comes to I&O events, does your mind go to a pager waking somebody up in the middle of the night? Better to wake up a function and keep sleeping.

The main area of natural applicability for FaaS is in event processing. Such code functions after all are event handlers.

In the devops world, for development and application lifecycle events, FaaS presents a simple way to further automate. The scale is probably low for most CI/CD events, so the main benefit here of employing FaaS is the simplicity and agility in your pipelines as code. It also helps foster the change and process innovation to do pipelines as code in the first place. In the devops area of continuous response (CR), that Randy Bias coins here, there is almost certainly more scale and importance to respond to events, where the property of FaaS to handle on-demand bursty scale could also be a significant benefit.

In the software-defined infrastructure world, there are traditional events such as change logs, alarms, security alerts, and modern telemetry and analytics events. Of all the types, analytics, telemetry and security events strike me as the area where FaaS could present the biggest benefit and most interesting examination here. Similar to how for CR, analytics for an application (processing telemetry such as API calls, clicks, and other user events) needs scale, software-defined infrastructure is increasingly generating massive amounts of telemetry data that needs processing.

Systems like Juniper Networks AppFormix and Contrail Networking already do some telemetry collection and processing (see for yourself). AppFormix, a platform for smart operations management, in fact, processes much of the telemetry with light machine learning on the edge node on which it’s generated, so that sending the big data for centralized heavy-duty crunching and machine learning in the centralized control plane is more efficient. We at Juniper have written before about this distributed processing model and its benefits such as realer real-time visibility monitoring to show you what’s happening because… the devops call, “You build it, you run it,” sounds great until you’re running in a multi-tenant/team/app environment and chaos ensues. AppFormix also has also a smart policy engine, that is using static thresholds as well as anomaly detection to generate... you guessed it... events!

Just like devops event hooks can trigger a FaaS function, so can these software-defined infrastructure events.

Obviously a system like AppFormix implements much of the CR machine-learning smart glue for the analytics of your stack and its workloads; smart-driving ops as I call it. Other systems like Contrail Networking with its analytics Kafka or API interface are rawer. The same is true of other SD-infrastructure components like raw Kubernetes watcher events with which some FaaS systems like Fission already integrate.

The more raw the events, I bet, the more frequent they are too, so you have to decide at what point the frequency becomes so high that it’s more efficient to run a full-time process instead of functions on demand. That depends on the scale with which you want to run your own FaaS tool, or use a public cloud FaaS service (of course that only makes sense if you’re running the stack on that public cloud, and want to tradeoff some lock-in until they mature as usable in a standard way…worth seeing Serverless Framework here).


One of the ends that you could have in mind to run operational-based functions is greater operational intelligence. As we’ve discussed, AppFormix does this to a good extent, but it does it within a cluster. We can consider further smart automation use cases in-cluster, but using a FaaS external to your clusters, your infrastructure developers could scope your intelligence deeper or broader too.

For example, you could go deeper, processing telemetry in customized ways that AppFormix does not today. Imagine you’re running on GCP. Well you could do deep learning on telemetry data, even with TPUs I might add. To borrow a metaphor from a common ML example of image recognition, where AppFormix might be able to tell you, “I see a bird,” a powerful deep learning engine with TPU speed up, may, in real-enough time, be able to tell you, “I see a peregrine falcon.” In more practical terms, you may be able to go from detecting anomalies or security alerts to operations and security situational awareness where your systems can run themselves better than humans. After all, they already can at lower levels thanks to systems like Kubernetes, Marathon and AppFormix.

Rather than deeper, broader examples could be use cases that extend beyond your cluster. For example, auto-scaling or auto-healing of the cluster by integrating with physical DCIM systems or underlying IaaS. Yes, it’s serverless managing your servers folks!

In closing, there is plenty more to be written in the future of serverless. I expect we’ll see something soon in the CNCF, and then we’ll definitely start hearing of more application use cases. But hopefully, this blog sparked an idea or two about its usefulness beyond the frequent http-request and API-callback use cases for high-level apps. (Please share your ideas). For intent-driven, self-driving, smart-driving infrastructure, I think a FaaS track is emerging.