PS-04: Prasad Dorbala, Avesha / KubeSlice – Why is Kubernetes multi-tenancy still hard to achieve? – Intellyx – The Digital Transformation Experts

June 3, 2022

Problem Solvers — Podcast / vCast for June 3, 2022:

In this June 2022 Intellyx Brainwave Podcast, Problem Solvers Edition 4, we’re joined by Prasad Dorbala, co-Founder and Chief Product Officer of Avesha Systems, the creator and contributor of the open source KubeSlice cloud native multitenancy solution for multi-cloud/multi-cluster Kubernetes networking. We’ll discuss the limitations of IP addressing and the complexities of maintaining continuity for users and application workloads across multiple cloud instances, K8s namespaces, and what we can do about it.

Guest: Prasad Dorbala, CPO, Avesha (@prasaddorbala)

Intellyx host: Jason ‘Jay-E’ English (@bluefug), Intellyx

–

Episode PS-04 June 2022 Show links:

Avesha / KubeSlice Solutions page: https://avesha.io
Watch the YouTube version here: https://youtu.be/ZAs24bmr0OM
Hear / download the podcast on your favorite player on Anchor.com: https://anchor.fm/intellyx/episodes/PS-04-Prasad-Dorbala–Avesha–KubeSlice—Why-is-Kubernetes-multi-tenancy-still-hard-to-achieve-e1jfj6j

Watch the YouTube version (Audio only) here: https://youtu.be/ZAs24bmr0OM

Full transcript of the podcast:

Jay-E: Welcome to another Intellyx BrainWave podcast. This is our thought leaders problem-solvers episode where we talk about some common problems that exist in the cloud native ecosystem and what we can do about them.

So today I have Prasad Dorbala, he’s the co-founder and chief product officer for Avesha. You might know of them better as the people who make the KubeSlice product, which is a multitenant, multi- cluster networking and orchestration solution. So good to have you on Prasad.

Prasad: Thank you, Jason. Good to meet you.

Jay-E: Prasad, I think let’s just start off with the fundamental issue. It seems like, Kubernetes itself and a lot of the tools surrounding that were supposed to solve the orchestration problems, and they do to a large extent. And then the whole promise of cloud in general is really multitenancy.

Why has multitenancy for Kubernetes environments become such a hard problem to solve?

Prasad: Yeah. Thanks Jason. Kubernetes has done a phenomenal job in orchestrating containers. Everybody is containerizing their monolithic applications to microservices and the de facto standard for orchestration is Kubernetes and containers are growing at a 32% CAGR year over year, a massive growth. So Kubernetes was all driven from an enterprise viewpoint, right? So enterprises have certain behaviors, enterprises have users. And you know, users in an enterprise have a trust behavior, right?

So most of the enterprises users are all employees of the same enterprise. So there is a mutual trust associated with that. But when you start to extend that Kubernetes deployment as a SaaS solution for a different set of users, then there are challenges in the way Kubernetes has been designed, right?

The control plane is common in a cluster. So. Now you can argue saying that, Hey why don’t you spin up multiple clusters per enterprise? Yes, you can do that, but that actually goes into a whole notion of kube sprawl, right? There are many, many Kube clusters, which are getting spun up per enterprise, which actually means you’re you get the multitenancy for free by doing, you know, multiple clusters.

But, as the scale grows up tens and hundreds of clusters, then managing them is a big challenge. And then also the fact that you have to spin up a cluster per tenant or clusters per tenant, you’re not efficiently using the resources, which give you the benefit of running Kubernetes.

Right? So that is the reason why people are considering, how to do multitenancy in a single cluster. And then extend that multitenancy across multi cluster so that, you know, when the demand comes in and then you add more clusters. That tenancy, which is defined in a single cluster should extend rather than reprovision it.

and these clusters are ephemeral too, right? So once they come and then go, you don’t have that much time for configuration challenges. And then essentially when you have all these things, you also add the visibility tools, which you put it in, right? The visibility tools are different for a different tenant, all of that needs to be orchestrated and then becomes a big challenge.

Jason that is the reason why industry is now trying to –the success of Kubernetes is driving some of the additional features which are needed into the foundations of Kubernetes.

Jay-E: Mmhmm. It’s almost like this mentality of being able to use open source solutions, combined with what you have, roll your own clusters for the needs of different customer groups has kind of created this situation that we’re in right now.

Right? I mean, it was a very positive thing. To allow infrastructure to be responsive to the needs of, the business and business users in the end, but it created this new problem. so how do you, how do you address this use case?

Prasad: Yeah, it’s a very important factor from a cost and efficiency standpoint.

Now, first of all, we need to understand what is tenancy, right? Tenancy can be deemed to be in many grades. Let’s say in an enterprise, as I talked to you about, a tenancy can be for an application. There are certain sets of applications, which are very critical from a revenue generation standpoint and a security posture standpoint.

They need to be isolated from other sets of applications, which are kind of generic in nature. Right? So the tenancy is a type of an isolation. Right. It is an isolation. Then to define an isolation– isolation in an enterprise is different from isolation across multiple enterprises, and then isolation is different for applications.

So in order to do that, there are many, many knobs you need to establish. For all different types of isolations. And then essentially fundamentally Kubernetes has a control plane, which is common for every cluster. Right? Control plane is common. And then as well as some of the resources like nodes and other things are very common to clusters, right?

When you talk about isolation, You know, what, what is to say that a tenant brings their own container and you know, for no reason or some you know, situation where they have included third party you know, software into that container. It may have some malicious content, which they are, they’re not purposefully doing it, but they have packaged in there.

So now do you have a collateral damage for somebody else sitting on the same cluster? Right. So how do you isolate them from a pod point of view? How do you isolate them at a construct called namespace? Or, how do we isolate them from a control plane standpoint, right? So that when they use all these tools like KubeCTL and others, they are only seeing their part of the infrastructure rather than the overall part of the infrastructure.

Right. So there are challenges there, right? And then that is what we are trying to solve. And fundamentally there is a construct called namespace. Which is unique per cluster, right? You cannot have the same namespace in the cluster, for two different things. But when you are given access to the infrastructure.

They may call a namespace called Jason and somebody else might call the namespace, Jason. But if they happen to be on the same cluster, that is not allowed. So there are lots of challenges like that when you want to do a true multitenancy we need to overcome in Kubernetes.

Jay-E: Yeah. I mean, it seems like we were expecting when we were talking about the infrastructure being so ephemeral containers and pods and everything coming and going there that might solve this kind of noisy neighbor problem that we usually have with conventional architectures.

And, maybe that would have gone away, but actually even the consumption model for the resources is still kind of constrained by the fact that different people are using it different ways and it’s probably not optimal. Right. Even beyond just the security concerns. I mean, how do you, how do you kind of solve for some of that?

Prasad: Yeah. So that’s an important factor, right? You hit it on the head, noisy neighbor problem.

Right. Especially when you have a tenant, which is like Coke vs. Pepsi, a different enterprise, right? Running on the same cluster. Right? How do you shard such a way that designated set of resources are allocated to a tenant so that, you know, they may be running on the same node. Or they maybe a segregated on a different node.

There are network resources which need to be isolated. There is the compute resources which need to be isolated. And then there is also a whole notion of storage components, which are tied, which are common PVS– are kind of common for clusters. How do you associate it to a tenant? Those are all fundamental challenges, which need to be solved from a multitenancy standpoint.

Jay-E: Yeah. So yeah, this is very interesting. And so how do you kind of solve the problem of having workloads that are able to cross multiple clusters or even– so are you saying that they could cross multiple namespaces that are in different– so that concept is sort of your overarching goal right? Is to be able to even have the infrastructure cross multiple namespaces, multiple instances, even different cloud services whatever. How does that how does that happen?

Prasad: Yeah, so that, that is what KubeSlice is trying to solve. What we have is creating an abstraction on top of the existing Kubernetes infrastructure.

And then we call it a slice. A slice is a collection of namespaces wherever they are. Right. So you rightly said, Jason –positional location is important too, right? With the advent of edge computing coming into play, You need to have workloads closer to the customer, so that customer can have the best quality of experience.

So then your hyperscalers may be in one location. Now you have edge, which is coming where the customers are, then that’s another cluster. So combining these things and then making sure that the tenancy is maintained. Across multi cluster is an important factor in the industry coming up.

That is what KubeSlice is bringing is to drive the tenancy– true tenancy in a cluster, and then extend that tenancy across multi cluster in a seamless way. Right. So in such a way that. The consumption of people who are actually deploying applications are focused on the business logic rather than infrastructure logic.

We take care of the harmonization of tenancy across multicluster– you see one of the fundamental problems when you are crossing two administrative domains, right? You are in a hyperscaler, and then you want to have a footprint in an edge provider. So there are two different administrators’ domains.

Yes– containers have abstracted and then made it simple for deployment anywhere, but the infrastructure needs to be tied so that you can distribute the applications and then talk across these disparate different infrastructures, which are administratively different, right? The address planning, the IP address planning, which needs to happen, what needs to happen in the hyperscaler, what needs to happen in clusters and the edge clusters can be many in nature.

Right. And then also you know, geographically there are multiple regions in hyperscalers, right? Think about it where somebody is actually doing on-prem Kubernetes and then tying it out with edges. That’s another vector we are addressing. So those addressing schemes are a big challenge and then we solve it so seamlessly by giving an overlay network.

So it’s a way that they don’t have to worry about the addressing. They just focus on business logic. So bringing all these multi clusters, be it in on-prem be it in hyperscaler, or be it in edge. And then make it look like a cohesive Virtual cluster is something which we bring.

Jay-E: Yeah. Rather than waiting for– I don’t think we can wait for the hyperscalers or, or telcos for that matter to solve this problem.

I don’t see them moving coming to coordination that quickly. I think some people might look at what you’re talking about and say, well, isn’t that just another service mesh? So I think those might exist in the landscape that you’re talking about. So how do you kind of differentiate from some of those approaches?

Prasad: Yeah service mesh solves a different type of problem, service mesh definitely solves service to service communication. But they assume fundamentally that there is a transport available, right. Service mesh does not distinguish well on the tenancy part of it, service mesh is on top of Kubernetes.

As long as in a single cluster, there is no construct like a tenancy inside that. Right? So that is where we change a lot from a API server standpoint, how the scheduler works, what needs to be scheduled, where those are all different than a service mesh mission, right? Service mesh mission is about service to service, communication, visibility, and MDLS, right?

Those three fundamental things they do very well. And we are kind of a little lower than that, right. Which is essentially at the Kubernetes layer, not at the service mesh layer.

Jay-E: Yeah. So as opposed to being purely about the integration, it’s really about making sure that all of these components can work together.

And so there could be multiple service meshes within these landscapes. that might be your mechanism for having that service to service communication, but it doesn’t make a requirement of your solution?

Prasad: Not necessarily. In fact our posture, essentially, if you look at 5G and other technologies, which are doing something for network slicing, and then the ORAN’s getting distributed and all that stuff, right.

Which is all Kubernetes centric, we help because the network, which we’re bringing is not HTTP only. They have all kinds of you know exotic protocols, which need to be passed through from cluster to cluster. So they are layer 3 and about right. Rather than you know, HTTP-centric, right? The ingress egress challenges, which they have. Some of the CN apps, what we call cloud native network functions, Are not so you know, comfortable with NATting some of the protocols which go about, so we need to give a much easier transport across multi cluster because NATS are something no-no, for some of the protocols. So that is what we bring to the 5Gs of the world. To be able to help solve that problem more natively in the Kubernetes realm rather than at the service mesh level.

Jay-E: Very interesting. So what would you say would be the as far as your initial set of customers and users that you’re seeing, what kind of roles are looking to take care of this problem?

So is it sort of like at that architect level, or is it somebody who’s thinking more or what kind of initial use cases are you seeing out there when people approach you?

I think there are multiple use cases, which we are seeing people who are really interested in tenancy of isolation, across applications.

And there are SaaS providers who are offering Kubernetes as a service, and they want to economize the number of clusters they have. And then from that standpoint, they want to have true tenancy multiple enterprises running on a same. Right. So that they can have a clear separation of control plane and as well as designated resources and solving the chatty neighbor problem.

And then, from a standpoint of enterprises who are you know, a single region who wants to have multi-region support and actually distribute the workloads closer to the customers either be it at their own data center which is essentially hybrid kind of a scenario or consuming the new edge providers.

Those are all the use cases we are seeing. And the persona who is trying to solve all these problems are the the new breed of platform services, right? They are offering the platform services to their business. Our application providers are application developers. So these platform teams traditionally used to be called DevOps.

Now it’s kind of evolved into a platform team who are the responsible for keeping the health and life cycle management of the infrastructure. And then making sure that the Quanta of capacity exists where the business needs are. So those are all the personas we are addressing in order to solve the distribution of their workloads in a more you know, guardrail multitenancy way.

Yeah. That makes sense. I think because yeah, the first one that always popped into my mind was basically like a CSP or MSP on steroids can kind of like, think about like, how do I enable this without having to roll my own new Kubernetes version of this managed platform. But then when you think about it from the point of view of an enterprise, they’re constantly wrangling resources to build platforms.

So I could definitely see that as being part of internal enterprise org too.

Prasad: Yeah, and everybody wants to be a SaaS provider now. Right? All enterprises are SaaS solutions. Now a SaaS solution. Think about it– when you are a SaaS solution, you pretty much have all the customers across the globe.

Now you want to make sure that your Quanta of workload is available wherever the hotspot of customers there are, right. You don’t want to back haul all of their traffic to one centralized location, but you want to be distributing that.

And once you distribute that, do you distribute the full stack, or do you distribute a partial of the stack, which is necessary for it to be available, to solve some basic business problem. So that when you split the stack and then distribute it, then comes the whole notion of how do you make sure intercluster traffic is harmonized.

Don’t go to a different administrative domain, right? When you have a security posture, you want to go in that security landscape, rather than go into different landscape and then have all these auditing and other stuff to go through to make sure that it is safe. So how do you Reduce that headache is one of the key factors from an operational efficiency standpoint.

Jay-E: Yeah. Well, fascinating. Prasad thanks so much for joining me. How would customers approach you to get started with, with KubeSlice or working with Avesha?

Prasad: So Avesha has taken an approach– see, the platform team are pretty much consuming open source for a lot of different things. So we are open core. We have KubeSlice as an open source. The slice concept is an open source, so that at least people can try it out. You can say, Github.com/kubeslice.

You have all the repos out there. You can try it out. And then once they try it out, they see the value. That drives us to bring the additional value of controls, additional value of, simplifying how you deploy it, simplifying the configuration, and then some key values which we are bringing from augmentation to KubeSlice.

So that’s how we are seeing a product line from our standpoint. And then as well as some of the pinpoints people are seeing, especially in the retail industry. Where they have different clusters in different places.

And then there are also regulation issues, right? From the GDPR standpoint. Let’s say EU doesn’t want to leave their PII information from the boundaries –data sovereignty is one big thing. So there are policies associated with that. Some of these things are driving applications to be distributed and we see that day in, day out..

Jay-E: Yeah. So it sounds like you could, you could probably use the open source to call and build in this functionality if you wanted to, but if you wanted to look at it more from an orchestration and policy perspective, and really have observability into what’s happening. With this multi-tenant architecture, then you would talk to Avesha then.

Prasad: Yep. Excellent. That’s that’s the key point, right? All of us have run large scale operating environments.

One of the key things we always say is that the team members should have their own sanity. They don’t need to be waking up in the middle of the night. So in order for that, observability is the key foundation. If you don’t know how to find the root cause you will be spending hours and hours trying to figure out where the things happened and where it died.

So that one of the things which we learned from our lessons is that, you know, it’s easy to solve, but it is important to solve it forever so that it doesn’t happen, the same problem. So those are the things which we have looked at and then help build from an operational efficiency standpoint.

Jay-E: Excellent. Well, thanks so much Prasad, and thanks everyone for joining us on the Intellyx problem-solvers podcast today. And we’ll be seeing you.

Prasad: Thank you very much. Thanks for the opportunity.

©2022 Intellyx LLC. At the time of publishing, Avesha Systems is an Intellyx subscriber. All dialogue in this program represents the expressed opinions of the hosts and guests, and are not necessarily the official position of Intellyx, or any company mentioned or included in this podcast audio or video.