Why does Data Dog use Kubernetes - Dog Food-ing - Migrate away from Chef - Multiple cloud providers, so a shared API was attractive It is the hard way because of the tutorial kelsey hightower, kubernetes the hard way. this is on Github public. What happens after you finished the 101 1. Resilient and scalable Control Plane - you need 3 but ideally 5 kubectl instances. - ETCD is very intense on the network side - The other components are stateless. - Move ETCD nodes out on their own. - the API Server with controller in it's group is going to be extra hard - API memory/network - Schedule and controller are CPU Securing - The certificates are everywhere and they expire in a year. - If you can just renew them frequently. - Use certificates in every case you can. - Kubelet: TLS Bootstrap - The signing controller / signing service should be monitored - Certificate rotation is not so simple - etcd caches are repopulated when the API server restarts - When it restarts it also spikes the dependent or consuming services - the traffic and memory consumption is hit very hard because caches are quite large. - clients should in an orderly fashion. try to avoid having them all rush the bank so to speak. Efficient Networking It is hard. - Throughput - Scale - Latency - Topology |- Routes -- Small clusters can handle static routes. -- Overlays, allow you to tunnel connections like Calico or Flannel not very performant -- Native pod routing, is an example using regular routing of the network -- On premise lets you use BGP, Calico and Kube-router -- Native pod routing has worked very well at large scale -- Topic is still dynamic ( Cilium introduced ENI) -- Lyft / Cilium are good partners |- Accessing Services -- Virtual IPs that are load balanced on the host |- IPVS is an alternative versus IP-tables -- Service IPs are mapped to virtual servers -- Pods are mapped to virtual servers -- Kube-proxy then talks to the virtual server. -- IPVS instead of wait just garbage collection first, kill the pod |- Ingress Traffic ( Getting data into the cluster ) -- A, B, and C want to talk to each other -- Traditionally you would use a Load-Balancer Service ( node bot ) -- Registration with the LB is slow -- HTTP Traffic Some existing issues or problems that need resolution DNS Stateful apps Daemon sets Accidentally DDOS'ing a Node Understanding Life-cycle of a Cluster k8s.af Data dog is hiring datadoghq.com/careers