This is for a bunch of internal tools, so if it goes down it's more of a nuisanc...

yebyen · on March 28, 2018

I have my group's internal Jenkins service hosted on a single node EC2 instance running Kubernetes (t2.medium) and I would echo all of the advice you're getting. Kubeadm, definitely. And moreover, don't call it production-ready.

A production-ready cluster has dedicated master(s), period. In order to get your single-node cluster to work (so you can schedule "worker" jobs on it) you're going to "remove the dedicated taint," which is signaling that this node is not reserved for "master" pods or kube-system pods only. That will mean that if you do your resource planning poorly with limits and requests, you will easily be able to swamp your "production" cluster and put it underwater, until a reboot.

(The default configuration of a master will ensure that worker pods don't get scheduled there, which makes it harder to accidentally swamp your cluster and break the Kube API, but also won't do anything but basic Kubernetes API stuff.)

If things go south, you're going to be running `kubeadm reset` and `kubeadm init` again because it's 100% faster than any kind of debugging you might try to do, and you're losing money while you try to figure it out. That's not a production HA disaster readiness or recovery plan.

But it 100% works. Practice it well. Jenkins with the kubernetes-plugin is awesome, and if I have a backup copy of the configuration volume and its contents, I can start from scratch and be back to exactly where I was yesterday in about 15-20 minutes of work.

My 1.5.2 cluster's SSL certificate expired a few weeks ago, on the server's birthday, and after several hours trying to reconcile the way that SSL certificate management has changed, to find the proper documentation about how to change the certificate in this ancient version, as well as making considerations that I might upgrade, and what does that mean (read: figuring out how to configure or disable RBAC, at the very least)... I conceded that it was easy to implement the "DR-plan Lite" that we had discussed, went ahead and reinstalled over the same instance "from scratch" again with v1.5.2, and got back to work in short order.

I've spoken with at least half a dozen people that said administering Jenkins servers is an immeasurable pain in the behind. I don't know if that's what you intend to do, but I can tell you that if it's a Jenkins server you want, this is the best way to do it, and you will be well prepared for the day when you decide that it really needs more worker nodes. It was easy to deploy Jenkins from the stable Helm chart.

swozey · on March 28, 2018

I've done a number of 1.5 to 1.9 migrations, if you need help figuring out what API endpoints/etc have changed I can give you some guidance if you ping me on k8s slack; mikej.

Once you get onto 1.8+ w/ CRDs you can manage your SSL certs automatically via Jetstacks Certmanager; https://github.com/jetstack/cert-manager/tree/master/contrib...

yebyen · on March 28, 2018

Thanks! I will check it out!

It just hasn't been a priority. I have no need for RBAC at this point, as I am the only cluster admin, and the whole network is fairly well isolated.

I couldn't really think of a good reason to not upgrade when it came time to kubeadm init again, but then I realized I could probably save ten minutes by not upgrading, it was down, and I didn't know what the immediate consequences of adding RBAC would be for my existing Jenkins deployment and jobs.

Chances are it would have worked.

swozey · on March 29, 2018

Honestly for the situation you presented you'll find very few QOL improvements by upgrading. You could probably sit on 1.5 forever on that system (internal jenkins) forever.

yebyen · on March 30, 2018

The biggest driver is actually just to not be behind.

You can tell already from what little conversation we've had that "always be upgrading" is not a cultural practice here (yet.)

We have regular meetings about changing that! Had two just yesterday. Chuckle

outworlder · on March 28, 2018

I don't think so, provided it has the necessary resources to run everything in a single node. There are a few more moving parts which you won't really be using to any great extent.

cagenut · on March 28, 2018

more parts = more things that can go wrong

guslees · on March 28, 2018

That's not quite true. More parts == more things that can fail, but whether those failures result in the entire system failing depends on how you've combined the parts.

If you make each of the pieces required parts of the whole, then yes - adding more of them will increase the chance that the whole system fails. But in kubernetes, the additional pieces (nodes) are all redundant parts of the whole, and can fail without affecting the availability of the whole system. The more nodes you add, the more redundancy you're adding, and the less chance that the system as a whole will be affected.

Mathematically:

If a component fails F% of the time, then adding N of them "in series" (all of them need to work) means your whole system fails with a (1-(1-F)^N)% chance. Iow, as N goes up, the system approaches (1-0)% => 100% chance of failure.

Otoh, if you combine the parts "in parallel", and you only need any one[1] of the components to work in order for the whole system to work, then the system has a F^N% chance of failure. As N goes up, this system approaches 0% chance of failure.

[1] Kubernetes (etcd) isn't quite this redundant, since etcd needs a majority quorum to be functional not just any single node. But the principle is similar and still gets more reliable as you add nodes.