The Groups.io Engineering Philosophy
By Mark Fletcher
- 5 minutes read - 919 wordsAsk my kids and they will tell you, I am the party of no. No dessert before you eat your vegetables. No skipping out on 3rd grade to day trade stonks. No feeding your gremlin after midnight. I apply this principle to Groups.io devops as well.
One thing to note before we begin is that Groups.io is a small company with steady, predictable growth. For reference, we have about 60 Linodes of various sizes in our production cluster. Traffic is of the order of 10s of millions of page views and emails a day. Some of the following is not applicable or appropriate for fast growing or large services (but I think it is appropriate for most services).
KISS Is Not Just A Band Although KISS Is Also A Band
My guiding principle is:
Eschew recrementitious convolution
That means I have a thesaurus. It also means, keep things as simple as possible. The fewer technologies we use, the fewer things there are to go wrong.
When I’m getting paged at 3am because something is wrong with the site, I am not going to be at my sharpest and I’m going to be under a lot of pressure to get things fixed quickly. The simpler the system is, the easier it is going to be for me to figure out what’s wrong, and to fix it.
For us, this policy manifests in several choices for our tech stack. And that starts with the programming language.
Compile The World And Copy It Everywhere
Everything is written in Go. I am on record as being a huge Go fanboy. I will not extoll all the virtues of Go in this post, but I want to point out three important aspects of the language. First, it is not complex. There is no magic. This is important when trying to figure out what a piece of code you wrote in the last decade is doing. The second important aspect is that it is a compiled language. I believe in compiling all the things. The third important aspect is that it produces binaries that are (mostly) statically linked. Which means that it’s super easy to just copy them around as needed, without needing any support scaffolding.
In Which Our Villain Says No To A Bunch Of Useful Tech
Groups.io does not use containers. And that means no Kubernetes. The way we do releases is that we create a tarball of new executables, copy that to all the machines, untar them into a new directory, and then move symlinks to point to them. After that is completed, we restart the services using the new executables. Our production instances typically have uptimes in the hundreds of days.
We don’t do any autoscaling. I realize that means that we’re not fully utilizing our paid-for production capacity at all times, and I’m ok with that. I’ll take that tradeoff, for less complexity. We have a utility that uses the Linode API to spin up new machine instances and configure them when we need them. But because we have predictable growth, we have plenty of warning and can add machines at our leisure.
We don’t use DNS to access our machines internally. We use an SSH config file with ProxyCommands for each machine. Again, this is to keep things simple.
We have a monorepo, hosted by Github. We do not do automatic deployments on commit. We deploy all the time (including Friday afternoons!), it’s just the idea of automatic deployments makes me twitchy.
Structured Logs And JSON Configuration Files
We keep all our log files on the machines they were produced on. The way that Linode prices instances, we end up having a lot of extra, unused disk space on each instance, so we’re able to keep copious logs (see also above where we don’t autoscale). We have a program, unimaginatively called research, that will grep through appropriate log files on each machine, when we need to debug or trace an event. This works really well for us.
Configuration data is kept in JSON files, and is copied to each machine. We do not have a centralized configuration database.
Distributed Messaging Is Great
I don’t always say no. One core piece of our tech stack is NSQ, a distributed messaging system. It has been rock solid for us. Our large but not-exactly-a-monolith web server will off-load tasks that can be done in the background or that take a long time, to various microservices via NSQ. Some of these include uploading files to Amazon S3, generating group or user exports, sending notifications and webhooks, and sending emails.
One thing I wish NSQ had was a request response system. When interacting with some of our microservices, we need a response. For those services, we currently use HTTP JSON RPC calls. It would be interesting if instead we could just use NSQ in a way where we could receive a response (currently NSQ is basically fire-and-forget). That way, we could take advantage of NSQ’s distributed nature, which would make managing those microservices easier. I believe NATS may have something like this, although I haven’t investigated it.
Groups.io is the best tool to get a bunch of people organized and sharing knowledge. Start a free trial group today.