TIL Boltdb has no index. the store is a b+ tree. So reterival is o(log n) for something that you would think it is o(1) also it sets on top of one large mmap..

Last time I checked boltdb is what etcd use for I/o
boltdb has the concept of key buckets I am not sure if etcd uses.

there is a leaner relationship between how many objects you have in kubernetes and latency on HTTP GET api call. This relationship is not affected by concurrent read because boltdb has shared read model
With the increased use of kubernetes as storage/control loop for arbitrary objects and collection of via CRD .. these things must be understood.
Kubernetes itself does not really do anything complex storage wise. It does not have cross key transactions. Writes are always single object at a time. Key name is a uri like string that carries name+type+namespace. What kubernetes is heavy on is compare exchange and watches.
Also boltdb code is elegantly simple. The original author set out to build a simple embadable persisted k/v store and I think he (I think he is a he) nailed it. Making the hard choice of drop features for sake of simplicity.
The way it works is b+ tree first few pages is meta data and “feee list” of pages. All writes are rsync-ed. Single file for data.
Finally boltdb has no DB traditional thread pool all ops are running in the caller thread. So scaling up read is done via go routine hosting app must spin up. It is a single writer via a top level mutex so you really don’t need to scale those up.
That is why too many events or too many writes really chock kubernetes not just because of raft dsm .. but also because of how boltdb works.
Also .. boltdb has no in-mem cache, all reads are done via I/O. I didn’t check if etcd has any.. but I doubt it does.
So how do you scale up in the face of too many controllers on cluster doing too many not too smart things ( looking at you Prometheus operator)?

Easy/cheap: use smaller etcd cluster 3 are better than five (YMMV availability wise).
Easy/expensive: get the fastest disk you can put your hands on and put your wal files on them. Or one of those fancy new tech like NVRAM
Hard/expensive: split your storage over multiple etcds.. keep events in a separate cluster. If you have a crd that gets frequently updated (few times per sec or per few sec) consider having them in a separate etcd
By separate: configure api server to use multiple etcd(it can, rtfm).

Your cert rotation will be a nightmare (all etcds will have to use the same cert iirc). And your ops will be another nightmare but your perf will be *chef’s 😘*
/fin
You can follow @khenidak.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: