Do I read correctly that tiered storage is GA in @confluentinc platform 6.0?

I absolutely must try it if so. https://docs.confluent.io/current/kafka/tiered-storage.html
Confluent Platform 6.0 upgrade is baking in the oven...
so far, it's been amazingly unremarkable, we hardly even noticed from the consumers/producers that anything had changed (aside from rolling broker restarts).

A few alarming blips to the "requests total time" for fetch.consumer.mean, but other than that smooth sailing.
wishing desperately that our jolokia plugin passed along the version number so we could watch the rollout in realtime, but it was (1) not configured to pass it, and (2) telegraf/wavefront really don't like string typed values :(
Tiered storage buckets added but not attached to topics yet :)

To do tomorrow instead, when I can watch it start archiving to s3 during working hours.

and then after that, we'll shrink from i3en.xlarge to m6gd.large or m6gd.xlarge for hot storage :)
Apparently the licensing has changed? even though it's in the confluent-kafka package which is confluent community licensed, I'm hearing that the tiered storage feature (which is a one way ratchet and can't be disabled once enabled for a topic) might be enterprise-only :(.
so now back to the holding pattern while we sort out commercial arrangements with Confluent sales.

I was excited to try flipping the switch tomorrow morning, too :(
aha, I have the answer. nothing will happen when I flip the switch tomorrow, because I have the community licensed build, so there's no bait and switch happening licensing-wise :) :phew: https://twitter.com/ijuma/status/1319133308425441280
Cut over, data immediately began appearing in the S3 buckets. However, I think our pre-existing monitoring script, which isn't yet aware of tiered storage, is going to start complaining that there's only an hour of local storage...
Looking like it'll take about 2 hours to archive 1TB of no longer needed dogfood data to S3...
(which means 1TB of storage we're no longer needing i3en machines for :D)
btw, yes, I did wind up getting a trial license key, after verifying that the feature did not in fact come enabled on community edition.
Fwoosh.
and that freed up enough disk space to do this! @_msw_
Auto-healing verified by kicking a few more over to m6gd and verifying that topics are re-replicated/re-balanced on demand. Pretty cool.

Now to get Confluent Control Center working except... oops, it needs some kind of REST adapter?
Okay, that was easier than expected (although less documented than I'd like). Each Confluent broker already listens on port 8090 by default for REST RPCs, so I just had to write a little Terraform to add that port to our Kafka bootstrap NLB.
https://i.imgflip.com/4k21wc.jpg 

[caption: wait, nlb is for service discovery? -- always has been]
So I got Confluent Control Center working. I started working on shrinking the cluster and removing the 3 no longer needed brokers, bringing me from 6 to 3 brokers.

and then, friends, something somewhat catastrophic happened :(
I initially tried the CCC kafka-remove-broker method. It protested, claiming the rebalancing service wasn't available.

So I started manually juggling the partitions with kafka-reassign, to go onto the 3 brokers I knew I'd be keeping.
Except the partition replicas _would not stay_ on the new brokers. Even when I turned off the automatic rebalancing, because it wasn't willing to perform broker removal for me.
I thought I'd moved all the partitions entirely onto the new brokers, only to find that there was a partition that had reverted itself. :(

and then, I made the second critical mistake: failing to understand how AWS ASG removal works. We use ASGs to replace brokers.
I needed to pin all of the brokers as do-not-scale-in, and then release one at a time. I failed to do that, and went from 6 to 3 all at once. And because of the magical broomsticks, some of the internal partitions were stuck entirely on old brokers that were removed :(
so now we have stuck internal partitions. I figured most of them were confluent-internal and could be regenerated, forgetting that, _confluent-tier-state is actually load-bearing regardless, so, yeah :(

explosion resulted.

fortunately this is just dogfood, so we can rebuild.
but it's going to be ugly, and mars what would otherwise be a _very_ successful proof of concept.
You can follow @lizthegrey.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: