Thread by @vortexau, I just decommissioned my last recon k8s cluster.A thread on some stuff [...]

I just decommissioned my last recon k8s cluster.

A thread on some stuff I learned.

Ultimately it worked, but considering hunters are basically just running a bunch of python and golang programs - I feel there are much better ways to scale.

1/?

Kubernetes for recon is expensive. Nodes are always running, and scaling up and down was difficult to get right. I ran K8S on AWS and DO, the costs were fairly significant on each.

2/?

99%+ of the K8S community seem to only use it for web apps (understandable, as that& #39;s what it& #39;s designed for) so the use case of a bounty hunter can be difficult to find assistance for all the edge cases webapp folks don& #39;t seem to run into.

3/?

I had problems with pod health monitors. They& #39;re simple with a web app or service in a pod, but I found monitoring a python script to be tricky and I never had a good solution. As a result, scripts would sometimes die, but the pods didn& #39;t, creating a bottleneck.

4/?

That one is probably a solvable problem, but given my full time work + hobbies, I didn& #39;t want to put in the time. I considered paying a consultant, but would the return on investment make it worth while? The clusters weren& #39;t earning anywhere near their cost to run, so no.

5/?

My clusters made heavy use of RabbitMQ internally for job queuing. As I was learning K8S, I went through countless iterations of spawning a cron job to kick off the automation process, which would overwhelm the RMQ pod, which would then die and the queue would be lost.

6/?

So, the solution to that? RabbitMQ Cluster. This means more pods, spread across nodes, and more resources. This worked better, but ultimately it just moved the capacity problems elsewhere...

7/?

I wrote a pod autoscaler which would adjust the pod count for the services I was running (DNS Resolvers, HTTP scanners, screenshot generator etc) based on queue sizes.

This had to be severely restricted when scaling pods, if they scaled too fast - resources would run out.

8/?

What happened when resources would run out? Nodes die, that takes the pods (and often most of the RMQ cluster!) with it, creating errors on other pods which would then not recover...

9/?

So, why not scale nodes then too? Primarily on AWS, I noticed that the load on the existing nodes would come up too quickly and at least one node would die before more could be provisioned.

Again, maybe a solvable problem?

10/?

Maybe we could just have more nodes running all the time? Yeah... more cost.
Maybe we could just have more powerful initial nodes?
Yeah... more cost.

Costs I couldn& #39;t justify as the clusters weren& #39;t even close paying their own costs, let alone any profit!

11/?

Maybe I could have employed a k8s consultant for a week to resolve a bunch of these issues?

Given what I HAD found, I wasn& #39;t confident fixing the issues would find enough to justify the cost.

12/?

An external RMQ service was also on the list of things to try. That would have solved the loss of the queues issue (I didn& #39;t have success persisting to disk on RMQ pods - disk storage is a WHOLE other issue in k8s!)

13/?

I learned quickly that the database needed to be external to the cluster, for the same reasons as RMQ, persisting to disk in pod was painful.

14/?

Screenshot service was another problem. hose pods would die randomly and even with all the debugging and exception handling I could add to the pod, they would just & #39;freeze& #39; - but not lock up so the pod was still accessible, just & #39;nothing& #39; working.

15/?

Those also added to the resourcing problems. Chromium in a pod is **heavy** - I put this down to resourcing problems again, and it added to the growing list of issues which I didn& #39;t seem to have a resolution for.

16/?

In the end - I think most of these issues ARE solvable, and that maybe when done properly, k8s could work for a bounty hunter.

I think my main issue was that I was falling back to treating pods as pets.

The core code in each pod was designed to maintain connection to...

17/?

...RabbitMQ, and consume items constantly. In hindsight I think this was a mistake. The pods probably should have had an intentionally limited life span - say 100 iterations each, before exiting and a new pod spawns in its place to process the next 100 items from the queue.

18/?

I& #39;m rebuilding, redesigning, and reassessing my whole recon automation approach.
It& #39;s no secret that recon isn& #39;t actually required to be successful (about 5% of my bounty income is pure recon findings) but I think it can be used smarter to find the things to go deeper on.

19/?

I& #39;m looking closely at AWS Lambda for a lot of this, and breaking it all down to the smallest pieces possible.

Time will tell if this approach is better. I& #39;m hoping it will be much more reliable and much cheaper to operate, and ultimately find more stuff to hack on.

20/20

Latest Threads Unrolled: