Changes in klog and logr have made automatic bumps from dependabot
problematic.
We also shouldn't need klogv1 so removed that.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
In error cases these goroutines never exit.
Trying to debug cases we end up with a bunch of these goroutines stuck
making it difficult to troubleshoot.
We could just make a buffered channel, however this will makes it less
clear, in cases of an error, what all is happening.
This starts watching for events prior to the start of the controller.
This smells like a bug in the fakeclient bits, but it seems to fix
the problem.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
We had added an optimization that made it so we dedupe pod status updates
from the provider. This ignored two subfields that could be updated along
with status.
Because the details of subresource updating is a bit API server centric,
I wrote an envtest which checks for this behaviour.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
This refactor is a preparation for another commit. I want to add instrumentation
around our queues. The code of how queues were handled was spread throughout
the code base, and that made adding such instrumentation nice and complicated.
This centralizes the queue management logic in queue.go, and only requires
the user to provide a (custom) rate limiter, if they want to, a name,
and a handler.
The lease code is moved into its own package to simplify testing, because
the goroutine leak tester was triggering incorrectly if other tests
were running, and it was measuring leaks from those tests.
This also identified buggy behaviour:
wq := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultItemBasedRateLimiter(), "test")
wq.AddRateLimited("hi")
fmt.Printf("Added hi, len: %d\n", wq.Len())
wq.Forget("hi")
fmt.Printf("Forgot hi, len: %d\n", wq.Len())
wq.Done("hi")
fmt.Printf("Done hi, len: %d\n", wq.Len())
---
Prints all 0s because event non-delayed items are delayed. If you call Add
directly, then the last line prints a len of 2.
// Workqueue docs:
// Forget indicates that an item is finished being retried. Doesn't matter whether it's for perm failing
// or for success, we'll stop the rate limiter from tracking it. This only clears the `rateLimiter`, you
// still have to call `Done` on the queue.
^----- Even this seems untrue
If a pod is being gracefully deleted at podcontroller startup,
it will not get deleted via the deletedanglingpods code. This
ensures the normal deletion loop covers the case.
This moves from forcefully deleting pods to deleting pods in a
graceful manner from the API Server. It waits for the pod to
get to a terminal status prior to deleting the pod from api
server.
This removes the legacy sync provider interface. All new providers
are expected to implement the async NotifyPods interface.
The legacy sync provider interface creates complexities around
how the deletion flow works, and the mixed sync and async APIs
block us from evolving functionality.
This collapses in the NotifyPods interface into the PodLifecycleHandler
interface.
Allows callers to wait for pod controller exit in addition to readiness.
This means the caller does not have to deal handling errors from the pod
controller running in a gorutine since it can wait for exit via `Done()`
and check the error with `Err()`
We poll legacy providers for their pod(s) status periodically. This is because
we have no way of knowing when the pod is updated. If the pod somehow goes
missing in the provider, that state must be handled. Currently, we update
API server, and mark the pod as failed, or ignore it.
If the informers are starting at the same time as createPods,
then we can get into a situation where the pod seems to get
"lost". Instead, we wait for the informer to get into sync
prior to the createpod event.
This also moves to one informer as a microoptimization in
the tests.
Lifecycle test had a hotloop, where it would run a never-yielding
function while processing was going on elsewhere. This inserts
a sleep. A sleep is used rather than a yield to be kind to
people's battery life.
It turns out that running atomic.Read(...) in a tight loop breaks
Golang. The goroutine would never yield control over the scheduler,
so we ended up getting into a situation where the test would get
stuck forever. This moves to a different model, in which
there is a condition var, instead of atomics in loops.
* Fix the deletion test to actually test the pod is deleted
* Fix the update pods test to update a value which is allowed
to be updated
* Shut down watches after tests
* Do not delete pod statuses on DeletePod in mock_test
This intentionally leaks pod statuses, but it makes the situation
a lot less complicated around handling race conditions with
the GetPodStatus callback
This makes sure the update function works correctly after the pod
is running if the podspec is changed. Upon writing the test, I realized
we were accessing the variables outside of the goroutine that the
workers with tests were running in, and we had no locks. Therefore,
I converted all of those numbers to use atomics.