virtual-kubelet

Author	SHA1	Message	Date
Sargun Dhillon	b259cb0548	Add the ability to dictate custom retries Our current retry policy is naive and only does 20 retries. It is also based off of the rate limiter. If the user is somewhat aggressive in rate limiting, but they have a temporary outage on API server, they may want to continue to delay. In facts, K8s has a built-in function to suggest delays: https://pkg.go.dev/k8s.io/apimachinery/pkg/api/errors#SuggestsClientDelay Signed-off-by: Sargun Dhillon <sargun@sargun.me>	2021-04-14 10:52:26 -07:00
Sargun Dhillon	c4582ccfbc	Allow providers to update pod statuses We had added an optimization that made it so we dedupe pod status updates from the provider. This ignored two subfields that could be updated along with status. Because the details of subresource updating is a bit API server centric, I wrote an envtest which checks for this behaviour. Signed-off-by: Sargun Dhillon <sargun@sargun.me>	2021-02-16 12:30:53 -08:00
wadecai	3ff1694252	Fix race between k8s and provider when deleting pod	2021-02-16 17:45:55 +08:00
Sargun Dhillon	3a361ebabd	queue: Add tracing This adds tracing throughout the queues, so we can determine what's going on.	2021-02-08 11:07:03 -08:00
Sargun Dhillon	1b8597647b	Refactor queue code This refactor is a preparation for another commit. I want to add instrumentation around our queues. The code of how queues were handled was spread throughout the code base, and that made adding such instrumentation nice and complicated. This centralizes the queue management logic in queue.go, and only requires the user to provide a (custom) rate limiter, if they want to, a name, and a handler. The lease code is moved into its own package to simplify testing, because the goroutine leak tester was triggering incorrectly if other tests were running, and it was measuring leaks from those tests. This also identified buggy behaviour: wq := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultItemBasedRateLimiter(), "test") wq.AddRateLimited("hi") fmt.Printf("Added hi, len: %d\n", wq.Len()) wq.Forget("hi") fmt.Printf("Forgot hi, len: %d\n", wq.Len()) wq.Done("hi") fmt.Printf("Done hi, len: %d\n", wq.Len()) --- Prints all 0s because event non-delayed items are delayed. If you call Add directly, then the last line prints a len of 2. // Workqueue docs: // Forget indicates that an item is finished being retried. Doesn't matter whether it's for perm failing // or for success, we'll stop the rate limiter from tracking it. This only clears the `rateLimiter`, you // still have to call `Done` on the queue. ^----- Even this seems untrue	2021-01-08 00:56:05 -08:00
Sargun Dhillon	de7f7dd173	Fix issue #899 : Pod status out of sync after being marked as not ready by controller manager As described in the issue, if the following sequence happens, we fail to properly update the pod status in api server: 1. Create pod in k8s 2. Provider creates the pod and syncs its status back 3. Pod in k8s ready/running, all fine 4. Virtual kubelet fails to update node status for some time for whatever reason (e.g. network connectivity issues) 5. Virtual node marked as NotReady with message: Kubelet stopped posting node status 6. kube-controller-manager of k8s, goes and marks all pods as Ready = false: 7. Virtual kubelet never sync's status of pod in provider back to k8s	2020-12-07 16:50:00 -08:00
wadecai	966a960eef	Allow to delete pod in K8s if it is deleting in Provider For example: Provier is a K8s provider, pod created by deployment would be evicted when node is not ready. If we do not delete pod in K8s, deployment would not create a new one. Add some tests for updateStatus	2020-11-24 15:04:25 +08:00
Hasan Turken	42f7c56d32	Don't skip pods status update if podStatusReasonProviderFailed Closes #399 Signed-off-by: Hasan Turken <turkenh@gmail.com>	2020-11-16 13:57:48 -08:00
Sargun Dhillon	c437e05ad0	Move env var code into its own package This creates a new package -- podutils. The env var related code doesn't really have any business being part of the node package, and to create a separation of concerns, faster tests, and just general code isolation and cleanliness, we can move the env var related code into this package. This change is purely hygiene, and not logic related. For node, the package is under internal, because the constructor references manager, which is an internal package.	2020-11-06 14:49:53 -08:00
Sargun Dhillon	4258c46746	Enhance / cleanup enqueuePodStatusUpdate polling in retry loop	2020-07-22 18:57:27 -07:00
Sargun Dhillon	1e9e055e89	Address concerns with PR Also, just use Kubernetes waiter library.	2020-07-22 18:57:27 -07:00
Sargun Dhillon	12625131b5	Solve the notification on startup pod status notification race condition This solves the race condition as described in https://github.com/virtual-kubelet/virtual-kubelet/issues/836. It does this by checking two conditions when the possible race condition is detected. If we receive a pod notification from the provider, and it is not in our known pods list: 1. Is our cache in-sync? 2. Is it known to our pod lister? The first case can happen because of the order we start the provider and sync our caches. The second case can happen because even if the cache returns synced, it does not mean all of the call backs on the informer have quiesced. This slightly changes the behaviour of notifyPods to that it can block (especially at startup). We can solve this later by using something like a fair (ticket?) lock.	2020-07-22 18:57:27 -07:00
Adrien Trouillaud	845b4cd409	upgrade k8s libs to 1.18.4	2020-07-07 21:00:56 -07:00
Weidong Cai	2398504d08	dedup in updatePodStatus (#830 ) Co-authored-by: Brian Goff <cpuguy83@gmail.com>	2020-06-15 14:35:14 -07:00
wadecai	3db9ab97c6	Avoid enqueue when status of k8s pods change	2020-06-13 13:19:55 +08:00
wadecai	30e31c0451	Check pod status equal before enqueue	2020-04-21 10:42:29 +08:00
wadecai	55f3f17ba0	add some event to pod	2019-11-29 14:33:00 +08:00
Brian Goff	4ee2c4d370	Re-add support for sync providers This brings back support for sync providers by wrapping them in a provider that handles async notifications.	2019-10-24 09:23:28 -07:00
Sargun Dhillon	d22265e5f5	Do not delete pods in a non-graceful manner This moves from forcefully deleting pods to deleting pods in a graceful manner from the API Server. It waits for the pod to get to a terminal status prior to deleting the pod from api server.	2019-10-17 09:58:21 -07:00
Sargun Dhillon	871424368f	Fix pod status updates for when pod is updated outside of VK Pods can be updated outside of VK. Right now, if this happens, pod status updates are dropped because the resourceversion from the provider will mismatch with what's on the server, breaking pod status updates. Since we're the only ones writing to the pod status, we can do a blind overwrite.	2019-10-11 16:32:48 -07:00
Sargun Dhillon	4202b03cda	Remove sync provider support This removes the legacy sync provider interface. All new providers are expected to implement the async NotifyPods interface. The legacy sync provider interface creates complexities around how the deletion flow works, and the mixed sync and async APIs block us from evolving functionality. This collapses in the NotifyPods interface into the PodLifecycleHandler interface.	2019-10-02 09:28:09 -07:00
Sargun Dhillon	82a430ccf7	Add unused code linter	2019-09-24 12:55:52 -07:00
Sargun Dhillon	da57373abb	Test pods going missing while they're running in legacy providers (#759 ) We poll legacy providers for their pod(s) status periodically. This is because we have no way of knowing when the pod is updated. If the pod somehow goes missing in the provider, that state must be handled. Currently, we update API server, and mark the pod as failed, or ignore it.	2019-09-04 22:16:14 +01:00
Sargun Dhillon	33df981904	Have NotifyPods store the pod status in a map (#751 ) We introduce a map that can be used to store the pod status. In this, we do not need to call GetPodStatus immediately after NotifyPods is called. Instead, we stash the pod passed via notifypods as in a map we can access later. In addition to this, for legacy providers, the logic to merge the pod, and the pod status is hoisted up to the loop. It prevents leaks by deleting the entry in the map as soon as the pod is deleted from k8s.	2019-09-04 20:14:34 +01:00
Sargun Dhillon	69f1186713	Do not mutate pods, nor hand off pod references to provider This moves to a model where any time that pods are given to a provider, it uses a DeepCopy, as opposed to a reference. If the provider mutates the pod, it prevents it from causing issues with the informer cache. It has to use reflect instead of comparing the hashes because spew prints DeepCopy'd data structures ever so slightly differently.	2019-08-15 09:59:01 -07:00
ethan	75a1877d9f	cleanup: fix misspelled words in error message Signed-off-by: Guangming Wang <guangming.wang@daocloud.io>	2019-08-10 19:03:44 +08:00
Brian Goff	905709c646	Don't ignore all GetPod errors on deletePod (#691 ) A legit error could be passed here and we don't want to ignore it.	2019-07-03 09:31:44 +01:00
Brian Goff	b915cde1ae	Fix error handling for delete pod (#685 ) * Fix error handling for delete pod - Error handling was looking for a k8s error from the provider, but providers should be using errdefs. - Error handling was returning early if pod was not found and deleting from k8s in all other cases. * Don't run unit tests twice	2019-06-29 08:07:24 +01:00
Brian Goff	bd742d5d99	Add license details on file heads. (#665 ) Realized as I was starting to copy some stuff to other repos that we should go ahead and add this.	2019-06-13 10:13:14 -07:00
Brian Goff	a54753cb82	Move around some packages (#658 ) * Move tracing exporter registration This doesn't belong in the library and should be configured by the consumer of the opencensus package. * Rename `vkublet` package to `node` `vkubelet` does not convey any information to the consumers of the package. Really it would be nice to move this package to the root of the repo, but then you wind up with... interesting... import semantics due to the repo name... and after thinking about it some, a subpackage is really not so bad as long as it has a name that convey's some information. `node` was chosen since this package deals with all the semantics of operating a node in Kubernetes.	2019-06-12 13:11:49 +01:00

30 Commits