virtual-kubelet

Author	SHA1	Message	Date
Brian Goff	8437e237be	Copy stats types from upstream. This drops another dependency on k8s.io/kubernetes. This does have the unfortunate side effect that implementers will now get a compile error until they update their code to use the new type. Just as a note: The stats types have moved to k8s.io/kubelet, however the stats types are only there as of v1.20. Currently we support older versions than v1.20, and even our go.mod imports from v1.19. For now we copy the types in. Later we can remove the type defs and change them to type aliases to the k8s.io/kubelet types (which prevents another compile time issue). Anything relying on type assertions to determine if something implements this method will, unfortunately, be broken and it will be hard to notice until runtime. We need to make sure to call this out in the release notes. Signed-off-by: Brian Goff <cpuguy83@gmail.com>	2021-05-05 23:01:52 +00:00
Brian Goff	405d5d63b1	Don't import pod util package from k/k These are all simple changes that will not change w/o breaking API changes upstream anyway. Signed-off-by: Brian Goff <cpuguy83@gmail.com>	2021-05-04 23:55:30 +00:00
Sargun Dhillon	b259cb0548	Add the ability to dictate custom retries Our current retry policy is naive and only does 20 retries. It is also based off of the rate limiter. If the user is somewhat aggressive in rate limiting, but they have a temporary outage on API server, they may want to continue to delay. In facts, K8s has a built-in function to suggest delays: https://pkg.go.dev/k8s.io/apimachinery/pkg/api/errors#SuggestsClientDelay Signed-off-by: Sargun Dhillon <sargun@sargun.me>	2021-04-14 10:52:26 -07:00
Sargun Dhillon	e95023b76e	Fix test This starts watching for events prior to the start of the controller. This smells like a bug in the fakeclient bits, but it seems to fix the problem. Signed-off-by: Sargun Dhillon <sargun@sargun.me>	2021-04-14 10:52:26 -07:00
Sargun Dhillon	c40a255eae	Remove errant double queue This seems to be a typo where we erroneously double-queue a deletion, but one without the "key".	2021-03-24 10:21:27 -07:00
Sargun Dhillon	c4582ccfbc	Allow providers to update pod statuses We had added an optimization that made it so we dedupe pod status updates from the provider. This ignored two subfields that could be updated along with status. Because the details of subresource updating is a bit API server centric, I wrote an envtest which checks for this behaviour. Signed-off-by: Sargun Dhillon <sargun@sargun.me>	2021-02-16 12:30:53 -08:00
Sargun Dhillon	7feb175720	Split up lifecycle test wireUpSystem function This splits up the wireUpSystem function into a chunk that makes it "client agnostic". It also removes the requirement that the client is faked.	2021-02-16 12:30:51 -08:00
Sargun Dhillon	0e1cc1566e	Create envtest wrapper Lift up a little bit of the common envtest code into a common wrapper function.	2021-02-16 12:30:51 -08:00
wadecai	3ff1694252	Fix race between k8s and provider when deleting pod	2021-02-16 17:45:55 +08:00
Sargun Dhillon	3a361ebabd	queue: Add tracing This adds tracing throughout the queues, so we can determine what's going on.	2021-02-08 11:07:03 -08:00
Sargun Dhillon	ac9a1af564	Replace golang workqueue with our own This is a fundamentally different API than that of the K8s workqueue which is better suited for our needs. Specifically, we need a simple queue which doesn't have complex features like delayed adds that sit on "external" goroutines. In addition, we need deep introspection into the operations of the workqueue. Although you can get this on top of the K8s workqueue by implementing a custom rate limiter, the problem is that the underlying rate limiter's behaviour is still somewhat opaque. This basically has 100% code coverage.	2021-02-08 11:07:03 -08:00
Sargun Dhillon	82452a73a5	Split out rate limiter per workqueue If you share a ratelimiter between workqueues, it breaks. WQ1: Starts processing item (When) WQ1: Fails to process item (When) WQ1: Fails to process item (When) WQ1: Fails to process item (When) --- At this point we've backed off a bit --- WQ2: Starts processing item (with same key, When) WQ2: Succeeds at processing item (Forget) WQ1: Fails to process item (When) ---> THIS RESULTS IN AN ERROR This results in an error because it "forgot" the previous rate limit.	2021-02-02 11:40:58 -08:00
Miek Gieben	c9969ee33d	Import kubernetes/remotecommand Copy/paste some more kubernetes code. This is to remove the dep on kubernetes/kubernetes from within exec.go See #940 Signed-off-by: Miek Gieben <miek@miek.nl>	2021-01-12 13:18:30 +01:00
Sargun Dhillon	1b8597647b	Refactor queue code This refactor is a preparation for another commit. I want to add instrumentation around our queues. The code of how queues were handled was spread throughout the code base, and that made adding such instrumentation nice and complicated. This centralizes the queue management logic in queue.go, and only requires the user to provide a (custom) rate limiter, if they want to, a name, and a handler. The lease code is moved into its own package to simplify testing, because the goroutine leak tester was triggering incorrectly if other tests were running, and it was measuring leaks from those tests. This also identified buggy behaviour: wq := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultItemBasedRateLimiter(), "test") wq.AddRateLimited("hi") fmt.Printf("Added hi, len: %d\n", wq.Len()) wq.Forget("hi") fmt.Printf("Forgot hi, len: %d\n", wq.Len()) wq.Done("hi") fmt.Printf("Done hi, len: %d\n", wq.Len()) --- Prints all 0s because event non-delayed items are delayed. If you call Add directly, then the last line prints a len of 2. // Workqueue docs: // Forget indicates that an item is finished being retried. Doesn't matter whether it's for perm failing // or for success, we'll stop the rate limiter from tracking it. This only clears the `rateLimiter`, you // still have to call `Done` on the queue. ^----- Even this seems untrue	2021-01-08 00:56:05 -08:00
Sargun Dhillon	735eb34829	This adds the v1 lease controller This refactors the v1 lease controller. It makes two functional differences to the lease controller: * It no longer ties lease updates to node pings or node status updates * There is no fallback mechanism to status updates This also moves vk_envtest, allowing for future brown-box testing of the lease controller with envtest	2021-01-05 11:40:44 -08:00
Sargun Dhillon	de7f7dd173	Fix issue #899 : Pod status out of sync after being marked as not ready by controller manager As described in the issue, if the following sequence happens, we fail to properly update the pod status in api server: 1. Create pod in k8s 2. Provider creates the pod and syncs its status back 3. Pod in k8s ready/running, all fine 4. Virtual kubelet fails to update node status for some time for whatever reason (e.g. network connectivity issues) 5. Virtual node marked as NotReady with message: Kubelet stopped posting node status 6. kube-controller-manager of k8s, goes and marks all pods as Ready = false: 7. Virtual kubelet never sync's status of pod in provider back to k8s	2020-12-07 16:50:00 -08:00
Sargun Dhillon	0d1f6f1625	Add Stutter linter This also adds a bunch of nolints for the node package which has a ton of stuttering. Perhaps something to mitigate in another iteration.	2020-12-07 08:51:57 -08:00
Sargun Dhillon	d29adf5ce3	Add Gocritic This also fixes the issues laid out by gocritic	2020-12-06 13:20:03 -08:00
Sargun Dhillon	c0d5809285	Add nolintlint to warn us of extraneous nolint comments	2020-12-05 10:59:10 -08:00
Sargun Dhillon	bbe4551940	Fix linter exemptions in golint We were having issues with golint not properly reporting declaration of functions without proper documentation (comments). This is due to a config with golangci. See: https://github.com/golangci/golangci-lint/issues/456	2020-12-05 10:59:10 -08:00
Brian Goff	4fd2b754b5	Merge pull request #923 from sargun/fix-linter Enable all linters by default	2020-12-04 10:50:28 -08:00
Sargun Dhillon	11c63bca6f	Refactor the way that the that node_ping_controller works This moves node ping controller to using the new internal lock API. The reason for this is twofold: * The channel approach that was used to notify other controllers of changes could only be used once (at startup), and couldn't be used in the future to broadcast node ping status. The idea idea is here that we could move to a sync.Cond style API and only wakeup other controllers on change, as opposed to constantly polling each other * The problem with sync.Cond is that it's not context friendly. If we want to do stuff like wait on a sync.cond and use a context or a timer or similar, it doesn't work whereas this API allows context cancellations on condition change. The idea is that as we have more controllers that act as centralized sources of authority, they can broadcast out their state.	2020-12-03 11:40:01 -08:00
Sargun Dhillon	d64d427ec8	Enable all linters by default This removes the directive from .golangci.yml to disable all linters, and fixes the relevant bugs / issues that are exposed.	2020-12-03 11:33:06 -08:00
wadecai	966a960eef	Allow to delete pod in K8s if it is deleting in Provider For example: Provier is a K8s provider, pod created by deployment would be evicted when node is not ready. If we do not delete pod in K8s, deployment would not create a new one. Add some tests for updateStatus	2020-11-24 15:04:25 +08:00
Hasan Turken	42f7c56d32	Don't skip pods status update if podStatusReasonProviderFailed Closes #399 Signed-off-by: Hasan Turken <turkenh@gmail.com>	2020-11-16 13:57:48 -08:00
Brian Goff	2716c38e1f	Merge branch 'master' into upgrade-golint-to-v1.32.2	2020-11-06 16:08:09 -08:00
Sargun Dhillon	c437e05ad0	Move env var code into its own package This creates a new package -- podutils. The env var related code doesn't really have any business being part of the node package, and to create a separation of concerns, faster tests, and just general code isolation and cleanliness, we can move the env var related code into this package. This change is purely hygiene, and not logic related. For node, the package is under internal, because the constructor references manager, which is an internal package.	2020-11-06 14:49:53 -08:00
Sargun Dhillon	b9303714de	Upgrade to golangci-lint v1.32.2	2020-11-06 14:45:19 -08:00
chao zheng	b793d89c66	when initialzing vk, use empty clientcmd.ConfigOverrides instead of nil	2020-10-28 15:34:16 -07:00
Brian Goff	590d2e7f01	Merge pull request #862 from cpuguy83/node_helpers	2020-10-26 15:00:45 -07:00
Sargun Dhillon	84a169f25d	Fix golang ci warner	2020-10-04 19:52:34 -07:00
Sargun Dhillon	946c616c67	Create stronger separation between provider node and server node There were some (additional) bugs that were easy-ish to introduce by interleaving the provider provided node, and the server provided updated node. This removes the chance of that confusion.	2020-10-04 19:52:34 -07:00
Sargun Dhillon	1c32b2c8ee	Fix data race in test	2020-09-21 23:38:48 -07:00
Sargun Dhillon	cf2d5264a5	Fix datarace in node ping controller	2020-09-21 23:38:43 -07:00
Brian Goff	0c64171e85	Add v2 node provider for accepting status updates This allows the use of a built-in provider to do things like mark a node as ready once all the controllers are spun up. The e2e tests now use this instead of waiting on the pod that the vk provider is deployed in to be marked ready (this was waiting on /stats/summary to be serving, which is racey).	2020-09-17 13:52:58 -07:00
Sargun Dhillon	3d1226d45d	Fix logging when leases are mis-set This fixes a small logic bug in the leases code for checking is owner references are not set correctly, and makes it so that we properly log when owner references are set, but not set to the node that is "us".	2020-09-08 12:04:16 -07:00
Sargun Dhillon	cd059d9755	Fix node ping interval code / default setting code Change the place where we set the defaults for node ping and node status interval. This problem manifested itself by the node ping interval being 0 when it was set to the default. This makes two changes: 1. Invalid ping values, and ping timeouts will not allow VK to start up 2. We set the default values very early on in creation of the node controller -- where all the other values are set. Signed-off-by: Sargun Dhillon <sargun@sargun.me>	2020-08-18 00:39:14 -07:00
Sargun Dhillon	6845cf825a	Delete and recreate lease on conflict This takes a somewhat hamfisted approach at dealing with lease conflicts. This can happen if "someone" changes the lease underneath us. Again, this should happen rarely, but it can happen (And does happen in production systems). Signed-off-by: Sargun Dhillon <sargun@sargun.me>	2020-08-17 11:54:43 -07:00
Sargun Dhillon	d390dfce43	Move node pinging to its own goroutine This moves the job of pinging the node provider into its own goroutine. If it takes a long time, it shouldn't slow down leases, and vice-versa. It also adds timeouts for node pings. One of the problems is that we don't know how long a node ping will take -- there could be a bunch of network calls underneath us. The point of the lease is to say whether or not the Kubelet is unreachable, not whether or not the node pings are "passing". Signed-off-by: Sargun Dhillon <sargun@sargun.me>	2020-08-03 10:57:37 -07:00
Sargun Dhillon	49c596c5ca	Split waitableInt into its own test file This is merely a rearranging of the deck chairs and moving waitable int into its own file since we intend to use it across multiple tests.	2020-08-03 10:57:37 -07:00
Brian Goff	3fc79dc677	Merge pull request #871 from virtual-kubelet/set-node-lease-owner Set Node Leader Owner Reference	2020-07-31 11:51:14 -07:00
Sargun Dhillon	4bdcba5b85	Set Node Leader Owner Reference This sets / updates the node lease owner reference to the current node. Previously, we did not set this, which had the interesting problem of leaking node leases on clusters with node churn.	2020-07-31 11:23:47 -07:00
Brian Goff	c0296b99fd	Support custom filter for pod event handlers This allows users who have a shared informer that is not filtering on node name to supply a filter for event handlers to ensure events do not fire for pods not scheduled to the node. Signed-off-by: Brian Goff <cpuguy83@gmail.com>	2020-07-30 17:17:42 -07:00
Brian Goff	83f8cd1a58	Add helpers for common setup code Create a clientset, setup pod informer filters, and setup node lease client.	2020-07-27 14:51:02 -07:00
Brian Goff	af1df79088	Merge pull request #851 from virtual-kubelet/race-condition-2nd	2020-07-23 13:53:58 -07:00
Vilmos Nebehaj	56b248c854	Add GetStatsSummary to PodHandlerConfig If both the metrics routes and the pod routes are attached to the same mux with the pattern "/", it will panic. Instead, add the stats handler function to PodHandlerConfig and set up the route if it is not nil.	2020-07-23 09:50:19 -07:00
Sargun Dhillon	4258c46746	Enhance / cleanup enqueuePodStatusUpdate polling in retry loop	2020-07-22 18:57:27 -07:00
Sargun Dhillon	1e9e055e89	Address concerns with PR Also, just use Kubernetes waiter library.	2020-07-22 18:57:27 -07:00
Sargun Dhillon	12625131b5	Solve the notification on startup pod status notification race condition This solves the race condition as described in https://github.com/virtual-kubelet/virtual-kubelet/issues/836. It does this by checking two conditions when the possible race condition is detected. If we receive a pod notification from the provider, and it is not in our known pods list: 1. Is our cache in-sync? 2. Is it known to our pod lister? The first case can happen because of the order we start the provider and sync our caches. The second case can happen because even if the cache returns synced, it does not mean all of the call backs on the informer have quiesced. This slightly changes the behaviour of notifyPods to that it can block (especially at startup). We can solve this later by using something like a fair (ticket?) lock.	2020-07-22 18:57:27 -07:00
Brian Goff	bcb5dfa11c	Fix running pods handler on nil lister This follows suit with other hanlders and returns a NotImplemented http.HandlerFunc when the lister is nil. Signed-off-by: Brian Goff <cpuguy83@gmail.com>	2020-07-14 15:33:59 -07:00

1 2 3

133 Commits