Commit Graph

161 Commits

Author SHA1 Message Date
zhenqxuMSFT
dd31d67a53 Add error handler to recreate virtual node when it's deleted (#1143) 2023-09-14 11:58:27 -07:00
Jose Fernandez
394128a0f2 feat: Add methods to get length of PodController queues (#1145) 2023-09-14 18:32:19 +00:00
Salvatore Cirone
59fd7fddb6 add portforwarding support to node/api (#1102)
Co-authored-by: Pablo Borrelli <pablo.borrelli0@gmail.com>
Co-authored-by: windyear <1280027646@qq.com>
2023-06-15 18:45:10 -07:00
Heba Elayoty
077ee93fa2 fix: Fix opentelemetry dependencies issues (#1122)
Signed-off-by: Heba Elayoty <hebaelayoty@gmail.com>
2023-06-15 18:23:16 -07:00
pigletfly
4a14603c56 bump golang version to 1.19
Signed-off-by: pigletfly <wangbing.adam@gmail.com>
2023-04-14 12:47:29 +01:00
fnuarnav
2c155accb7 Prometheus metrics are encoded as text, not JSON (#1101)
Co-authored-by: Sanchit Mehta <sanchit.mehta602@gmail.com>
2023-04-06 08:03:43 +01:00
Salvatore Cirone
9c32bfb0ae Add support for Attach API functionality (#1090)
Co-authored-by: Pablo Borrelli <pablo.borrelli0@gmail.com>
2023-03-31 08:51:50 -07:00
fnuarnav
a457d445a3 feat: Implement new metrics endpoint for k8s 1.24+ (#1082) 2023-03-28 13:01:37 +01:00
Heba Elayoty
a2070739bb fix: Fix missing Backoff property for WebHookAuth (#1089) 2023-03-16 02:24:23 +00:00
Pires
eb5d959215 replace deprecated pointer funcs 2023-03-13 10:56:38 +00:00
Brian Goff
6feafcf018 Remove klogv2 alias
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2022-10-07 23:21:47 +00:00
Brian Goff
5db1443e33 Fix apparent bad copy/pasta in test causing panic
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2022-10-07 23:21:47 +00:00
Brian Goff
2c4442b17f Fix linting issues and update make lint target.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2022-10-07 23:21:47 +00:00
Brian Goff
70848cfdae Bump k8s deps to v0.24
This requires dropping otel down to v0.20 because the apiserver package
is importing it and some packages moved around with otel v1.
Even k8s v0.25 still uses this old version of otel, so we are stuck for
a bit (v0.26 will, as of now, use a newer otel version).

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2022-10-07 23:21:47 +00:00
Brian Goff
c668ae6ab6 Bump problematic deps
Changes in klog and logr have made automatic bumps from dependabot
problematic.
We also shouldn't need klogv1 so removed that.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2022-10-07 23:21:47 +00:00
lubingtan
67be3c681d Add default client
Signed-off-by: lubingtan <bingtlu@ebay.com>
2022-09-30 09:47:22 +08:00
Brian Goff
008fe17b91 Merge pull request #1015 from cpuguy83/gh_actions
Add github actions
2022-08-31 11:00:53 -07:00
Brian Goff
f617ccebc5 Fixup some new lint issues
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2022-08-31 00:58:56 +00:00
Kyle Anderson
a8f253088c Removed deprecated node Clustername
With this one line, vk fails to build against k8s 1.24 libs.

The comment says:

    // Deprecated: ClusterName is a legacy field that was always cleared by
    // the system and never used; it will be removed completely in 1.25.

Seems to be removed in 1.24 though.
2022-08-25 19:39:11 -07:00
Brian Goff
c9c0d99064 Rename NewNodeFromClient to just NewNode
Since we now store the client on the config, we don't need to use a
custom client.
2021-09-14 17:10:17 +00:00
Brian Goff
4974e062d0 Add webhook and anon auth support
Auth is not automatically enabled because this requires some
bootstrapping to work.
I'll leave this for some future work.
In the meantime people can use the current code similar to how they used
the node-cli code to inject their own auth.
2021-09-14 17:10:17 +00:00
Brian Goff
e1342777d6 Add API config to node set
This moves API handling into the node object so now everything can be
done in one place.

TLS is required.
In the current form, auth must be setup by the caller.
2021-09-14 17:10:17 +00:00
Brian Goff
597e7dc281 Make ControllerManager more useful
This changes `ControllerManager` to `Node`.

`Node` is created from a client where the VK lib is responsible for
creating all the things except the client (unless client is nil, then we
use the env client).

This should be a good replacement for node-cli.  It offers a simpler
API.  *It only works with leases enabled* since this seems always
desired, however an option could be added to disable if needed.

The intent of this is to provide a simpler way to get a vk node up and
running while also being extensible. We can slowly add options, but
they should be focussed on a use-case rather than trying to support
every possible scenario... in which case the user can just use the
controllers directly.
2021-09-14 17:10:14 +00:00
Brian Goff
22f329fcf0 Add extra logging for pod status update skip 2021-09-03 18:02:35 +00:00
Brian Goff
09ad3fe644 Return early on ping error
Found that this caused a panic after many many test runs.
It seems like we should have returned early since the pingResult is nil.
We don't want to update a lease when ping fails.
2021-08-24 18:49:42 +00:00
Brian Goff
68347d4ed1 Merge pull request #967 from cpuguy83/controller_manager2
Move some boiler plate startup logic to nodeutil
2021-06-01 12:05:59 -07:00
Brian Goff
f63c23108f Move some boiler plate startup logic to nodeutil
This makes a controller that handles the startup for the node and pod
controller.
Later if we add an "api controller" it can also be added here.

This is just part of reducing some of the boiler plate code so it is
easier to get off of node-cli.
2021-05-25 17:54:53 +00:00
Brian Goff
0543245668 lifecycle test: timeout send goroutine on context
In error cases these goroutines never exit.
Trying to debug cases we end up with a bunch of these goroutines stuck
making it difficult to troubleshoot.

We could just make a buffered channel, however this will makes it less
clear, in cases of an error, what all is happening.
2021-05-18 23:06:55 +00:00
Brian Goff
8437e237be Copy stats types from upstream.
This drops another dependency on k8s.io/kubernetes.
This does have the unfortunate side effect that implementers will now
get a compile error until they update their code to use the new type.

Just as a note:

The stats types have moved to k8s.io/kubelet, however the stats types
are only there as of v1.20.
Currently we support older versions than v1.20, and even our go.mod
imports from v1.19.

For now we copy the types in. Later we can remove the type defs and
change them to type aliases to the k8s.io/kubelet types (which prevents
another compile time issue).

Anything relying on type assertions to determine if something implements
this method will, unfortunately, be broken and it will be hard to notice
until runtime. We need to make sure to call this out in the release
notes.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2021-05-05 23:01:52 +00:00
Brian Goff
405d5d63b1 Don't import pod util package from k/k
These are all simple changes that will not change w/o breaking API
changes upstream anyway.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2021-05-04 23:55:30 +00:00
Sargun Dhillon
b259cb0548 Add the ability to dictate custom retries
Our current retry policy is naive and only does 20 retries. It is
also based off of the rate limiter. If the user is somewhat aggressive in
rate limiting, but they have a temporary outage on API server, they
may want to continue to delay.

In facts, K8s has a built-in function to suggest delays:
https://pkg.go.dev/k8s.io/apimachinery/pkg/api/errors#SuggestsClientDelay

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
2021-04-14 10:52:26 -07:00
Sargun Dhillon
e95023b76e Fix test
This starts watching for events prior to the start of the controller.
This smells like a bug in the fakeclient bits, but it seems to fix
the problem.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
2021-04-14 10:52:26 -07:00
Sargun Dhillon
c40a255eae Remove errant double queue
This seems to be a typo where we erroneously double-queue a deletion,
but one without the "key".
2021-03-24 10:21:27 -07:00
Sargun Dhillon
c4582ccfbc Allow providers to update pod statuses
We had added an optimization that made it so we dedupe pod status updates
from the provider. This ignored two subfields that could be updated along
with status.

Because the details of subresource updating is a bit API server centric,
I wrote an envtest which checks for this behaviour.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
2021-02-16 12:30:53 -08:00
Sargun Dhillon
7feb175720 Split up lifecycle test wireUpSystem function
This splits up the wireUpSystem function into a chunk that makes it
"client agnostic". It also removes the requirement that the client
is faked.
2021-02-16 12:30:51 -08:00
Sargun Dhillon
0e1cc1566e Create envtest wrapper
Lift up a little bit of the common envtest code into a common wrapper function.
2021-02-16 12:30:51 -08:00
wadecai
3ff1694252 Fix race between k8s and provider when deleting pod 2021-02-16 17:45:55 +08:00
Sargun Dhillon
3a361ebabd queue: Add tracing
This adds tracing throughout the queues, so we can determine what's going on.
2021-02-08 11:07:03 -08:00
Sargun Dhillon
ac9a1af564 Replace golang workqueue with our own
This is a fundamentally different API than that of the K8s workqueue
which is better suited for our needs. Specifically, we need a simple
queue which doesn't have complex features like delayed adds that
sit on "external" goroutines.

In addition, we need deep introspection into the operations of the
workqueue. Although you can get this on top of the K8s workqueue
by implementing a custom rate limiter, the problem is that
the underlying rate limiter's behaviour is still somewhat
opaque.

This basically has 100% code coverage.
2021-02-08 11:07:03 -08:00
Sargun Dhillon
82452a73a5 Split out rate limiter per workqueue
If you share a ratelimiter between workqueues, it breaks.

WQ1: Starts processing item (When)
WQ1: Fails to process item (When)
WQ1: Fails to process item (When)
WQ1: Fails to process item (When)
--- At this point we've backed off a bit ---
WQ2: Starts processing item (with same key, When)
WQ2: Succeeds at processing item (Forget)
WQ1: Fails to process item (When) ---> THIS RESULTS IN AN ERROR

This results in an error because it "forgot" the previous
rate limit.
2021-02-02 11:40:58 -08:00
Miek Gieben
c9969ee33d Import kubernetes/remotecommand
Copy/paste some more kubernetes code. This is to remove the dep on
kubernetes/kubernetes from within exec.go

See #940

Signed-off-by: Miek Gieben <miek@miek.nl>
2021-01-12 13:18:30 +01:00
Sargun Dhillon
1b8597647b Refactor queue code
This refactor is a preparation for another commit. I want to add instrumentation
around our queues. The code of how queues were handled was spread throughout
the code base, and that made adding such instrumentation nice and complicated.

This centralizes the queue management logic in queue.go, and only requires
the user to provide a (custom) rate limiter, if they want to, a name,
and a handler.

The lease code is moved into its own package to simplify testing, because
the goroutine leak tester was triggering incorrectly if other tests
were running, and it was measuring leaks from those tests.

This also identified buggy behaviour:

wq := workqueue.NewNamedRateLimitingQueue(workqueue.DefaultItemBasedRateLimiter(), "test")
wq.AddRateLimited("hi")
fmt.Printf("Added hi, len: %d\n", wq.Len())

wq.Forget("hi")
fmt.Printf("Forgot hi, len: %d\n", wq.Len())

wq.Done("hi")
fmt.Printf("Done hi, len: %d\n", wq.Len())

---
Prints all 0s because event non-delayed items are delayed. If you call Add
directly, then the last line prints a len of 2.

// Workqueue docs:
// Forget indicates that an item is finished being retried.  Doesn't matter whether it's for perm failing
// or for success, we'll stop the rate limiter from tracking it.  This only clears the `rateLimiter`, you
// still have to call `Done` on the queue.

^----- Even this seems untrue
2021-01-08 00:56:05 -08:00
Sargun Dhillon
735eb34829 This adds the v1 lease controller
This refactors the v1 lease controller. It makes two functional differences
to the lease controller:
* It no longer ties lease updates to node pings or node status updates
* There is no fallback mechanism to status updates

This also moves vk_envtest, allowing for future brown-box testing of the
lease controller with envtest
2021-01-05 11:40:44 -08:00
Sargun Dhillon
de7f7dd173 Fix issue #899: Pod status out of sync after being marked as not ready by controller manager
As described in the issue, if the following sequence happens, we fail to properly
update the pod status in api server:

1. Create pod in k8s
2. Provider creates the pod and syncs its status back
3. Pod in k8s ready/running, all fine
4. Virtual kubelet fails to update node status for some time for whatever reason (e.g. network connectivity issues)
5. Virtual node marked as NotReady with message: Kubelet stopped posting node status
6. kube-controller-manager of k8s, goes and marks all pods as Ready = false:
7. Virtual kubelet never sync's status of pod in provider back to k8s
2020-12-07 16:50:00 -08:00
Sargun Dhillon
0d1f6f1625 Add Stutter linter
This also adds a bunch of nolints for the node package which
has a ton of stuttering. Perhaps something to mitigate in another
iteration.
2020-12-07 08:51:57 -08:00
Sargun Dhillon
d29adf5ce3 Add Gocritic
This also fixes the issues laid out by gocritic
2020-12-06 13:20:03 -08:00
Sargun Dhillon
c0d5809285 Add nolintlint to warn us of extraneous nolint comments 2020-12-05 10:59:10 -08:00
Sargun Dhillon
bbe4551940 Fix linter exemptions in golint
We were having issues with golint not properly reporting declaration of functions
without proper documentation (comments). This is due to a config with golangci.

See: https://github.com/golangci/golangci-lint/issues/456
2020-12-05 10:59:10 -08:00
Brian Goff
4fd2b754b5 Merge pull request #923 from sargun/fix-linter
Enable all linters by default
2020-12-04 10:50:28 -08:00
Sargun Dhillon
11c63bca6f Refactor the way that the that node_ping_controller works
This moves node ping controller to using the new internal lock
API.

The reason for this is twofold:
* The channel approach that was used to notify other
  controllers of changes could only be used once (at startup),
  and couldn't be used in the future to broadcast node
  ping status. The idea idea is here that we could move
  to a sync.Cond style API and only wakeup other controllers
  on change, as opposed to constantly polling each other
* The problem with sync.Cond is that it's not context friendly.
  If we want to do stuff like wait on a sync.cond and use a context
  or a timer or similar, it doesn't work whereas this API allows
  context cancellations on condition change.

The idea is that as we have more controllers that act as centralized
sources of authority, they can broadcast out their state.
2020-12-03 11:40:01 -08:00