Commit Graph

133 Commits

Author SHA1 Message Date
Adrien Trouillaud
845b4cd409 upgrade k8s libs to 1.18.4 2020-07-07 21:00:56 -07:00
Sargun Dhillon
e805cb744a Introduce three-way patch for proper handling of out-of-band status updates
As described in the patch itself, there is a case that if a node is updated out of
band (e.g. node-problem-detector (https://github.com/kubernetes/node-problem-detector)),
we will overwrite the patch in our typicaly two-way strategic patch for node status
updates.

The reason why the standard kubelet can do this is because the flow goes:
apiserver->kubelet: Fetch current node
kubelet->kubelet: Update apiserver's snapshot with local state changes
kubelet->apiserver: patch

We don't have this luxury, as we rely on providers making a callback into us
in order to get the most recent pod status. They do not have a way
to do that merge operation themselves, and a two-way merge doesn't
give us enough metadata.

In order to work around this, we perform a three-way merge on behalf of
the user. We do this by stashing the contents of the last update inside
of it. We then fetch that status back, and use that for the future
update itself.

In the upgrade case, or the case where the VK has been created by
"someone else", we do not know which attributes were created by
or written by us, so we cannot generate a three way patch.

In this case, we will do our best to avoid deleting any attributes,
and only overwrite them. We will consider all current api server
values written by "someone else", and not edit them. This is done
by considering the "old node" to be empty.
2020-07-06 11:10:32 -07:00
Brian Goff
5306173408 Merge pull request #846 from sargun/add-trace-to-updateStatus
Add instrumentation to node controller (tracing)
2020-07-01 12:53:27 -07:00
Sargun Dhillon
30aabe6fcb Add instrumentation to node controller (tracing)
This adds tracing in node controller in several sections where
it was missing.
2020-07-01 12:40:09 -07:00
Sargun Dhillon
1e8c16877d Make node status updates non-blocking
There's a (somewhat) common case we can get into where the node
status update loop is busy while a provider is trying to send
a node status update. Right now, we block the provider from
creating a notification in this case.
2020-07-01 12:32:54 -07:00
wadecai
ca417d5239 Expose the queue rate limiter 2020-06-26 10:45:41 +08:00
wadecai
fedffd6f2c Add parameters to support change work queue qps 2020-06-26 10:44:09 +08:00
Weidong Cai
2398504d08 dedup in updatePodStatus (#830)
Co-authored-by: Brian Goff <cpuguy83@gmail.com>
2020-06-15 14:35:14 -07:00
wadecai
3db9ab97c6 Avoid enqueue when status of k8s pods change 2020-06-13 13:19:55 +08:00
Brian Goff
51b9a6c40d Fix stream timeout defaults
This was an unintentional breaking change in
0bdf742303

A timeout of 0 doesn't make any sense, so use the old values of 30s as a
default.
2020-06-03 10:01:34 -07:00
Vilmos Nebehaj
3e0d03c833 Use errdefs.InvalidInputf() for formatting 2020-04-28 11:19:37 -07:00
Vilmos Nebehaj
7628c13aeb Add tests for parseLogOptions() 2020-04-28 11:19:37 -07:00
Vilmos Nebehaj
8308033eff Add support for v1.PodLogOptions 2020-04-28 11:19:37 -07:00
wadecai
30e31c0451 Check pod status equal before enqueue 2020-04-21 10:42:29 +08:00
Sargun Dhillon
5ad12cd476 Add /pods HTTP endpoint 2020-03-20 12:04:00 -07:00
guoliangshuai
554d30a0b1 add 'GET' method to pod exec handler, so it can support websocket 2020-03-09 14:16:49 +08:00
Vilmos Nebehaj
47112aa5d6 Use correct Flush() prototype from http.Flusher
When calling GetContainerLogs(), a type check is performed to see if the
http.ResponseWriter supports flushing. However, Flush() in http.Flusher
does not return an error, therefore the type check will always fail.

Fix the flushWriter helper interface so flushing the writer will work.
2020-01-20 13:27:36 -08:00
Weidong Cai
0bdf742303 Make exec timeout configurable (#803)
* make exec timeout configurable
2020-01-18 12:11:54 -08:00
wadecai
55f3f17ba0 add some event to pod 2019-11-29 14:33:00 +08:00
Brian Goff
6e33b0f084 [Sync Provider] Fix panic on not found pod status 2019-11-15 09:44:29 -08:00
Thomas Hartland
c258614d8f After handling status update, reset update timer with correct duration
If the ping timer is being used, it should be reset with the ping update
interval. If the status update interval is used then Ping stops being
called for long enough to cause kubernetes to mark the node as NotReady.
2019-11-11 14:29:52 +01:00
Thomas Hartland
3783a39b26 Add test for node ping interval 2019-11-11 14:29:52 +01:00
Brian Goff
0ccf5059e4 Put sync lifecycle tests being -short flag.
This lets you skip tests for the slower sync provider.
2019-10-29 15:05:35 -07:00
Brian Goff
31c8fbaa41 Apply suggestions from code review
Typos and punctuation fixes.

Co-Authored-By: Pires <1752631+pires@users.noreply.github.com>
2019-10-24 09:23:33 -07:00
Brian Goff
4ee2c4d370 Re-add support for sync providers
This brings back support for sync providers by wrapping them in a
provider that handles async notifications.
2019-10-24 09:23:28 -07:00
Sargun Dhillon
c314045d60 Ensure that delete dangling pods which are still deleting at startup (#784)
If a pod is being gracefully deleted at podcontroller startup,
it will not get deleted via the deletedanglingpods code. This
ensures the normal deletion loop covers the case.
2019-10-22 06:45:36 -04:00
Sargun Dhillon
d22265e5f5 Do not delete pods in a non-graceful manner
This moves from forcefully deleting pods to deleting pods in a
graceful manner from the API Server. It waits for the pod to
get to a terminal status prior to deleting the pod from api
server.
2019-10-17 09:58:21 -07:00
Sargun Dhillon
871424368f Fix pod status updates for when pod is updated outside of VK
Pods can be updated outside of VK. Right now, if this happens, pod
status updates are dropped because the resourceversion from the
provider will mismatch with what's on the server, breaking
pod status updates.

Since we're the only ones writing to the pod status, we
can do a blind overwrite.
2019-10-11 16:32:48 -07:00
Sargun Dhillon
cdc261a08d Use go-cmp to compare pods to suppress duplicate updates
Rather than copying the pods, this uses go-cmp and filters out
the paths which should not be compared.
2019-10-10 13:25:27 -07:00
Sargun Dhillon
4202b03cda Remove sync provider support
This removes the legacy sync provider interface. All new providers
are expected to implement the async NotifyPods interface.

The legacy sync provider interface creates complexities around
how the deletion flow works, and the mixed sync and async APIs
block us from evolving functionality.

This collapses in the NotifyPods interface into the PodLifecycleHandler
interface.
2019-10-02 09:28:09 -07:00
toshi0607
bcfc2accf8 misspell 2019-09-26 20:52:06 +09:00
toshi0607
b712751c6d gofmt 2019-09-26 20:50:36 +09:00
Sargun Dhillon
82a430ccf7 Add unused code linter 2019-09-24 12:55:52 -07:00
Sargun Dhillon
ea8495c3a1 Wait for Workers to exit prior to returning from PodController.Run
This changes the behaviour slightly, so rather than immediately exiting on
context cancellation, this calls shutdown, and waits for the current
items to finish being worked on before returning to the user.
2019-09-12 11:04:32 -07:00
Brian Goff
334baa73cf Merge pull request #743 from chewong/pod-status-nil-pointer
Add unit tests for #584
2019-09-11 14:49:55 -07:00
Brian Goff
bb9ff1adf3 Adds Done() and Err() to pod controller (#735)
Allows callers to wait for pod controller exit in addition to readiness.
This means the caller does not have to deal handling errors from the pod
controller running in a gorutine since it can wait for exit via `Done()`
and check the error with `Err()`
2019-09-10 17:44:19 +01:00
Ernest Wong
fdb0c805f7 Add more unit test to #584 2019-09-05 10:48:35 -07:00
Ernest Wong
dc7ff44303 Add unit tests for #584 2019-09-05 09:49:41 -07:00
Sargun Dhillon
da57373abb Test pods going missing while they're running in legacy providers (#759)
We poll legacy providers for their pod(s) status periodically. This is because
we have no way of knowing when the pod is updated. If the pod somehow goes
missing in the provider, that state must be handled. Currently, we update
API server, and mark the pod as failed, or ignore it.
2019-09-04 22:16:14 +01:00
Sargun Dhillon
33df981904 Have NotifyPods store the pod status in a map (#751)
We introduce a map that can be used to store the pod status. In this,
we do not need to call GetPodStatus immediately after NotifyPods
is called. Instead, we stash the pod passed via notifypods
as in a map we can access later. In addition to this, for legacy
providers, the logic to merge the pod, and the pod status is
hoisted up to the loop.

It prevents leaks by deleting the entry in the map as soon
as the pod is deleted from k8s.
2019-09-04 20:14:34 +01:00
Sargun Dhillon
7133a372d6 Mark current linting errors as non-errors
This is basically claiming linting bankruptcy. It marks all of the
issues we had up until this point as nolint.
2019-09-03 11:00:33 -07:00
Sargun Dhillon
5949e6279d Miscellaneous cleanup for linting 2019-09-03 11:00:33 -07:00
Sargun Dhillon
9cce8640a5 Fix linting errors in node/pod_test.go
This moves away from defining pods independently. It moves pod (spec)
generation to an independent function.
2019-09-03 11:00:33 -07:00
Sargun Dhillon
7accddcaf4 Fix linting errors in node/podcontroller.go 2019-09-03 11:00:33 -07:00
Brian Goff
2507f57f97 Merge pull request #732 from sargun/move-around-reactor
Move location of eventhandler registration
2019-09-03 10:44:52 -07:00
Sargun Dhillon
43ee086360 Fix mock_test DeletePod to store updated pod status 2019-08-25 10:42:35 -07:00
Sargun Dhillon
ccb6713b86 Move location of eventhandler registration
This moves the event handler registration until after the cache
is in-sync.

It makes it so we can use the log object from the context,
rather than having to use the global logger

The cache race condition of the cache starting while the reactor
is being added wont exist because we wait for the cache
to startup / go in sync prior to adding it.
2019-08-18 08:20:49 -07:00
Sargun Dhillon
69f1186713 Do not mutate pods, nor hand off pod references to provider
This moves to a model where any time that pods are given to a
provider, it uses a DeepCopy, as opposed to a reference. If the
provider mutates the pod, it prevents it from causing issues
with the informer cache.

It has to use reflect instead of comparing the hashes because
spew prints DeepCopy'd data structures ever so slightly differently.
2019-08-15 09:59:01 -07:00
Sargun Dhillon
89d88a17ed Add a generic reactor to lifecycle_test to bump resource version (#733)
All updates in our tests should have the behaviour that best
reflects what API server does.
2019-08-15 08:46:38 +01:00
Sargun Dhillon
bc2f6e0dc4 Wait for the informer to become in sync before starting tests
If the informers are starting at the same time as createPods,
then we can get into a situation where the pod seems to get
"lost". Instead, we wait for the informer to get into sync
prior to the createpod event.

This also moves to one informer as a microoptimization in
the tests.
2019-08-14 07:03:53 -07:00