Commit Graph

108 Commits

Author SHA1 Message Date
Sargun Dhillon
c314045d60 Ensure that delete dangling pods which are still deleting at startup (#784)
If a pod is being gracefully deleted at podcontroller startup,
it will not get deleted via the deletedanglingpods code. This
ensures the normal deletion loop covers the case.
2019-10-22 06:45:36 -04:00
Sargun Dhillon
d22265e5f5 Do not delete pods in a non-graceful manner
This moves from forcefully deleting pods to deleting pods in a
graceful manner from the API Server. It waits for the pod to
get to a terminal status prior to deleting the pod from api
server.
2019-10-17 09:58:21 -07:00
Sargun Dhillon
871424368f Fix pod status updates for when pod is updated outside of VK
Pods can be updated outside of VK. Right now, if this happens, pod
status updates are dropped because the resourceversion from the
provider will mismatch with what's on the server, breaking
pod status updates.

Since we're the only ones writing to the pod status, we
can do a blind overwrite.
2019-10-11 16:32:48 -07:00
Sargun Dhillon
cdc261a08d Use go-cmp to compare pods to suppress duplicate updates
Rather than copying the pods, this uses go-cmp and filters out
the paths which should not be compared.
2019-10-10 13:25:27 -07:00
Sargun Dhillon
4202b03cda Remove sync provider support
This removes the legacy sync provider interface. All new providers
are expected to implement the async NotifyPods interface.

The legacy sync provider interface creates complexities around
how the deletion flow works, and the mixed sync and async APIs
block us from evolving functionality.

This collapses in the NotifyPods interface into the PodLifecycleHandler
interface.
2019-10-02 09:28:09 -07:00
toshi0607
bcfc2accf8 misspell 2019-09-26 20:52:06 +09:00
toshi0607
b712751c6d gofmt 2019-09-26 20:50:36 +09:00
Sargun Dhillon
82a430ccf7 Add unused code linter 2019-09-24 12:55:52 -07:00
Sargun Dhillon
ea8495c3a1 Wait for Workers to exit prior to returning from PodController.Run
This changes the behaviour slightly, so rather than immediately exiting on
context cancellation, this calls shutdown, and waits for the current
items to finish being worked on before returning to the user.
2019-09-12 11:04:32 -07:00
Brian Goff
334baa73cf Merge pull request #743 from chewong/pod-status-nil-pointer
Add unit tests for #584
2019-09-11 14:49:55 -07:00
Brian Goff
bb9ff1adf3 Adds Done() and Err() to pod controller (#735)
Allows callers to wait for pod controller exit in addition to readiness.
This means the caller does not have to deal handling errors from the pod
controller running in a gorutine since it can wait for exit via `Done()`
and check the error with `Err()`
2019-09-10 17:44:19 +01:00
Ernest Wong
fdb0c805f7 Add more unit test to #584 2019-09-05 10:48:35 -07:00
Ernest Wong
dc7ff44303 Add unit tests for #584 2019-09-05 09:49:41 -07:00
Sargun Dhillon
da57373abb Test pods going missing while they're running in legacy providers (#759)
We poll legacy providers for their pod(s) status periodically. This is because
we have no way of knowing when the pod is updated. If the pod somehow goes
missing in the provider, that state must be handled. Currently, we update
API server, and mark the pod as failed, or ignore it.
2019-09-04 22:16:14 +01:00
Sargun Dhillon
33df981904 Have NotifyPods store the pod status in a map (#751)
We introduce a map that can be used to store the pod status. In this,
we do not need to call GetPodStatus immediately after NotifyPods
is called. Instead, we stash the pod passed via notifypods
as in a map we can access later. In addition to this, for legacy
providers, the logic to merge the pod, and the pod status is
hoisted up to the loop.

It prevents leaks by deleting the entry in the map as soon
as the pod is deleted from k8s.
2019-09-04 20:14:34 +01:00
Sargun Dhillon
7133a372d6 Mark current linting errors as non-errors
This is basically claiming linting bankruptcy. It marks all of the
issues we had up until this point as nolint.
2019-09-03 11:00:33 -07:00
Sargun Dhillon
5949e6279d Miscellaneous cleanup for linting 2019-09-03 11:00:33 -07:00
Sargun Dhillon
9cce8640a5 Fix linting errors in node/pod_test.go
This moves away from defining pods independently. It moves pod (spec)
generation to an independent function.
2019-09-03 11:00:33 -07:00
Sargun Dhillon
7accddcaf4 Fix linting errors in node/podcontroller.go 2019-09-03 11:00:33 -07:00
Brian Goff
2507f57f97 Merge pull request #732 from sargun/move-around-reactor
Move location of eventhandler registration
2019-09-03 10:44:52 -07:00
Sargun Dhillon
43ee086360 Fix mock_test DeletePod to store updated pod status 2019-08-25 10:42:35 -07:00
Sargun Dhillon
ccb6713b86 Move location of eventhandler registration
This moves the event handler registration until after the cache
is in-sync.

It makes it so we can use the log object from the context,
rather than having to use the global logger

The cache race condition of the cache starting while the reactor
is being added wont exist because we wait for the cache
to startup / go in sync prior to adding it.
2019-08-18 08:20:49 -07:00
Sargun Dhillon
69f1186713 Do not mutate pods, nor hand off pod references to provider
This moves to a model where any time that pods are given to a
provider, it uses a DeepCopy, as opposed to a reference. If the
provider mutates the pod, it prevents it from causing issues
with the informer cache.

It has to use reflect instead of comparing the hashes because
spew prints DeepCopy'd data structures ever so slightly differently.
2019-08-15 09:59:01 -07:00
Sargun Dhillon
89d88a17ed Add a generic reactor to lifecycle_test to bump resource version (#733)
All updates in our tests should have the behaviour that best
reflects what API server does.
2019-08-15 08:46:38 +01:00
Sargun Dhillon
bc2f6e0dc4 Wait for the informer to become in sync before starting tests
If the informers are starting at the same time as createPods,
then we can get into a situation where the pod seems to get
"lost". Instead, we wait for the informer to get into sync
prior to the createpod event.

This also moves to one informer as a microoptimization in
the tests.
2019-08-14 07:03:53 -07:00
Brian Goff
47f5aa45df Merge pull request #727 from ethan-daocloud/patch-2
cleanup: fix some typos in node.go
2019-08-13 12:00:43 -07:00
Brian Goff
569706f371 Merge branch 'master' into document-api 2019-08-13 11:47:04 -07:00
Guangming Wang
cb307df71e cleanup: fix some typos in node.go
Signed-off-by: Guangming Wang <guangming.wang@daocloud.io>
2019-08-13 11:39:00 -07:00
Sargun Dhillon
edc0991c0c Fix hotloop around scheduling in lifecycle_test
Lifecycle test had a hotloop, where it would run a never-yielding
function while processing was going on elsewhere. This inserts
a sleep. A sleep is used rather than a yield to be kind to
people's battery life.
2019-08-13 11:25:21 -07:00
Sargun Dhillon
fbed4ca702 Remove usage of atomics
It turns out that running atomic.Read(...) in a tight loop breaks
Golang. The goroutine would never yield control over the scheduler,
so we ended up getting into a situation where the test would get
stuck forever. This moves to a different model, in which
there is a condition var, instead of atomics in loops.
2019-08-13 11:25:21 -07:00
Sargun Dhillon
9b27eb83fe Make mock_test follow the aformentioned documentation 2019-08-13 10:30:02 -07:00
Sargun Dhillon
3b3bf3ff20 Add documentation to the provider API about concurrency / mutability
This adds documentation around what is allowed to be mutated and
what may be accessed concurrently from the provider API. Previously,
the API was ambigious, and that meant providers could return pods
and change them. This resulted in data races occuring.
2019-08-13 10:29:12 -07:00
Pires
f0a0e8cbfe Merge branch 'master' into upgrade-k8s-v2 2019-08-13 10:43:00 +01:00
Sargun Dhillon
5c2b682cdc Array of minor fixups to lifecycle tests
* Fix the deletion test to actually test the pod is deleted
 * Fix the update pods test to update a value which is allowed
   to be updated
 * Shut down watches after tests
 * Do not delete pod statuses on DeletePod in mock_test

This intentionally leaks pod statuses, but it makes the situation
a lot less complicated around handling race conditions with
the GetPodStatus callback
2019-08-12 12:10:29 -07:00
Sargun Dhillon
5ac33e4b0a Fix race conditions in node_test 2019-08-12 11:33:48 -07:00
Brian Goff
10b291dba1 Merge branch 'master' into patch-1 2019-08-12 10:48:15 -07:00
Sargun Dhillon
ad6cd7d552 Upgrade K8s
* Upgrade k8s.io/api
   go get k8s.io/api@kubernetes-1.15.2
 * Upgrade k8s.io/apimachinery
   go get k8s.io/apimachinery@kubernetes-1.15.2
 * Upgrade kubernetes-1.15.2
   go get k8s.io/client-go@kubernetes-1.15.2
 * Upgrade kk8s.io/kubernetes to v1.15.2
   go get k8s.io/kubernetes@v1.15.2

This also locks the the dependency for
github.com/prometheus/client_golang/prometheus due to a golang bug, and to
please the validation scripts.

The replaces were generated by:
go get k8s.io/kubernetes@v1.15.2 2> fail
for i in $(cat fail|grep unknown|cut -f1 -d@|cut -f2 -d" ")
  do echo "replace ${i} => ${i} kubernetes-1.15.2"
done
2019-08-12 10:29:19 -07:00
Sargun Dhillon
a28969355e Fix race condition around worker ID generation in podcontroller.go 2019-08-12 10:27:21 -07:00
ethan
75a1877d9f cleanup: fix misspelled words in error message
Signed-off-by: Guangming Wang <guangming.wang@daocloud.io>
2019-08-10 19:03:44 +08:00
Sargun Dhillon
3efc9229ba Add a little bit of documentation to NotifyPods
As far as I can tell, based on the implementation in MockProvider
NotifyPods is called with the mutated pod. This allows us to
take a copy of the Pod object in NotifyPods, and make it so
(eventually) we don't need to do a callback to GetPodStatus.
2019-08-06 20:20:59 -07:00
Sakura
7188238caa fix a to an in annotation (#715) 2019-08-05 20:13:40 +01:00
Sargun Dhillon
50bbc3d1d4 Add tests around updates
This makes sure the update function works correctly after the pod
is running if the podspec is changed. Upon writing the test, I realized
we were accessing the variables outside of the goroutine that the
workers with tests were running in, and we had no locks. Therefore,
I converted all of those numbers to use atomics.
2019-07-30 09:13:43 -07:00
Sargun Dhillon
bd8e39e3f9 Add a benchmark for pod creation
This adds a benchmark for pod creation and makes the mock_test
provider actually work correctly in concurrent situations.
2019-07-30 09:12:56 -07:00
Sargun Dhillon
ce38d72c0e Add additional lifecycle tests
* Don't scheduled failed, or succeeded pods
 * Delete dangling pods
2019-07-30 06:56:54 -07:00
Sargun Dhillon
4a270fea08 Add a test which tests the e2e lifecycle of the pod controller
This uses the mock provider, so I moved the mock provider to a
location where the node test can use it.
2019-07-30 06:56:54 -07:00
Sargun Dhillon
4d60fc2049 Setup event handler at Pod Controller creation time
This seems to avoid a race conditions where at pod informer
startup time, the reactor doesn't properly get setup.

It also refactors the root command example to start up
the informers after everything is wired up.
2019-07-26 13:57:00 -07:00
Sargun Dhillon
ce60fb81d4 Make NewPodController function validate that provider is set
In NewPodController we validate that the rest of the config is
set to non-nil values. The provider must be non-nil as well.
2019-07-21 16:19:00 -07:00
jerryzhuang
0ba0200067 fix several typo
Signed-off-by: zhuangqh <zhuangqhc@gmail.com>
2019-07-17 10:36:17 +08:00
Brian Goff
8493cbb42a Unexport node update helper functions (#701)
Thinking these maybe should either not be exposed or in a separate
package.
For 1.0 let's unexport them and we may re-introduce later.
2019-07-05 19:24:46 +01:00
Brian Goff
f7fee27790 Move CLI related packages into internal (#697)
We don't want people to import these packages, so move these out into
private packages.
2019-07-04 10:14:38 +01:00