This project is mirrored from https://github.com/neondatabase/autoscaling.
Pull mirroring updated .
- Jul 14, 2023
-
-
Em Sharnoff authored
-
- Jul 13, 2023
-
-
Em Sharnoff authored
This allows us to make scheduling decisions while acknowledging that the 'overprovisioning' paused pods should not *actually* have any resources reserved for them. This won't affect Filter requests, and so we should still have the desired behavior of rejecting pods when the total usage *including* any ignored namespaces is too high; we just don't want to start migrating VMs away from a node that's primarily filled with paused pods.
-
Em Sharnoff authored
This was originally meant for tracking various system daemons, etc. But now that the plugin is actually just tracking those itself, we don't need it.
-
Em Sharnoff authored
... even the ones that never went through the scheduler. This should give us a more accurate view of cluster resource usage. This also requires being more lenient about how we calculate a pod's resource usage - not all pods' containers have resources.requests or resources.limits set, and we should *probably* more closely match cluster-autoscaler's calculations, which requires only looking at the resource requests (and not limits).
-
Em Sharnoff authored
There's some missing logs for alertable metrics (e.g. PostFilter calls, appropriate log levels for Filter rejections). In general, we'd like to be able to associate any increase in a metric with some log line. This should help with that.
-
Em Sharnoff authored
-
Felix Prasanna authored
-
- Jul 12, 2023
-
-
Em Sharnoff authored
-
Em Sharnoff authored
-
Em Sharnoff authored
ref #234. This affects pools, and should be fine without it, because the Runner will auto-restart and is generally ok after that, but it's worth fixing anyways.
-
- Jul 11, 2023
-
-
Em Sharnoff authored
-
Em Sharnoff authored
Like all good things, this commit comes in three parts: 1. If the plugin decides to trigger a migration, it no longer returns a non-nil PluginResponse.Migrate if the VirtualMachineMigration already existed. - This had the potential to cause spurious failures where the autoscaler-agent permanently shuts off communication because *it* was told that the scheduler is going to migrate it, but actually the migration had already completed. 2. The plugin now automatically cleans up completed migrations. - To make this work, all migrations now have the 'autoscaling.neon.tech/created-by-scheduler' label. 3. The plugin now exposes metrics about migration creation and deletion. These are: - autoscaling_plugin_migrations_created_total - autoscaling_plugin_migrations_deleted_total - autoscaling_plugin_migration_create_fails_total - autoscaling_plugin_migration_delete_fails_total
-
Em Sharnoff authored
One of the features we're actually sorely missing with our current node-level resource usage metrics is the ability to aggregate them by node group. Per-node group information, rather than per-cluster is more likely to be a useful signal (because various node groups being over/under provisioned typically won't affect the others). This commit adds a new config field to set the node group label: 'k8sNodeGroupLabel'. For EKS, this label is 'eks.amazonaws.com/nodegroup'
-
Em Sharnoff authored
Basically, because we were recording the change from 'downscaled' to 'target', rather than 'current' to 'target', any time we sent a NeonVM request to downscale, we'd record the change as doing nothing.
-
Em Sharnoff authored
-
- Jul 10, 2023
-
-
Em Sharnoff authored
-
Em Sharnoff authored
This was... fascinating to debug. Here's a story: For the past few days in prod, it's seemed like we've had more "autoscaling stuck" VMs than there should be - even after taking into account that VMs that are part of a pool will always be stuck. We finally got confirmation of that with the new metrics from v0.12.0, which unlocked *just* looking at the "autoscaling stuck" VMs that aren't in pools, and it turned out there were around 25 — much more than expected. So, debugging: the autoscaler-agent's "state dump" feature proves useful, showing that these are *mostly* VMs from 2023-07-07 and after, with one from 07-04 and one from 07-05. So maybe a recent release did something? That wouldn't make sense though — we haven't had any major changes to the vm-informant, aside from the connection closing fix (see #367). Looking at the logs for any of these VMs shows... nothing from the informant? Very curious. If we go into a VM though, it shows that there's *one* vm-informant process running though! ... uh oh, that means it's only *the parent* that's running. And `kill <pid>` doesn't work (because we're already trapping the signals), but `kill -9 <pid>` does, and fixes the issue, for one VM. Now that we know that the parent process is stalled, we can look at the logs from the *start* of the VM, see when that happened. It turns out we don't have to look very far — *every single* VM that was affected by this bug has been affected in exactly the same way. The child process starts up, dies quickly (because postgres isn't alive yet), and then the parent process just sits there... waiting. Thankfully, there's 17 more stuck VMs we can play around with for debugging! On the next one, let's `kill -6 <pid>` so we can get the stack traces. One entry sticks out like a sore thumb: goroutine 1 [chan receive, 6640 minutes]: runtime.gopark(0x0?, 0x0?, 0x80?, 0x4d?, 0xc00007c2e0?) /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc0005777b0 sp=0xc000577790 pc=0x43dab6 runtime.chanrecv(0xc000089920, 0x0, 0x1) /usr/local/go/src/runtime/chan.go:583 +0x49d fp=0xc000577840 sp=0xc0005777b0 pc=0x408c1d runtime.chanrecv1(0xc0000ff968?, 0xc0000ff8f8?) /usr/local/go/src/runtime/chan.go:442 +0x18 fp=0xc000577868 sp=0xc000577840 pc=0x408718 main.runRestartOnFailure({0x194a8e0, 0xc0002fef30}, 0x6?, {0xc000340a80, 0x4, 0x4}, {0xc00047ac98, 0x1, 0x1}) /workspace/cmd/vm-informant/main.go:221 +0x285 fp=0xc000577a20 sp=0xc000577868 pc=0x1429605 main.main() /workspace/cmd/vm-informant/main.go:84 +0x1832 fp=0xc000577f80 sp=0xc000577a20 pc=0x1428cb2 runtime.main() /usr/local/go/src/runtime/proc.go:250 +0x207 fp=0xc000577fe0 sp=0xc000577f80 pc=0x43d687 runtime.goexit() /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000577fe8 sp=0xc000577fe0 pc=0x471d21 Specifically, it was stuck trying to receive on a channel — the one for the timer. It turns out that the typical advice of: if !timer.Stop() { <-timer.C } .. doesn't apply if you've *already received from the channel*. Because in that case, having timer.Stop() return false means that the timer already finished, which would make sense (you already received on it), and so it isn't going to put another item into the channel for you. In our case, we received from the channel on a previous iteration of the loop, and then try to receive from it again when the child informant dies for a *second* time. So the fix: Track if you've received from the channel; don't block on receiving from timer.C if there won't be anything there.
-
Em Sharnoff authored
Should have been handled by #282, but oh well.
-
Felix Prasanna authored
-
- Jul 07, 2023
-
-
Felix Prasanna authored
-
- Jul 06, 2023
-
-
dependabot[bot] authored
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.51.0 to 1.53.0. - [Release notes](https://github.com/grpc/grpc-go/releases) - [Commits](https://github.com/grpc/grpc-go/compare/v1.51.0...v1.53.0 ) --- updated-dependencies: - dependency-name: google.golang.org/grpc dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- Jul 04, 2023
-
-
Em Sharnoff authored
-
Em Sharnoff authored
-
Em Sharnoff authored
This should resolve some of the ongoing issues we've had with billing push requests timing out, because *currently* the push timeout must be short in order to produce correct data. Also, this commit removes pkg/billing.Batch, because this commit changes the agent to use its own batching, meaning pkg/billing.Batch is no longer required. Billing config changes: - Renamed pushTimeoutSeconds to pushRequestTimeoutSeconds - Added accumulateEverySeconds (now distinct from pushing!) - Added maxBatchSize Billing metrics changes: - Renamed autoscaling_agent_billing_batch_size to autoscaling_agent_billing_queue_size - Added autoscaling_agent_billing_last_send_duration_seconds
-
Em Sharnoff authored
In order to make this work on the agent's side, idempotency key generation needed to be *somewhat* exposed, so this commit adds the Enrich function, which handles the common "filling out the other fields" tasks for each event. By abstracting this interface, we also remove the need for separate (*Batch).AddAbsoluteEvent vs (*Batch).AddIncrementalEvent methods, so now we just have a single (*Batch).Add method that also calls Enrich.
-
Felix Prasanna authored
This reverts commit 6af98b75.
-
Felix Prasanna authored
-
- Jul 03, 2023
-
-
Em Sharnoff authored
We have various sources of flakiness (some avoidable, after fixing bugs, and some unavoidable), and given the success rate of e2e tests, I suspect the default fail-fast behavior is probably using our CI runners less efficiently than if we turned it off.
-
- Jun 30, 2023
-
-
Em Sharnoff authored
-
- Jun 29, 2023
-
-
Em Sharnoff authored
-
Em Sharnoff authored
This is a pre-req for adding per-node usage metrics, because of what's mentioned in #248: > Looking at the state for prod-us-east-2-delta, there's currently 107 > nodes that the plugin is still tracking state for that don't exist Unclear whether this will resolve #248 ("scheduler has a memory leak"). --- A lot of the extra work behind this PR is around using the new watch.Store instead of the preexisting calls to the K8s API when we need to fetch information about a Node. Realistically... this is probably more hassle than it's worth, but it's nice to have the consistency.
-
- Jun 27, 2023
-
-
Felix Prasanna authored
-
Em Sharnoff authored
-
Em Sharnoff authored
Pre-req for #361, directly copying everything from NameIndex. In theory, we could just reuse NameIndex and rely on non-namespaced objects (like Nodes) having Namespace = "". However, this could introduce counterintuitive failure modes, so it's best to do this properly here.
-
- Jun 26, 2023
-
-
Em Sharnoff authored
-
Em Sharnoff authored
The previous implementation doesn't *really* work because it turns out PostFilter is only called if all Filter calls failed - so we some other way to count the number of filter cycles. This PR removes two metrics: * autoscaling_plugin_filter_cycle_successes_total * autoscaling_plugin_filter_cycle_rejections_total The new recommended flow is: * Get attempts with autoscaling_plugin_extension_calls_total{method="PreFilter"} * Get rejections with autoscaling_plugin_extension_calls_total{method="PostFilter"} * Successes can be calculated by the difference of the two.
-
Em Sharnoff authored
-
- Jun 23, 2023
-
-
Em Sharnoff authored
-
Em Sharnoff authored
With upcoming compute pool changes, we're going to end up with a lot of VMs where the informant is unable to start up - because the initial file cache connection will fail until postgres is alive, which only happens once the pooled VM is bound to a particular endpoint. So on staging, we currently report a lot of "autoscaling stuck" VMs, when in reality these are just part of the pool. Having a separate value for the number of these stuck VMs that are actually running something will ensure our metrics continue to be useful. And also, in passing this through so that we can make a metric out of it, it's worth storing & logging the endpoint ID, so that the information is more easily available (without having to cross-reference the console DB)
-
Em Sharnoff authored
-