Commits · v0.13.0 · Synced / Neon Autoscaling

This project is mirrored from https://github.com/neondatabase/autoscaling. Pull mirroring updated 4 minutes ago.

Jul 14, 2023
- Bump version: v0.12.2 -> v0.13.0 · 909ae5ed
  Em Sharnoff authored 1 year ago
  
  v0.13.0
  
  909ae5ed
Jul 13, 2023

plugin: Allow ignoring resource usage from namespace(s) (#399) · 5eca7aa6

Em Sharnoff authored 1 year ago

This allows us to make scheduling decisions while acknowledging that the
'overprovisioning' paused pods should not *actually* have any resources
reserved for them.

This won't affect Filter requests, and so we should still have the
desired behavior of rejecting pods when the total usage *including* any
ignored namespaces is too high; we just don't want to start migrating
VMs away from a node that's primarily filled with paused pods.

5eca7aa6

plugin: Remove 'System' reserved resources (#399) · ee827fc4

Em Sharnoff authored 1 year ago

This was originally meant for tracking various system daemons, etc. But
now that the plugin is actually just tracking those itself, we don't
need it.

ee827fc4

plugin: Track all pods (#399) · 340e08cb

Em Sharnoff authored 1 year ago

... even the ones that never went through the scheduler. This should
give us a more accurate view of cluster resource usage.

This also requires being more lenient about how we calculate a pod's
resource usage - not all pods' containers have resources.requests or
resources.limits set, and we should *probably* more closely match
cluster-autoscaler's calculations, which requires only looking at the
resource requests (and not limits).

340e08cb

plugin: Improve plugin method logs (#405) · 928e003c

Em Sharnoff authored 1 year ago

There's some missing logs for alertable metrics (e.g. PostFilter calls,
appropriate log levels for Filter rejections).

In general, we'd like to be able to associate any increase in a metric
with some log line. This should help with that.

928e003c

plugin: Fix typo 'in .. in' -> 'in .. or' (#406) · edc65650
Em Sharnoff authored 1 year ago

edc65650
Collect memory metrics and scale based on them (#393) · e57fc548
Felix Prasanna authored 1 year ago

e57fc548

Jul 12, 2023
- agent: Fix log typo: s/runnign/running/ (#396) · 8704e922
  Em Sharnoff authored 1 year ago
  
  8704e922
- Bump version: v0.12.1 -> v0.12.2 · 87af5b1a
  Em Sharnoff authored 1 year ago
  
  v0.12.2
  
  87af5b1a
- agent: Fix Runner panic when scaling bounds decrease (#395) · e5cdcafc
  Em Sharnoff authored 1 year ago
  
  ref #234. This affects pools, and should be fine without it, because the Runner will auto-restart and is generally ok after that, but it's worth fixing anyways.
  e5cdcafc
Jul 11, 2023

Bump version: v0.12.0 -> v0.12.1 · 129b29df
Em Sharnoff authored 1 year ago

v0.12.1

129b29df

plugin: Migration handling reliability improvements (#387) · 63a6914d

Em Sharnoff authored 1 year ago

Like all good things, this commit comes in three parts:

1. If the plugin decides to trigger a migration, it no longer returns a
   non-nil PluginResponse.Migrate if the VirtualMachineMigration
   already existed.
    - This had the potential to cause spurious failures where the
      autoscaler-agent permanently shuts off communication because *it*
      was told that the scheduler is going to migrate it, but actually
      the migration had already completed.
2. The plugin now automatically cleans up completed migrations.
    - To make this work, all migrations now have the
      'autoscaling.neon.tech/created-by-scheduler' label.
3. The plugin now exposes metrics about migration creation and deletion.
   These are:
    - autoscaling_plugin_migrations_created_total
    - autoscaling_plugin_migrations_deleted_total
    - autoscaling_plugin_migration_create_fails_total
    - autoscaling_plugin_migration_delete_fails_total

63a6914d

plugin: Include node group in node resource metrics (#382) · a2862767

Em Sharnoff authored 1 year ago

One of the features we're actually sorely missing with our current
node-level resource usage metrics is the ability to aggregate them by
node group.

Per-node group information, rather than per-cluster is more likely to be
a useful signal (because various node groups being over/under
provisioned typically won't affect the others).

This commit adds a new config field to set the node group label:
'k8sNodeGroupLabel'. For EKS, this label is 'eks.amazonaws.com/nodegroup'

a2862767

agent: Fix NeonVM downscaling not showing up in metrics (#381) · 4967d423

Em Sharnoff authored 1 year ago

Basically, because we were recording the change from 'downscaled' to
'target', rather than 'current' to 'target', any time we sent a NeonVM
request to downscale, we'd record the change as doing nothing.

4967d423

agent/metrics: Add "endpoint" for agent→informant request metrics (#380) · 81cfac43
Em Sharnoff authored 1 year ago

81cfac43

Jul 10, 2023

lint/exhaustruct: exclude K8s DeleteOptions (#388) · eaa67fc6
Em Sharnoff authored 1 year ago

eaa67fc6

informant: Fix parent process stall when child dies qickly (#389) · 696fe2fa

Em Sharnoff authored 1 year ago

This was... fascinating to debug. Here's a story:

For the past few days in prod, it's seemed like we've had more
"autoscaling stuck" VMs than there should be - even after taking into
account that VMs that are part of a pool will always be stuck.

We finally got confirmation of that with the new metrics from v0.12.0,
which unlocked *just* looking at the "autoscaling stuck" VMs that aren't
in pools, and it turned out there were around 25 — much more than
expected.

So, debugging: the autoscaler-agent's "state dump" feature proves
useful, showing that these are *mostly* VMs from 2023-07-07 and after,
with one from 07-04 and one from 07-05. So maybe a recent release did
something? That wouldn't make sense though — we haven't had any major
changes to the vm-informant, aside from the connection closing fix
(see #367).

Looking at the logs for any of these VMs shows... nothing from the
informant? Very curious. If we go into a VM though, it shows that
there's *one* vm-informant process running though! ... uh oh, that means
it's only *the parent* that's running. And `kill <pid>` doesn't work
(because we're already trapping the signals), but `kill -9 <pid>` does,
and fixes the issue, for one VM.

Now that we know that the parent process is stalled, we can look at the
logs from the *start* of the VM, see when that happened. It turns out we
don't have to look very far — *every single* VM that was affected by
this bug has been affected in exactly the same way. The child process
starts up, dies quickly (because postgres isn't alive yet), and then the
parent process just sits there... waiting.

Thankfully, there's 17 more stuck VMs we can play around with for
debugging! On the next one, let's `kill -6 <pid>` so we can get the
stack traces.

One entry sticks out like a sore thumb:

    goroutine 1 [chan receive, 6640 minutes]:
    runtime.gopark(0x0?, 0x0?, 0x80?, 0x4d?, 0xc00007c2e0?)
        /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc0005777b0 sp=0xc000577790 pc=0x43dab6
    runtime.chanrecv(0xc000089920, 0x0, 0x1)
        /usr/local/go/src/runtime/chan.go:583 +0x49d fp=0xc000577840 sp=0xc0005777b0 pc=0x408c1d
    runtime.chanrecv1(0xc0000ff968?, 0xc0000ff8f8?)
        /usr/local/go/src/runtime/chan.go:442 +0x18 fp=0xc000577868 sp=0xc000577840 pc=0x408718
    main.runRestartOnFailure({0x194a8e0, 0xc0002fef30}, 0x6?, {0xc000340a80, 0x4, 0x4}, {0xc00047ac98, 0x1, 0x1})
        /workspace/cmd/vm-informant/main.go:221 +0x285 fp=0xc000577a20 sp=0xc000577868 pc=0x1429605
    main.main()
        /workspace/cmd/vm-informant/main.go:84 +0x1832 fp=0xc000577f80 sp=0xc000577a20 pc=0x1428cb2
    runtime.main()
        /usr/local/go/src/runtime/proc.go:250 +0x207 fp=0xc000577fe0 sp=0xc000577f80 pc=0x43d687
    runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000577fe8 sp=0xc000577fe0 pc=0x471d21

Specifically, it was stuck trying to receive on a channel — the one for
the timer.

It turns out that the typical advice of:

    if !timer.Stop() {
        <-timer.C
    }

.. doesn't apply if you've *already received from the channel*. Because
in that case, having timer.Stop() return false means that the timer
already finished, which would make sense (you already received on it),
and so it isn't going to put another item into the channel for you.

In our case, we received from the channel on a previous iteration of the
loop, and then try to receive from it again when the child informant
dies for a *second* time.

So the fix: Track if you've received from the channel; don't block on
receiving from timer.C if there won't be anything there.

696fe2fa

release workflow: Add vmscrape.yaml asset (#392) · 1bb7fa4c
Em Sharnoff authored 1 year ago
```
Should have been handled by #282, but oh well.
```
1bb7fa4c
Fix spurious "updated scaling bounds logs" (#391) · 4296eeeb
Felix Prasanna authored 1 year ago

4296eeeb

Jul 07, 2023
- fix typo (#378) · a6e20820
  Felix Prasanna authored 1 year ago
  
  a6e20820
Jul 06, 2023

build(deps): bump google.golang.org/grpc from 1.51.0 to 1.53.0 (#376) · 9eaa20da

dependabot[bot] authored 1 year ago

Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.51.0 to 1.53.0.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.51.0...v1.53.0

)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

9eaa20da

Jul 04, 2023

Bump version: v0.11.0 -> v0.12.0 · 0cd3cdbb
Em Sharnoff authored 1 year ago

v0.12.0

0cd3cdbb
billing: Add x-trace-id header to requests (#372) · 88d3fbae
Em Sharnoff authored 1 year ago

88d3fbae

agent/billing: Move push logic into separate thread (#368) · 44d0c69e

Em Sharnoff authored 1 year ago

This should resolve some of the ongoing issues we've had with billing
push requests timing out, because *currently* the push timeout must be
short in order to produce correct data.

Also, this commit removes pkg/billing.Batch, because this commit changes
the agent to use its own batching, meaning pkg/billing.Batch is no
longer required.

Billing config changes:

- Renamed pushTimeoutSeconds to pushRequestTimeoutSeconds
- Added accumulateEverySeconds (now distinct from pushing!)
- Added maxBatchSize

Billing metrics changes:

- Renamed autoscaling_agent_billing_batch_size to autoscaling_agent_billing_queue_size
- Added autoscaling_agent_billing_last_send_duration_seconds

44d0c69e

billing, agent/billing: Log IdempotencyKey of events (#366) · 0eb5df9e

Em Sharnoff authored 1 year ago

In order to make this work on the agent's side, idempotency key
generation needed to be *somewhat* exposed, so this commit adds the
Enrich function, which handles the common "filling out the other fields"
tasks for each event.

By abstracting this interface, we also remove the need for separate
(*Batch).AddAbsoluteEvent vs (*Batch).AddIncrementalEvent methods, so
now we just have a single (*Batch).Add method that also calls Enrich.

0eb5df9e

Revert "add port for monitor" · c71eff62
Felix Prasanna authored 1 year ago
```
This reverts commit 6af98b75.
```
c71eff62
add port for monitor · 6af98b75
Felix Prasanna authored 1 year ago

6af98b75

Jul 03, 2023

e2e: Set strategy.fail-fast: false (keep running if other fails) (#373) · 6b39f8b6

Em Sharnoff authored 1 year ago

We have various sources of flakiness (some avoidable, after fixing
bugs, and some unavoidable), and given the success rate of e2e tests, I
suspect the default fail-fast behavior is probably using our CI runners
less efficiently than if we turned it off.

6b39f8b6

Jun 30, 2023
- informant/filecache: Close DB connections (#367) · ddb6671d
  Em Sharnoff authored 1 year ago
  
  ddb6671d
Jun 29, 2023

plugin: Add per-node resource level metrics (#363) · c88cb01d
Em Sharnoff authored 1 year ago

c88cb01d

plugin: Cleanup state for deleted Nodes (#361) · 1fdaeac1

Em Sharnoff authored 1 year ago

This is a pre-req for adding per-node usage metrics, because of what's
mentioned in #248:

> Looking at the state for prod-us-east-2-delta, there's currently 107
> nodes that the plugin is still tracking state for that don't exist

Unclear whether this will resolve #248 ("scheduler has a memory leak").

---

A lot of the extra work behind this PR is around using the new
watch.Store instead of the preexisting calls to the K8s API when we need
to fetch information about a Node. Realistically... this is probably
more hassle than it's worth, but it's nice to have the consistency.

1fdaeac1

Jun 27, 2023
- Correct make commands to reflect kind and k3d (#365) · fa0c2f90
  Felix Prasanna authored 1 year ago
  
  fa0c2f90
- plugin: Fix comments about where .metrics field is set (#364) · 4067072f
  Em Sharnoff authored 1 year ago
  
  4067072f
- util/watch: Add FlatNameIndex for non-namespaced objects (e.g. Nodes) (#360) · 015f7570
  Em Sharnoff authored 1 year ago
  
  Pre-req for #361, directly copying everything from NameIndex. In theory, we could just reuse NameIndex and rely on non-namespaced objects (like Nodes) having Namespace = "". However, this could introduce counterintuitive failure modes, so it's best to do this properly here.
  015f7570
Jun 26, 2023

plugin: Clean up IndexedVMStore usage (#359) · f0ad3bd9
Em Sharnoff authored 1 year ago

f0ad3bd9

plugin: Fix filter cycle metrics (#356) · d5cb880b

Em Sharnoff authored 1 year ago

The previous implementation doesn't *really* work because it turns out
PostFilter is only called if all Filter calls failed - so we some other
way to count the number of filter cycles.

This PR removes two metrics:

* autoscaling_plugin_filter_cycle_successes_total
* autoscaling_plugin_filter_cycle_rejections_total

The new recommended flow is:

* Get attempts with autoscaling_plugin_extension_calls_total{method="PreFilter"}
* Get rejections with autoscaling_plugin_extension_calls_total{method="PostFilter"}
* Successes can be calculated by the difference of the two.

d5cb880b

plugin/watch: Remove redundant error wrapping (#358) · b7b4a5da
Em Sharnoff authored 1 year ago

b7b4a5da

Jun 23, 2023

agent: Improve metrics help message (#354) · d574fbbb
Em Sharnoff authored 1 year ago

d574fbbb

agent: Record endpoints for `Runner`s (#353) · fb7a4506

Em Sharnoff authored 1 year ago

With upcoming compute pool changes, we're going to end up with a lot of
VMs where the informant is unable to start up - because the initial file
cache connection will fail until postgres is alive, which only happens
once the pooled VM is bound to a particular endpoint.

So on staging, we currently report a lot of "autoscaling stuck" VMs,
when in reality these are just part of the pool. Having a separate
value for the number of these stuck VMs that are actually running
something will ensure our metrics continue to be useful.

And also, in passing this through so that we can make a metric out of
it, it's worth storing & logging the endpoint ID, so that the
information is more easily available (without having to cross-reference
the console DB)

fb7a4506

util/watch: Add more logs (#351) · adb41cec
Em Sharnoff authored 1 year ago

adb41cec