Skip to content
Snippets Groups Projects
This project is mirrored from https://github.com/neondatabase/autoscaling. Pull mirroring updated .
  1. Jul 14, 2023
  2. Jul 13, 2023
    • Em Sharnoff's avatar
      plugin: Allow ignoring resource usage from namespace(s) (#399) · 5eca7aa6
      Em Sharnoff authored
      This allows us to make scheduling decisions while acknowledging that the
      'overprovisioning' paused pods should not *actually* have any resources
      reserved for them.
      
      This won't affect Filter requests, and so we should still have the
      desired behavior of rejecting pods when the total usage *including* any
      ignored namespaces is too high; we just don't want to start migrating
      VMs away from a node that's primarily filled with paused pods.
      5eca7aa6
    • Em Sharnoff's avatar
      plugin: Remove 'System' reserved resources (#399) · ee827fc4
      Em Sharnoff authored
      This was originally meant for tracking various system daemons, etc. But
      now that the plugin is actually just tracking those itself, we don't
      need it.
      ee827fc4
    • Em Sharnoff's avatar
      plugin: Track all pods (#399) · 340e08cb
      Em Sharnoff authored
      ... even the ones that never went through the scheduler. This should
      give us a more accurate view of cluster resource usage.
      
      This also requires being more lenient about how we calculate a pod's
      resource usage - not all pods' containers have resources.requests or
      resources.limits set, and we should *probably* more closely match
      cluster-autoscaler's calculations, which requires only looking at the
      resource requests (and not limits).
      340e08cb
    • Em Sharnoff's avatar
      plugin: Improve plugin method logs (#405) · 928e003c
      Em Sharnoff authored
      There's some missing logs for alertable metrics (e.g. PostFilter calls,
      appropriate log levels for Filter rejections).
      
      In general, we'd like to be able to associate any increase in a metric
      with some log line. This should help with that.
      928e003c
    • Em Sharnoff's avatar
      edc65650
    • Felix Prasanna's avatar
  3. Jul 12, 2023
  4. Jul 11, 2023
    • Em Sharnoff's avatar
      Bump version: v0.12.0 -> v0.12.1 · 129b29df
      Em Sharnoff authored
      v0.12.1
      129b29df
    • Em Sharnoff's avatar
      plugin: Migration handling reliability improvements (#387) · 63a6914d
      Em Sharnoff authored
      Like all good things, this commit comes in three parts:
      
      1. If the plugin decides to trigger a migration, it no longer returns a
         non-nil PluginResponse.Migrate if the VirtualMachineMigration
         already existed.
          - This had the potential to cause spurious failures where the
            autoscaler-agent permanently shuts off communication because *it*
            was told that the scheduler is going to migrate it, but actually
            the migration had already completed.
      2. The plugin now automatically cleans up completed migrations.
          - To make this work, all migrations now have the
            'autoscaling.neon.tech/created-by-scheduler' label.
      3. The plugin now exposes metrics about migration creation and deletion.
         These are:
          - autoscaling_plugin_migrations_created_total
          - autoscaling_plugin_migrations_deleted_total
          - autoscaling_plugin_migration_create_fails_total
          - autoscaling_plugin_migration_delete_fails_total
      63a6914d
    • Em Sharnoff's avatar
      plugin: Include node group in node resource metrics (#382) · a2862767
      Em Sharnoff authored
      One of the features we're actually sorely missing with our current
      node-level resource usage metrics is the ability to aggregate them by
      node group.
      
      Per-node group information, rather than per-cluster is more likely to be
      a useful signal (because various node groups being over/under
      provisioned typically won't affect the others).
      
      This commit adds a new config field to set the node group label:
      'k8sNodeGroupLabel'. For EKS, this label is 'eks.amazonaws.com/nodegroup'
      a2862767
    • Em Sharnoff's avatar
      agent: Fix NeonVM downscaling not showing up in metrics (#381) · 4967d423
      Em Sharnoff authored
      Basically, because we were recording the change from 'downscaled' to
      'target', rather than 'current' to 'target', any time we sent a NeonVM
      request to downscale, we'd record the change as doing nothing.
      4967d423
    • Em Sharnoff's avatar
  5. Jul 10, 2023
    • Em Sharnoff's avatar
      eaa67fc6
    • Em Sharnoff's avatar
      informant: Fix parent process stall when child dies qickly (#389) · 696fe2fa
      Em Sharnoff authored
      This was... fascinating to debug. Here's a story:
      
      For the past few days in prod, it's seemed like we've had more
      "autoscaling stuck" VMs than there should be - even after taking into
      account that VMs that are part of a pool will always be stuck.
      
      We finally got confirmation of that with the new metrics from v0.12.0,
      which unlocked *just* looking at the "autoscaling stuck" VMs that aren't
      in pools, and it turned out there were around 25 — much more than
      expected.
      
      So, debugging: the autoscaler-agent's "state dump" feature proves
      useful, showing that these are *mostly* VMs from 2023-07-07 and after,
      with one from 07-04 and one from 07-05. So maybe a recent release did
      something? That wouldn't make sense though — we haven't had any major
      changes to the vm-informant, aside from the connection closing fix
      (see #367).
      
      Looking at the logs for any of these VMs shows... nothing from the
      informant? Very curious. If we go into a VM though, it shows that
      there's *one* vm-informant process running though! ... uh oh, that means
      it's only *the parent* that's running. And `kill <pid>` doesn't work
      (because we're already trapping the signals), but `kill -9 <pid>` does,
      and fixes the issue, for one VM.
      
      Now that we know that the parent process is stalled, we can look at the
      logs from the *start* of the VM, see when that happened. It turns out we
      don't have to look very far — *every single* VM that was affected by
      this bug has been affected in exactly the same way. The child process
      starts up, dies quickly (because postgres isn't alive yet), and then the
      parent process just sits there... waiting.
      
      Thankfully, there's 17 more stuck VMs we can play around with for
      debugging! On the next one, let's `kill -6 <pid>` so we can get the
      stack traces.
      
      One entry sticks out like a sore thumb:
      
          goroutine 1 [chan receive, 6640 minutes]:
          runtime.gopark(0x0?, 0x0?, 0x80?, 0x4d?, 0xc00007c2e0?)
              /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc0005777b0 sp=0xc000577790 pc=0x43dab6
          runtime.chanrecv(0xc000089920, 0x0, 0x1)
              /usr/local/go/src/runtime/chan.go:583 +0x49d fp=0xc000577840 sp=0xc0005777b0 pc=0x408c1d
          runtime.chanrecv1(0xc0000ff968?, 0xc0000ff8f8?)
              /usr/local/go/src/runtime/chan.go:442 +0x18 fp=0xc000577868 sp=0xc000577840 pc=0x408718
          main.runRestartOnFailure({0x194a8e0, 0xc0002fef30}, 0x6?, {0xc000340a80, 0x4, 0x4}, {0xc00047ac98, 0x1, 0x1})
              /workspace/cmd/vm-informant/main.go:221 +0x285 fp=0xc000577a20 sp=0xc000577868 pc=0x1429605
          main.main()
              /workspace/cmd/vm-informant/main.go:84 +0x1832 fp=0xc000577f80 sp=0xc000577a20 pc=0x1428cb2
          runtime.main()
              /usr/local/go/src/runtime/proc.go:250 +0x207 fp=0xc000577fe0 sp=0xc000577f80 pc=0x43d687
          runtime.goexit()
              /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000577fe8 sp=0xc000577fe0 pc=0x471d21
      
      Specifically, it was stuck trying to receive on a channel — the one for
      the timer.
      
      It turns out that the typical advice of:
      
          if !timer.Stop() {
              <-timer.C
          }
      
      .. doesn't apply if you've *already received from the channel*. Because
      in that case, having timer.Stop() return false means that the timer
      already finished, which would make sense (you already received on it),
      and so it isn't going to put another item into the channel for you.
      
      In our case, we received from the channel on a previous iteration of the
      loop, and then try to receive from it again when the child informant
      dies for a *second* time.
      
      So the fix: Track if you've received from the channel; don't block on
      receiving from timer.C if there won't be anything there.
      696fe2fa
    • Em Sharnoff's avatar
      release workflow: Add vmscrape.yaml asset (#392) · 1bb7fa4c
      Em Sharnoff authored
      Should have been handled by #282, but oh well.
      1bb7fa4c
    • Felix Prasanna's avatar
      4296eeeb
  6. Jul 07, 2023
  7. Jul 06, 2023
  8. Jul 04, 2023
    • Em Sharnoff's avatar
      Bump version: v0.11.0 -> v0.12.0 · 0cd3cdbb
      Em Sharnoff authored
      v0.12.0
      0cd3cdbb
    • Em Sharnoff's avatar
      88d3fbae
    • Em Sharnoff's avatar
      agent/billing: Move push logic into separate thread (#368) · 44d0c69e
      Em Sharnoff authored
      This should resolve some of the ongoing issues we've had with billing
      push requests timing out, because *currently* the push timeout must be
      short in order to produce correct data.
      
      Also, this commit removes pkg/billing.Batch, because this commit changes
      the agent to use its own batching, meaning pkg/billing.Batch is no
      longer required.
      
      Billing config changes:
      
      - Renamed pushTimeoutSeconds to pushRequestTimeoutSeconds
      - Added accumulateEverySeconds (now distinct from pushing!)
      - Added maxBatchSize
      
      Billing metrics changes:
      
      - Renamed autoscaling_agent_billing_batch_size to autoscaling_agent_billing_queue_size
      - Added autoscaling_agent_billing_last_send_duration_seconds
      44d0c69e
    • Em Sharnoff's avatar
      billing, agent/billing: Log IdempotencyKey of events (#366) · 0eb5df9e
      Em Sharnoff authored
      In order to make this work on the agent's side, idempotency key
      generation needed to be *somewhat* exposed, so this commit adds the
      Enrich function, which handles the common "filling out the other fields"
      tasks for each event.
      
      By abstracting this interface, we also remove the need for separate
      (*Batch).AddAbsoluteEvent vs (*Batch).AddIncrementalEvent methods, so
      now we just have a single (*Batch).Add method that also calls Enrich.
      0eb5df9e
    • Felix Prasanna's avatar
      Revert "add port for monitor" · c71eff62
      Felix Prasanna authored
      This reverts commit 6af98b75.
      c71eff62
    • Felix Prasanna's avatar
      add port for monitor · 6af98b75
      Felix Prasanna authored
      6af98b75
  9. Jul 03, 2023
  10. Jun 30, 2023
  11. Jun 29, 2023
    • Em Sharnoff's avatar
      c88cb01d
    • Em Sharnoff's avatar
      plugin: Cleanup state for deleted Nodes (#361) · 1fdaeac1
      Em Sharnoff authored
      This is a pre-req for adding per-node usage metrics, because of what's
      mentioned in #248:
      
      > Looking at the state for prod-us-east-2-delta, there's currently 107
      > nodes that the plugin is still tracking state for that don't exist
      
      Unclear whether this will resolve #248 ("scheduler has a memory leak").
      
      ---
      
      A lot of the extra work behind this PR is around using the new
      watch.Store instead of the preexisting calls to the K8s API when we need
      to fetch information about a Node. Realistically... this is probably
      more hassle than it's worth, but it's nice to have the consistency.
      1fdaeac1
  12. Jun 27, 2023
  13. Jun 26, 2023
    • Em Sharnoff's avatar
      f0ad3bd9
    • Em Sharnoff's avatar
      plugin: Fix filter cycle metrics (#356) · d5cb880b
      Em Sharnoff authored
      The previous implementation doesn't *really* work because it turns out
      PostFilter is only called if all Filter calls failed - so we some other
      way to count the number of filter cycles.
      
      This PR removes two metrics:
      
      * autoscaling_plugin_filter_cycle_successes_total
      * autoscaling_plugin_filter_cycle_rejections_total
      
      The new recommended flow is:
      
      * Get attempts with autoscaling_plugin_extension_calls_total{method="PreFilter"}
      * Get rejections with autoscaling_plugin_extension_calls_total{method="PostFilter"}
      * Successes can be calculated by the difference of the two.
      d5cb880b
    • Em Sharnoff's avatar
      b7b4a5da
  14. Jun 23, 2023
    • Em Sharnoff's avatar
      d574fbbb
    • Em Sharnoff's avatar
      agent: Record endpoints for `Runner`s (#353) · fb7a4506
      Em Sharnoff authored
      With upcoming compute pool changes, we're going to end up with a lot of
      VMs where the informant is unable to start up - because the initial file
      cache connection will fail until postgres is alive, which only happens
      once the pooled VM is bound to a particular endpoint.
      
      So on staging, we currently report a lot of "autoscaling stuck" VMs,
      when in reality these are just part of the pool. Having a separate
      value for the number of these stuck VMs that are actually running
      something will ensure our metrics continue to be useful.
      
      And also, in passing this through so that we can make a metric out of
      it, it's worth storing & logging the endpoint ID, so that the
      information is more easily available (without having to cross-reference
      the console DB)
      fb7a4506
    • Em Sharnoff's avatar
      util/watch: Add more logs (#351) · adb41cec
      Em Sharnoff authored
      adb41cec
Loading