Skip to content
Snippets Groups Projects
This project is mirrored from https://github.com/neondatabase/neon. Pull mirroring failed .
Repository mirroring has been paused due to too many failed attempts. It can be resumed by a project maintainer or owner.
Last successful update .
  1. Mar 25, 2024
  2. Mar 23, 2024
    • Christian Schwarz's avatar
      pageserver: use a single tokio runtime (#6555) · 3220f830
      Christian Schwarz authored
      Before this PR, each core had 3 executor threads from 3 different
      runtimes. With this PR, we just have one runtime, with one thread per
      core. Switching to a single tokio runtime should reduce that effective
      over-commit of CPU and in theory help with tail latencies -- iff all
      tokio tasks are well-behaved and yield to the runtime regularly.
      
      Are All Tasks Well-Behaved? Are We Ready?
      -----------------------------------------
      
      Sadly there doesn't seem to be good out-of-the box tokio tooling to
      answer this question.
      
      We *believe* all tasks are well behaved in today's code base, as of the
      switch to `virtual_file_io_engine = "tokio-epoll-uring"` in production
      (https://github.com/neondatabase/aws/pull/1121).
      
      The only remaining executor-thread-blocking code is walredo and some
      filesystem namespace operations.
      
      Filesystem namespace operations work is being tracked in #6663 and not
      considered likely to actually block at this time.
      
      Regarding walredo, it currently does a blocking `poll` for read/write to
      the pipe file descriptors we use for IPC with the walredo process.
      There is an ongoing experiment to make walredo async (#6628), but it
      needs more time because there are surprisingly tricky trade-offs that
      are articulated in that PR's description (which itself is still WIP).
      What's relevant for *this* PR is that
      1. walredo is always CPU-bound
      2. production tail latencies for walredo request-response
      (`pageserver_wal_redo_seconds_bucket`) are
        - p90: with few exceptions, low hundreds of micro-seconds
        - p95: except on very packed pageservers, below 1ms
        - p99: all below 50ms, vast majority below 1ms
        - p99.9: almost all around 50ms, rarely at >= 70ms
      - [Dashboard
      Link](https://neonprod.grafana.net/d/edgggcrmki3uof/2024-03-walredo-latency?orgId=1&var-ds=ZNX49CDVz&var-pXX_by_instance=0.9&var-pXX_by_instance=0.99&var-pXX_by_instance=0.95&var-adhoc=instance%7C%21%3D%7Cpageserver-30.us-west-2.aws.neon.tech&var-per_instance_pXX_max_seconds=0.0005&from=1711049688777&to=1711136088777)
      
      The ones below 1ms are below our current threshold for when we start
      thinking about yielding to the executor.
      The tens of milliseconds stalls aren't great, but, not least because of
      the implicit overcommit of CPU by the three runtimes, we can't be sure
      whether these tens of milliseconds are inherently necessary to do the
      walredo work or whether we could be faster if there was less contention
      for CPU.
      
      On the first item (walredo being always CPU-bound work): it means that
      walredo processes will always compete with the executor threads.
      We could yield, using async walredo, but then we hit the trade-offs
      explained in that PR.
      
      tl;dr: the risk of stalling executor threads through blocking walredo
      seems low, and switching to one runtime cleans up one potential source
      for higher-than-necessary stall times (explained in the previous
      paragraphs).
      
      
      Code Changes
      ------------
      
      - Remove the 3 different runtime definitions.
      - Add a new definition called `THE_RUNTIME`.
      - Use it in all places that previously used one of the 3 removed
      runtimes.
      - Remove the argument from `task_mgr`.
      - Fix failpoint usage where `pausable_failpoint!` should have been used.
      We encountered some actual failures because of this, e.g., hung
      `get_metric()` calls during test teardown that would client-timeout
      after 300s.
      
      As indicated by the comment above `THE_RUNTIME`, we could take this
      clean-up further.
      But before we create so much churn, let's first validate that there's no
      perf regression.
      
      
      Performance
      -----------
      
      We will test this in staging using the various nightly benchmark runs.
      
      However, the worst-case impact of this change is likely compaction
      (=>image layer creation) competing with compute requests.
      Image layer creation work can't be easily generated & repeated quickly
      by pagebench.
      So, we'll simply watch getpage & basebackup tail latencies in staging.
      
      Additionally, I have done manual benchmarking using pagebench.
      Report:
      https://neondatabase.notion.site/2024-03-23-oneruntime-change-benchmarking-22a399c411e24399a73311115fb703ec?pvs=4
      Tail latencies and throughput are marginally better (no regression =
      good).
      Except in a workload with 128 clients against one tenant.
      There, the p99.9 and p99.99 getpage latency is about 2x worse (at
      slightly lower throughput).
      A dip in throughput every 20s (compaction_period_ is clearly visible,
      and probably responsible for that worse tail latency.
      This has potential to improve with async walredo, and is an edge case
      workload anyway.
      
      
      Future Work
      -----------
      
      1. Once this change has shown satisfying results in production, change
      the codebase to use the ambient runtime instead of explicitly
      referencing `THE_RUNTIME`.
      2. Have a mode where we run with a single-threaded runtime, so we
      uncover executor stalls more quickly.
      3. Switch or write our own failpoints library that is async-native:
      https://github.com/neondatabase/neon/issues/7216
      3220f830
    • Conrad Ludgate's avatar
      proxy: fix stack overflow in cancel publisher (#7212) · 72103d48
      Conrad Ludgate authored
      ## Problem
      
      stack overflow in blanket impl for `CancellationPublisher`
      
      ## Summary of changes
      
      Removes `async_trait` and fixes the impl order to make it non-recursive.
      72103d48
    • Alex Chi Z's avatar
      fixup(#7204 / postgres): revert `IsPrimaryAlive` checks (#7209) · 643683f4
      Alex Chi Z authored
      Fix #7204.
      
      https://github.com/neondatabase/postgres/pull/400
      https://github.com/neondatabase/postgres/pull/401
      https://github.com/neondatabase/postgres/pull/402
      
      These commits never go into prod. Detailed investigation will be posted
      in another issue. Reverting the commits so that things can keep running
      in prod. This pull request adds the test to start two replicas. It fails
      on the current main https://github.com/neondatabase/neon/pull/7210
      
       but
      passes in this pull request.
      
      ---------
      
      Signed-off-by: default avatarAlex Chi Z <chi@neon.tech>
      643683f4
  3. Mar 22, 2024
  4. Mar 21, 2024
    • Christian Schwarz's avatar
      walredo benchmark: throughput-oriented rewrite (#7190) · fb60278e
      Christian Schwarz authored
      See the updated `bench_walredo.rs` module comment.
      
      tl;dr: we measure avg latency of single redo operations issues against a
      single redo manager from N tokio tasks.
      
      part of https://github.com/neondatabase/neon/issues/6628
      fb60278e
    • Conrad Ludgate's avatar
      proxy: simplify password validation (#7188) · d5304337
      Conrad Ludgate authored
      ## Problem
      
      for HTTP/WS/password hack flows we imitate SCRAM to validate passwords.
      This code was unnecessarily complicated.
      
      ## Summary of changes
      
      Copy in the `pbkdf2` and 'derive keys' steps from the
      `postgres_protocol` crate in our `rust-postgres` fork. Derive the
      `client_key`, `server_key` and `stored_key` from the password directly.
      Use constant time equality to compare the `stored_key` and `server_key`
      with the ones we are sent from cplane.
      d5304337
    • John Spray's avatar
      pageserver: extend /re-attach response to include tenant mode (#6941) · 06cb582d
      John Spray authored
      This change improves the resilience of the system to unclean restarts.
      
      Previously, re-attach responses only included attached tenants
      - If the pageserver had local state for a secondary location, it would
      remain, but with no guarantee that it was still _meant_ to be there.
      After this change, the pageserver will only retain secondary locations
      if the /re-attach response indicates that they should still be there.
      - If the pageserver had local state for an attached location that was
      omitted from a re-attach response, it would be entirely detached. This
      is wasteful in a typical HA setup, where an offline node's tenants might
      have been re-attached elsewhere before it restarts, but the offline
      node's location should revert to a secondary location rather than being
      wiped. Including secondary tenants in the re-attach response enables the
      pageserver to avoid throwing away local state unnecessarily.
      
      In this PR:
      - The re-attach items are extended with a 'mode' field.
      - Storage controller populates 'mode'
      - Pageserver interprets it (default is attached if missing) to construct
      either a SecondaryTenant or a Tenant.
      - A new test exercises both cases.
      06cb582d
    • John Spray's avatar
      pageserver: quieten log on shutdown-while-attaching (#7177) · bb47d536
      John Spray authored
      ## Problem
      
      If a shutdown happens when a tenant is attaching, we were logging at
      ERROR severity and with a backtrace. Yuck.
      
      ## Summary of changes
      
      - Pass a flag into `make_broken` to enable quietening this non-scary
      case.
      bb47d536
    • John Spray's avatar
      storage controller: fixes to secondary location handling (#7169) · 59cdee74
      John Spray authored
      Stacks on:
      - https://github.com/neondatabase/neon/pull/7165
      
      Fixes while working on background optimization of scheduling after a
      split:
      - When a tenant has secondary locations, we weren't detaching the parent
      shards' secondary locations when doing a split
      - When a reconciler detaches a location, it was feeding back a
      locationconf with `Detached` mode in its `observed` object, whereas it
      should omit that location. This could cause the background reconcile
      task to keep kicking off no-op reconcilers forever (harmless but
      annoying).
      - During shard split, we were scheduling secondary locations for the
      child shards, but no reconcile was run for these until the next time the
      background reconcile task ran. Creating these ASAP is useful, because
      they'll be used shortly after a shard split as the destination locations
      for migrating the new shards to different nodes.
      59cdee74
    • Vlad Lazar's avatar
      storage_controller: add metrics (#7178) · c75b5844
      Vlad Lazar authored
      ## Problem
      Storage controller had basically no metrics.
      
      ## Summary of changes
      1. Migrate the existing metrics to use Conrad's
      [`measured`](https://docs.rs/measured/0.0.14/measured/) crate.
      2. Add metrics for incoming http requests
      3. Add metrics for outgoing http requests to the pageserver
      4. Add metrics for outgoing pass through requests to the pageserver
      5. Add metrics for database queries
      
      Note that the metrics response for the attachment service does not use
      chunked encoding like the rest of the metrics endpoints. Conrad has
      kindly extended the crate such that it can now be done. Let's leave it
      for a follow-up since the payload shouldn't be that big at this point.
      
      Fixes https://github.com/neondatabase/neon/issues/6875
      c75b5844
    • Conrad Ludgate's avatar
      proxy: async aware password validation (#7176) · 5ec6862b
      Conrad Ludgate authored
      ## Problem
      
      spawn_blocking in #7171 was a hack
      
      ## Summary of changes
      
      https://github.com/neondatabase/rust-postgres/pull/29
      5ec6862b
    • Jure Bajic's avatar
      Enforce LSN ordering of batch entries (#7071) · 94138c1a
      Jure Bajic authored
      ## Summary of changes
      
      Enforce LSN ordering of batch entries.
      
      Closes https://github.com/neondatabase/neon/issues/6707
      94138c1a
    • Joonas Koivunen's avatar
      fix(layer): remove the need to repair internal state (#7030) · 2206e14c
      Joonas Koivunen authored
      ## Problem
      
      The current implementation of struct Layer supports canceled read
      requests, but those will leave the internal state such that a following
      `Layer::keep_resident` call will need to repair the state. In
      pathological cases seen during generation numbers resetting in staging
      or with too many in-progress on-demand downloads, this repair activity
      will need to wait for the download to complete, which stalls disk
      usage-based eviction. Similar stalls have been observed in staging near
      disk-full situations, where downloads failed because the disk was full.
      
      Fixes #6028 or the "layer is present on filesystem but not evictable"
      problems by:
      1. not canceling pending evictions by a canceled
      `LayerInner::get_or_maybe_download`
      2. completing post-download initialization of the `LayerInner::inner`
      from the download task
      
      Not canceling evictions above case (1) and always initializing (2) lead
      to plain `LayerInner::inner` always having the up-to-date information,
      which leads to the old `Layer::keep_resident` never having to wait for
      downloads to complete. Finally, the `Layer::keep_resident` is replaced
      with `Layer::is_likely_resident`. These fix #7145.
      
      ## Summary of changes
      
      - add a new test showing that a canceled get_or_maybe_download should
      not cancel the eviction
      - switch to using a `watch` internally rather than a `broadcast` to
      avoid hanging eviction while a download is ongoing
      - doc changes for new semantics and cleanup
      - fix `Layer::keep_resident` to use just `self.0.inner.get()` as truth
      as `Layer::is_likely_resident`
      - remove `LayerInner::wanted_evicted` boolean as no longer needed
      
      Builds upon: #7185. Cc: #5331.
      2206e14c
  5. Mar 20, 2024
  6. Mar 19, 2024
    • John Spray's avatar
      storage controller: tech debt (#7165) · a5d5c2a6
      John Spray authored
      This is a mixed bag of changes split out for separate review while
      working on other things, and batched together to reduce load on CI
      runners. Each commits stands alone for review purposes:
      - do_tenant_shard_split was a long function and had a synchronous
      validation phase at the start that could readily be pulled out into a
      separate function. This also avoids the special casing of
      ApiError::BadRequest when deciding whether an abort is needed on errors
      - Add a 'describe' API (GET on tenant ID) that will enable storcon-cli
      to see what's going on with a tenant
      - the 'locate' API wasn't really meant for use in the field. It's for
      tests: demote it to the /debug/ prefix
      - The `Single` placement policy was a redundant duplicate of Double(0),
      and Double was a bad name. Rename it Attached.
      (https://github.com/neondatabase/neon/issues/7107)
      - Some neon_local commands were added for debug/demos, which are now
      replaced by commands in storcon-cli (#7114 ). Even though that's not
      merged yet, we don't need the neon_local ones any more.
      
      Closes https://github.com/neondatabase/neon/issues/7107
      
      ## Backward compat of Single/Double -> `Attached(n)` change
      
      A database migration is used to convert any existing values.
      a5d5c2a6
    • Tristan Partin's avatar
      Move functions for creating/extracting tarballs into utils · 64c6dfd3
      Tristan Partin authored
      Useful for other code paths which will handle zstd compression and
      decompression.
      64c6dfd3
    • Alex Chi Z's avatar
      fixup(#7168): neon_local: use pageserver defaults for known but unspecified... · a8384a07
      Alex Chi Z authored
      fixup(#7168): neon_local: use pageserver defaults for known but unspecified config overrides (#7166)
      
      e2e tests cannot run on macOS unless the file engine env var is
      supplied.
      
      ```
      ./scripts/pytest test_runner/regress/test_neon_superuser.py -s
      ```
      
      will fail with tokio-epoll-uring not supported.
      
      This is because we persist the file engine config by default. In this
      pull request, we only persist when someone specifies it, so that it can
      use the default platform-variant config in the page server.
      
      ---------
      
      Signed-off-by: default avatarAlex Chi Z <chi@neon.tech>
      a8384a07
    • Arpad Müller's avatar
      Merge pull request #7154 from neondatabase/rc/2024-03-18 · 7b860b83
      Arpad Müller authored
      Release 2024-03-18
      release-5147
      7b860b83
    • John Spray's avatar
      tests: log hygiene checks for storage controller (#6710) · b80704cd
      John Spray authored
      ## Problem
      
      As with the pageserver, we should fail tests that emit unexpected log
      errors/warnings.
      
      ## Summary of changes
      
      - Refactor existing log checks to be reusable
      - Run log checks for attachment_service
      - Add allow lists as needed.
      b80704cd
  7. Mar 18, 2024
Loading