This project is mirrored from https://github.com/neondatabase/neon.
Pull mirroring failed .
Repository mirroring has been paused due to too many failed attempts. It can be resumed by a project maintainer or owner.
Last successful update .
Repository mirroring has been paused due to too many failed attempts. It can be resumed by a project maintainer or owner.
Last successful update .
- Jul 22, 2024
-
-
Christian Schwarz authored
## Storage & Compute release 2024-07-22 This PR has so many commits because the release branch diverged from `main`. Details https://neondb.slack.com/archives/C033A2WE6BZ/p1721650938949059?thread_ts=1721308848.034069&cid=C033A2WE6BZ The commit range that is truly new since the last storage release are the the `main` commit which I cherry-picked using this command ``` git cherry-pick 8a8b83df..4e547e62 ```
-
Arpad Müller authored
PR #8299 has switched the storage scrubber to use `DefaultCredentialsChain`. Now we do this for `remote_storage`, as it allows us to use `remote_storage` from inside kubernetes. Most of the diff is due to `GenericRemoteStorage::from_config` becoming `async fn`.
-
Arpad Müller authored
This adds an archival_config endpoint to the pageserver. Currently it has no effect, and always "works", but later the intent is that it will make a timeline archived/unarchived. - [x] add yml spec - [x] add endpoint handler Part of https://github.com/neondatabase/neon/issues/8088
-
Shinya Kato authored
## Problem There are some swagger errors in `pageserver/src/http/openapi_spec.yml` ``` Error 431 15000 Object includes not allowed fields Error 569 3100401 should always have a 'required' Error 569 15000 Object includes not allowed fields Error 1111 10037 properties members must be schemas ``` ## Summary of changes Fixed the above errors.
-
John Spray authored
## Problem This test had two locations with 2 second timeouts, which is rather low when we run on a highly contended test machine running lots of tests in parallel. It usually passes, but today I've seen both of these locations time out on separate PRs. Example failure: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8432/10007868041/index.html#suites/837740b64a53e769572c4ed7b7a7eeeb/6c6a092be083d27c ## Summary of changes - Change 2 second timeouts to 20 second timeouts
-
Shinya Kato authored
There are unused safekeeper runtimes `WAL_REMOVER_RUNTIME` and `METRICS_SHIFTER_RUNTIME`. `WAL_REMOVER_RUNTIME` was implemented in [#4119](https://github.com/neondatabase/neon/pull/4119) and removed in [#7887](https://github.com/neondatabase/neon/pull/7887). `METRICS_SHIFTER_RUNTIME` was also implemented in [#4119](https://github.com/neondatabase/neon/pull/4119) but has never been used. I removed unused safekeeper runtimes `WAL_REMOVER_RUNTIME` and `METRICS_SHIFTER_RUNTIME`.
-
John Spray authored
## Problem After a shard split, the pageserver leaves the ancestor shard's content in place. It may be referenced by child shards, but eventually child shards will de-reference most ancestor layers as they write their own data and do GC. We would like to eventually clean up those ancestor layers to reclaim space. ## Summary of changes - Extend the physical GC command with `--mode=full`, which includes cleaning up unreferenced ancestor shard layers - Add test `test_scrubber_physical_gc_ancestors` - Remove colored log output: in testing this is irritating ANSI code spam in logs, and in interactive use doesn't add much. - Refactor storage controller API client code out of storcon_client into a `storage_controller/client` crate - During physical GC of ancestors, call into the storage controller to check that the latest shards seen in S3 reflect the latest state of the tenant, and there is no shard split in progress.
-
Christian Schwarz authored
We're removing the usage of this long-meaningless config field in https://github.com/neondatabase/aws/pull/1599 Once that PR has been deployed to staging and prod, we can merge this PR.
-
Peter Bendel authored
## Problem My prior PR https://github.com/neondatabase/neon/pull/8422 caused leftovers in the GitHub action runner work directory with root permission. As an example see here https://github.com/neondatabase/neon/actions/runs/10001857641/job/27646237324#step:3:37 To work-around we install vanilla postgres as non-root using deb packages in /home/nonroot user directory ## Summary of changes - since we cannot use root we install the deb pkgs directly and create symbolic links for psql, pgbench and libs in expected places - continue jobs an aws even if azure jobs fail (because this region is currently unreliable)
-
Arpad Müller authored
Successor of #8288 , just enable zstd in tests. Also adds a test that creates easily compressable data. Part of #5431 --------- Co-authored-by: John Spray <john@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>
-
Arthur Petukhovsky authored
The error means that manager exited earlier than `ResidenceGuard` and it's not unexpected with current deletion implementation. This commit changes log level to reduse noise.
-
Peter Bendel authored
## Problem https://github.com/neondatabase/neon/issues/8275 is not yet fixed Periodic benchmarking fails with SIGABRT in pgvector step, see https://github.com/neondatabase/neon/actions/runs/9967453263/job/27541159738#step:7:393 ## Summary of changes Instead of using pgbench and psql from Neon artifacts, download vanilla postgres binaries into the container and use those to run the client side of the test.
-
Alex Chi Z. authored
Use the k-merge iterator in the compaction process to reduce memory footprint. part of https://github.com/neondatabase/neon/issues/8002 ## Summary of changes * refactor the bottom-most compaction code to use k-merge iterator * add Send bound on some structs as it is used across the await points --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
-
Arthur Petukhovsky authored
We have an issue that some partial uploaded segments can be actually missing in remote storage. I found this issue when was looking at the logs in staging, and it can be triggered by failed uploads: 1. Code tries to upload `SEG_TERM_LSN_LSN_sk5.partial`, but receives error from S3 2. The failed attempt is saved to `segments` vec 3. After some time, the code tries to upload `SEG_TERM_LSN_LSN_sk5.partial` again 4. This time the upload is successful and code calls `gc()` to delete previous uploads 5. Since new object and old object share the same name, uploaded data gets deleted from remote storage This commit fixes the issue by patching `gc()` not to delete objects with the same name as currently uploaded. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
-
John Spray authored
## Problem Ahead of enabling eviction in the field, where it will become the normal/default mode, let's enable it by default throughout our tests in case any issues become visible there. ## Summary of changes - Make default `extra_opts` for safekeepers enable offload & deletion - Set low timeouts in `extra_opts` so that tests running for tens of seconds have a chance to hit some of these background operations.
-
John Spray authored
## Problem These tests time out ~1 in 50 runs when in debug mode. There is no indication of a real issue: they're just wrappers that have large numbers of individual tests contained within on pytest case. ## Summary of changes - Bump pg_regress timeout from 600 to 900s - Bump test_isolation timeout from 300s (default) to 600s In future it would be nice to break out these tests to run individual cases (or batches thereof) as separate tests, rather than this monolith.
-
John Spray authored
## Problem This test would occasionally fail its metric check. This could happen in the rare case that the nodes had all been restarted before their most recent eviction. The metric check was added in https://github.com/neondatabase/neon/pull/8348 ## Summary of changes - Check metrics before each restart, accumulate into a bool that we assert on at the end of the test
-
Christian Schwarz authored
When `NeonEnv.from_repo_dir` was introduced, storage controller stored its state exclusively `attachments.json`. Since then, it has moved to using Postgres, which stores its state in `storage_controller_db`. But `NeonEnv.from_repo_dir` wasn't adjusted to do this. This PR rectifies the situation. Context for this is failures in `test_pageserver_characterize_throughput_with_n_tenants` CF: https://neondb.slack.com/archives/C033RQ5SPDH/p1721035799502239?thread_ts=1720901332.293769&cid=C033RQ5SPDH Notably, `from_repo_dir` is also used by the backwards- and forwards-compatibility. Thus, the changes in this PR affect those tests as well. However, it turns out that the compatibility snapshot already contains the `storage_controller_db`. Thus, it should just work and in fact we can remove hacks like `fixup_storage_controller`. Follow-ups created as part of this work: * https://github.com/neondatabase/neon/issues/8399 * https://github.com/neondatabase/neon/issues/8400
-
dotdister authored
## Problem There are something wrong in the comment of `control_plane/src/broker.rs` and `control_plane/src/pageserver.rs` ## Summary of changes Fixed the comment about component name and their data path in `control_plane/src/broker.rs` and `control_plane/src/pageserver.rs`.
-
Joonas Koivunen authored
Fix flakyness on `test_sharded_timeline_detach_ancestor` which does not reproduce on a fast enough runner by allowing cancelled request before completing on all pageservers. It was only allowed on half of the pageservers. Failure evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8352/9972357040/index.html#suites/a1c2be32556270764423c495fad75d47/7cca3e3d94fe12f2
-
John Spray authored
## Problem We lack insight into: - How much of a tenant's physical size is image vs. delta layers - Average sizes of image vs. delta layers - Total layer counts per timeline, indicating size of index_part object As well as general observability love, this is motivated by https://github.com/neondatabase/neon/issues/6738, where we need to define some sensible thresholds for storage amplification, and using total physical size may not work well (if someone does a lot of DROPs then it's legitimate for the physical-synthetic ratio to be huge), but the ratio between image layer size and delta layer size may be a better indicator of whether we're generating unreasonable quantities of image layers. ## Summary of changes - Add pageserver_layer_bytes and pageserver_layer_count metrics, labelled by timeline and `kind` (delta or image) - Add & subtract these with LayerInner's lifetime. I'm intentionally avoiding using a generic metric RAII guard object, to avoid bloating LayerInner: it already has all the information it needs to update metric on new+drop.
-
Yuchen Liang authored
The db name was renamed to storage_controller from attachment_service. Doc was stale.
-
John Spray authored
This test reproduces the case of a writer creating a deep stack of L0 layers. It uses realistic layer sizes and writes several gigabytes of data, therefore runs as a performance test although it is validating memory footprint rather than performance per se. It acts a regression test for two recent fixes: - https://github.com/neondatabase/neon/pull/8401 - https://github.com/neondatabase/neon/pull/8391 In future it will demonstrate the larger improvement of using a k-merge iterator for L0 compaction (#8184) This test can be extended to enforce limits on the memory consumption of other housekeeping steps, by restarting the pageserver and then running other things to do the same "how much did RSS increase" measurement.
-
Alex Chi Z. authored
Existing tenants and some selection of layers might produce duplicated keys. Add tests to ensure the k-merge iterator handles it correctly. We also enforced ordering of the k-merge iterator to put images before deltas. part of https://github.com/neondatabase/neon/issues/8002 --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>
-
Peter Bendel authored
## Problem We want to run performance tests on all supported cloud providers. We want to run most tests on the postgres version which is default for new projects in production, currently (July 24) this is postgres version 16 ## Summary of changes - change default postgres version for some (performance) tests to 16 (which is our default for new projects in prod anyhow) - add azure region to pgbench_compare jobs - add azure region to pgvector benchmarking jobs - re-used project `weathered-snowflake-88107345` was prepared with 1 million embeddings running on 7 minCU 7 maxCU in azure region to compare with AWS region (pgvector indexing and hnsw queries) - see job pgbench-pgvector - Note we now have a 11 environments combinations where we run pgbench-compare and 5 are for k8s-pod (deprecated) which we can remove in the future once auto-scaling team approves. ## Logs A current run with the changes from this pull request is running here https://github.com/neondatabase/neon/actions/runs/9972096222 Note that we currently expect some failures due to - https://github.com/neondatabase/neon/issues/8275 - instability of projects on azure region
-
John Spray authored
## Problem When a tenant creates a new timeline that they will treat as their 'main' history, it is awkward to permanently retain an 'old main' timeline as its ancestor. Currently this is necessary because it is forbidden to delete a timeline which has descendents. ## Summary of changes A new pageserver API is proposed to 'adopt' data from a parent timeline into one of its children, such that the link between ancestor and child can be severed, leaving the parent in a state where it may then be deleted. --------- Co-authored-by: Joonas Koivunen <joonas@neon.tech>
-
John Spray authored
## Problem ValueRef is an unnecessarily large structure, because it carries a cursor. L0 compaction currently instantiates gigabytes of these under some circumstances. ## Summary of changes - Carry a ref to the parent layer instead of a cursor, and construct a cursor on demand. This reduces RSS high watermark during L0 compaction by about 20%.
-
John Spray authored
## Problem The `evictions_with_low_residence_duration` is used as an indicator of cache thrashing. However, there are situations where it is quite legitimate to only have a short residence during compaction, where a delta is downloaded, used to generate an image layer, and then discarded. This can lead to false positive alerts. ## Summary of changes - Only track low residence duration for layers that have been accessed at least once (compaction doesn't count as an access). This will give us a metric that indicates thrashing on layers that the _user_ is using, rather than those we're downloading for housekeeping purposes. Once we add "layer visibility" as an explicit property of layers, this can also be used as a cleaner condition (residence of non-visible layers should never be alertable)
-
Alex Chi Z. authored
## Problem close https://github.com/neondatabase/neon/issues/8389 ## Summary of changes A quick mitigation for tenants with fast writes. We compact at most 60 delta layers at a time, expecting a memory footprint of 15GB. We will pick the oldest 60 L0 layers. This should be a relatively safe change so no test is added. Question is whether to make this parameter configurable via tenant config. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: John Spray <john@neon.tech>
-
Tristan Partin authored
-
Tristan Partin authored
-
Tristan Partin authored
Previously, every migration was run in the same transaction. This is preparatory work for fixing CVE-2024-4317.
-
Tristan Partin authored
This matches what we put into the neon_migration.migration_id table.
-
John Spray authored
- `horizon` is a confusing term, it's not at all obvious that this means space-based retention limit, rather than the total GC history limit. Rename to `GcCutoffs::space`. - `pitr` is less confusing, but still an unecessary level of indirection from what we really mean: a time-based condition. The fact that we use that that time-history for Point In Time Recovery doesn't mean we have to refer to time as "pitr" everywhere. Rename to `GcCutoffs::time`.
-
dependabot[bot] authored
Bumps [setuptools](https://github.com/pypa/setuptools ) from 65.5.1 to 70.0.0. Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: a-masterov <72613290+a-masterov@users.noreply.github.com>
-
Arpad Müller authored
As described in #8385, the likely source for flakiness in test_tenant_creation_fails is the following sequence of events: 1. test instructs the storage controller to create the tenant 2. storage controller adds the tenant and persists it to the database. issues a creation request 3. the pageserver restarts with the failpoint disabled 4. storage controller's background reconciliation still wants to create the tenant 5. pageserver gets new request to create the tenant from background reconciliation This commit just avoids the storage controller entirely. It has its own set of issues, as the re-attach request will obviously not include the tenant, but it's still useful to test for non-existence of the tenant. The generation is also not optional any more during tenant attachment. If you omit it, the pageserver yields an error. We change the signature of `tenant_attach` to reflect that. Alternative to #8385 Fixes #8266
-
Anastasia Lubennikova authored
Fixes #8251
-
John Spray authored
## Problem This structure was in an Arc<> unnecessarily, making it harder to reason about its lifetime (i.e. it was superficially possible for LayerManager to outlive timeline, even though no code used it that way) ## Summary of changes - Remove the Arc<>
-
Arpad Müller authored
The `doc_lazy_continuation` lint of clippy is still unknown on latest rust stable. Fixes fall-out from #8151.
-