This project is mirrored from https://github.com/neondatabase/neon.
Pull mirroring failed .
Repository mirroring has been paused due to too many failed attempts. It can be resumed by a project maintainer or owner.
Last successful update .
Repository mirroring has been paused due to too many failed attempts. It can be resumed by a project maintainer or owner.
Last successful update .
- Sep 12, 2023
-
-
Shany Pozin authored
Release 2023-09-12
-
Christian Schwarz authored
The sequence that can lead to a deadlock: 1. DELETE request gets all the way to `tenant.shutdown(progress, false).await.is_err() ` , while holding TENANTS.read() 2. POST request for tenant creation comes in, calls `tenant_map_insert`, it does `let mut guard = TENANTS.write().await;` 3. Something that `tenant.shutdown()` needs to wait for needs a `TENANTS.read().await`. The only case identified in exhaustive manual scanning of the code base is this one: Imitate size access does `get_tenant().await`, which does `TENANTS.read().await` under the hood. In the above case (1) waits for (3), (3)'s read-lock request is queued behind (2)'s write-lock, and (2) waits for (1). Deadlock. I made a reproducer/proof-that-above-hypothesis-holds in https://github.com/neondatabase/neon/pull/5281 , but, it's not ready for merge yet and we want the fix _now_. fixes https://github.com/neondatabase/neon/issues/5284
-
Arpad Müller authored
## Problem `block_in_place` is a quite expensive operation, and if it is used, we should explicitly have to opt into it by allowing the `clippy::disallowed_methods` lint. For more, see https://github.com/neondatabase/neon/pull/5023#discussion_r1304194495. Similar arguments exist for `Handle::block_on`, but we don't do this yet as there is still usages. ## Summary of changes Adds a clippy.toml file, configuring the [`disallowed_methods` lint](https://rust-lang.github.io/rust-clippy/master/#/disallowed_method).
-
- Sep 11, 2023
-
-
Arpad Müller authored
## Problem Previously, we were using `observe_closure_duration` in `VirtualFile` file opening code, but this doesn't support async open operations, which we want to use as part of #4743. ## Summary of changes * Move the duration measurement from the `with_file` macro into a `observe_duration` macro. * Some smaller drive-by fixes to replace the old strings with the new variant names introduced by #5273 Part of #4743, follow-up of #5247.
-
Arpad Müller authored
## Problem For #4743, we want to convert everything up to the actual I/O operations of `VirtualFile` to `async fn`. ## Summary of changes This PR is the last change in a series of changes to `VirtualFile`: #5189, #5190, #5195, #5203, and #5224. It does the last preparations before the I/O operations are actually made async. We are doing the following things: * First, we change the locks for the file descriptor cache to tokio's locks that support Send. This is important when one wants to hold locks across await points (which we want to do), otherwise the Future won't be Send. Also, one shouldn't generally block in async code as executors don't like that. * Due to the lock change, we now take an approach for the `VirtualFile` destructors similar to the one proposed by #5122 for the page cache, to use `try_write`. Similarly to the situation in the linked PR, one can make an argument that if we are in the destructor and the slot has not been reused yet, we are the only user accessing the slot due to owning the lock mutably. It is still possible that we are not obtaining the lock, but the only cause for that is the clock algorithm touching the slot, which should be quite an unlikely occurence. For the instance of `try_write` failing, we spawn an async task to destroy the lock. As just argued however, most of the time the code path where we spawn the task should not be visited. * Lastly, we split `with_file` into a macro part, and a function part that contains most of the logic. The function part returns a lock object, that the macro uses. The macro exists to perform the operation in a more compact fashion, saving code from putting the lock into a variable and then doing the operation while measuring the time to run it. We take the locks approach because Rust has no support for async closures. One can make normal closures return a future, but that approach gets into lifetime issues the moment you want to pass data to these closures via parameters that has a lifetime (captures work). For details, see [this](https://smallcultfollowing.com/babysteps/blog/2023/03/29/thoughts-on-async-closures/) and [this](https://users.rust-lang.org/t/function-that-takes-an-async-closure/61663) link. In #5224, we ran into a similar problem with the `test_files` function, and we ended up passing the path and the `OpenOptions` by-value instead of by-ref, at the expense of a few extra copies. This can be done as the data is cheaply copyable, and we are in test code. But here, we are not, and while `File::try_clone` exists, it [issues system calls internally](https://github.com/rust-lang/rust/blob/1e746d7741d44551e9378daf13b8797322aa0b74/library/std/src/os/fd/owned.rs#L94-L111). Also, it would allocate an entirely new file descriptor, something that the fd cache was built to prevent. * We change the `STORAGE_IO_TIME` metrics to support async. Part of #4743.
-
bojanserafimov authored
-
duguorong009 authored
Introduce the `StorageIoOperation` enum, `StorageIoTime` struct, and `STORAGE_IO_TIME_METRIC` static which provides lockless access to histograms consumed by `VirtualFile`. Closes #5131 Co-authored-by: Joonas Koivunen <joonas@neon.tech>
-
Joonas Koivunen authored
Assorted flakyness fixes from #5198, might not be flaky on `main`. Migrate some tests using neon_simple_env to just neon_env_builder and using initial_tenant to make flakyness understanding easier. (Did not understand the flakyness of `test_timeline_create_break_after_uninit_mark`.) `test_download_remote_layers_api` is flaky because we have no atomic "wait for WAL, checkpoint, wait for upload and do not receive any more WAL". `test_tenant_size` fixes are just boilerplate which should had always existed; we should wait for the tenant to be active. similarly for `test_timeline_delete`. `test_timeline_size_post_checkpoint` fails often for me with reading zero from metrics. Give it a few attempts.
-
- Sep 10, 2023
-
-
Rahul Modpur authored
## Problem Detaching a tenant can involve many thousands of local filesystem metadata writes, but the control plane would benefit from us not blocking detach/delete responses on these. ## Summary of changes After rename of local tenant directory ack tenant detach and delete tenant directory in background #5183 --------- Signed-off-by: Rahul Modpur <rmodpur2@gmail.com>
-
Alexander Bayandin authored
## Problem Another thing I overlooked regarding'approved-for-ci-run`: - When we create a PR, the action is associated with @vipvap and this triggers the pipeline — this is good. - When we update the PR by force-pushing to the branch, the action is associated with @github-actions, which doesn't trigger a pipeline — this is bad. Initially spotted in #5239 / #5211 ([link](https://github.com/neondatabase/neon/actions/runs/6122249456/job/16633919558?pr=5239)) — `check-permissions` should not fail. ## Summary of changes - Use `CI_ACCESS_TOKEN` to check out the repo (I expect this token will be reused in the following `git push`)
-
Alexander Bayandin authored
## Problem When PR `ci-run/pr-*` is created the GitHub Autocomment with test results are supposed to be posted to the original PR, currently, this doesn't work. I created this PR from a personal fork to debug and fix the issue. ## Summary of changes - `scripts/comment-test-report.js`: use `pull_request.head` instead of `pull_request.base`
-
Alexander Bayandin authored
## Problem Add a CI pipeline that checks GitHub Workflows with https://github.com/rhysd/actionlint (it uses `shellcheck` for shell scripts in steps) To run it locally: `SHELLCHECK_OPTS=--exclude=SC2046,SC2086 actionlint` ## Summary of changes - Add `.github/workflows/actionlint.yml` - Fix actionlint warnings
-
Em Sharnoff authored
Some VMs, when already scaled up as much as possible, end up spamming the autoscaler-agent with upscale requests that will never be fulfilled. If postgres is using memory greater than the cgroup's memory.high, it can emit new memory.high events 1000 times per second, which... just means unnecessary load on the rest of the system. This changes the vm-monitor so that we skip sending upscale requests if we already sent one within the last second, to avoid spamming the autoscaler-agent. This matches previous behavior that the vm-informant hand.
-
Em Sharnoff authored
It makes the logs too verbose. ref https://neondb.slack.com/archives/C03F5SM1N02/p1694281232874719?thread_ts=1694272777.207109&cid=C03F5SM1N02
-
- Sep 09, 2023
-
-
Alexander Bayandin authored
## Problem This PR creates a GitHub release from a release tag with an autogenerated changelog: https://github.com/neondatabase/neon/releases ## Summary of changes - Call GitHub API to create a release
-
Konstantin Knizhnik authored
See #5001 No space is what's expected if we're at size limit. Of course if SK incorrectly returned "no space", the availability check wouldn't fire. But users would notice such a bug quite soon anyways. So ignoring "no space" is the right trade-off. ## Problem ## Summary of changes ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>
-
Konstantin Knizhnik authored
refer #5208 ## Problem See https://neondb.slack.com/archives/C03H1K0PGKH/p1693938336062439?thread_ts=1693928260.704799&cid=C03H1K0PGKH #5208 disable LFC forever in case of error. It is not good because the problem causing this error (for example ENOSPC) can be resolved anti will be nice to reenable it after fixing. Also #5208 disables LFC locally in one backend. But other backends may still see corrupted data. It should not cause problems right now with "permission denied" error because there should be no backend which is able to normally open LFC. But in case of out-of-disk-space error, other backend can read corrupted data. ## Summary of changes 1. Cleanup hash table after error to prevent access to stale or corrupted data 2. Perform disk write under exclusive lock (hoping it will not affect performance because usually write just copy data from user to system space) 3. Use generations to prevent access to stale data in lfc_read ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
-
- Sep 08, 2023
-
-
Joonas Koivunen authored
I forgot a `str(...)` conversion in #5243. This lead to log lines such as: ``` Using fs root 'PosixPath('/tmp/test_output/test_backward_compatibility[debug-pg14]/compatibility_snapshot/repo/local_fs_remote_storage/pageserver')' as a remote storage ``` This surprisingly works, creating hierarchy of under current working directory (`repo_dir` for tests): - `PosixPath('` - `tmp` .. up until .. `local_fs_remote_storage` - `pageserver')` It should not work but right now test_compatibility.py tests finds local metadata and layers, which end up used. After #5172 when remote storage is the source of truth it will no longer work.
-
Heikki Linnakangas authored
It includes PostgreSQL 16 support.
-
Joonas Koivunen authored
Switches everyone without an `rustup override` to 1.72.0. Code changes required already done in #5255. Depends on https://github.com/neondatabase/build/pull/65.
-
Alexander Bayandin authored
## Problem A bunch of fixes for different test-related things ## Summary of changes - Fix test_runner/pg_clients (`subprocess_capture` return value has changed) - Do not run create-test-report if check-permissions failed for not cancelled jobs - Fix Code Coverage comment layout after flaky tests. Add another healing "\n" - test_compatibility: add an instruction for local run Co-authored-by: Joonas Koivunen <joonas@neon.tech>
-
John Spray authored
## Problem Currently our testing environment only supports running a single pageserver at a time. This is insufficient for testing failover and migrations. - Dependency of writing tests for #5207 ## Summary of changes - `neon_local` and `neon_fixture` now handle multiple pageservers - This is a breaking change to the `.neon/config` format: any local environments will need recreating - Existing tests continue to work unchanged: - The default number of pageservers is 1 - `NeonEnv.pageserver` is now a helper property that retrieves the first pageserver if there is only one, else throws. - Pageserver data directories are now at `.neon/pageserver_{n}` where n is 1,2,3... - Compatibility tests get some special casing to migrate neon_local configs: these are not meant to be backward/forward compatible, but they were treated that way by the test.
-
Konstantin Knizhnik authored
See https://neondb.slack.com/archives/C03H1K0PGKH/p1692550646191429 ## Problem Build index concurrently is writing WAL outside transaction. `backpressure_throttling_impl` doesn't perform throttling for read-only transactions (not assigned XID). It cause huge write lag which can cause large delay of accessing the table. ## Summary of changes Looks at `PROC_IN_SAFE_IC` in process state set during concurrent index build. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Heikki Linnakangas <heikki@neon.tech>
-
Joonas Koivunen authored
Prepare to upgrade rust version to latest stable. - `rustfmt` has learned to format `let irrefutable = $expr else { ... };` blocks - There's a new warning about virtual (workspace) crate resolver, picked the latest resolver as I suspect everyone would expect it to be the latest; should not matter anyways - Some new clippies, which seem alright
-
Joonas Koivunen authored
Remote storage cleanup split from #5198: - pageserver, extensions, and safekeepers now have their separate remote storage - RemoteStorageKind has the configuration code - S3Storage has the cleanup code - with MOCK_S3, pageserver, extensions, safekeepers use different buckets - with LOCAL_FS, `repo_dir / "local_fs_remote_storage" / $user` is used as path, where $user is `pageserver`, `safekeeper` - no more `NeonEnvBuilder.enable_xxx_remote_storage` but one `enable_{pageserver,extensions,safekeeper}_remote_storage` Should not have any real changes. These will allow us to default to `LOCAL_FS` for pageserver on the next PR, remove `RemoteStorageKind.NOOP`, work towards #5172. Co-authored-by: Alexander Bayandin <alexander@neon.tech>
-
Heikki Linnakangas authored
This includes PostgreSQL 16 support. There are no catalog changes, so this is a drop-in replacement, no need to run "ALTER EXTENSION UPDATE".
-
Heikki Linnakangas authored
This brings v16 support.
-
Alexander Bayandin authored
## Problem `test_runner/performance/test_startup.py::test_startup` started to fail more frequently because of the timeout. Let's increase the timeout to see the failures on the perf dashboard. ## Summary of changes - Increase timeout for`test_startup` from 600 to 900 seconds
-
Heikki Linnakangas authored
The v1.4.0 includes changes to make it compile with PostgreSQL 16. The commit log doesn't call it out explicitly, but I tested it manually. v1.4.0 includes some new functions, but I tested manually that the the v1.3.1 functionality works with the v1.4.0 version of the library. That means that this doesn't break existing installations. Users can do "ALTER EXTENSION hypopg UPDATE" if they want to use the new v1.4.0 functionality, but they don't have to.
-
- Sep 07, 2023
-
-
Heikki Linnakangas authored
This version includes trivial changes to make it compile with PostgreSQL 16. No functional changes.
-
Heikki Linnakangas authored
This includes PostgreSQL 16 support. No other changes, really. The extension version in the upstream was changed from 2.17 to 2.18, however, there is no difference between the catalog objects. So if you had installed 2.17 previously, it will continue to work. You can run "ALTER EXTENSION hll UPDATE", but all it will do is update the version number in the pg_extension table.
-
Heikki Linnakangas authored
Includes PostgreSQL v16 support. No functional changes.
-
Arpad Müller authored
## Problem Once we use async file system APIs for `VirtualFile`, these functions will also need to be async fn. ## Summary of changes Makes the functions `open, open_with_options, create,sync_all,with_file` of `VirtualFile` async fn, including all functions that call it. Like in the prior PRs, the actual I/O operations are not using async APIs yet, as per request in the #4743 epic. We switch towards not using `VirtualFile` in the par_fsync module, hopefully this is only temporary until we can actually do fully async I/O in `VirtualFile`. This might cause us to exhaust fd limits in the tests, but it should only be an issue for the local developer as we have high ulimits in prod. This PR is a follow-up of #5189, #5190, #5195, and #5203. Part of #4743.
-
Heikki Linnakangas authored
This includes v16 support.
-
Heikki Linnakangas authored
It's a good idea to keep up-to-date in general. One noteworthy change is that PostGIS 3.3.3 adds support for PostgreSQL v16. We'll need that. PostGIS 3.4.0 has already been released, and we should consider upgrading to that. However, it's a major upgrade and requires running "SELECT postgis_extensions_upgrade();" in each database, to upgrade the catalogs. I don't want to deal with that right now.
-
Alexander Bayandin authored
-
Rahul Modpur authored
## Problem Fix pg_config version parsing ## Summary of changes Use regex to capture major version of postgres #5146
-
Alexander Bayandin authored
## Problem We likely need this to support Postgres 16 It's also been asked by a user https://github.com/neondatabase/neon/discussions/5042 The latest version is 3.2.0, but it requires some changes in the build script (which I haven't checked, but it didn't work right away) ## Summary of changes ``` 3.1.8 2023-08-01 - force v8 to compile in release mode 3.1.7 2023-06-26 - fix byteoffset issue with arraybuffers - support postgres 16 beta 3.1.6 2023-04-08 - fix crash issue on fetch apply - fix interrupt issue ``` From https://github.com/plv8/plv8/blob/v3.1.8/Changes
-
Alexander Bayandin authored
## Problem We've got `approved-for-ci-run` to work
But it's still a bit rough, this PR should improve the UX for external contributors. ## Summary of changes - `build_and_test.yml`: add `check-permissions` job, which fails if PR is created from a fork. Make all jobs in the workflow to be dependant on `check-permission` to fail fast - `approved-for-ci-run.yml`: add `cleanup` job to close `ci-run/pr-*` PRs and delete linked branches when the parent PR is closed - `approved-for-ci-run.yml`: fix the layout for the `ci-run/pr-*` PR description - GitHub Autocomment: add a comment with tests result to the original PR (instead of a PR from `ci-run/pr-*` ) -
duguorong009 authored
Add a `walreceiver_state` field to `TimelineInfo` (response of `GET /v1/tenant/:tenant_id/timeline/:timeline_id`) and while doing that, refactor out a common `Timeline::walreceiver_state(..)`. No OpenAPI changes, because this is an internal debugging addition. Fixes #3115. Co-authored-by: Joonas Koivunen <joonas.koivunen@gmail.com>
-