Skip to content
Snippets Groups Projects
This project is mirrored from https://github.com/cockroachdb/cockroach. Pull mirroring updated .
  1. Mar 06, 2023
  2. Feb 14, 2023
  3. Feb 13, 2023
  4. Feb 10, 2023
    • Nick Travers's avatar
      Merge pull request #96952 from nicktrav/nickt.22.1-disk-stall · 96481c61
      Nick Travers authored
      release-22.1: backport disk-stalled roachtest changes
      96481c61
    • Nick Travers's avatar
      roachtest: unskip fuse disk-stall roachtest variant · c656cfaf
      Nick Travers authored
      The `fuse` variant of the `disk-stalled` roachtest was skipped in
      \#95865.
      
      Re-enable the skipped variant, updating it to make use of our forked
      version of `charybdefs`. This fork includes a patch that allows for
      specifying a delay time for syscalls, making it possible to simulate a
      complete disk stall. Previously, delay times were limited to 50ms, which
      meant that the detection time had to be even lower (e.g. 40ms), which
      was not representative of how Cockroach is configured in practice.
      
      Allow the roachprod infrastructure to interpolate strings such as
      `{store-dir}`, etc. when provided as `ExtraArgs` or the `KeyCmd`. Fix by
      expanding expanding all args, rather than just `ExtraArgs`.
      
      Fixes #95874.
      
      Release note: None.
      c656cfaf
    • Nick Travers's avatar
      roachprod: remove string splitting logic for arguments · 61203472
      Nick Travers authored
      Currently, if `ExtraArgs` (a `[]string`) is specified for the start
      options for a cluster, and an argument in the slice contains whitespace,
      the argument will be split into sub-arguments.
      
      This results in situations where an argument intended to be interpreted
      as a literal string is split into separate arguments. E.g.
      
      ```
      ExtraArgs = []{"'foo bar baz'"} // becomes "'foo" "bar" "baz'"
      ```
      
      Remove the string splitting logic, instead relying on callers to specify
      arguments as already "pre-split".
      
      Update two existing usages of `ExtraArgs` to pre-split the arguments.
      
      Improve documentation.
      
      Release note: None.
      61203472
    • Jackson Owens's avatar
      cmd/roachtest: adapt disk-stall detection roachtest · d67aab52
      Jackson Owens authored
      Move the existing disk-stall/* roachtests under disk-stall/fuse/* (for the FUSE
      filesystem approach to stalling) and skip them for now. Currently, they're not
      capable of stalling the disk longer 50us (see #95886), which makes them
      unreliable at exercising stalls.
      
      Add two new roachtests, disk-stall/dmsetup and disk-stall/cgroup that use
      dmsetup and cgroup bandwidth restrctions respectively to reliably induce a
      write stall for an indefinite duration.
      
      Informs #94373.
      Epic: None
      Release note: None
      d67aab52
  5. Feb 09, 2023
  6. Feb 08, 2023
    • Rafi Shamim's avatar
    • Oliver Tan's avatar
      Merge pull request #96401 from otan-cockroach/backport22.1-96159 · 0571670c
      Oliver Tan authored
      release-22.1: roachtest/tpcc: retry prometheus query during DRT
      0571670c
    • Evan Wall's avatar
      Merge pull request #96815 from ecwall/backport22.1-96659 · 31e6e4a6
      Evan Wall authored
      release-22.1: sql: wrap stacktraceless errors with errors.Wrap
      31e6e4a6
    • Evan Wall's avatar
      sql: wrap stacktraceless errors with errors.Wrap · 7b3302ee
      Evan Wall authored
      Fixes #95794
      
      This replaces the previous attempt to add logging here #95797.
      
      The context itself cannot be augmented to add a stack trace to errors because
      it interferes with grpc timeout logic - gRPC compares errors directly without
      checking causes https://github.com/grpc/grpc-go/blob/v1.46.0/rpc_util.go#L833.
      Although the method signature allows it, `Context.Err()` should not be
      overriden to customize the error:
      ```
      // If Done is not yet closed, Err returns nil.
      // If Done is closed, Err returns a non-nil error explaining why:
      // Canceled if the context was canceled
      // or DeadlineExceeded if the context's deadline passed.
      // After Err returns a non-nil error, successive calls to Err return the same error.
      Err() error
      ```
      Additionally, a child context of the augmented context may end up being used
      which will circumvent the stack trace capture.
      
      This change instead wraps `errors.Wrap` in a few places that might end up
      helping debug the original problem:
      1) Where we call `Context.Err()` directly.
      2) Where gRPC returns an error after possibly calling `Context.Err()`
         internally or returns an error that does not have a stack trace.
      
      Release note: None
      7b3302ee
  7. Feb 07, 2023
    • Rafi Shamim's avatar
      roachtest: fix sqlalchemy version pinning · f31c4da8
      Rafi Shamim authored
      The test setup was wrong, and was always using the latest sqlalchemy.
      This fixes the pinning, and also updates to a newer version.
      
      Release note: None
      f31c4da8
    • Marcus Gartner's avatar
      Merge pull request #96732 from mgartner/backport22.1-96001 · a25810b3
      Marcus Gartner authored
      release-22.1: sql/logictest: fix flaky test in unique
      a25810b3
    • Marcus Gartner's avatar
      sql/logictest: fix flaky test in unique · 65569c6b
      Marcus Gartner authored
      This commit fixes a flaky test in the `unique` logic tests. The test
      could flake because an `UPSERT` violated two unique constraints, making
      the error message non-deterministic.
      
      Fixes #95968
      
      Release note: None
      65569c6b
    • Nathan VanBenschoten's avatar
      Merge pull request #95215 from nvanbenschoten/backport22.1-83688 · 5b17995f
      Nathan VanBenschoten authored
      release-22.1: kvcoord: heartbeat immediately to avoid being considered expired
      5b17995f
    • Alex Sarkesian's avatar
      kvcoord: heartbeat immediately to avoid being considered expired · 26377f2c
      Alex Sarkesian authored
      This changes the `txnHeartbeater` to modify when we start our heartbeat
      loop in some cases. Previously, we would start the heartbeat loop (which
      writes the transaction record) one heartbeat interval (default 1s) after
      the first request in the transaction that acquires locks. In the case
      that more than 5 heartbeat intervals have passed since the first read in
      the transaction by the time that we encounter the first locking request,
      however, any other operations that encounter the locks and attempt to
      push (before the transaction heartbeats) will consider this transaction
      to be expired. To avoid this situation, this changes the interceptor to
      heartbeat immediately if the transaction would otherwise be considered
      expired before its first heartbeat interval.
      
      Release note (bug fix): Fixes a race condition where some operations
      waiting on locks can cause the lockholder transaction to be aborted
      if they occur before the transaction can write its record.
      
      Release justification: Bug fix.
      26377f2c
    • Tobias Grieger's avatar
  8. Feb 06, 2023
    • Nathan Stilwell's avatar
    • Nathan Stilwell's avatar
      ui: Adjusting package.json to include types · 96cbdb19
      Nathan Stilwell authored
      For unknown reasons[1], when publishing Cluster UI from `release-22.1` the
      Typescript type files are not being included in the `.tgz` file that is
      published to npm. Adding a wildcard to the `dist/` entry in the `files`
      property of the package.json seems to include them, so that change was
      made along with a version bump to publish Cluster UI when this change is
      merged.
      
      [1]: the package.json `files` property is the same on branches release-22.1
      and release-22.2, as well as the `.npmignore` and the `.gitignore`
      files. When publishing Cluster UI from branch release-22.2 the types are
      included, but publishing from release-22.1 they are not. Other factors
      considered were npm version, node version, and dependency versions.
      
      Epic: none
      
      Release note: None
      96cbdb19
    • Jackson Owens's avatar
      Merge pull request #96666 from jbowens/jackson/pebble-release-22.1-a30d64b32b0b · 81d1ddb7
      Jackson Owens authored
      release-22.1: vendor: bump Pebble to a30d64b32b0b
      81d1ddb7
    • Jackson Owens's avatar
      vendor: bump Pebble to a30d64b32b0b · 804a1e01
      Jackson Owens authored
      ```
      a30d64b3 vfs: handle concurrent directory Syncs in disk-health checking
      5dee4bea db: add Options.WithFSDefaults
      ```
      
      Epic: None
      Release note (bug fix): Fix bug where a disk stall could go undetected in rare
      circumstances where multiple goroutines sync the data directory concurrently.
      Release justification: Fix severe issue of undetected disk stall.
      804a1e01
    • Jackson Owens's avatar
      Merge pull request #96369 from cockroachdb/blathers/backport-release-22.1-96145 · fb8dbeaf
      Jackson Owens authored
      release-22.1: cli: close listeners and all open connections on disk stall
      fb8dbeaf
    • Jackson Owens's avatar
      Merge pull request #96296 from jbowens/jackson/pebble-release-22.1-10f3aff6757a · d4f1e4a9
      Jackson Owens authored
      release-22.1: vendor: bump Pebble to 10f3aff6757a
      d4f1e4a9
    • Jackson Owens's avatar
      cli: close listeners and all open connections on disk stall · a3b917a7
      Jackson Owens authored
      Disk stalls prevent a node from making progress. Any ranges for which the
      stalled node is leaseholder may also be prevented from making progress while
      the stalled node remains online but incapacitated. CockroachDB nodes detect
      stalls within their stores through timing all write filesystem operations.
      Previously, when a stall was detected, Cockroach would simply fatal the
      process. However, a process blocked on disk IO cannot be terminated. The
      process would enter the zombie state, but would be unable to be reaped.
      
      This commit adds a new step to disk stall handling, closing all open sockets.
      
      Epic: None
      Release note (bug fix): Fix a bug where a node with a disk stall would continue
      to accept new connections and preserve existing connections until the disk
      stall abated.
      a3b917a7
    • Jackson Owens's avatar
      vendor: bump Pebble to 10f3aff6757a · 091982eb
      Jackson Owens authored
      ```
      10f3aff6 .github: install crlfmt@024b567c
      287ed0f1 vfs: add SyncData,SyncTo,Preallocate to vfs.File
      2c4a74ee vfs: clean up Fd functionality
      ```
      
      Release note: Fix bug whereby a stalled disk would sometimes be undetected. Now
      the stall is detected any time a filesystem write operation is observed to last
      longer than the value in the storage.max_sync_duration cluster setting.
      091982eb
    • Tobias Grieger's avatar
      tracingpb: add goroutine ID to jaeger trace · a19fe03e
      Tobias Grieger authored
      This improves #96332 by including (as a tag) the goroutine ID under
      which spans are created. This allows following the trace in a Go
      execution trace if one is available.
      
      Epic: none
      Release note: None
      a19fe03e
  9. Feb 03, 2023
  10. Feb 02, 2023
    • Evan Wall's avatar
      Revert "sql: improve stack trace for get-user-timeout timeouts" · f40002a2
      Evan Wall authored
      This reverts commit 60e39266.
      
      Fixes #96167
      
      Reverting due to failing test test_quit.tcl.
      
      Release note: None
      f40002a2
    • Nathan Stilwell's avatar
    • Nathan Stilwell's avatar
      workflow: Cluster UI publishing workflow test · d31f2453
      Nathan Stilwell authored
      - Using `actions/setup-node@v3` a registry-url needs to be specified (one
        isn't defaulted and I'm unsure if npm publish will default to
        `registry.npmjs.org` by default, so better safe than sorry) as well as
        supplying the environment variable `NODE_AUTH_TOKEN` rather than
        `NPM_TOKEN` (which npm uses by default, but will be overriden by an
        `.npmrc`).
      - Adding a check for an existing tag.
      - Adding a "files" property to the package.json of Cluster UI to ensure
        that all the types are included in the publish.
      - ui: bumping Cluster UI version to trigger a publish
      
      Epic: none
      
      Release note: None
      d31f2453
    • Renato Costa's avatar
      roachtest: allow TC_BUILDTYPE_ID to be accessible by Docker · fe0d7199
      Renato Costa authored
      In #81103, the process of generating TeamCity links in test failure
      reports started relying on the `TC_BUILDTYPE_ID` environment
      variable. While that variable was added to TeamCity builds, it was not
      being passed down to Docker where the tests actually run. As a result,
      links generated by the GitHub poster were broken (see, for example, #81572).
      
      This commit makes `TC_BUILDTYPE_ID` accessible by Docker for every
      build that was already passing `TC_BUILD_BRANCH`. This should be
      sufficient to cover all existing cases and more, in case having
      access to this variable becomes useful in the future.
      
      Release note: None
      fe0d7199
Loading