Skip to content
Snippets Groups Projects
This project is mirrored from https://github.com/cockroachdb/cockroach. Pull mirroring updated .
  1. Nov 28, 2018
  2. Nov 27, 2018
    • rytaft's avatar
      Merge pull request #32636 from rytaft/backport2.0-32614 · 85335982
      rytaft authored
      release-2.0: stats,distsqlrun: reduce memory usage of CREATE STATISTICS
    • Rebecca Taft's avatar
      stats,distsqlrun: reduce memory usage of CREATE STATISTICS · 452e75b8
      Rebecca Taft authored
      Prior to this patch, running CREATE STATISTICS on a large table
      could cause the memory utilization of the database to grow so large
      that it crashed the server. This was due to the way garbage collection
      works in go. As described in this blog post:
      https://blog.golang.org/go-slices-usage-and-internals,
      
      > Re-slicing a slice doesn't make a copy of the underlying array. The
      > full array will be kept in memory until it is no longer referenced.
      > Occasionally this can cause the program to hold all the data in memory
      > when only a small piece of it is needed.
      
      This is exactly what was happening with CREATE STATISTICS. The CREATE
      STATISTICS command performs a full table scan, and randomly samples
      rows in order to build a histogram for a column. When rows are scanned
      in the kv layer, they are scanned in batches of ~10,000, creating arrays
      of encoded datums of length 10,000. Since the EncDatum object passed to
      the sampler contains a slice referencing that underlying array, the
      entire scanned batch would remain in memory as long as a single datum
      from that slice was included in the sample. With random sampling, it
      is likely that most of the batches have at least one sampled row, meaning
      almost the entire table will be in memory until the histogram is built.
      
      The fix is to copy the EncDatum to the sampler with only the decoded
      Datum, setting the encoded []bytes slice to nil.
      
      Release note (bug fix): Fixed an issue where calling CREATE STATISTICS
      on a large table could cause the server to crash due to running out of
      memory.
      452e75b8
  3. Nov 26, 2018
  4. Nov 20, 2018
    • Raphael 'kena' Poss's avatar
      sql: catch invalid functions in the UPDATE SET RHS (v2.0.x) · f47ca677
      Raphael 'kena' Poss authored
      Prior to this patch, an invalid scalar function in the RHS of UPDATE
      ... SET would not be rejected during planning and cause a crash during
      execution.
      
      This patch fixes this by ensuring the bad construct is rejected during
      planning.
      
      The fix here is different from the one effected in master/v2.1.
      
      Release note (bug fix): CockroachDB now properly rejects queries
      that use an invalid function in UPDATE SET (e.g., an aggregation).
      f47ca677
  5. Nov 12, 2018
    • Andrei Matei's avatar
      Merge pull request #32223 from andreimatei/backport2.0-32166 · 2fdb530a
      Andrei Matei authored
      release-2.0: storage: don't apply local results if cmd processing failed
    • Andrei Matei's avatar
      storage: add a test for the clearing of local results · 09f68de1
      Andrei Matei authored
      ... in case of rejected command.
      
      Release note: None
      09f68de1
    • Andrei Matei's avatar
      storage: don't apply local results if cmd processing failed · d3d40aa3
      Andrei Matei authored
      This patch fixes a bug with the application of LocalResults: they were
      applied even in cases when the processing of the respective command
      failed. So, we were evaluating a batch, proposing a command, failing to
      apply the command (in particular because of the lease index check), but
      then still applying the LocalResult even though the replicated state was
      not updated.
      This bug had catastrophic consequences: part of the LocalResult is a
      list of UpdatedTxn - this is used to ping the txn wait queue and unblock
      waiters, communicating a final txn disposition to them. Now imagine an
      EndTxn batch that evaluates to a command that fails the lease check. The
      respective LocalResult is populate with out txn, updated to the
      COMMITTED disposition. The command fails to process, but we still ping
      the wait queue telling all the waiters that the transaction committed.
      The waiters, if any, now go and send ResolveIntent requests, which
      actually commit intents (thinking that the intent's txn committed). And
      so we end up with "partially committed transactions" - some of the
      writes from a txn are committed (the ones that happened to have waiters
      on their intents), others don't.
      
      In order for this bug to manifest itself, we need:
      - read/write contention on some keys
      - a command to fail processing. Generally this happens either because of
      the lease index check or because of the lease check; so either a
      Raft leadership change or a lease change can potentially cause the
      badness. In the case of a Raft leadership change, proposals can get
      dropped, which leads to reproposals, which seem to frequently(?) end up
      processing at the wrong lease index and thus be rejected (and it's the
      reproposal processing that triggers the bug). This kind of situation is
      triggered by the Jepsen register test, with various nemeses. The lease
      check failure can probably be triggered simply with lease transfers
      (e.g. lease rebalancing).
      
      Interestingly, the rate of occurence of this bug probably changed
      between 2.0 and 2.1 because of the introduction, in 2.1, of the
      QueryIntent request (and also, separately, because of increased usage of
      lease transfers; this claim was not verified though). Speaking of the
      Raft leadership change scenario, once the reproposal fail to be applied
      (but the wait queue is erroneously signalled), we also re-evaluate the
      EndTransaction batch. Unless something else goes wrong, in 2.0 this
      re-evaluation was likely succeeding. In 2.1, however, it tends to fail
      if there was a waiter on our txn because the EndTxn requests are usually
      bundled with QueryIntent requests - and these QueryIntents fail during
      re-evaluation because they don't find the intent - it was, by then,
      errneously  committed by a waiter through a ResolveIntent. Except for
      transaction that wrote to multiple ranges: I think for those the
      QueryIntent requests are split off from the EndTransaction request by
      the DistSender, and so we're back to the situation in 2.0.
      
      Fixes #30792
      
      Release note (bug fix): Fix a bug causing transactions to appear to be
      partially committed. CRDB was sometimes claiming to have failed to
      commit a transaction but some (or all) of its writes were actually
      persisted.
      d3d40aa3
  6. Oct 25, 2018
  7. Oct 24, 2018
    • Nathan VanBenschoten's avatar
      kv: avoid transaction too large error if already refreshed · b341d2a3
      Nathan VanBenschoten authored
      This fixes a bug where a txn that already refreshed its spans
      and no longer needed them could still run into a "transaction
      is too large to complete" error.
      
      This will be backported to 2.0 and 2.1.
      
      Release note (bug fix): Fix bug where transactions unnecessarily
      hit "too large" error.
      b341d2a3
  8. Oct 16, 2018
    • Nikhil Benesch's avatar
      Merge pull request #31522 from benesch/backport2.0-31516 · ca7aaa3f
      Nikhil Benesch authored
      release-2.0: c-deps: bump CryptoPP to avoid SIGTRAP on macOS
    • Nikhil Benesch's avatar
      c-deps: bump CryptoPP to avoid SIGTRAP on macOS · 54ea8d90
      Nikhil Benesch authored
      Bump CryptoPP to pick up a fix for cockroachdb/cockroach#31380.
      Details reproduced below.
      
      Fix #31380.
      
      ---
      
      As part of its CPU feature detection, CryptoPP installs a SIGILL signal
      handler before issuing the cpuid instruction. The intent is to
      gracefully degrade on CPUs that don't support the cpuid instruction.
      
      The problem is that it is impossible to safely overwrite a signal
      handler installed by the Go runtime in go1.10 on macOS
      (golang/go#22805). This causes CockroachDB 2.0 to crash on macOS Mojave:
      cockroachdb/cockroach#31380.
      
      The situation has improved on the Go front, as go1.11 makes it possible
      to safely save and restore signal handlers installed by the Go runtime
      on macOS.
      
      Still, we can do better and support go1.10. There is no need to bother
      installing a SIGILL handler, as the cpuid instruction is supported by
      every x86-64 CPU. We can instead use conditional compilation to make
      sure that we never execute a cpuid instruction on a non x86-64 CPU.
      
      Note that CPU feature detection is performed at executable load time
      (see the attribute(constructor) on DetectX86Features); therefore any
      reference to function which calls DetectX86Features (notably HasAESNI)
      corrupts the signal handler. It's not entirely clear why this corruption
      later leads to the SIGTRAP seen in cockroachdb/cockroach#31380--is
      something in macOS or the Go runtime generating a SIGILL and trying to
      handle it gracefully?--but regardless, not mucking with the signal
      handler fixes the issue.
      
      Release note (bug fix): CockroachDB no longer crashes due to a SIGTRAP error
      soon after startup on macOS Mojave (cockroachdb/cockroach#31380).
      
      diff --git a/c-deps/cryptopp b/c-deps/cryptopp
      index c621ce0532..6d6064445d 160000
      --- a/c-deps/cryptopp
      +++ b/c-deps/cryptopp
      @@ -1 +1 @@
      -Subproject commit c621ce053298fafc1e59191079c33acd76045c26
      +Subproject commit 6d6064445ded787c7d6bf0a021fb718fddb63345
      54ea8d90
  9. Oct 15, 2018
    • Tobias Schottdorf's avatar
      Merge pull request #31350 from tschottdorf/backport2.1-31013 · 366deb27
      Tobias Schottdorf authored
      backport-2.0: kv: try next replica on RangeNotFoundError
    • Tobias Schottdorf's avatar
      Merge pull request #31351 from tschottdorf/backport2.0-29579 · a403a481
      Tobias Schottdorf authored
      backport-2.0: storage: return one entry less in Entries
    • Tobias Schottdorf's avatar
      storage: return one entry less in Entries · 25369101
      Tobias Schottdorf authored
      This works around the bug outlined in:
      
      https://github.com/etcd-io/etcd/pull/10063
      
      by matching Raft's internal implementation of commit pagination.
      Once the above PR lands, we can revert this commit (but I assume
      that it will take a little bit), and I think we should do that
      because the code hasn't gotten any nicer to look at.
      
      Fixes #28918.
      
      Release note: None
      
      #
      # Commit message recommendations:
      #
      #     ---
      #     <pkg>: <short description>
      #
      #     <long description>
      #
      #     Release note (category): <release note description>
      #     ---
      #
      # Wrap long lines! 72 columns is best.
      #
      # The release note must be present if your commit has
      # user-facing changes. Leave the default above if not.
      #
      # Categories for release notes:
      # - cli change
      # - sql change
      # - admin ui change
      # - general change (e.g., change of required Go version)
      # - build change (e.g., compatibility with older CPUs)
      # - enterprise change (e.g., change to backup/restore)
      # - backwards-incompatible change
      # - performance improvement
      # - bug fix
      #
      # Commit message recommendations:
      #
      #     ---
      #     <pkg>: <short description>
      #
      #     <long description>
      #
      #     Release note (category): <release note description>
      #     ---
      #
      # Wrap long lines! 72 columns is best.
      #
      # The release note must be present if your commit has
      # user-facing changes. Leave the default above if not.
      #
      # Categories for release notes:
      # - cli change
      # - sql change
      # - admin ui change
      # - general change (e.g., change of required Go version)
      # - build change (e.g., compatibility with older CPUs)
      # - enterprise change (e.g., change to backup/restore)
      # - backwards-incompatible change
      # - performance improvement
      # - bug fix
      25369101
    • Tobias Schottdorf's avatar
      kv: try next replica on RangeNotFoundError · 08fd28cc
      Tobias Schottdorf authored
      Previously, if a Batch RPC came back with a RangeNotFoundError,
      we would immediately stop trying to send to more replicas, evict the
      range descriptor, and start a new attempt after a back-off.
      
      This new attempt could end up using the same replica, so if the
      RangeNotFoundError persisted for some amount of time, so would the
      unsuccessful retries for requests to it as DistSender doesn't
      aggressively shuffle the replicas.
      
      It turns out that there are such situations, and the
      election-after-restart roachtest spuriously hit one of them:
      
      1. new replica receives a preemptive snapshot and the ConfChange
      2. cluster restarts
      3. now the new replica is in this state until the range wakes
         up, which may not happen for some time.
      4. the first request to the range runs into the above problem
      
      @nvanbenschoten: I think there is an issue to be filed about the
      tendency of DistSender to get stuck in unfortunate configurations.
      
      Fixes #30613.
      
      Release note (bug fix): Avoid repeatedly trying a replica that was found
      to be in the process of being added.
      08fd28cc
  10. Oct 13, 2018
  11. Oct 12, 2018
  12. Oct 05, 2018
  13. Oct 01, 2018
    • Nikhil Benesch's avatar
      Merge pull request #30823 from benesch/secure-grpc-2.0 · dfa0c881
      Nikhil Benesch authored
      release-2.0: rpc,server: authenticate all gRPC methods
    • Nikhil Benesch's avatar
      rpc,server: authenticate all gRPC methods · d3078772
      Nikhil Benesch authored
      Previously only the roachpb.Batch RPC was correctly checking for an
      authenticated user. All other RPCs were open to the public, even when
      the server was running in secure mode.
      
      To prevent future accidents of this kind, hoist the authentication check
      to a gRPC interceptor that is guaranteed to run before all RPCs.
      
      Release note (bug fix): A security vulnerability in which data could be
      leaked from or tampered with in a cluster in secure mode has been fixed.
      
      Release note: None
      v2.0.6
      d3078772
  14. Sep 25, 2018
  15. Sep 19, 2018
  16. Sep 18, 2018
  17. Sep 17, 2018
    • Nathan VanBenschoten's avatar
      Merge pull request #30309 from nvanbenschoten/backport2.0-29685 · 878b628b
      Nathan VanBenschoten authored
      release-2.0: kv: check transaction status before recording refresh spans
    • Nathan VanBenschoten's avatar
      kv: check transaction status before recording refresh spans · 7b38b95b
      Nathan VanBenschoten authored
      Fixes #29253.
      
      We previously accounted for refresh spans before checking the
      transaction's status. This meant that we would attempt to add new
      refresh spans even if the txn had already been committed. If this
      attempt failed because the `max_refresh_spans_bytes` threshold was
      exceeded, we would erroneously return an error to the client even
      though their transaction had already been committed. The simple
      fix for this is to only record refresh spans for transactions that
      are still PENDING. The new test case would have failed with the
      error observed in #29253 without this fix.
      
      This raises an auxiliary question about whether we should be asserting
      that a transaction's status is not COMMITTED if we ever return an
      error to the client. cc. @andreimatei.
      
      On the positive side, this should be a minor perf win, as we'll
      avoid recording refresh spans on the final batch of every txn.
      
      A similar fix will need to be backported to 2.0. The backport
      won't be clean though because of the new txnInterceptor structure.
      However, the test shouldn't need to change.
      
      Release note (bug fix): Don't return transaction size limit errors
      for transactions that have already committed.
      7b38b95b
  18. Sep 13, 2018
  19. Sep 12, 2018
    • Bram Gruneir's avatar
      sqlbase: fix composite fk column checking panic · d3fd5b9b
      Bram Gruneir authored
      Prior to this patch, it was possible to skip the check that ensures that all
      columns are present in a composite fk if only the final column was present,
      thus causing a panic.
      
      Fixes #26748
      
      Release note (bug fix): Fixed a panic that occured when not all values were
      present in a composite foreign key.
      d3fd5b9b
    • Bram Gruneir's avatar
      sqlbase: don't casecade through NULL values · 9fe33d1e
      Bram Gruneir authored
      Prior to this fix, we would cascade updates or deletes through NULLs. See #28896
      for a great test case.
      
      This however did remind me that we use MATCH FULL instead of MATCH SIMPLE and we
      should add support for MATCH SIMPLE (and make it the default).  See #20305.
      
      Closes #28896
      
      Release note (bug fix): ON DELETE/ON UPDATE CASCADE should not cascade through
      NULLs.
      9fe33d1e
  20. Sep 11, 2018
Loading