Commits · scylla-1.5.0 · Synced / Scylladb

This project is mirrored from https://github.com/scylladb/scylladb. Pull mirroring updated 12 minutes ago.

Dec 21, 2016
- release: prepare for 1.5.0 · 654919cb
  Pekka Enberg authored 8 years ago
  
  scylla-1.5.0
  
  654919cb
Dec 20, 2016

tests: commitlog: Fix assumption about write visibility · 0d0e53c5

Tomasz Grabiec authored 8 years ago

The test assumed that mutations added to the commitlog are visible to
reads as soon as a new segment is opened. That's not true because
buffers are written back in the background, and new segment may be
active while the previous one is still being written or not yet
synced.

Fix the test so that it expectes that the number of mutations read
this way is <= the number of mutations read, and that after all
segments are synced, the number of mutations read is equal.

Message-Id: <1481630481-19395-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit fe6a70db)

0d0e53c5

Dec 19, 2016

commitlog: correctly report requests blocked · 99d9b4e7

Glauber Costa authored 8 years ago


The semaphore future may be unavailable for many reasons. Specifically,
if the task quota is depleted right between sem.wait() and the .then()
clause in get_units() the resulting future won't be available.

That is particularly visible if we decrease the task quota, since those
events will be more frequent: we can in those cases clearly see this
counter going up, even though there aren't more requests pending than
usual.

This patch improves the situation by replacing that check. We now verify
whether or not there are waiters in the semaphore.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <113c0d6b43cd6653ce972541baf6920e5765546b.1481222621.git.glauber@scylladb.com>
(cherry picked from commit 9b5e6d6b)

99d9b4e7

Dec 18, 2016
- release: prepare for 1.5.rc3 · e2790748
  Pekka Enberg authored 8 years ago
  
  scylla-1.5.rc3
  
  e2790748
Dec 16, 2016

Merge branch 'virtual-dirty-fixes-1.5-backport' from... · e82324fb

Tomasz Grabiec authored 8 years ago

Merge branch 'virtual-dirty-fixes-1.5-backport' from git@github.com:glommer/scylla.git into branch-1.5

Rework dirty memory hierarchy from Glauber.

e82324fb

config: get rid of memtable_total_space · 1ae62678

Glauber Costa authored 8 years ago


Those values are now statically set.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 2aa65146)
Signed-off-by: Glauber Costa <glauber@scylladb.com>

1ae62678

database: rework dirty memory hierarchy · 09a463fd

Glauber Costa authored 8 years ago


Issue #1918 describes a problem, in which we are generating smaller
memtables than we could, and therefore not respecting the flush
criteria.

That happens because group sizes (and limits) for pressure purposes, and
the the soft threshold is currently at 40 %. This causes system group's
soft threshold to be way below regular's virtual dirty limit and close
to regular group's soft threshold. The system group was very likely to
become under soft pressure when regular was because writes to regular
group are not yet throttled when they cross both soft thresholds.

This is a direct consequence of the linear hierarchy between the regions
and to guarantee that it won't happen we would have acqire the semaphore
of all ancestor regions when flushing from a child region. While that
works, it can lead to problems on its own, like priority inversion if
the regions have different priorities - like streaming and regular, and
groups lower in the hierarchy, like user, blocking explicit flushes
from their ancestors

To fix that, this patch reorganizes the dirty memory region groups so
that groups are now completely independent. As a disadvantage, when
streaming happen we will draw some memory from the cache, but we will
live with it for the time being.

Fixes #1918

[ glauber: fix conflicts in memtable.cc due to lack of graceful clear ]

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 80440c0d)
Signed-off-by: Glauber Costa <glauber@scylladb.com>

09a463fd

system keyspace: write batchlog mutation in user memory · 34713638

Glauber Costa authored 8 years ago


Batchlog is a potentially memory-intensive table whose workload is
driven by user needs, not system's. Move it to the user dirty memory
manager.

[ glauber: fix conflict with virtual readers ]

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit db7cc3cb)
Signed-off-by: Glauber Costa <glauber@scylladb.com>

34713638

database: remove friendship declaration · 8680174f

Glauber Costa authored 8 years ago


Not needed anymore since memtable started having a direct pointer to the
memtable list.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 2e8c7d2c)
Signed-off-by: Glauber Costa <glauber@scylladb.com>

8680174f

database: simplify flush_one · 261b67f4

Glauber Costa authored 8 years ago


flush_one has to make sure that we're using the correct
dirty_memory_manager object, because we could be flushing from a region
group different than the one the flush request originated.

It's simpler to just assume flush_one will be dealing with the right
object, and use a different object instead of "this" when calling it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit bb1509c2)
Signed-off-by: Glauber Costa <glauber@scylladb.com>

261b67f4

database: make memtable_list aware in cases it can't flush · bb173e3e

Glauber Costa authored 8 years ago


Some of our CFs can't be flushed. Those are the ones who are not marked
as having durable writes. We treat them just the same from the point of
view of the flush logic, but they provide a function that doesn't do
anything and just returns right away.

We already had troubles with that in the past, and that also poses a
problem for an upcoming patch reworking the flush memtable pick
criteria.

It's easier, simpler, and cleaner, to just make the memtable_list aware
it can't flush. Achieving that is also not very complicated: we just
need a special constructor that doesn't take a seal function and then we
make sure that it is initialized to an empty std::function

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 8ab7c04c)
Signed-off-by: Glauber Costa <glauber@scylladb.com>

bb173e3e

Dec 12, 2016

database: move reversion of virtual dirty state closer to update_cache. · 9688dca8

Glauber Costa authored 8 years ago


When we finish writing a memtable, we revert the dirty memory charges
immediately. When we do that, dirty memory will grow back to what it
was, and soon (we hope) will go down again when we release the requests
for real.

During that time, we may not accept new requests. Sealing can take a
long time, specially in the face of Linux issues like the ones we have
seen in the past. It also will take proportionally more time if the
SSTables end up being small, which is a possibility in some scenarios.

This patch changes the dirty_memory_manager so that the charges won't be
reverted right after we finish the flush. Rather, we will hold on to it,
and revert it right before we update the cache. We don't need to do it
for all classes of memtable writes, because after we finish flushing,
flush_one() will destroy the hashed element anyway.

[tgrabiec: conflicts]

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <2d5a8f6ca57d5036f4850ac163557bca59b8063d.1480004384.git.glauber@scylladb.com>
(cherry picked from commit c32803f2)

9688dca8

Dec 11, 2016

lz4: Conditionally use LZ4_compress_default() · 549c9790

Duarte Nunes authored 8 years ago


Since not all distributions have a version of LZ4 with
LZ4_compress_default(), we use it conditionally.

This is specially important beginning with version 1.7.3 of LZ4,
which deprecates the LZ4_compress() function in favour of
LZ4_compress_default() and thus prevents Scylla from compiling
due to the deprecated warning.

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161124092339.23017-1-duarte@scylladb.com>
(cherry picked from commit cc3f26c9)

549c9790

Update seastar submodule · 631d9217

Avi Kivity authored 8 years ago

* seastar 386ccd9...bd9eda1 (1):
  > rpc: Conditionally use LZ4_compress_default()

631d9217

Dec 09, 2016

database: try to acquire semaphore before we start flush · 0a341b40

Glauber Costa authored 8 years ago


As Tomek pointed out, as we are starting the flush before we acquire the
semaphore, we are not really limiting parallelism, but only delaying the
end of the flush instead.

Fixes #1919

[tgrabiec: conflicts]

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <6cbf9ec2f3a341c76becf94f794cfa16539c5192.1481120410.git.glauber@scylladb.com>
(cherry picked from commit 733d87fc)

0a341b40

Dec 08, 2016

sstables: fix probe with Unknown component · 182f67cf

Avi Kivity authored 8 years ago

Commit 53b7b7de ("sstables: handle unrecognized sstable component")
ignores unrecognized components, but misses one code path during probe_file().

Ignore unrecognized components there too.

Fixes #1922.
Message-Id: <20161208131027.28939-1-avi@scylladb.com>

(cherry picked from commit 872b5ef5)

182f67cf

Dec 07, 2016

commitlog: Fix replay to not delete dirty segments · dc08cb46

Tomasz Grabiec authored 8 years ago

The problem is that replay will unlink any segments which were on disk
at the time the replay starts. However, some of those segments may
have been created by current node since the boot. If a segment is part
of reserve for example, it will be unlinked by replay, but we will
still use that segment to log mutations. Those mutations will not be
visible to replay after a crash though.

The fix is to record preexisting segents before any new segments will
have a chance to be created and use that as the replay list.

Introduced in abe73587.

dtest failure:

 commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup

Message-Id: <1481117436-6243-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit f7197dab)

dc08cb46

Dec 06, 2016

systemd: reset housekeeping timer at each start · 06db918d

Amos Kong authored 8 years ago


Currently housekeeping timer won't be reset when we restart scylla-server.
We expect the service to be run at each start, it will be consistent with
upstart script in Ubuntu 14.04

When we restart scylla-server, housekeepting timer will also be restarted,
so let's replace "OnBootSec" with "OnActiveSec".

Fixes: #1601

Signed-off-by: Amos Kong <amos@scylladb.com>
Message-Id: <a22943cc11a3de23db266c52fd476c08014098c4.1480607401.git.amos@scylladb.com>

06db918d

dist/common/systemd/scylla-housekeeping.timer: workaround to avoid crash of systemd on RHEL 7.3 · edbd25ea

Takuya ASADA authored 8 years ago

RHEL 7.3's systemd contains known bug on timer.c:
https://github.com/systemd/systemd/issues/2632



This is workaround to avoid hitting bug.

Fixes #1846

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480452194-11683-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 84649030)

edbd25ea

Dec 05, 2016
- release: prepare for 1.5.rc2 · c7f7a3aa
  Pekka Enberg authored 8 years ago
  
  c7f7a3aa
Dec 01, 2016

row_cache: dummy entry does not count as partition · c014e738

Paweł Dziepak authored 8 years ago


Since continuity flag introduction row cache contains a single dummy
entry. cache_tracker knows nothing about it so that it doesn't appear in
any of the metrics. However, cache destructor calls
cache_tracker::on_erase() for every entry in the cache including the
dummy one. This is incorrect since the tracker wasn't informed when the
dummy entry was created.

Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1478608776-10363-1-git-send-email-pdziepak@scylladb.com>

c014e738

prevent commitlog replay position reordering during reserve refill · abe73587

Glauber Costa authored 8 years ago


When requests hit the commitlog, each of them will be assigned a replay
position, which we expect to be ordered. If reorders happen, the request
will be discarded and re-applied. Although this is supposed to be rare,
it does increase our latencies, specially when big requests are
involved. Processing big requests is expensive and if we have to do it
twice that adds to the cost.

The commitlog is supposed to issue replay positions in order, and it
coudl be that the code that adds them to the memtables will reorder
them. However, there is one instance in which the commitlog will not
keep its side of the bargain.

That happens when the reserve is exhausted, and we are allocating a
segment directly at the same time the reserve is being replenished.  The
following sequence of events with its deferring points will ilustrate
it:

on_timer:

    return this->allocate_segment(false). // defer here // then([this](sseg_ptr s) {

At this point, the segment id is already allocated.

new_segment():

    if (_reserve_segments.empty()) {
	[ ... ]
        return allocate_segment(true).then ...

At this point, we have a new segment that has an id that is higher than
the previous id allocated.

Then we resume the execution from the deferring point in on_timer():

    i = _reserve_segments.emplace(i, std::move(s));

The next time we need to allocate a segment, we'll pick it from the
reserve. But the segment in the reserve has an id that is lower than the
id that we have already used.

Reorders are bad, but this one is particularly bad: because the reorder
happens with the segment id side of the replay position, that means that
every request that falls into that segment will have to be reinserted.

This bug can be a bit tricky to reproduce. To make it more common, we
can artificially add a sleep() fiber after the allocate_segment(false)
in on_timer(). If we do that, we'll see a sea of reinsertions going on
in the logs (if dblog is set to debug).

Applying this patch (keeping the sleep) will make them all disappear.
We do this by rewriting the reserve logic, so that the segments always
come from the reserve. If we draw from a single pool all the time, there
is no chance of reordering happening. To make that more amenable, we'll
have the reserve filler always running in the background and take it out
of the timer code.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <49eb7edfcafaef7f1fdceb270639a9a8b50cfce7.1480531446.git.glauber@scylladb.com>
(cherry picked from commit 99a5a772)

abe73587

commitlog: sync segments before acquiring semaphore on shutdown. · 0bce0197

Glauber Costa authored 8 years ago


Sync all segments before acquiring the semaphore, otherwise waiting may
have to wait for the timer to kick in and push them down.
Note that we can't guarantee that no other requests were executed in the
mean time, so we have to sync again.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <aea019fe49820acce5d2b55dd5ec31e975b3436c.1480388674.git.glauber@scylladb.com>
(cherry picked from commit 353a4cd2)

0bce0197

tests: Fix use-after-free on commitlog · ae3b1667

Tomasz Grabiec authored 8 years ago

Only shutdown() ensures all internal processes are complete. Call it before calling clear().

Message-Id: <1480495534-2253-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit c35e18ba)

ae3b1667

Update seastar submodule · 2aa73ac1

Tomasz Grabiec authored 8 years ago

* seastar 6fd4534...386ccd9 (1):
  > queue: allow queue to change its maximum size

2aa73ac1

Update scylla-ami submodule · 261fcc1e

Avi Kivity authored 8 years ago

* dist/ami/files/scylla-ami e1e3919...d5a4397 (3):
  > scylla_install_ami: allow specify different repository for Scylla installation and receive update
  > scylla_install_ami: delete unneeded authorized_keys from AMI image
  > scylla_ami_setup: run posix_net_conf.sh when NCPUS < 8

261fcc1e

dist/ami: allow specify different repository for Scylla installation and receive update · 3a7b9d55

Takuya ASADA authored 8 years ago


This fix splits build_ami.sh --repo to three different options:
 --repo-for-install is for Scylla package installation, only valid
 during AMI construction.

 --repo-for-update will be stored at /etc/yum.repos.d/scylla.repo, to
 receive update package on AMI.

 --repo is both, for installation and update.

Fixes #1872

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480438858-6007-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 17ef5e63)

3a7b9d55

Nov 30, 2016

database: do not call seal directly from the streaming timer · 60d5b21e

Glauber Costa authored 8 years ago


Streaming memtable have a delayed mode where many flushes are coalesced
together into one, with the actual flush happening later and propagated
to all the previous waiters.

However, the timer that triggers the actual flush was not using the
newly introduced flush infrastructure. This was a minor problem because
those flushes wouldn't try to take the semaphore, and so we could have
many flushes going on at the same time.

What was a potential performance issue became a correctness issue when
we moved the reversal of the dirty memory accounting out of
revert_potentially_cleaned_up_memory() into remove_from_flush_manager().

Since the latter is only called through the flush infrastructure, it
simply wasn't called. So the deferral of the reversal exposed this bug.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <0d5755375bc27524b8cfb9970c76d492b14d9eea.1480522742.git.glauber@scylladb.com>
(cherry picked from commit d7256e7b)

60d5b21e

commitlog: use read ahead for replay requests · 903a323b

Glauber Costa authored 8 years ago


Aside from putting the requests in the commitlog class, read ahead
will help us going through the file faster.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 59a41cf7)

903a323b

commitlog: use commitlog priority for replay · 0174b9ad

Glauber Costa authored 8 years ago


Right now replay is being issued with the standard seastar priority.
The rationale for that at the time is that it is an early event that
doesn't really share the disk with anybody.

That is largely untrue now that we start compactions on boot.
Compactions may fight for bandwidth with the commitlog, and with such
low priority the commitlog is guaranteed to lose.

Fixes #1856

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit aa375cd3)

0174b9ad

commitlog: close file after read, and not at stop · 3b7f646f

Glauber Costa authored 8 years ago


There are other code paths that may interrupt the read in the middle
and bypass stop. It's safer this way.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <8c32ca2777ce2f44462d141fd582848ac7cf832d.1479477360.git.glauber@scylladb.com>
(cherry picked from commit 60b7d35f)

3b7f646f

commitlog: close replay file · 127152e0

Glauber Costa authored 8 years ago


Replay file is opened, so it should be closed. We're not seeing any
problems arising from this, but they may happen. Enabling read ahead in
this stream makes them happen immediately. Fix it.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
(cherry picked from commit 4d3d7747)

127152e0

Nov 29, 2016

dist/common/scripts/scylla_kernel_check: fix incorrect document URL · 80811d38

Takuya ASADA authored 8 years ago


Fixes #1871

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1480327243-18177-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 1042e401)

80811d38

Nov 27, 2016

Update seastar submodule · c6ffda7a

Avi Kivity authored 8 years ago

* seastar df471a8...6fd4534 (1):
  > Collectd get_value_map safe scan the map

Fixes #1835.

c6ffda7a

Nov 24, 2016

dist/ubuntu: increase number of open files on Ubuntu 14.04(upstart) · be9f62bd

Takuya ASADA authored 8 years ago


Follow the change of NOFILE for non-systemd environment.

Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1479975050-14907-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit ce80fb3a)

be9f62bd

dist: increase number of open files · d6ab5ff1

Glauber Costa authored 8 years ago


This limit was found to be too low for production environments. It would
be hit at boot, when we're touching a lot of files from multiple shards
before deciding that we don't need them.

Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <87bbf43da1a67f5fa6174017205c6ef8bdb0dc3d.1479829232.git.glauber@scylladb.com>
(cherry picked from commit 18b9fa3d)

d6ab5ff1

thrift: Don't apply cell limit across rows · 8a83819f

Duarte Nunes authored 8 years ago


In Thrift, SliceRange defines a count that limits the number of cells
to return from that row (in CQL3 terms, it limits the number of rows
in that partition). While this limit is honored in the engine, the
Thrift layer also applies the same limit, which, while redundant in
most cases, is used to support the get_paged_slice verb.

Currently, the limit is not being reset per Thrift row (CQL3
partition), so in practice, instead of limiting the cells in a row,
we're limiting the rows we return as well. This patch fixes that by
ensuring the limit applies only within a row/partition.

Fixes #1882

Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20161123220001.15496-1-duarte@scylladb.com>
(cherry picked from commit a527ba28)

8a83819f

dist/docker: Actually use 1.5... · 44249e4b
Pekka Enberg authored 8 years ago
```
Fix typo in the RPM repository URL to actually use 1.5.
```
44249e4b
dist/docker: Use Scylla 1.5 RPM repository · 33c3a7e7
Pekka Enberg authored 8 years ago

33c3a7e7

Nov 23, 2016

Update seastar submodule · de5327a4

Tomasz Grabiec authored 8 years ago

* seastar 25137c2...df471a8 (1):
  > semaphore_units: add missing return statement

de5327a4