Performances collapse with 10 millions pictures in a bucket #851
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
5 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#851
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Issue
Once ~4 millions objects are uploaded, PUT requests take more than 10 seconds even when the cluster is idle, even on empty buckets, even for 2KB files. GET requests do not seem to be impacted (they are all done in less than a second).
Expected
PUT requests take less than a second to execute, and probably less than 200ms considering the cluster spec. In our performance benchmark, we were able to do ~100 PUT/second without any issue.
Cluster spec
ZFS is used to store both metadata and data.
LMDB is used for the metadata engine.
Garage runs in a VM on a powerful server, it has 4 real CPU (not a vCPU) assigned + 8GB of RAM.
Configuration has been reviewed, it is standard (fsync disabled, 1MB chunks).
There is no repair tasks running, everything is correctly synchronized.
The team running the cluster is knowledgeable and currently uses Riak KV to store these 10M objects without any issue.
During the debug, we noted that the metadata database reached 50GB (for 80GB of data chunks). So around 7KB per entry, not sure if it's intended or not.
Workload spec
~11 millions pictures of ~1MB.
Migration is done by sending batches of 4 pictures in parallel.
At the beginning, the batch is finished in less than a few secondes.
After a while, it takes more than 60 seconds and hit an internal timeout.
We have done manual test then on the idle cluster, where we sent various small files that all took ~10 seconds to upload.
Investigations so far
We extracted a trace for a
PUT
and aGET
request. For thePUT
request, the performance hit is due to a slowtry_write_many_sets
, waiting a long time for the other nodes of the network. The corresponding call forGET
is fast however. We can't investigate further currently, as we are "crossing a network boundary" and we have not implemented "distributed tracing" (mainly sending the trace/span id over the network).Based on this observation, we suppose that the metadata engine is slow. Indeed, the network and the cluster are idle, so we have no bottleneck or buffer bloat/queuing issue here, and additionally, GET requests have no issue. Furthermore, the slow delays are observed for metadata writes (writing an entry in the bucket list, a new object version, etc.) and thus are not limited to the block manager.
Today, we have no opentelemetry metrics to measure the responsiveness of the metadata engine, the length of our network queues, the amount of bytes written, of data sent and receive, etc.
We were also not able to explore the LMDB database to see if we have some "stale" entries or things like that.
Investigations to conduct
garage
commandOBJ_SIZE=1048576
- 1MB objectsBATCH_SIZE=16
- batch_size is multiplied by THREAD, each thread will send BATCH_SIZE PUT req before reporting/synchronizingTHREAD=64
- should be called "goroutine", the number of parallel requestBATCH_COUNT=10000
- we report every ~1k objects, to send 10M objects, we need 10k iterations.SSL,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION,SSL_INSECURE,ENDPOINT
must/might be configuredSummary of my tests on Grid 5k
On august 15th, I ran some tests on the grappe cluster of the Grid5k project.
Test setup
Relevant observations
full dashboard (different time span)
Docker compose + prometheus conf to explore data yourself
compose.yml:
prometheus.yml
Analysis
Garage ran in 2 different regimes:
Oscillations could be due to how the benchmark took (
warp
) is implemented.Summary of my tests on a Scaleway DEV1-S cluster
Servers have 2vCPU, 2GB of RAM, 50GB of SAN SSD storage. They run Ubuntu 24.04 with Garage on it.
Scaleway is broken so my benchmarking node has been shutdown numerous time (but not my garage cluster):it's why data have holes.
The test command is:
Reproducing performance collapse.
I analyzed a run going from 0 to 1.5 million objects:
Here also performances collapse as soon as LMDB database is larger than RAM.
Let zoom on the part where we go from 1.5M objects to 1.6M objects
We go from ~1k req/sec to 40 req/sec (25 times slower). However we don't have oscillations like in Grid5k.
Disabling read ahead improves performances by at least 10x
It appears that by default read ahead is activated on Linux. When fetching a page on disk, Linux also fetches the 15 next ones. It's meant to improve sequential reads on hard drives, but we don't have a hard drive nor sequential reads here. So basically, it hurts our performances.
LMDB has an option to deactivate read ahead on Linux.
It's named
MdbNoRdAhead
on heed, the Rust wrapper we use for LMDB, I did a simple commit to activate it.Incidentally, activating this option when your database is larger than your memory is recommended by LMDB author.
So I did the same test on the same hardware but with this flag activated:
It started at the same speed, but once we reached the threshold of the RAM size, performances did not collapse similarly as the previous run. I was able to push Garage to 10M objects without any major issue. At 10M objects, the LMDB database takes ~33GB on the filesystem.
Let zoom around 1.5 million objects to see how it behaves:
We are putting objects at ~700 req/sec. Compared to previous run at 40 req/sec, we are ~17x faster. And if look around 10M objects, we are still at ~400 req/sec, in other words ~10x faster.
Discussion
About mmap. There are discussions to know whether or not mmap is a good idea. There is this old debate: 1975 programming vs 2006 programming. More recently, scholars have published a paper where they heavily discourage DBMS implementers from using mmap. With these benchmarks, we have learnt that LMDB has really "2 regimes": one when the whole database fits in RAM and one when it does not fit anymore, and when it does not fit anymore, it is very sensitive to its environment. Like Riak, the metadata engine is abstracted behind an interface in Garage: it's not hard to implement other engines to see how they perform. Trying LSM-Tree based engine, like RocksDB or the Rust native fjall could be interesting. A more exhaustive list of possible metadata engine.
About ZFS. This whole discussion started mentioning ZFS. While our benchmarks showed suboptimal performances, none of them where as abysmal as observed by people reporting the initial issue. By digging on the Internet, it appears that ZFS has a 128kb record size (ie. page size?) by default. And it is known to perform poorly with LMDB as reported by a netizen trying to use Monero over ZFS (that records its blockchain in LMDB). Checking ZFS record size and setting it to 4096 (4k to match LMDB page size) apparently drastically increase performances.
About Merkle Todo. It appears on some nodes the "Merkle Todo" grows without any bound. Garage has a worker that builds asynchronously a Merkle Tree that will then be used for healing. New items to put in this merkle tree are put in a "todo queue" that is also persisted in the metadata engine. In the end, it's not a huge issue that this queue grows: it only means that your healing mechanism will not be aware of the most recent items before some times (delaying a little bit the repair logic). However, by having the queue also managed by the LMDB database, in some way, we may "double" the size of the LMDB database (by having one entry in the object table and one entry in the merkle queue). Also having a queue in LMDB requires to do many writes & removes that are intensive. Maybe LMDB is not the best tool to manage this queue? We might also want to slow down the RPC if we are not able to process this queue fast enough to allow some backpressure in Garage (and in the same way if too much RPC accumulate on a node, slow down the responses to the API and/or return an "overloaded" error).
Explore the data
All the collected data are made available as a prometheus snapshot (see
prom.tgz
below).Retest on a cluster w/ data blocks (objects beyond 3KiB)
Very nice catch for the read ahead flag, thank you for your commit! To further investigate the issue, we re-ran a benchmark of our own, using larger and random object sizes to ensure block refs were created and blocks were stored.
send_all_at_once
read strategy was not usedwarp command line
garage.toml
Cluster observability
That initial 4-hour-long benchmark yielded mixed results, as shown in the PutObject rate graphs below:
There appears to be roughly 3 phases:
It should also be noted that the database ended up at ~6.3GiB (below RAM size) so the impact of read-ahead must have been minimal. The RAM is also dedicated to the VMs so nothing external is forcing pages out of memory.
The speed pattern matches what is observed on RPCs and table operations:
The LMDB metrics also follow the pattern:
More: LMDB heat maps
Looking at cluster synchronisation, we observe that the block sync buffers (
block_ram_buffer_free_kb
) do not get overwhelmed:Similarly, no node is lagging behind on Merkle tree updates, although node 3 (2nd in 1st AZ) exhibits higher values:
However the block resync queue (
block_resync_queue_length
) goes absolutely crazy on that 3rd node:System observability
We are also able to provide some system-level stats. First looking at load, we can definitely see that node 3 is struggling more than the rest, although all of them remain within their CPU capacity:
Looking at cache usage, a strange behaviour comes up : while nodes 1 and 2 keep a steady cache usage, pages on node 3 are regularly being released:
That memory does not appear to be claimed for use though, meaning those drops are unlikely to be forced evictions:
Finally, network-wise, all nodes exhibit similar behaviours, communicating far below link capacity (I may have got those units wrong, but in any case, there's plenty of bandwidth left) :
Discussion
While the read-ahead flag definitely delivers improvements when the database size exceeds the RAM, there appears to be other factors which, in our setup, hinder Garage's capabilities in terms of performance. If the Scaleway cluster stabilised at around 400#/s, we were only able to reach (a much more unstable) 75#/s.
Regarding the impact of the object count, we just restarted another benchmark today, starting off where the first one had left off. The idea was to figure out whether the slowdown was related to cluster size (in which case the new benchmark should start at around 75#/s) or to load/work backlog (in which case the new benchmark should start over from about 800#/s now that the cluster has settled). The first results suggest the former.
➡️ Cluster size appears to still play a part in the performance drop, even below RAM capacity.
Regarding node 3, those results also confirm what we initially saw on our test cluster: the 2nd node in the 1st availability zone appears to do a lot more work than the rest of the nodes, causing it to lag behind at times. This has a cluster-wide impact since this node is still expected to participate in quorums.
➡️ Quick tests were made using
send_all_at_once
but this only improves reads, which overall play a minor role in a PutObject call - node 3 may still be delaying write operations.Regarding RAM cache, those releases on node 3 are definitely odd. They suggest that the node may be behaving more erratically than the rest and moving more things in and out of memory.
Similarly, regarding the block resync queue, we are still unable to explain why node 3 accumulates so much work (although this may just be another symptom).
➡️ Overall, it seems important to understand what makes node 3 so special, given that, system-wise, it does not appear to be particularly different from the rest of the cluster.
We will continue our investigations, but would definitely appreciate any feedback you may have on those results! If something worth testing comes to mind, also feel free to suggest it so we may run it on this cluster. While we can't share access to those servers, the resources will remain available for testing for some time.
Next steps for us:
JKR
Thanks a lot for your extensive feedback!
Regarding the fact that performances diminish with the number of objects stored - and size of LMDB, I think it can be expected as long as it's
O(log(n))
. Quickly:https://www.geeksforgeeks.org/introduction-of-b-tree-2/
On discussing your node 3 performances, it seems that now you are less concerned by raw performances and more by predictability / explainability / homogeneity of performances. Am I right?
It might be an obsession but I don't like the following graph as I don't understand what create this pattern. I would really like to understand what's going on and be able to say whether it's expected or an issue.
I will try to get a closer look later to all of your data. We might find a thing or two!
I don't know exactly when I will be able to run additional benchmarks on my side, but that's what I have on my roadmap:
@quentin Hi, author of Fjall here.
This is observable for any database. While it's true that LMDB's read path is probably the quickest anywhere - as your data surpasses RAM size, any database will take quite a hit as I/O takes over. It can be really hard to benchmark LMDB because the OS just allows it to take all the memory, while with RocksDB etc. you have to explicitly set the block cache size, and you don't want it to start swapping memory. On the other hand, blocks in LSM-trees are tightly packed, while pages in a B-tree tend to be 1/2 or 1/3 empty (if values are small), so you may have to cache more pages for a B-tree to keep the same amount of items in memory...
There are definitely some considerations here. If you write large values, as you seem to do, you probably want to use RocksDB's BlobDB mode, which will heavily reduce compaction overhead caused by large values (the equivalent in Fjall being
use_kv_separation(true)
in upcoming V2). And there are probably a myriad of other configuration options, which is one pain point of RocksDB really, next to its absurdly high compilation times.A big problem with B-trees is that they need to read to write, ideally a lot of your pages are cached, but you may incur random read I/O to actually write data, which is then possibly a random write (LSM-trees always write sequentially, which SSDs like). Why the above benchmark degrades so much while having all the data still in memory, I'm not sure. It is an unsolvable downside of the LMDB write path - you can at least alleviate some of LSM-trees' shortcomings by throwing more memory at it. And an LSM-tree just appends to the WAL, which has O(1) complexity, plus the write path is shorter overall.
There are also circumstances where a tuned RocksDB's read performance is very competitive with LMDB, sometimes beating it in multithreaded workloads: http://www.lmdb.tech/bench/ondisk/HW96rscale.png. It's not black & white, but LMDB has the better average read performance, of course.
Sync
I'm not sure which sync level was used in the above benchmarks, but note that LMDB's NOSYNC level is possibly unsafe, unless you can assure you file system works with it:
LSM-trees, because they use a WAL, can adjust the durability level per write during runtime, not by setting a global environment flag. And unlike LMDB's NoSync flag, writing to a WAL cannot corrupt the database. For example, you could write blocks without sync, and then sync after writing the last block, without worrying that your file system may corrupt your database as described in the LMDB docs.
While I know some stuff about S3's multipart uploads, I'm not really familiar with Garage and what durability levels it needs or how many KV operations it actually performs per PUT, and what kind of value sizes the metadata is.
Possible issues in code
https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/db/lmdb_adapter.rs#L135-L142 Db::insert and ::remove return the previous value. This is expensive, as you essentially, for every insert, perform a O(log n) tree search (+ heap allocation with memcpy because of Vec::from), followed up by another O(log n) tree search (as explained above), followed by write I/O during commit, as LMDB needs to CoW the affected pages. If your data set is greater than RAM, your inserts now (1) may have to wait for read I/O and (2) can cause pages to be kicked out of the read cache. Using RocksDB wouldn't necessarily help here. The advantage of LSM-trees is to not read while inserting, but the code does exactly that.
Looking at the code base, the return value is (almost?) never used, so I would suggest changing
insert
andremove
to be:And if the return value is actually needed, add new operations, maybe something like:
Queue in key-value store
Creating a sorted collection from random writes is expensive. If the collection is sorted to begin with, it is not expensive.
If I am reading the code correctly, the keys for the MerkleTree are not monotonic, because it's a compound key (partition, prefix). This will cause writes to be distributed across the B-tree (same for LSM-tree), which is undesirable. If you want a queue, you need monotonic keys, e.g. ULID, CUID, Scru128 maybe, and store the partition + prefix in the value instead, not the key. I still don't think LMDB is a great fit for this use case because it can (1) never shrink and (2) still needs random reads & writes as explained above. Plus if your values are small (can't tell from the code), a B-tree has very high write amplification.
Quick follow-up on the previous benchmark, discussion below - thanks again everyone for your input!
Inlined objects, no data logic
This is more of a retest following the read-ahead improvement, where the benchmark stores very small (inlined) values. We can confirm that we were able to reproduce the results obtained by @quentin on the Scaleway cluster:
Full dashboard
(forgive the gap, my benchmarking node likes to fail during lunch)
In those results, we can definitely see Garage performing steadily at about 800#/s. LMDB alone seems to perform well under stress and, with the read-ahead fix, barely suffers once the database has outgrown the RAM. The last graph confirms that the data partition remains unused (all inlined).
35KB objects, ZFS data partition, record size 128K
The previous results suggest that the performance drop may be related to the data block logic, since LMDB alone performs well. To eliminate possible ext4 performance issues when handling many small files, we reran that test using ext4 for LMDB and ZFS for the data blocks. We're keeping the default record size for now (128K) since LMDB isn't on ZFS.
Full dashboard
(no idea where the data partition usage data went, but FYI it kept growing steadily up to 30GB)
As before, in this scenario, we are definitely seeing a performance drop: as the number of data blocks grows, the insert operations begin to slow down. This is also visible in the LMDB heatmaps (insert op). This suggests that writing block ref objects is much slower than just writing objects/versions. Perhaps this causes more page swaps? There's also the (probably) non-negligible cost of writing to the filesystem.
There was also a very odd behaviour towards the end of the benchmark there... Disk usage for LMDB went straight from 10GB to 40GB in a matter of seconds, reaching 100% capacity. At this point all operations also took a serious performance hit.
Discussion
I fear that when data blocks are involved, we may be leaning towards the "worst-case" edge in terms of complexity. The performance graph is definitely giving logarithmic vibes, but it's going down pretty fast still. That sudden change soon after we exceeded RAM size in the second benchmark also leaves me a little perplexed.
Yes, the idea was to understand it in case it got involved in quorums and slowed down the entire cluster. But that seems to go against what we are seeing in the block resync queue graph. From what I understand, our setup implies that all nodes hold the same data and effectively act as 3 mirrors. In this case, am I right to assume that the preference list is the same for all keys, say
[node1, node2, node3]
? Then all quorums are reached using nodes 1 and 2, while node 3 accumulates an async backlog. In this scenario, node 3 never gets a chance to participate in quorums and therefore can't slow them down. Let me know if I got that wrong.I'm afraid these last 2 benchmarks don't really bring much light on this particular point... We are still seeing the pattern even with inlined objects.
I am not nearly familiar enough with LSM-trees (and/or Garage) to provide much feedback on that subject I'm afraid. One thing I did notice when benchmarking LMDB earlier was that the Garage DB did in fact prefix its keys (at least for objects). I wonder if non-monotonic keys may be playing a part in the performance drop once block refs are involved. The values also aren't particularly large, so there is that last concern about write amplification.
JKR
I find this behaviour really curious. Is this maybe caused by the Todo queue's rapid growth? It would be nice to see the size of the different tables specifically, I can't imagine metadata taking 40GB. If so, one problem is LMDB files will never shrink. LMDB will try to reuse pages if possible, but even if the queue shrinks, you won't get those 30GB back. The only solution is to rewrite the entire database file (https://www.openldap.com/lists/openldap-devel/201505/msg00028.html).
I have tried to recreate the block ref schema in Garage in a CLI tool thingy: https://github.com/marvin-j97/write-bench
@withings Can you try running this on hardware that is comparable (or the same) as above? With:
I have found LMDB to really struggle on a memory-limited (1 GB) EC2 VM (at least 50x slower than LSM). On my personal machine (32 GB RAM) it's "only" 3x slower.
If the machine works through the workload fine, can you increase the amount of items, until you get to the breaking point?
Hello! Sorry for the delay, there was some OoO on my end.
Thanks for
write-bench
, I ran it on one of our nodes with Garage down: we observed roughly the same performance trend as the above graphs suggest, without any burst even beyond 10M objects.It turns out the explanation was actually a lot less exotic: we had
metadata_auto_snapshot_interval
on, which creates a snapshot/copy of the DB under the same mount point after some time. Those snapshots were filling up the disk and hogging CPU time. There appears to be a timezone issue with that feature (it triggered after 4 hours instead of 6 in UTC+2 - I'll see about opening a separate issue).With that feature off, Garage behaves as expected:
Performance went down to ~110#/s at 5M objects:
We've added some extra metrics to be able to split LMDB usage per tree and got the following graph:
As we may have expected,
object:merkle_tree
andversion:merkle_tree
are the busiest trees, although cumulatively thetodo
trees still account for a non-negligible portion of LMDB usage.Most likely next steps for us:
todo
trees with queues as suggested earlierAgain, not sure how far we'll be able to push it, just a rough roadmap for the time being.
JKR
@withings What NVMe drives are you running? The device can vary the write speed quite a lot for LMDB, even when using NO_SYNC (+ NO_READAHEAD, because that makes reads faster as we have established earlier).
Here's what I got (80M x 100 byte, random key):
Samsung 990 EVO:
Samsung PM9A3:
For both runs, you can see until about 280s it runs quickly, but then when it doesn't get any more memory from the OS, it starts running into fsyncs. For the EVO this becomes really expensive, reducing the write rate down to around 10'000 per second, and 24'000 for the PM9A3 respectively.
Our Garage instances run in virtual machines with a number of abstractions between them and the drives (RAID, encryption, ...) so it's a bit difficult to pin it down to the hardware itself. That being said, I can still run write-bench and graph its performance on top of it all.
Average insert time (ms) up to 100M entries
We can see some serious variations between nodes but a common breaking point just before reaching 50M inserts. It aligns nearly perfectly with the moment the DB size exceeded the RAM, so that's to be expected. All in all, I'm not going to complain about a 0.3ms insert time at 100M objects, that's more than fine.
Retry in production
That went well: our client is slower than Garage so we can't challenge its performance in that context anymore. We are still seeing some sporadic slow calls up to 7s which are somewhat annoying. Those will still need to be investigated, however last time I tried I couldn't find that "lost" time in traces.
Unifying key prefixes
Something which was mentioned before is that Garage's LMDB entries have prefixed keys. Objects, for instance, are recorded using the bucket ID as a prefix. I was curious to check performance with a more uniform key distribution so I did a run where keys were transparently hashed (XXH3) by the adapter. Those changes were pushed to withings/garage/feat/lmdb-hashed-keys but the performance difference is too insignificant to warrant a PR IMO.
Using queues for the todo trees
We have implemented that over at withings/garage/feat/todo-queues, using yaque as the first adapter. First results seem to suggest that the approach does indeed improve performance, and helps LMDB a lot.
Note that this approach does come with some warning signs. First, it is no longer possible to probe for queue length, meaning you can't "see" how many items are pending through Garage (yaque limitation). Second, I'm very uncertain of the approach I've taken when refactoring the GC. I think I will open a draft PR for this, I'd love your feedback.
Performance on the S3 endpoint appears to be more stable, although those "dips" remain unexplained)
LMDB performance is a lot more stable once the merkle_todo and gc_todo workloads have been moved out
Also observed on the LMDB op delay, much more stable
Using queues for block resync
In its current state the resync worker actually needs a tree as it relies on the order of its entries to work. Using queues there would require much more important rework IMO.
That being said, we did try (out of curiosity) to remove the resync logic from Garage (never resync). I don't have the graphs with me, but the results seemed to suggest that the resync worker claims a lot of CPU time, to the point where it can overpower the client-serving logic and ultimately render Garage unusable. There might be some room for fiddling in the tranquilizer logic which regulates that worker.
Fjall backend
We also tried fjall as a new database backend, the work for this is over at withings/garage/feat/fjall-db-engine. There's got to be something fundamentally wrong with my implementation though, because I could not get it beyond 50 req/s at the peak, and 10 req/s after a few 100k objects. I tried to fiddle with it a bit, but no dice.
RocksDB backend
Similar story here, you'll find that work on withings/garage/feat/feat/rocksdb-engine. The results were a bit more catastrophic there, with the engine straight up crashing after a couple 100k objects. I'm 99% sure there's something wrong with my implementation there too, but couldn't figure it out.
Regarding the merkle-todo worker stressing LMDB: I was wondering if applying todo items in batches to the merkle tree could help.
The idea would be to pop several items from the todo queue and update the tree at once for this batch of items, resulting in less writes to LMDB overall compared to applying items one by one. (For instance, applying 100 items individually means updating the root node 100 times, whereas applying a batch of 100 items would only need to update the root node of the tree once.)
I think this could be implemented in two ways: (1) by directly implementing
update_items
, a multi-item version ofupdate_item
(that would be logically equivalent to iteratingupdate_item
but more efficient in terms of DB writes), (2) by generalizingupdate_item
to an abstract type of database transaction, and implementingupdate_items
on top of it by logging the underlying DB operations in a log, compressing the log (discarding writes overwritten by later writes) and applying it to the actual DB.I've spent a bit of time to implement option (1) and I found it quite tricky; I think (2) is easier to implement conceptually, though probably more invasive.
@withings Your profile/repository are set to private I think, otherwise I would take a look
It would be nice if you could open PRs to include fjall and rocksdb support in Garage. Even if they are janky we might want to include them so that users could try them out and try to tune them to work better.
@marvinj97 @lx whoops sorry, force of habit when I created my profile probably. I've set my profile to public. I'll go through your reviews ASAP!
@withings So the most obvious thing I'm seeing without reading too much into the code is that:
Also, you should see both LSM-trees using less disk space comparatively to LMDB. For anything else I would need to understand more how Garage stores and KVs and what kind of values (and their sizes) are stored.