Replace Sled with other embedded databases #284

New issue

Closed

opened 2022-04-07 13:12:53 +00:00 by quentin · 0 comments

quentin commented

2022-04-07 13:12:53 +00:00

Owner

Sled is known to use lot of resources, including RAM. We thought we were able to cap RAM consumption by changing Sled's cache size. But it does not work in practise.

Investigations

It seems that cache size does not map to the effective RAM consumption:

cache_capacity is currently a bit messed up as it uses the on-disk size of things instead of the larger in-memory representation. So, 1gb is not actually the default memory usage, it's the amount of disk space that items loaded in memory will take, which will result in a lot more space in memory being used, at least for smaller keys and values. So, play around with setting it to a much smaller value.

Also, it also seems that for some actions, it does not have any impact. For example, we are aware that for some people with 2GB of RAM, it is impossible to simply open a sled database.

 INFO  garage::server > Loading configuration...
 INFO  garage::server > Opening database...
 TRACE sled::context  > starting context
 TRACE sled::pagecache > starting pagecache
 TRACE sled::pagecache::iterator > file len: 39635124224 segment len 524288 segments: 75598
 TRACE sled::pagecache::logger   > reading segment header at 0
 TRACE sled::pagecache::iterator > SA scanned header at lid 0 during startup: SegmentHeader { lsn: 212693155840, max_stable_lsn: 212692631551, ok: true }
 TRACE sled::pagecache::iterator > not using segment at lid 0, ok: true lsn: 212693155840 min lsn: 212695252992
 
...

 TRACE sled::pagecache::logger   > reading segment header at 39634599936
 TRACE sled::pagecache::iterator > SA scanned header at lid 39634599936 during startup: SegmentHeader { lsn: 212692631552, max_stable_lsn: 212692107263, ok: true }
 TRACE sled::pagecache::iterator > not using segment at lid 39634599936, ok: true lsn: 212692631552 min lsn: 212695252992
 DEBUG sled::pagecache::iterator > ordering before clearing tears: {}, max_header_stable_lsn: 212695252992
 DEBUG sled::pagecache::iterator > in clean_tail_tears, found missing item in tail: None and we'll scan segments {} above lowest lsn 212695252992
 DEBUG sled::pagecache::iterator > unable to load new segment: Io(Custom { kind: Other, error: "no segments remaining to iterate over" })
 DEBUG sled::pagecache::iterator > filtering out segments after detected tear at (lsn, lid) -1
 TRACE sled::pagecache::iterator > trying to find the max stable tip for bounding batch manifests with segment iter {} of segments >= first_tip -1
 TRACE sled::pagecache::iterator > generated iterator over segments {} with lsn >= 212695252992
memory allocation of 2818572288 bytes failed

Issues documenting Sled's memory problem: #986, #1093, #1061, #1304
Work in progress that may improve the situation: #1304

We are considering multiple options:

Writing an abstraction layer and implement different backends, one of them would be Sled but others could RocksDB, LMDB, Persy or Nebari. See this Reddit thread for more. See the table below.
Asking for help on Sled's Discord channel
- Done, no answer

Name	Concurrency	Atomicity	Known issues	Roadmap	Community	Notes
sled	Checked by Rust types	transactions, batch, compare-and-swap, update-and-fetch	memory usage: #986, #1093, #1061, #1304	#1304		N/A
sqlite	Manually	Full SQL semantic, including transactions.	?	?	?	rex by Criner
LMDB	??					Heed
RocksDB	??
Nebary	??
Persy	??

Note: in this specific case, this situation has been triggered because v0.6.0 had a bug and put 185M objects in the resync queue. Still, it should not prevent booting the server, the on-disk database being only 6GB large.

Sled is known to use lot of resources, including RAM. We thought we were able to cap RAM consumption by changing Sled's cache size. But it does not work in practise. <details> <summary>Investigations</summary> It seems that cache size does not map to the effective RAM consumption: > cache_capacity is currently a bit messed up as it uses the on-disk size of things instead of the larger in-memory representation. So, 1gb is not actually the default memory usage, it's the amount of disk space that items loaded in memory will take, which will result in a lot more space in memory being used, at least for smaller keys and values. So, play around with setting it to a much smaller value. Also, it also seems that for some actions, it does not have any impact. For example, we are aware that for some people with 2GB of RAM, it is impossible to simply open a sled database. ``` INFO garage::server > Loading configuration... INFO garage::server > Opening database... TRACE sled::context > starting context TRACE sled::pagecache > starting pagecache TRACE sled::pagecache::iterator > file len: 39635124224 segment len 524288 segments: 75598 TRACE sled::pagecache::logger > reading segment header at 0 TRACE sled::pagecache::iterator > SA scanned header at lid 0 during startup: SegmentHeader { lsn: 212693155840, max_stable_lsn: 212692631551, ok: true } TRACE sled::pagecache::iterator > not using segment at lid 0, ok: true lsn: 212693155840 min lsn: 212695252992 ... TRACE sled::pagecache::logger > reading segment header at 39634599936 TRACE sled::pagecache::iterator > SA scanned header at lid 39634599936 during startup: SegmentHeader { lsn: 212692631552, max_stable_lsn: 212692107263, ok: true } TRACE sled::pagecache::iterator > not using segment at lid 39634599936, ok: true lsn: 212692631552 min lsn: 212695252992 DEBUG sled::pagecache::iterator > ordering before clearing tears: {}, max_header_stable_lsn: 212695252992 DEBUG sled::pagecache::iterator > in clean_tail_tears, found missing item in tail: None and we'll scan segments {} above lowest lsn 212695252992 DEBUG sled::pagecache::iterator > unable to load new segment: Io(Custom { kind: Other, error: "no segments remaining to iterate over" }) DEBUG sled::pagecache::iterator > filtering out segments after detected tear at (lsn, lid) -1 TRACE sled::pagecache::iterator > trying to find the max stable tip for bounding batch manifests with segment iter {} of segments >= first_tip -1 TRACE sled::pagecache::iterator > generated iterator over segments {} with lsn >= 212695252992 memory allocation of 2818572288 bytes failed ``` </details> Issues documenting Sled's memory problem: [#986](https://github.com/spacejam/sled/issues/986), [#1093](https://github.com/spacejam/sled/issues/1093), [#1061](https://github.com/spacejam/sled/issues/1061), [#1304](https://github.com/spacejam/sled/issues/1036) Work in progress that may improve the situation: [#1304](https://github.com/spacejam/sled/pull/1304) We are considering multiple options: - Writing an abstraction layer and implement different backends, one of them would be Sled but others could RocksDB, LMDB, Persy or Nebari. [See this Reddit thread for more](https://www.reddit.com/r/rust/comments/s1cgof/what_is_the_best_keyvalue_store_for_rust_2021/). See the table below. - Asking for help on Sled's Discord channel - Done, no answer | Name | Concurrency | Atomicity | Known issues | Roadmap | Community | Notes | |----------------------------------------------------|-----------------------|---------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------|-----------|---| | [sled](https://sled.rs/) | Checked by Rust types | transactions, batch, compare-and-swap, update-and-fetch | memory usage: [#986](https://github.com/spacejam/sled/issues/986), [#1093](https://github.com/spacejam/sled/issues/1093), [#1061](https://github.com/spacejam/sled/issues/1061), [#1304](https://github.com/spacejam/sled/issues/1036) | [#1304](https://github.com/spacejam/sled/pull/1304) | | N/A | | [sqlite](https://docs.rs/sqlite/latest/sqlite/) | Manually | Full SQL semantic, including transactions.| ? | ? | ? | [rex by Criner](https://github.com/the-lean-crate/criner/discussions/5) | [LMDB](https://docs.rs/lmdb/latest/lmdb/) | ?? | | | | | [Heed](https://github.com/meilisearch/heed) | | [RocksDB](https://docs.rs/rocksdb/latest/rocksdb/) | ?? | | | | | | [Nebary](https://github.com/khonsulabs/nebari) | ?? | | | | | | [Persy](https://persy.rs/) | ?? | | | | | --- Note: in this specific case, this situation has been triggered because v0.6.0 had a bug and put 185M objects in the resync queue. Still, it should not prevent booting the server, the on-disk database being only 6GB large.