Consider bumping default block size to 10MB #139
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#139
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
There are many advantages to using larger block sizes:
The downside is mainly the following:
model/block.rs
), lots of block read/write ops will fail and Garage will have to retry lots of operations leading to wasted bandwidth. In the case where many block read/write ops fail, Garage becomes globally unusable.Quick calculation: on a 100Mbit connection, 10MB blocks can be transferred in ~1 second. BLOCK_RW_TIMEOUT is currently set to 42 seconds, so here we clearly have no issue. Suppose now an average quality ADSL link, for example 2Mbps down. The time to transfer a block is now around 20 seconds. The node behind an ADSL connection can definitely not be the entry node we contact for S3 API calls: if it were, it would have to send back all blocks it receives to two other nodes, with an even slower uplink. Even if the ADSL node is not acting as an API entry point, it might still be receiving several blocks at the same time, in which case the delay will be multiplied by the number of simultaneous transfers. It's quite obvious that the ADSL node will collapse really quickly.
Conclusion : 10MB blocks is ok as long as we don't have nodes with a slow connection such as ADSL. If we have them, we probably need to set block size back down to 1MB (or try increasing BLOCK_RW_TIMEOUT, but it's not clear that that's the good solution). Or if we really want to support ADSL nodes with large block sizes, we need to invent a smarter protocol for block transfers that does smaller chunks and keeps track of transfers in a smarter way.
Can we map BLOCK_RW_TIMEOUT to the fact we are receiving data on the TCP socket? So each time we receive an IP packet (or each time we read the TCP buffer), we reset the timer?
Wouldn't that means first byte latency would be normal latency (internal cluster latency + client to gateway latency) + 1s for the gateway to receive the full rpc message?
Now that we have block streaming (#343) we should benchmark a cluster with 10MB blocks and compare it to 1MB blocks to see if it is worse or if it's ok. If not worse, we can bump the default.
Last time we talked about this, the conclusion was that block streaming only helped in one direction: when reading objects, and not when writing them. This means that in all cases, increasing the block size will add an incompressible delay at the moment when the client just finished uploading all of its data, and before the server can return 200 OK. This is the delay of transferring the last block to two storage nodes, which increases with block size and can become quite signifiant on slow networks.
As we'd rather Garage be as reactive as possible even on modest networks (the main use-case being WAN which isn't always extremely fast), at this point I think it is preferable to keep the current default block size of 1MB, a satisfactory default for most users as time has proven, and indicate in the documentation that users which have a very fast network between all their nodes might be interested in increasing it to 10MB to optimize their system.