Consider bumping default block size to 10MB #139

Closed
opened 2021-11-02 17:19:20 +00:00 by lx · 4 comments
Owner

There are many advantages to using larger block sizes:

  • Less metadata to store (shorter lists of blocks for large objects)
  • Fewer HDD I/O operations leading to better performance

The downside is mainly the following:

  • On slow connections, transfering blocks of size 10MB between nodes is going to take some time. If the delay is regularly larger than BLOCK_RW_TIMEOUT (see model/block.rs), lots of block read/write ops will fail and Garage will have to retry lots of operations leading to wasted bandwidth. In the case where many block read/write ops fail, Garage becomes globally unusable.

Quick calculation: on a 100Mbit connection, 10MB blocks can be transferred in ~1 second. BLOCK_RW_TIMEOUT is currently set to 42 seconds, so here we clearly have no issue. Suppose now an average quality ADSL link, for example 2Mbps down. The time to transfer a block is now around 20 seconds. The node behind an ADSL connection can definitely not be the entry node we contact for S3 API calls: if it were, it would have to send back all blocks it receives to two other nodes, with an even slower uplink. Even if the ADSL node is not acting as an API entry point, it might still be receiving several blocks at the same time, in which case the delay will be multiplied by the number of simultaneous transfers. It's quite obvious that the ADSL node will collapse really quickly.

Conclusion : 10MB blocks is ok as long as we don't have nodes with a slow connection such as ADSL. If we have them, we probably need to set block size back down to 1MB (or try increasing BLOCK_RW_TIMEOUT, but it's not clear that that's the good solution). Or if we really want to support ADSL nodes with large block sizes, we need to invent a smarter protocol for block transfers that does smaller chunks and keeps track of transfers in a smarter way.

There are many advantages to using larger block sizes: - Less metadata to store (shorter lists of blocks for large objects) - Fewer HDD I/O operations leading to better performance The downside is mainly the following: - On slow connections, transfering blocks of size 10MB between nodes is going to take some time. If the delay is regularly larger than BLOCK_RW_TIMEOUT (see `model/block.rs`), lots of block read/write ops will fail and Garage will have to retry lots of operations leading to wasted bandwidth. In the case where many block read/write ops fail, Garage becomes globally unusable. Quick calculation: on a 100Mbit connection, 10MB blocks can be transferred in ~1 second. BLOCK_RW_TIMEOUT is currently set to 42 seconds, so here we clearly have no issue. Suppose now an average quality ADSL link, for example 2Mbps down. The time to transfer a block is now around 20 seconds. The node behind an ADSL connection can definitely not be the entry node we contact for S3 API calls: if it were, it would have to send back all blocks it receives to two other nodes, with an even slower uplink. Even if the ADSL node is not acting as an API entry point, it might still be receiving several blocks at the same time, in which case the delay will be multiplied by the number of simultaneous transfers. It's quite obvious that the ADSL node will collapse really quickly. Conclusion : 10MB blocks is ok as long as we don't have nodes with a slow connection such as ADSL. If we have them, we probably need to set block size back down to 1MB (or try increasing BLOCK_RW_TIMEOUT, but it's not clear that that's the good solution). Or if we really want to support ADSL nodes with large block sizes, we need to invent a smarter protocol for block transfers that does smaller chunks and keeps track of transfers in a smarter way.
Owner

Can we map BLOCK_RW_TIMEOUT to the fact we are receiving data on the TCP socket? So each time we receive an IP packet (or each time we read the TCP buffer), we reset the timer?

Can we map BLOCK_RW_TIMEOUT to the fact we are receiving data on the TCP socket? So each time we receive an IP packet (or each time we read the TCP buffer), we reset the timer?
lx added the
Improvement
Performance
labels 2021-11-08 15:29:29 +00:00

Quick calculation: on a 100Mbit connection, 10MB blocks can be transferred in ~1 second.

Wouldn't that means first byte latency would be normal latency (internal cluster latency + client to gateway latency) + 1s for the gateway to receive the full rpc message?

> Quick calculation: on a 100Mbit connection, 10MB blocks can be transferred in ~1 second. Wouldn't that means first byte latency would be normal latency (internal cluster latency + client to gateway latency) + 1s for the gateway to receive the full rpc message?
lx added this to the v0.8 milestone 2022-09-14 11:10:22 +00:00
Author
Owner

Now that we have block streaming (#343) we should benchmark a cluster with 10MB blocks and compare it to 1MB blocks to see if it is worse or if it's ok. If not worse, we can bump the default.

Now that we have block streaming (#343) we should benchmark a cluster with 10MB blocks and compare it to 1MB blocks to see if it is worse or if it's ok. If not worse, we can bump the default.
lx modified the milestone from v0.8 to v0.9 2022-09-19 13:14:12 +00:00
Author
Owner

Last time we talked about this, the conclusion was that block streaming only helped in one direction: when reading objects, and not when writing them. This means that in all cases, increasing the block size will add an incompressible delay at the moment when the client just finished uploading all of its data, and before the server can return 200 OK. This is the delay of transferring the last block to two storage nodes, which increases with block size and can become quite signifiant on slow networks.

As we'd rather Garage be as reactive as possible even on modest networks (the main use-case being WAN which isn't always extremely fast), at this point I think it is preferable to keep the current default block size of 1MB, a satisfactory default for most users as time has proven, and indicate in the documentation that users which have a very fast network between all their nodes might be interested in increasing it to 10MB to optimize their system.

Last time we talked about this, the conclusion was that block streaming only helped in one direction: when reading objects, and not when writing them. This means that in all cases, increasing the block size will add an incompressible delay at the moment when the client just finished uploading all of its data, and before the server can return 200 OK. This is the delay of transferring the last block to two storage nodes, which increases with block size and can become quite signifiant on slow networks. As we'd rather Garage be as reactive as possible even on modest networks (the main use-case being WAN which isn't always extremely fast), at this point I think it is preferable to keep the current default block size of 1MB, a satisfactory default for most users as time has proven, and indicate in the documentation that users which have a very fast network between all their nodes might be interested in increasing it to 10MB to optimize their system.
lx closed this issue 2022-11-10 10:32:13 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#139
No description provided.