Use SRPT scheduling to improve liveness when sending messages between peers #361

Closed
opened 2022-08-17 15:14:39 +00:00 by baptiste · 5 comments
Owner

Currently, according to discussions and current research, messages between Garage peers are scheduled according to the following strategy:

  • higher priority messages get scheduled first
  • messages with the same priority are sent with a Round-Robin strategy

Round-Robin is not ideal in the second case, because when there are many concurrent messages, all messages will experience a high completion time.

A better strategy for messages with identical priorities is SRPT: Shortest Remaining Processing Time. It is designed to minimize the number of objects in the system (depending on hypotheses), and it also minimizes the average completion time.

For Garage it would look like the following:

  • order all messages in the queue according to their remaining size (e.g. if we already sent 800 KB of a 1 MB block, the remaining size is 200 KB)
  • send messages one after the other according to this order (no parallelization)
  • whenever a new message needs to be sent, it is inserted in the ordered queue; if it comes first in the queue (i.e. it is smaller than the remaining size of all other messages), it pre-empts the message that we were currently sending

Here are some papers on the subject (in different theoretical settings but the ideas should apply in a similar way):

Currently, according to discussions and [current research](https://git.deuxfleurs.fr/quentin/mknet/src/branch/liveness/liveness.md), messages between Garage peers are scheduled according to the following strategy: - higher priority messages get scheduled first - messages with the same priority are sent with a Round-Robin strategy Round-Robin is not ideal in the second case, because when there are many concurrent messages, **all** messages will experience a high completion time. A better strategy for messages with identical priorities is **SRPT: Shortest Remaining Processing Time**. It is designed to minimize the number of objects in the system (depending on hypotheses), and it also minimizes the average completion time. For Garage it would look like the following: - order all messages in the queue according to their **remaining size** (e.g. if we already sent 800 KB of a 1 MB block, the remaining size is 200 KB) - send messages one after the other according to this order (no parallelization) - whenever a new message needs to be sent, it is inserted in the ordered queue; if it comes first in the queue (i.e. it is smaller than the remaining size of all other messages), it pre-empts the message that we were currently sending Here are some papers on the subject (in different theoretical settings but the ideas should apply in a similar way): - http://www.bcamath.org/documentos_public/archivos/publicaciones/BSnetworksANORfinal.pdf - https://pubsonline.informs.org/doi/pdf/10.1287/opre.26.1.197 - https://hal.inria.fr/hal-02570686/document
quentin added the
kind
improvement
label 2022-08-17 16:34:38 +00:00
Owner

Question: We are planing to moving Netapp to a streaming-based API, in which clients and servers could use async streams to send and receive data. This means that Netapp would not necessarily know how much data is remaining to send on each stream it is currently processing, as such data has not been produced yet. Do we know what is the optimal scheduling policy in the case where the remaining size is unknown?

**Question:** We are planing to moving Netapp to a streaming-based API, in which clients and servers could use async streams to send and receive data. This means that Netapp would not necessarily know how much data is remaining to send on each stream it is currently processing, as such data has not been produced yet. Do we know what is the optimal scheduling policy in the case where the remaining size is unknown?
lx added the
kind
performance
label 2022-08-31 17:27:55 +00:00
Owner

The netapp streaming PR that was merged recently (#343) prevents the use of SRPT because we don't know the length of streams that will be sent ahead of time. Instead it uses LAS scheduling (Least Attained Service). TODO: benchmark this.

The netapp streaming PR that was merged recently (#343) prevents the use of SRPT because we don't know the length of streams that will be sent ahead of time. Instead it uses LAS scheduling (Least Attained Service). TODO: benchmark this.
lx added this to the v0.8 milestone 2022-09-14 11:38:43 +00:00
lx modified the milestone from v0.8 to v0.9 2022-09-19 13:14:01 +00:00
Author
Owner

As far as I know, LAS is a good idea for long-tail distribution of flow sizes (roughly speaking: many small flows, a few very big flows) when the actual flow size is not known in advance.

Do you have an idea of the typical distribution of stream sizes in Garage? If there are many streams of the same size, I have the feeling that LAS would be similar to Round-Robin.

As far as I know, LAS is a good idea for long-tail distribution of flow sizes (roughly speaking: many small flows, a few very big flows) when the actual flow size is not known in advance. Do you have an idea of the typical distribution of stream sizes in Garage? If there are many streams of the same size, I have the feeling that LAS would be similar to Round-Robin.
Owner

After discussing with LX, it seems that in practise we always know the length of a stream ahead of time. Either the user is pushing an object of an unknown size, in this case we need to buffer it to compute its hash and choose a server, or a user is fetching it, and thus we have the object in our system, and we can get its size.

After discussing with LX, it seems that in practise we always know the length of a stream ahead of time. Either the user is pushing an object of an unknown size, in this case we need to buffer it to compute its hash and choose a server, or a user is fetching it, and thus we have the object in our system, and we can get its size.
Owner

The algorithm I have converged on for Garage v1.0 is the following:

  • In priority, poll and send packets for new streams, as these streams have a chance of being short messages and being successfully sent in only a single chunk.

  • For streams that require more than one chunk (this applies almost exclusively to data blocks), we use the basic round robin strategy

  • The block manager has a semaphore (buffer_kb_semaphore) that limits the maximum number of data chunks to be transferred concurrently. This avoids memory exhaustion by buffering too many chunks in RAM when the network throughput is too slow (#788, #792)

For the time being, this solves the issue quite well.

The algorithm I have converged on for Garage v1.0 is the following: - In priority, poll and send packets for new streams, as these streams have a chance of being short messages and being successfully sent in only a single chunk. - For streams that require more than one chunk (this applies almost exclusively to data blocks), we use the basic round robin strategy - The block manager has a semaphore (`buffer_kb_semaphore`) that limits the maximum number of data chunks to be transferred concurrently. This avoids memory exhaustion by buffering too many chunks in RAM when the network throughput is too slow (#788, #792) For the time being, this solves the issue quite well.
lx closed this issue 2024-05-24 17:09:36 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#361
No description provided.