Append / Stream #260

Closed
opened 2022-03-01 16:05:07 +00:00 by quentin · 1 comment
Owner

We had some questions about supporting appends/streams. Some use cases:

  • For Cryptpad (check my log)
  • For MLOps

I think I understand Cryptpad use case better than the MLOps one, but even for Cryptpad I would be curious to see which amount of data is stored for an average document, what is the size of the updates, the frequency, and so on.

Append/Streams on object storage APIs

Some cloud providers implemented append endpoints (with some restrictions):

Append/Streams on Dynamo API

It might also be interesting to check DynamoDB as it may be useful for some cases. With Dynamo semantic, we can fetch multiple entries at once, and also have a "list semantic" on which we can append data for one entry.

Dynamo has also a stream feature to generate a dynamic WAL (log) of what is happening.

The kinesis approach

Amazon recommends using Kinesis to append real time logs to S3 files. This is probably not what we want.

About Cryptpad

Some insights we gathered during FOSDEM by talking with Cryptpad's team:

CryptPad would need to be able to make appends on S3 files.. So that's a major performance hit if there is no append. Well if a pad is active, there are messages coming which need to be stored. It doesn't need to be stored right away as the cryptpad server can send the messages to the other party and then store the data later.. but this approach could create risks of inconsistencies
In the end it's important that the data is there appended to the file representing the pad. In any case the biggest issue is that the API that cryptpad uses would need to be rewritten in order to take into account different ways of implementing it.. we might have APIs that are too close to the way we currently do it. It's mostly appending for the operation of pads.. But to create a realtime editing session what is needed is append logs. It doesn't make cryptpad automatically distributed though as there are some application caches that would need to communicate if multiple servers talk to the same storage.

We had some questions about supporting appends/streams. Some use cases: - For Cryptpad (check my log) - For [MLOps](https://news.ycombinator.com/item?id=30259418) I think I understand Cryptpad use case better than the MLOps one, but even for Cryptpad I would be curious to see which amount of data is stored for an average document, what is the size of the updates, the frequency, and so on. ## Append/Streams on object storage APIs Some cloud providers implemented append endpoints (with some restrictions): - [Wasabi](https://wasabi.com/wp-content/themes/wasabi/docs/API_Guide/topics/Appending_to_Objects.htm) - [Alibaba](https://www.alibabacloud.com/help/en/doc-detail/31851.htm) ## Append/Streams on Dynamo API It might also be interesting to check DynamoDB as it may be useful for some cases. With Dynamo semantic, we can fetch multiple entries at once, and also have a "list semantic" on which we can append data for one entry. Dynamo has also a [stream feature](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html) to generate a dynamic WAL (log) of what is happening. ## The kinesis approach Amazon recommends using [Kinesis](https://docs.aws.amazon.com/firehose/latest/APIReference/Welcome.html) to append real time logs to S3 files. This is probably not what we want. ## About Cryptpad Some insights we gathered during FOSDEM by talking with Cryptpad's team: > **CryptPad would need to be able to make appends on S3 files.**. So that's a major performance hit if there is no append. Well if a pad is active, there are messages coming which need to be stored. It doesn't need to be stored right away as the cryptpad server can send the messages to the other party and then store the data later.. but this approach could create risks of inconsistencies In the end it's important that the data is there appended to the file representing the pad. In any case the biggest issue is that the API that cryptpad uses would need to be rewritten in order to take into account different ways of implementing it.. we might have APIs that are too close to the way we currently do it. It's mostly appending for the operation of pads.. But to create a realtime editing session what is needed is append logs. It doesn't make cryptpad automatically distributed though as there are some application caches that would need to communicate if multiple servers talk to the same storage.
quentin added the
Ideas
label 2022-03-01 16:05:07 +00:00
quentin added this to the Speculative milestone 2022-03-01 16:05:12 +00:00
Owner

We will not implement appending/streaming to S3 objects as this is not supported by the S3 API and we don't want to implement custom non-standard featuers.

Workloads that work in an append/stream fashion can now be built using the new K2V API. Moreover, with the Bayou implementation in Aerogramme, we have an example of how to use such an append log structure on K2V with regular compaction into S3 objects so as to minimize the load on K2V (remember that K2V stores everything in the meta/ table so we don't want to overload it).

Wrt Cryptpad, if we were to build a connector to Garage, it would make sense to build it upon K2V.

Closing for now, we can resume the discussion later.

We will not implement appending/streaming to S3 objects as this is not supported by the S3 API and we don't want to implement custom non-standard featuers. Workloads that work in an append/stream fashion can now be built using the new K2V API. Moreover, with the Bayou implementation in Aerogramme, we have an example of how to use such an append log structure on K2V with regular compaction into S3 objects so as to minimize the load on K2V (remember that K2V stores everything in the meta/ table so we don't want to overload it). Wrt Cryptpad, if we were to build a connector to Garage, it would make sense to build it upon K2V. Closing for now, we can resume the discussion later.
lx closed this issue 2022-09-14 11:18:32 +00:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#260
No description provided.