Large Files fail to upload #662

Open
opened 2023-10-25 02:13:48 +00:00 by pdx_pizza · 1 comment

I have encountered a reproducible problem uploading large files, typically close to or greater than 1TB in size. I can without issue upload lots a data, I have uploaded ~12TB to this current setup. But large individual files ~1TB+ I am unable to upload. I am able to verify this also doesn't work on older versions, tested also with v0.8.2.

Current version:

garage v0.9.0 [features: k2v, sled, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs]

Upload command:

aws s3 cp test3.tar s3://pizzatest/
upload failed: ./test3.tar to s3://pizzatest/test3.tar Read timeout on endpoint URL: "http://xxx.xxx.xxx.xxx:3900/pizzatest/test3.tar?uploadId=e06b810a5221c5aa16d1131d5c04bf8ce736e8ef5dd8db3f180b9018ddce6517"

File:

ls -l test3.tar
-rw-r--r-- 1 root users 1146850877441 Jul 4 17:18 test3.tar

Config setup:

cat garage.toml
metadata_dir = "/home/meta"
data_dir = "/home/data"
db_engine = "lmdb"

#replication_mode = "none"
replication_mode = "3"

rpc_bind_addr = "[::]:3901"
rpc_public_addr = "xxx.xxx.xxx.xxx:3901"
rpc_secret = "xxxx"

[s3_api]
s3_region = "garage"
api_bind_addr = "[::]:3900"
root_domain = ".s3.garage.localhost"

[s3_web]
bind_addr = "[::]:3902"
root_domain = ".web.garage.localhost"
index = "index.html"

[k2v_api]
api_bind_addr = "[::]:3904"

[admin]
api_bind_addr = "0.0.0.0:3903"
admin_token = "xxx="

I have encountered a reproducible problem uploading large files, typically close to or greater than 1TB in size. I can without issue upload lots a data, I have uploaded ~12TB to this current setup. But large individual files ~1TB+ I am unable to upload. I am able to verify this also doesn't work on older versions, tested also with v0.8.2. Current version: garage v0.9.0 [features: k2v, sled, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs] Upload command: aws s3 cp test3.tar s3://pizzatest/ upload failed: ./test3.tar to s3://pizzatest/test3.tar Read timeout on endpoint URL: "http://xxx.xxx.xxx.xxx:3900/pizzatest/test3.tar?uploadId=e06b810a5221c5aa16d1131d5c04bf8ce736e8ef5dd8db3f180b9018ddce6517" File: ls -l test3.tar -rw-r--r-- 1 root users 1146850877441 Jul 4 17:18 test3.tar Config setup: cat garage.toml metadata_dir = "/home/meta" data_dir = "/home/data" db_engine = "lmdb" #replication_mode = "none" replication_mode = "3" rpc_bind_addr = "[::]:3901" rpc_public_addr = "xxx.xxx.xxx.xxx:3901" rpc_secret = "xxxx" [s3_api] s3_region = "garage" api_bind_addr = "[::]:3900" root_domain = ".s3.garage.localhost" [s3_web] bind_addr = "[::]:3902" root_domain = ".web.garage.localhost" index = "index.html" [k2v_api] api_bind_addr = "[::]:3904" [admin] api_bind_addr = "0.0.0.0:3903" admin_token = "xxx="
Owner

The issue is probably linked to the fact that your 1TB+ files will get split into 1M+ blocks if you keep the default block size of 1MB. This means that Garage will have to generate (when uploading the object) and read (when reading the object) a list of 1M block IDs. In theory this is relatively doable but the issue you are having seems to indicate that you are stretching the limits of the system here. The first thing to do is to increase your block size (and maybe increase the size of parts in your multipart upload as well) to something like 10MB or 100MB depending on your network conditions (and something like 1GB for the multipart upload parts). If that doesn't fix the issue we will have to look into optimizing the code path.

The issue is probably linked to the fact that your 1TB+ files will get split into 1M+ blocks if you keep the default block size of 1MB. This means that Garage will have to generate (when uploading the object) and read (when reading the object) a list of 1M block IDs. In theory this is relatively doable but the issue you are having seems to indicate that you are stretching the limits of the system here. The first thing to do is to increase your block size (and maybe increase the size of parts in your multipart upload as well) to something like 10MB or 100MB depending on your network conditions (and something like 1GB for the multipart upload parts). If that doesn't fix the issue we will have to look into optimizing the code path.
lx added the
kind
performance
label 2023-10-25 07:36:32 +00:00
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#662
No description provided.