Block sync errors accumulating when using Mastodon #457

Closed
opened 2022-12-25 13:26:08 +00:00 by kaiyou · 10 comments
Contributor

We have been experiencing random block loss in our 2-node setup (edit: block loss does not seem related to the 2-node topology after all), and are investigating the issue on the garage side. We have been tracking this in our (french) internal bug-tracker so far: https://forge.tedomum.net/tedomum/garage/-/issues/2

History

General context

We have been running a 2-node production garage cluster for the past 2 months or so, using replication_mode = "2". We are well aware of the risks running a 2-node cluster in a single location (despite our telling garage those are in separate regions). We accept those risks for the time being and our current use-case.

After 4 weeks of use and freshly migrating our Mastodon instance to garage, we started noticing missing objects, while the application was not reporting any error. After investigating, the frontend was indeed cleared of all suspicion, while the S3 backend was a complete other story.

Hardware failure

Garage logs were reporting missing blocks, we are not sure for how long though because metrics were only being scaped for so long:

image

Rapid diagnosis showed that one of both nodes had ran into failing ATA, up to disconnecting the controller and corrupting some inodes, including some garage-related files:

[2355614.782308] ata1.00: status: { DRDY }
[2355614.782376] ata1: hard resetting link
[2355615.099416] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[2355615.146996] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
...
[2587920.656733] EXT4-fs error (device sdc1): ext4_lookup:1619: inode #50332544: comm garage: iget: checksum invalid
[2587929.616588] EXT4-fs error (device sdc1): ext4_lookup:1619: inode #50332544: comm garage: iget: checksum invalid
[2588039.413170] EXT4-fs error (device sdc1): ext4_lookup:1619: inode #50332544: comm garage: iget: checksum invalid

The drive was quickly replaced with a new one, and we ran a full fsck, then the node did start again, yet was showing inconsistent behavior at best.

We noticed that some zst files were not being properly renamed from the tmp temporary file after successful write. Checksums were okay, so we renamed them all:

cat tmpfiles-goodlist \
| sed "s/^.*(data.*.tmp).*bytes$/$1/" \
| awk -F"data" '{print $2}' \
| awk -F"." '{print $1}' \
| xargs -n1 -i mv data{}.tmp data{}.zst

We also ran a full repair and scrub:

21  Thr:0.941  Block scrub worker
                 6.49% done (tranquility = 4) ; corruptions detected: 0
[...]
                 tranquility = 0, 732 blocks in queue, 595 blocks in error state
                 tranquility = 0, 732 blocks in queue, 595 blocks in error state
                 tranquility = 0, 732 blocks in queue, 595 blocks in error state
                 tranquility = 0, 732 blocks in queue, 595 blocks in error state
51  Done       Version repair worker
                 167231 items done
52  Done       Block refs repair worker
                 405708 items done
53  Done       Block repair worker
                 Phase 2: 100.00% done

This helped recovering some blocks indeed:

image

Wrongly created bucket

While the node seemed to run fine, fixing the blocks and getting it to serve objects again showed another issue that didn't occur during the initial failure: newly written objects to a specific buckets were succesfully written, could be read from S3 yet were not exposed through the Web endpoint. Listing from S3 only showed recent objects while the web endpoint served older ones.

This was traced back to a wrongly created local bucket for the key:

Key-specific bucket aliases:
  mastodon-data    xxx

Authorized buckets:
  RW   mastodon-data                 yyy
  RWO                 mastodon-data  xxx

Unaliasing the local bucket fixed the issue:

garage unalias --local GKxxx xxx

We also:

  • rcloned from the local bucket to the new bucket to recover missing objects,
  • emptied the local bucket
  • delete the local bucket

Finally, this helped pinpointing a bug in checking key permission for bucket creation, fixed in: 2ac75018a1.

Random missing blocks

Eventually, general function was recovered for the cluster. However, some new blocks are still missing while PutObject calls report no error. This happens at a steady rate of approx 1% of all writes.

garage 2022-12-26T09:25:47.718417Z ERROR garage_block::resync: Error when resyncing 9ae9ff9724f7c14d: Unable to read block 9ae9ff9724f7c14d: no node returned a valid block                         
garage 2022-12-26T09:25:47.839923Z  INFO garage_block::resync: Resync block 870864e44892787b: fetching absent but needed block (refcount > 0)                
garage 2022-12-26T09:25:47.842481Z ERROR garage_block::resync: Error when resyncing 870864e44892787b: Unable to read block 870864e44892787b: no node returned a valid block

image

By using the new block list-errors and block info (687660b27f) commands, we were able to list objects related to recent errors, which indeed match with freshly written objects that the S3 client seemed to write succesfully.

Most of them (approx 75%) also show a dangling tmp file that is consistent yet not properly renamed to zst.

Current work

Detailed traces

We are in the process of:

  • intrumenting garage with sysdig,
  • enabling all debug logs,
  • collecting logs and traces when a new block goes missing.

Some facts

Based on traces, here are some facts about garage behavior while blocks go missing.

  • All write syscalls succeed
  • All written files in the block directory are associated with a single rename
  • All rename syscalls succeed
We have been experiencing random block loss in our 2-node setup (edit: block loss does not seem related to the 2-node topology after all), and are investigating the issue on the garage side. We have been tracking this in our (french) internal bug-tracker so far: https://forge.tedomum.net/tedomum/garage/-/issues/2 # History ## General context We have been running a 2-node production garage cluster for the past 2 months or so, using `replication_mode = "2"`. We are well aware of the risks running a 2-node cluster in a single location (despite our telling garage those are in separate regions). We accept those risks for the time being and our current use-case. After 4 weeks of use and freshly migrating our Mastodon instance to garage, we started noticing missing objects, while the application was not reporting any error. After investigating, the frontend was indeed cleared of all suspicion, while the S3 backend was a complete other story. ## Hardware failure Garage logs were reporting missing blocks, we are not sure for how long though because metrics were only being scaped for so long: ![image](/attachments/516cc56e-ed70-472c-baac-062451cfa21c) Rapid diagnosis showed that one of both nodes had ran into failing ATA, up to disconnecting the controller and corrupting some inodes, including some garage-related files: ``` [2355614.782308] ata1.00: status: { DRDY } [2355614.782376] ata1: hard resetting link [2355615.099416] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) [2355615.146996] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded ... [2587920.656733] EXT4-fs error (device sdc1): ext4_lookup:1619: inode #50332544: comm garage: iget: checksum invalid [2587929.616588] EXT4-fs error (device sdc1): ext4_lookup:1619: inode #50332544: comm garage: iget: checksum invalid [2588039.413170] EXT4-fs error (device sdc1): ext4_lookup:1619: inode #50332544: comm garage: iget: checksum invalid ``` The drive was quickly replaced with a new one, and we ran a full `fsck`, then the node did start again, yet was showing inconsistent behavior at best. We noticed that some `zst` files were not being properly renamed from the `tmp` temporary file after successful write. Checksums were okay, so we renamed them all: ```bash cat tmpfiles-goodlist \ | sed "s/^.*(data.*.tmp).*bytes$/$1/" \ | awk -F"data" '{print $2}' \ | awk -F"." '{print $1}' \ | xargs -n1 -i mv data{}.tmp data{}.zst ``` We also ran a full repair and scrub: ``` 21 Thr:0.941 Block scrub worker 6.49% done (tranquility = 4) ; corruptions detected: 0 [...] tranquility = 0, 732 blocks in queue, 595 blocks in error state tranquility = 0, 732 blocks in queue, 595 blocks in error state tranquility = 0, 732 blocks in queue, 595 blocks in error state tranquility = 0, 732 blocks in queue, 595 blocks in error state 51 Done Version repair worker 167231 items done 52 Done Block refs repair worker 405708 items done 53 Done Block repair worker Phase 2: 100.00% done ``` This helped recovering some blocks indeed: ![image](/attachments/200163de-ff9a-4e1c-898d-cefa35d710a5) ## Wrongly created bucket While the node seemed to run fine, fixing the blocks and getting it to serve objects again showed another issue that didn't occur during the initial failure: newly written objects to a specific buckets were succesfully written, could be read from S3 yet were not exposed through the Web endpoint. Listing from S3 only showed recent objects while the web endpoint served older ones. This was traced back to a wrongly created local bucket for the key: ``` Key-specific bucket aliases: mastodon-data xxx Authorized buckets: RW mastodon-data yyy RWO mastodon-data xxx ``` Unaliasing the local bucket fixed the issue: ``` garage unalias --local GKxxx xxx ``` We also: - rcloned from the local bucket to the new bucket to recover missing objects, - emptied the local bucket - delete the local bucket Finally, this helped pinpointing a bug in checking key permission for bucket creation, fixed in: https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/2ac75018a14a22b61cfc68bc66b4f82a981a4838. ## Random missing blocks Eventually, general function was recovered for the cluster. However, some new blocks are still missing while `PutObject` calls report no error. This happens at a steady rate of approx 1% of all writes. ``` garage 2022-12-26T09:25:47.718417Z ERROR garage_block::resync: Error when resyncing 9ae9ff9724f7c14d: Unable to read block 9ae9ff9724f7c14d: no node returned a valid block garage 2022-12-26T09:25:47.839923Z INFO garage_block::resync: Resync block 870864e44892787b: fetching absent but needed block (refcount > 0) garage 2022-12-26T09:25:47.842481Z ERROR garage_block::resync: Error when resyncing 870864e44892787b: Unable to read block 870864e44892787b: no node returned a valid block ``` ![image](/attachments/ddd44465-0170-4379-9a07-0880a145a83c) By using the new `block list-errors` and `block info` (https://git.deuxfleurs.fr/Deuxfleurs/garage/commit/687660b27f904422c689e09d2457293e5313d325) commands, we were able to list objects related to recent errors, which indeed match with freshly written objects that the S3 client seemed to write succesfully. Most of them (approx 75%) also show a dangling `tmp` file that is consistent yet not properly renamed to `zst`. # Current work ## Detailed traces We are in the process of: - intrumenting garage with sysdig, - enabling all debug logs, - collecting logs and traces when a new block goes missing. ## Some facts Based on traces, here are some facts about garage behavior while blocks go missing. - All `write` syscalls succeed - All written files in the block directory are associated with a single `rename` - All `rename` syscalls succeed
Author
Contributor

While writing this down, I started another instrumentation session in order to pinpoint the issue, failing syscall, or whatever. It seems that no new block has been missing for >24h, with no change on our side in the meantime.

image

We will report back on how this goes, still trying to make sense of our existing traces.

While writing this down, I started another instrumentation session in order to pinpoint the issue, failing syscall, or whatever. It seems that no new block has been missing for >24h, with no change on our side in the meantime. ![image](/attachments/a4b8ca46-8a62-4d8e-98b4-2325498d891b) We will report back on how this goes, still trying to make sense of our existing traces.
Author
Contributor

A couple days later: the issue is definitely not fixed, missing blocks keep accumulating.

We are making some progress though. It first seemed that this was some case of write-hole, where the client gets a success response from the backend, while data is in fact unproperly written. We are not sure how we wrongly came to that conclusion, but it does seem false after all.

Our main use for garage is currently as a Mastodon S3 backend for media. Mastodon caches remote media, and our instance is a small one, so 99.9% of it is actually cached media, and all missing blocks except 1 are from cached media indeed.

By comparing paths from block info and the Mastodon database, we estblish that Mastodon must be receiving some kind of error even if not logged, because it does not write the attachmend file name to database:

# Export all missing blocks info
./garage block list-errors \
| cut -d' ' -f 1 | sed 1d \
| ./garage block info \
> missing_blocks
# Then load the block id list into database and run some joins

This also gives us a decent hypothesis as to why the number sometimes decreases, or why it keeps stable. Because mastodon catches the PutObject error (still unclear how) and records that the media was not properly cached, when a user later browses that part of the timeline, mastodon retries caching the media, which succeeds.

Due to the object path being the same and blocks being content-addressed, the block is written and the path properly served from that point. Some of our users must have real-time data displayed in their browser from time to time, which forces refresh when their browser fetches the media, which in turn might fix the missing block, all of this before metrics show anything (we poll every 10 minutes).

A couple days later: the issue is definitely not fixed, missing blocks keep accumulating. We are making some progress though. It first seemed that this was some case of write-hole, where the client gets a success response from the backend, while data is in fact unproperly written. We are not sure how we wrongly came to that conclusion, but it does seem false after all. Our main use for garage is currently as a Mastodon S3 backend for media. Mastodon caches remote media, and our instance is a small one, so 99.9% of it is actually cached media, and all missing blocks except 1 are from cached media indeed. By comparing paths from `block info` and the Mastodon database, we estblish that Mastodon must be receiving some kind of error even if not logged, because it does not write the attachmend file name to database: ``` # Export all missing blocks info ./garage block list-errors \ | cut -d' ' -f 1 | sed 1d \ | ./garage block info \ > missing_blocks # Then load the block id list into database and run some joins ``` This also gives us a decent hypothesis as to why the number sometimes decreases, or why it keeps stable. Because mastodon catches the `PutObject` error (still unclear how) and records that the media was not properly cached, when a user later browses that part of the timeline, mastodon retries caching the media, which succeeds. Due to the object path being the same and blocks being content-addressed, the block is written and the path properly served from that point. Some of our users must have real-time data displayed in their browser from time to time, which forces refresh when their browser fetches the media, which in turn might fix the missing block, all of this before metrics show anything (we poll every 10 minutes).
Author
Contributor

The hypothesis currently being tested: maybe the core issue is not on garage side, but upstream in the chain.

What if when caching a media, Mastodon sometimes fails (the remote server fails, or any other networking error, or basically anything going glitchy)? What if it had already started calling PutObject and streaming data?

Then the object version would be created in garage, and possibly a block or even a couple of them. When PutObject fails, possibly the current block is not written to storage, but metadata remains somehow, which would make for a failing block?

The hypothesis currently being tested: maybe the core issue is not on garage side, but upstream in the chain. What if when caching a media, Mastodon sometimes fails (the remote server fails, or any other networking error, or basically anything going glitchy)? What if it had already started calling `PutObject` and streaming data? Then the object version would be created in garage, and possibly a block or even a couple of them. When `PutObject` fails, possibly the current block is not written to storage, but metadata remains somehow, which would make for a failing block?
Owner

Are there actual objects that cannot be read in the bucket?

Are there actual objects that cannot be read in the bucket?
Author
Contributor

None of the objects listed in block info for missing blocks actually shows up in a call to ListObjects. Are the object version and key mentioned in block info stored in the blocks table directly? which mean there is actually no stored version of that object, only orphaned blocks?

Also, to answer your other question about any unfinished multipart upload, yes indeed (so our Mastodon instance must be failing many uploads for some reason and not reporting it in the logs) :

Size: 110.7 GiB (118.9 GB)
Objects: 517817
Unfinished multipart uploads: 19566

After discussing this in the garage channel, it does not seem related to the 2-node topology, but to upload failures by Mastodon and how they end up showing up as missing blocks. I am renaming the issue accordingly.

None of the objects listed in `block info` for missing blocks actually shows up in a call to `ListObjects`. Are the object version and key mentioned in `block info` stored in the blocks table directly? which mean there is actually no stored version of that object, only orphaned blocks? Also, to answer your other question about any unfinished multipart upload, yes indeed (so our Mastodon instance must be failing many uploads for some reason and not reporting it in the logs) : ``` Size: 110.7 GiB (118.9 GB) Objects: 517817 Unfinished multipart uploads: 19566 ``` After discussing this in the garage channel, it does not seem related to the 2-node topology, but to upload failures by Mastodon and how they end up showing up as missing blocks. I am renaming the issue accordingly.
kaiyou changed title from Losing random blocks in a 2-node setup to Block sync errors accumulating when using Mastodon 2023-01-02 19:21:45 +00:00
Owner

Are the object version and key mentioned in block info stored in the blocks table directly?

In the block table there is an identifier of the version that owns the block. In the version table, there is the key (path) of the object for which this is a version.

which mean there is actually no stored version of that object, only orphaned blocks?

I think there is actually a stored version of the object, but it is in unfinished state, which seems to be consistent with "unfinished multipart uploads" > 0.

You should try a garage bucket cleanup-unfinished-uploads --delete-older-than 1d <bucket name>, the number of resync errors should go down to close to zero quite rapidly.

Overall, I think there are two sub-issues we can make from this:

  • when a PutObject cannot be completed, some stale data is kept and Garage continues to want to store blocks it doesn't need (a new issue should be opened for this)

  • unfinished uploads are not cleaned up automatically (#309 is a plan to fix this)

> Are the object version and key mentioned in block info stored in the blocks table directly? In the block table there is an identifier of the version that owns the block. In the version table, there is the key (path) of the object for which this is a version. > which mean there is actually no stored version of that object, only orphaned blocks? I think there is actually a stored version of the object, but it is in unfinished state, which seems to be consistent with "unfinished multipart uploads" > 0. You should try a `garage bucket cleanup-unfinished-uploads --delete-older-than 1d <bucket name>`, the number of resync errors should go down to close to zero quite rapidly. Overall, I think there are two sub-issues we can make from this: - when a PutObject cannot be completed, some stale data is kept and Garage continues to want to store blocks it doesn't need (a new issue should be opened for this) - unfinished uploads are not cleaned up automatically (#309 is a plan to fix this)
Author
Contributor

Content warning: please, please do not attempt to access to cached file mentionned in the following message. It was posted publicly and is technically legal under french law, yet disturbing enough that I regret opening it during my debugging session.

I noted an interesting behavior that might help shed light on the issue. Most blocks (1570) reporting as errorred have a refcount of 1, while a couple (3) of them have a refcount of 2. Those with refcount 2 are actually referenced by two different versions at very similar keys, only the filename changes, that is randomly generated by Mastodon on upload (which suggests caching was retried and failed again for those) :

Block hash: 282e247b17e0aa3c389b53135b7adc6bd358f23d1a1df5bdf6427c440ceafc09
Refcount: 2

Version           Bucket            Key                                                                                  Deleted
b6770bbeaa38e247  c3d3a41d734daead  cache/media_attachments/files/106/581/428/466/857/711/original/0f96e35a253058f8.mp4  false
ddd91f1a01637e34  c3d3a41d734daead  cache/media_attachments/files/106/581/428/466/857/711/original/c58ee8b8af3deeb3.mp4  false

None of the keys are returned by ListObject.

Also, all blocks that were listed as errors but are not anymore also have a refcount of 2. This makes sense because randomly generated name was different upon succesful retry. This also means that those blocks were actually received completely and consistently in the first place because the hash matches.

Block hash: d29f31baa3a586965fe201f4dc7613839302f754ec7be9ca1983bc7cd8c8891b
Refcount: 2

Version           Bucket            Key                                                                                  Deleted
879f83780ae42a2e  c3d3a41d734daead  cache/media_attachments/files/109/585/722/433/229/264/original/40607a6a50fd0a98.mp4  false
be1e65f1f931fba3  c3d3a41d734daead  cache/media_attachments/files/109/585/722/433/229/264/original/4fae45f98035a6b1.mp4  false

In this case, exactly one of both keys is returned by ListObject, and served correctly by GetObject.

My understanding is that a block being referenced by two versions, one of which is not served, would be consistent with an incomplete multipart upload as you suggested (Object created, Version created, some blocks created, but ObjectVersion still in uploading state due to failing or missing CompletMultipartUpload). I completely understand why a failing implementation, or even occasionnal errors, would increase the incomplete upload count.

Mastodon indeed uses kt-paperclip, which in turns relies on aws-sdk-s3, which uses dynamic Object::upload_file with automatic multipart support (https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/S3/Object.html#upload_file-instance_method).

This is confirmed by listing pending multipart uploads for a given key referenced in a missing block:

{
        "mastodon-data": [
                {
                        "ChecksumAlgorithm": null,
                        "Initiated": "2022-12-27T12:50:27.8Z",
                        "Key": "cache/media_attachments/files/109/585/722/433/229/264/original/4fae45f98035a6b1.mp4",
                        "StorageClass": "STANDARD",
                        "UploadId": "be1e65f1f931fba3a59a3b9031ccd9629eec54db82cd2b356ebc674648546cf4"
                }
        ]
}

Errorred blocks with refcount 2 even have 2 pending uploads, which sounds really consistent with your hypothesis.

However, I still to not understand how this would create a block metadata without writing anything to disk. In fact, most pending uploads have valid referenced blocks, only a fraction has blocks missing on disk on top of incomplete multipart uploads.

Also, digging in the logs (or network traces) shows that most failing PUT requests do not look like UploadPart (no part number) but more like a plain PutObject:

2022-12-27T12:50:27.80004636Z stderr F 2022-12-27T12:50:27.799869Z  INFO garage_api::generic_server: [::ffff:10.21.1.210]:47472 PUT /mastodon-data/cache/media_attachments/files/109/585/722/433/229/264/original/4fae45f98035a6b1.mp4
2022-12-27T12:50:27.800093231Z stderr F 2022-12-27T12:50:27.799903Z DEBUG garage_api::generic_server: Request { method: PUT, uri: /mastodon-data/cache/media_attachments/files/109/585/722/433/229/264/original/4fae45f98035a6b1.mp4, version: HTTP/1.1, headers: {"host": "objects.tedomum.net", "user-agent": "aws-sdk-ruby3/3.130.2 ruby/3.0.3 x86_64-linux aws-sdk-s3/1.114.0", "content-length": "8254608", "accept": "*/*", "accept-encoding": "", "accept-encoding": "gzip", "authorization": "AWS4-HMAC-SHA256 Credential=GKxxx/20221227/garage/s3/aws4_request, SignedHeaders=cache-control;content-md5;content-type;host;user-agent;x-amz-acl;x-amz-content-sha256;x-amz-date, Signature=xxx", "cache-control": "public, max-age=315576000, immutable", "content-md5": "FrwZX1rbB4RfqRUZF13H2Q==", "content-type": "video/mp4", "expect": "100-continue", "x-amz-acl": "public-read", "x-amz-content-sha256": "4678f4d9d3475f3cea0341b7aef4d0fa007f04e8027bb6d7622859049b75572b", "x-amz-date": "20221227T125027Z", "x-forwarded-for": "10.21.1.1", "x-forwarded-host": "objects.tedomum.net", "x-forwarded-port": "443", "x-forwarded-proto": "https", "x-forwarded-server": "traefik-6b47f84dfd-djrxd", "x-real-ip": "10.21.1.1"}, body: Body(Streaming) }

Moreover, Mastodon sets a 15MB threshold for multipart uploads, which most impacted files do not reach. This is why I am guessing there is indeed an issue with Mastodon not completing some multipart uploads and leaving uploaded parts in the bucket (which I will file with Mastodon). However, while possibly related, I think the missing blocks are a separate issue that relates to the way blocks are created, both for PutObject and UploadPart calls.

**Content warning: please, please do not attempt to access to cached file mentionned in the following message. It was posted publicly and is technically legal under french law, yet disturbing enough that I regret opening it during my debugging session.** I noted an interesting behavior that might help shed light on the issue. Most blocks (1570) reporting as errorred have a refcount of 1, while a couple (3) of them have a refcount of 2. Those with refcount 2 are actually referenced by two different versions at very similar keys, only the filename changes, that is randomly generated by Mastodon on upload (which suggests caching was retried and failed again for those) : ``` Block hash: 282e247b17e0aa3c389b53135b7adc6bd358f23d1a1df5bdf6427c440ceafc09 Refcount: 2 Version Bucket Key Deleted b6770bbeaa38e247 c3d3a41d734daead cache/media_attachments/files/106/581/428/466/857/711/original/0f96e35a253058f8.mp4 false ddd91f1a01637e34 c3d3a41d734daead cache/media_attachments/files/106/581/428/466/857/711/original/c58ee8b8af3deeb3.mp4 false ``` None of the keys are returned by `ListObject`. Also, all blocks that were listed as errors but are not anymore also have a refcount of 2. This makes sense because randomly generated name was different upon succesful retry. This also means that those blocks were actually received completely and consistently in the first place because the hash matches. ``` Block hash: d29f31baa3a586965fe201f4dc7613839302f754ec7be9ca1983bc7cd8c8891b Refcount: 2 Version Bucket Key Deleted 879f83780ae42a2e c3d3a41d734daead cache/media_attachments/files/109/585/722/433/229/264/original/40607a6a50fd0a98.mp4 false be1e65f1f931fba3 c3d3a41d734daead cache/media_attachments/files/109/585/722/433/229/264/original/4fae45f98035a6b1.mp4 false ``` In this case, exactly one of both keys is returned by `ListObject`, and served correctly by `GetObject`. My understanding is that a block being referenced by two versions, one of which is not served, would be consistent with an incomplete multipart upload as you suggested (`Object` created, `Version` created, some blocks created, but `ObjectVersion` still in uploading state due to failing or missing `CompletMultipartUpload`). I completely understand why a failing implementation, or even occasionnal errors, would increase the incomplete upload count. Mastodon indeed uses `kt-paperclip`, which in turns relies on `aws-sdk-s3`, which uses dynamic `Object::upload_file` with automatic multipart support (https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/S3/Object.html#upload_file-instance_method). This is confirmed by listing pending multipart uploads for a given key referenced in a missing block: ``` { "mastodon-data": [ { "ChecksumAlgorithm": null, "Initiated": "2022-12-27T12:50:27.8Z", "Key": "cache/media_attachments/files/109/585/722/433/229/264/original/4fae45f98035a6b1.mp4", "StorageClass": "STANDARD", "UploadId": "be1e65f1f931fba3a59a3b9031ccd9629eec54db82cd2b356ebc674648546cf4" } ] } ``` Errorred blocks with refcount 2 even have 2 pending uploads, which sounds really consistent with your hypothesis. However, I still to not understand how this would create a block metadata without writing anything to disk. In fact, most pending uploads have valid referenced blocks, only a fraction has blocks missing on disk on top of incomplete multipart uploads. Also, digging in the logs (or network traces) shows that most failing PUT requests do not look like `UploadPart` (no part number) but more like a plain `PutObject`: ``` 2022-12-27T12:50:27.80004636Z stderr F 2022-12-27T12:50:27.799869Z INFO garage_api::generic_server: [::ffff:10.21.1.210]:47472 PUT /mastodon-data/cache/media_attachments/files/109/585/722/433/229/264/original/4fae45f98035a6b1.mp4 2022-12-27T12:50:27.800093231Z stderr F 2022-12-27T12:50:27.799903Z DEBUG garage_api::generic_server: Request { method: PUT, uri: /mastodon-data/cache/media_attachments/files/109/585/722/433/229/264/original/4fae45f98035a6b1.mp4, version: HTTP/1.1, headers: {"host": "objects.tedomum.net", "user-agent": "aws-sdk-ruby3/3.130.2 ruby/3.0.3 x86_64-linux aws-sdk-s3/1.114.0", "content-length": "8254608", "accept": "*/*", "accept-encoding": "", "accept-encoding": "gzip", "authorization": "AWS4-HMAC-SHA256 Credential=GKxxx/20221227/garage/s3/aws4_request, SignedHeaders=cache-control;content-md5;content-type;host;user-agent;x-amz-acl;x-amz-content-sha256;x-amz-date, Signature=xxx", "cache-control": "public, max-age=315576000, immutable", "content-md5": "FrwZX1rbB4RfqRUZF13H2Q==", "content-type": "video/mp4", "expect": "100-continue", "x-amz-acl": "public-read", "x-amz-content-sha256": "4678f4d9d3475f3cea0341b7aef4d0fa007f04e8027bb6d7622859049b75572b", "x-amz-date": "20221227T125027Z", "x-forwarded-for": "10.21.1.1", "x-forwarded-host": "objects.tedomum.net", "x-forwarded-port": "443", "x-forwarded-proto": "https", "x-forwarded-server": "traefik-6b47f84dfd-djrxd", "x-real-ip": "10.21.1.1"}, body: Body(Streaming) } ``` Moreover, Mastodon sets a 15MB threshold for multipart uploads, which most impacted files do not reach. This is why I am guessing there is indeed an issue with Mastodon not completing some multipart uploads and leaving uploaded parts in the bucket (which I will file with Mastodon). However, while possibly related, I think the missing blocks are a separate issue that relates to the way blocks are created, both for `PutObject` and `UploadPart` calls.
Author
Contributor

Ok, I just read your previous message and I guess you figured out the part with PutObject calls.

I will clean unfinished uploads then, confirm that all counters drop to zero and close this issue in favor of the more specific one you suggested.

Thank you so much for helping with this!

Ok, I just read your previous message and I guess you figured out the part with `PutObject` calls. I will clean unfinished uploads then, confirm that all counters drop to zero and close this issue in favor of the more specific one you suggested. Thank you so much for helping with this!
Owner

However, I still to not understand how this would create a block metadata without writing anything to disk. In fact, most pending uploads have valid referenced blocks, only a fraction has blocks missing on disk on top of incomplete multipart uploads.

It depends at what point the upload gets interrupted in Garage. For each blocks, the two things are done simultaneously:

  1. store the block on storage nodes
  2. store a reference to the block

If 1 has the time to finish but not 2, you would end up with a block with no reference to it, which would be cleaned up at some point by a block repair. Conversely, if 2 finished but no node is able to complete the storage of the block (1), you would end up with a reference to a block that was never correctly stored.

> However, I still to not understand how this would create a block metadata without writing anything to disk. In fact, most pending uploads have valid referenced blocks, only a fraction has blocks missing on disk on top of incomplete multipart uploads. It depends at what point the upload gets interrupted in Garage. For each blocks, the two things are done simultaneously: 1. store the block on storage nodes 2. store a reference to the block If 1 has the time to finish but not 2, you would end up with a block with no reference to it, which would be cleaned up at some point by a block repair. Conversely, if 2 finished but no node is able to complete the storage of the block (1), you would end up with a reference to a block that was never correctly stored.
Author
Contributor

This is generally fixed, I created the (probably also now fixed) follow-up issue: #469

This is generally fixed, I created the (probably also now fixed) follow-up issue: https://git.deuxfleurs.fr/Deuxfleurs/garage/issues/469
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#457
No description provided.