Block sync errors accumulating when using Mastodon #457
Labels
No labels
action
check-aws
action
discussion-needed
action
for-external-contributors
action
for-newcomers
action
more-info-needed
action
need-funding
action
triage-required
kind
correctness
kind
ideas
kind
improvement
kind
performance
kind
testing
kind
usability
kind
wrong-behavior
prio
critical
prio
low
scope
admin-api
scope
background-healing
scope
build
scope
documentation
scope
k8s
scope
layout
scope
metadata
scope
ops
scope
rpc
scope
s3-api
scope
security
scope
telemetry
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: Deuxfleurs/garage#457
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
We have been experiencing random block loss in our 2-node setup (edit: block loss does not seem related to the 2-node topology after all), and are investigating the issue on the garage side. We have been tracking this in our (french) internal bug-tracker so far: https://forge.tedomum.net/tedomum/garage/-/issues/2
History
General context
We have been running a 2-node production garage cluster for the past 2 months or so, using
replication_mode = "2"
. We are well aware of the risks running a 2-node cluster in a single location (despite our telling garage those are in separate regions). We accept those risks for the time being and our current use-case.After 4 weeks of use and freshly migrating our Mastodon instance to garage, we started noticing missing objects, while the application was not reporting any error. After investigating, the frontend was indeed cleared of all suspicion, while the S3 backend was a complete other story.
Hardware failure
Garage logs were reporting missing blocks, we are not sure for how long though because metrics were only being scaped for so long:
Rapid diagnosis showed that one of both nodes had ran into failing ATA, up to disconnecting the controller and corrupting some inodes, including some garage-related files:
The drive was quickly replaced with a new one, and we ran a full
fsck
, then the node did start again, yet was showing inconsistent behavior at best.We noticed that some
zst
files were not being properly renamed from thetmp
temporary file after successful write. Checksums were okay, so we renamed them all:We also ran a full repair and scrub:
This helped recovering some blocks indeed:
Wrongly created bucket
While the node seemed to run fine, fixing the blocks and getting it to serve objects again showed another issue that didn't occur during the initial failure: newly written objects to a specific buckets were succesfully written, could be read from S3 yet were not exposed through the Web endpoint. Listing from S3 only showed recent objects while the web endpoint served older ones.
This was traced back to a wrongly created local bucket for the key:
Unaliasing the local bucket fixed the issue:
We also:
Finally, this helped pinpointing a bug in checking key permission for bucket creation, fixed in:
2ac75018a1
.Random missing blocks
Eventually, general function was recovered for the cluster. However, some new blocks are still missing while
PutObject
calls report no error. This happens at a steady rate of approx 1% of all writes.By using the new
block list-errors
andblock info
(687660b27f
) commands, we were able to list objects related to recent errors, which indeed match with freshly written objects that the S3 client seemed to write succesfully.Most of them (approx 75%) also show a dangling
tmp
file that is consistent yet not properly renamed tozst
.Current work
Detailed traces
We are in the process of:
Some facts
Based on traces, here are some facts about garage behavior while blocks go missing.
write
syscalls succeedrename
rename
syscalls succeedWhile writing this down, I started another instrumentation session in order to pinpoint the issue, failing syscall, or whatever. It seems that no new block has been missing for >24h, with no change on our side in the meantime.
We will report back on how this goes, still trying to make sense of our existing traces.
A couple days later: the issue is definitely not fixed, missing blocks keep accumulating.
We are making some progress though. It first seemed that this was some case of write-hole, where the client gets a success response from the backend, while data is in fact unproperly written. We are not sure how we wrongly came to that conclusion, but it does seem false after all.
Our main use for garage is currently as a Mastodon S3 backend for media. Mastodon caches remote media, and our instance is a small one, so 99.9% of it is actually cached media, and all missing blocks except 1 are from cached media indeed.
By comparing paths from
block info
and the Mastodon database, we estblish that Mastodon must be receiving some kind of error even if not logged, because it does not write the attachmend file name to database:This also gives us a decent hypothesis as to why the number sometimes decreases, or why it keeps stable. Because mastodon catches the
PutObject
error (still unclear how) and records that the media was not properly cached, when a user later browses that part of the timeline, mastodon retries caching the media, which succeeds.Due to the object path being the same and blocks being content-addressed, the block is written and the path properly served from that point. Some of our users must have real-time data displayed in their browser from time to time, which forces refresh when their browser fetches the media, which in turn might fix the missing block, all of this before metrics show anything (we poll every 10 minutes).
The hypothesis currently being tested: maybe the core issue is not on garage side, but upstream in the chain.
What if when caching a media, Mastodon sometimes fails (the remote server fails, or any other networking error, or basically anything going glitchy)? What if it had already started calling
PutObject
and streaming data?Then the object version would be created in garage, and possibly a block or even a couple of them. When
PutObject
fails, possibly the current block is not written to storage, but metadata remains somehow, which would make for a failing block?Are there actual objects that cannot be read in the bucket?
None of the objects listed in
block info
for missing blocks actually shows up in a call toListObjects
. Are the object version and key mentioned inblock info
stored in the blocks table directly? which mean there is actually no stored version of that object, only orphaned blocks?Also, to answer your other question about any unfinished multipart upload, yes indeed (so our Mastodon instance must be failing many uploads for some reason and not reporting it in the logs) :
After discussing this in the garage channel, it does not seem related to the 2-node topology, but to upload failures by Mastodon and how they end up showing up as missing blocks. I am renaming the issue accordingly.
Losing random blocks in a 2-node setupto Block sync errors accumulating when using MastodonIn the block table there is an identifier of the version that owns the block. In the version table, there is the key (path) of the object for which this is a version.
I think there is actually a stored version of the object, but it is in unfinished state, which seems to be consistent with "unfinished multipart uploads" > 0.
You should try a
garage bucket cleanup-unfinished-uploads --delete-older-than 1d <bucket name>
, the number of resync errors should go down to close to zero quite rapidly.Overall, I think there are two sub-issues we can make from this:
when a PutObject cannot be completed, some stale data is kept and Garage continues to want to store blocks it doesn't need (a new issue should be opened for this)
unfinished uploads are not cleaned up automatically (#309 is a plan to fix this)
Content warning: please, please do not attempt to access to cached file mentionned in the following message. It was posted publicly and is technically legal under french law, yet disturbing enough that I regret opening it during my debugging session.
I noted an interesting behavior that might help shed light on the issue. Most blocks (1570) reporting as errorred have a refcount of 1, while a couple (3) of them have a refcount of 2. Those with refcount 2 are actually referenced by two different versions at very similar keys, only the filename changes, that is randomly generated by Mastodon on upload (which suggests caching was retried and failed again for those) :
None of the keys are returned by
ListObject
.Also, all blocks that were listed as errors but are not anymore also have a refcount of 2. This makes sense because randomly generated name was different upon succesful retry. This also means that those blocks were actually received completely and consistently in the first place because the hash matches.
In this case, exactly one of both keys is returned by
ListObject
, and served correctly byGetObject
.My understanding is that a block being referenced by two versions, one of which is not served, would be consistent with an incomplete multipart upload as you suggested (
Object
created,Version
created, some blocks created, butObjectVersion
still in uploading state due to failing or missingCompletMultipartUpload
). I completely understand why a failing implementation, or even occasionnal errors, would increase the incomplete upload count.Mastodon indeed uses
kt-paperclip
, which in turns relies onaws-sdk-s3
, which uses dynamicObject::upload_file
with automatic multipart support (https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/S3/Object.html#upload_file-instance_method).This is confirmed by listing pending multipart uploads for a given key referenced in a missing block:
Errorred blocks with refcount 2 even have 2 pending uploads, which sounds really consistent with your hypothesis.
However, I still to not understand how this would create a block metadata without writing anything to disk. In fact, most pending uploads have valid referenced blocks, only a fraction has blocks missing on disk on top of incomplete multipart uploads.
Also, digging in the logs (or network traces) shows that most failing PUT requests do not look like
UploadPart
(no part number) but more like a plainPutObject
:Moreover, Mastodon sets a 15MB threshold for multipart uploads, which most impacted files do not reach. This is why I am guessing there is indeed an issue with Mastodon not completing some multipart uploads and leaving uploaded parts in the bucket (which I will file with Mastodon). However, while possibly related, I think the missing blocks are a separate issue that relates to the way blocks are created, both for
PutObject
andUploadPart
calls.Ok, I just read your previous message and I guess you figured out the part with
PutObject
calls.I will clean unfinished uploads then, confirm that all counters drop to zero and close this issue in favor of the more specific one you suggested.
Thank you so much for helping with this!
It depends at what point the upload gets interrupted in Garage. For each blocks, the two things are done simultaneously:
If 1 has the time to finish but not 2, you would end up with a block with no reference to it, which would be cleaned up at some point by a block repair. Conversely, if 2 finished but no node is able to complete the storage of the block (1), you would end up with a reference to a block that was never correctly stored.
This is generally fixed, I created the (probably also now fixed) follow-up issue: #469