garage craches on start #754

Closed
opened 2024-03-03 21:02:40 +00:00 by mr_tron · 3 comments

one node fall out from cluster and now can't join to it.

2024-03-03T20:57:06.599673Z  INFO garage::server: Loading configuration...
2024-03-03T20:57:06.600064Z  INFO garage::server: Initializing Garage main data store...
2024-03-03T20:57:06.600155Z  INFO garage_model::garage: Opening database...
2024-03-03T20:57:06.600236Z  INFO garage_model::garage: Opening LMDB database at: /var/lib/garage/meta/db.lmdb
2024-03-03T20:57:06.600614Z  INFO garage_model::garage: Initialize background variable system...
2024-03-03T20:57:06.600635Z  INFO garage_model::garage: Initialize membership management system...
2024-03-03T20:57:06.600713Z  INFO garage_rpc::system: Node ID of this node: 5c59b2c749541080
2024-03-03T20:57:06.601044Z  INFO garage_model::garage: Initialize block manager...
2024-03-03T20:57:06.602004Z  INFO garage_model::garage: Initialize bucket_table...
2024-03-03T20:57:06.602354Z  INFO garage_model::garage: Initialize bucket_alias_table...
2024-03-03T20:57:06.602626Z  INFO garage_model::garage: Initialize key_table_table...
2024-03-03T20:57:06.602887Z  INFO garage_model::garage: Initialize block_ref_table...
2024-03-03T20:57:06.603153Z  INFO garage_model::garage: Initialize version_table...
2024-03-03T20:57:06.603422Z  INFO garage_model::garage: Initialize multipart upload counter table...
2024-03-03T20:57:06.603726Z  INFO garage_model::garage: Initialize multipart upload table...
2024-03-03T20:57:06.603987Z  INFO garage_model::garage: Initialize object counter table...
2024-03-03T20:57:06.604264Z  INFO garage_model::garage: Initialize object_table...
2024-03-03T20:57:06.604538Z  INFO garage_model::garage: Load lifecycle worker state...
2024-03-03T20:57:06.604711Z  INFO garage_model::garage: Initialize K2V counter table...
2024-03-03T20:57:06.605014Z  INFO garage_model::garage: Initialize K2V subscription manager...
2024-03-03T20:57:06.605036Z  INFO garage_model::garage: Initialize K2V item table...
2024-03-03T20:57:06.605332Z  INFO garage_model::garage: Initialize K2V RPC handler...
2024-03-03T20:57:06.605435Z  INFO garage::server: Initializing background runner...
2024-03-03T20:57:06.605530Z  INFO garage::server: Spawning Garage workers...
2024-03-03T20:57:06.605807Z  INFO garage::server: Initialize Admin API server and metrics collector...
2024-03-03T20:57:06.605856Z  INFO garage::server: Launching internal Garage cluster communications...
2024-03-03T20:57:06.606175Z  INFO garage::server: Create admin RPC handler...
2024-03-03T20:57:06.606215Z  INFO garage::server: Initializing S3 API server...
2024-03-03T20:57:06.606242Z  INFO garage::server: Launching Admin API server...
2024-03-03T20:57:06.606554Z  INFO garage_api::generic_server: S3 API server listening on http://[::]:3900
2024-03-03T20:57:06.606554Z  INFO garage_api::generic_server: Admin API server listening on http://0.0.0.0:3903
2024-03-03T20:57:06.607363Z ERROR garage_util::background::worker: Error in worker bucket_object_counter queue (TID 29): Too many errors
2024-03-03T20:57:06.635171Z ERROR garage_util::background::worker: Error in worker version GC (TID 40): in try_send_and_delete in table GC:
GC: send tombstones
Could not reach quorum of 2. 0 of 2 request succeeded, others returned errors: ["Netapp error: Not connected: 330fe77f09552f2a", "Netapp error: Not connected: 52f5c8753ee5b49b"]
/build/lmdb-rkv-sys-0.11.2/lmdb/libraries/liblmdb/mdb.c:3267: Assertion 'len >= 0 && id <= env->me_pglast' failed in mdb_freelist_save()

one node fall out from cluster and now can't join to it. ``` 2024-03-03T20:57:06.599673Z INFO garage::server: Loading configuration... 2024-03-03T20:57:06.600064Z INFO garage::server: Initializing Garage main data store... 2024-03-03T20:57:06.600155Z INFO garage_model::garage: Opening database... 2024-03-03T20:57:06.600236Z INFO garage_model::garage: Opening LMDB database at: /var/lib/garage/meta/db.lmdb 2024-03-03T20:57:06.600614Z INFO garage_model::garage: Initialize background variable system... 2024-03-03T20:57:06.600635Z INFO garage_model::garage: Initialize membership management system... 2024-03-03T20:57:06.600713Z INFO garage_rpc::system: Node ID of this node: 5c59b2c749541080 2024-03-03T20:57:06.601044Z INFO garage_model::garage: Initialize block manager... 2024-03-03T20:57:06.602004Z INFO garage_model::garage: Initialize bucket_table... 2024-03-03T20:57:06.602354Z INFO garage_model::garage: Initialize bucket_alias_table... 2024-03-03T20:57:06.602626Z INFO garage_model::garage: Initialize key_table_table... 2024-03-03T20:57:06.602887Z INFO garage_model::garage: Initialize block_ref_table... 2024-03-03T20:57:06.603153Z INFO garage_model::garage: Initialize version_table... 2024-03-03T20:57:06.603422Z INFO garage_model::garage: Initialize multipart upload counter table... 2024-03-03T20:57:06.603726Z INFO garage_model::garage: Initialize multipart upload table... 2024-03-03T20:57:06.603987Z INFO garage_model::garage: Initialize object counter table... 2024-03-03T20:57:06.604264Z INFO garage_model::garage: Initialize object_table... 2024-03-03T20:57:06.604538Z INFO garage_model::garage: Load lifecycle worker state... 2024-03-03T20:57:06.604711Z INFO garage_model::garage: Initialize K2V counter table... 2024-03-03T20:57:06.605014Z INFO garage_model::garage: Initialize K2V subscription manager... 2024-03-03T20:57:06.605036Z INFO garage_model::garage: Initialize K2V item table... 2024-03-03T20:57:06.605332Z INFO garage_model::garage: Initialize K2V RPC handler... 2024-03-03T20:57:06.605435Z INFO garage::server: Initializing background runner... 2024-03-03T20:57:06.605530Z INFO garage::server: Spawning Garage workers... 2024-03-03T20:57:06.605807Z INFO garage::server: Initialize Admin API server and metrics collector... 2024-03-03T20:57:06.605856Z INFO garage::server: Launching internal Garage cluster communications... 2024-03-03T20:57:06.606175Z INFO garage::server: Create admin RPC handler... 2024-03-03T20:57:06.606215Z INFO garage::server: Initializing S3 API server... 2024-03-03T20:57:06.606242Z INFO garage::server: Launching Admin API server... 2024-03-03T20:57:06.606554Z INFO garage_api::generic_server: S3 API server listening on http://[::]:3900 2024-03-03T20:57:06.606554Z INFO garage_api::generic_server: Admin API server listening on http://0.0.0.0:3903 2024-03-03T20:57:06.607363Z ERROR garage_util::background::worker: Error in worker bucket_object_counter queue (TID 29): Too many errors 2024-03-03T20:57:06.635171Z ERROR garage_util::background::worker: Error in worker version GC (TID 40): in try_send_and_delete in table GC: GC: send tombstones Could not reach quorum of 2. 0 of 2 request succeeded, others returned errors: ["Netapp error: Not connected: 330fe77f09552f2a", "Netapp error: Not connected: 52f5c8753ee5b49b"] /build/lmdb-rkv-sys-0.11.2/lmdb/libraries/liblmdb/mdb.c:3267: Assertion 'len >= 0 && id <= env->me_pglast' failed in mdb_freelist_save() ```

We had a similar issue today. Check if the node that fell off does see other nodes with proper IP/addresses. For us, one of the nodes had 127.0.0.1:port. We were unable to track the core reason. Calling connect to all other nodes helped us. Something like:

garage node connect HASH@storage-b.original.instance:3901

Connect used the same way as in the documentation: https://garagehq.deuxfleurs.fr/documentation/cookbook/real-world/

We had a similar issue today. Check if the node that fell off does see other nodes with proper IP/addresses. For us, one of the nodes had `127.0.0.1:port`. We were unable to track the core reason. Calling connect to all other nodes helped us. Something like: ``` garage node connect HASH@storage-b.original.instance:3901 ``` Connect used the same way as in the documentation: https://garagehq.deuxfleurs.fr/documentation/cookbook/real-world/
Owner
/build/lmdb-rkv-sys-0.11.2/lmdb/libraries/liblmdb/mdb.c:3267: Assertion 'len >= 0 && id <= env->me_pglast' failed in mdb_freelist_save()

This line seems to indicate that your LMDB database file is corrupted. The best is probably to rebuild it from scratch if you can (see recovering from failures).

@kinovic Do you also have a similar message in your logs regarding LMDB?

``` /build/lmdb-rkv-sys-0.11.2/lmdb/libraries/liblmdb/mdb.c:3267: Assertion 'len >= 0 && id <= env->me_pglast' failed in mdb_freelist_save() ``` This line seems to indicate that your LMDB database file is corrupted. The best is probably to rebuild it from scratch if you can (see [recovering from failures](https://garagehq.deuxfleurs.fr/documentation/operations/recovering/)). @kinovic Do you also have a similar message in your logs regarding LMDB?

@lx I have read the issue wrong... We had a different problem then. Thank you for your explanation.

@lx I have read the issue wrong... We had a different problem then. Thank you for your explanation.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#754
No description provided.