Panic during application of new cluster layout in 0.8.0-rc1 #414

Closed
opened 3 weeks ago by mediocregopher · 16 comments

Seeing a panic occurring when trying to add a new instance to our cluster, via the admin REST interface. The instance which the POST /v0/layout/apply is being performed against is the same which is being added to the cluster. The logs on that instance look like:

2022-11-12T23:29:12.791487Z  INFO garage_api::generic_server: 10.242.0.3:51512 POST /v0/layout/apply
2022-11-12T23:29:12.791495Z  INFO garage_api::generic_server: 10.242.0.3:51512 POST /v0/layout/apply
Calculating updated partition assignation, this may take some time...
2022-11-12T23:29:12.796063Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21
2022-11-12T23:29:12.796069Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21

Even on another host, when running garage layout show, we can get the same panic:

2022-11-12T23:30:25.874951Z  INFO netapp::netapp: Connected to 10.242.0.1:3900, negotiating handshake...
2022-11-12T23:30:25.923971Z  INFO netapp::netapp: Connection established to 3e44b73b1dc65666
==== CURRENT CLUSTER LAYOUT ====
ID                Tags  Zone              Capacity
3e44b73b1dc65666        mediocre-desktop  1
3e70b1a7dabfd25b        mediocre-desktop  1
7d4aafc54fdae0f8        mediocre-desktop  1

Current cluster layout version: 1

==== STAGED ROLE CHANGES ====
ID                Tags  Zone       Capacity
c5c728378148e374        boobytrap  1

==== NEW CLUSTER LAYOUT AFTER APPLYING CHANGES ====
ID                Tags  Zone              Capacity
3e44b73b1dc65666        mediocre-desktop  1
3e70b1a7dabfd25b        mediocre-desktop  1
7d4aafc54fdae0f8        mediocre-desktop  1
c5c728378148e374        boobytrap         1

Calculating updated partition assignation, this may take some time...

2022-11-12T23:30:25.941615Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21
2022-11-12T23:30:25.941624Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21
[1]    2256531 IOT instruction  sudo cryptic-net garage cli layout show

This is all on garage 0.8.0-rc1 (2197753d), all linux/amd64.

I can attempt to retrieve the exact request which is being made to the instance, but it will be a bit of effort, hoping maybe the bug/solution will be evident just from this 🙏 but yeah, please let me know if I need to provide more info.

Seeing a panic occurring when trying to add a new instance to our cluster, via the admin REST interface. The instance which the `POST /v0/layout/apply` is being performed against is the same which is being added to the cluster. The logs on that instance look like: ``` 2022-11-12T23:29:12.791487Z INFO garage_api::generic_server: 10.242.0.3:51512 POST /v0/layout/apply 2022-11-12T23:29:12.791495Z INFO garage_api::generic_server: 10.242.0.3:51512 POST /v0/layout/apply Calculating updated partition assignation, this may take some time... 2022-11-12T23:29:12.796063Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21 2022-11-12T23:29:12.796069Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21 ``` Even on another host, when running `garage layout show`, we can get the same panic: ``` 2022-11-12T23:30:25.874951Z INFO netapp::netapp: Connected to 10.242.0.1:3900, negotiating handshake... 2022-11-12T23:30:25.923971Z INFO netapp::netapp: Connection established to 3e44b73b1dc65666 ==== CURRENT CLUSTER LAYOUT ==== ID Tags Zone Capacity 3e44b73b1dc65666 mediocre-desktop 1 3e70b1a7dabfd25b mediocre-desktop 1 7d4aafc54fdae0f8 mediocre-desktop 1 Current cluster layout version: 1 ==== STAGED ROLE CHANGES ==== ID Tags Zone Capacity c5c728378148e374 boobytrap 1 ==== NEW CLUSTER LAYOUT AFTER APPLYING CHANGES ==== ID Tags Zone Capacity 3e44b73b1dc65666 mediocre-desktop 1 3e70b1a7dabfd25b mediocre-desktop 1 7d4aafc54fdae0f8 mediocre-desktop 1 c5c728378148e374 boobytrap 1 Calculating updated partition assignation, this may take some time... 2022-11-12T23:30:25.941615Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21 2022-11-12T23:30:25.941624Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21 [1] 2256531 IOT instruction sudo cryptic-net garage cli layout show ``` This is all on garage `0.8.0-rc1` (`2197753d`), all linux/amd64. I can attempt to retrieve the exact request which is being made to the instance, but it will be a bit of effort, hoping maybe the bug/solution will be evident just from this :pray: but yeah, please let me know if I need to provide more info.
quentin added the
Bug
label 3 weeks ago
Owner

The error seems to be in the layout algorithm, more especially the algorithm detects that an invariant has been violated and crashes instead of building a buggy layout. So it (probably) means that 1) you have found a bug 2) we catched it early with this assertion.

@mediocregopher : Thanks for reporting this bug. I have 3 questions/requests fo you that could help us solving this bug:

  1. Could you try to add a third node that would be defined in a 3rd zone, try again a layout change, and report whether or not you still have this issue?
  2. Could you share with us your configuration file (without its secrets of course!)?
  3. Are you able to reproduce it reliably? If so, could you share with us a step-by-step guide?

@lx : officially, our current algorithm should have a best effort approach to spread data over different zones when replication_factor is set to 3 but only 2 zones are available. Would it be possible that we still have a regression/bug in this logic?

The error seems to be in the layout algorithm, more especially the algorithm detects that an invariant has been violated and crashes instead of building a buggy layout. So it (probably) means that 1) you have found a bug 2) we catched it early with this assertion. @mediocregopher : Thanks for reporting this bug. I have 3 questions/requests fo you that could help us solving this bug: 1. Could you try to add a third node that would be defined in a 3rd zone, try again a layout change, and report whether or not you still have this issue? 2. Could you share with us your configuration file (without its secrets of course!)? 3. Are you able to reproduce it reliably? If so, could you share with us a step-by-step guide? @lx : officially, our current algorithm should have a best effort approach to spread data over different zones when `replication_factor` is set to `3` but only `2` zones are available. Would it be possible that we still have a regression/bug in this logic?

Thanks @quentin, glad the assertion caught the bug :)

  1. Yes, even with a third instance in a third zone the bug still happens. It doesn't seem to matter if I'm applying the layout change on the new instance or an existing instance either.

  2. Attached toml file for the instance being added. The other config files are only different in their storage locations and bind addresses.

  3. Yes, it happens everytime. I added some logging to get the exact HTTP requests which are occurring:

** REQUEST 1 **

POST /v0/layout HTTP/1.1
Host: 10.242.0.1:3902
User-Agent: Go-http-client/1.1
Content-Length: 480
Authorization: Bearer ADMINSECRET
Accept-Encoding: gzip

{"3e44b73b1dc65666f6a2f7bdd14e1a0742298fb4becfc19e6b84c7f772b49f82":{"capacity":1,"zone":"mediocre-desktop","tags":[]},"3e70b1a7dabfd25bb70804a09e1a7bcda667ce10d1d0f9633a8ea473da274adc":{"capacity":1,"zone":"mediocre-desktop","tags":[]},"660fbe3df998664339778886325963b369112e155d6a5632da92c683bb5da0fd":{"capacity":1,"zone":"mediocre-other-desktop","tags":[]},"7d4aafc54fdae0f80b3a1cf11b3c78f17b0b7dc990a9ddd2a4b81a889e4f13d8":{"capacity":1,"zone":"mediocre-desktop","tags":[]}}

---

HTTP/1.1 200 OK
Content-Length: 0
Date: Sun, 13 Nov 2022 18:36:40 GMT

** REQUEST 2 **

GET /v0/layout HTTP/1.1
Host: 10.242.0.1:3902
User-Agent: Go-http-client/1.1
Authorization: Bearer ADMINSECRET
Accept-Encoding: gzip

---

HTTP/1.1 200 OK
Content-Length: 682
Content-Type: application/json
Date: Sun, 13 Nov 2022 18:36:40 GMT

{
  "version": 2,
  "roles": {
    "3e70b1a7dabfd25bb70804a09e1a7bcda667ce10d1d0f9633a8ea473da274adc": {
      "zone": "mediocre-desktop",
      "capacity": 1,
      "tags": []
    },
    "3e44b73b1dc65666f6a2f7bdd14e1a0742298fb4becfc19e6b84c7f772b49f82": {
      "zone": "mediocre-desktop",
      "capacity": 1,
      "tags": []
    },
    "7d4aafc54fdae0f80b3a1cf11b3c78f17b0b7dc990a9ddd2a4b81a889e4f13d8": {
      "zone": "mediocre-desktop",
      "capacity": 1,
      "tags": []
    }
  },
  "stagedRoleChanges": {
    "660fbe3df998664339778886325963b369112e155d6a5632da92c683bb5da0fd": {
      "zone": "mediocre-other-desktop",
      "capacity": 1,
      "tags": []
    }
  }
}

** REQUEST 3 **

POST /v0/layout/apply HTTP/1.1
Host: 10.242.0.1:3902
User-Agent: Go-http-client/1.1
Content-Length: 14
Authorization: Bearer ADMINSECRET
Accept-Encoding: gzip

{"version":3}

At this point the request never returns, and this shows up in the logs:

2022-11-13T18:36:40.927620Z  INFO garage_api::generic_server: 10.242.0.1:60688 POST /v0/layout/apply
2022-11-13T18:36:40.927626Z  INFO garage_api::generic_server: 10.242.0.1:60688 POST /v0/layout/apply    
Calculating updated partition assignation, this may take some time...

2022-11-13T18:36:40.928435Z  WARN garage_table::sync: (key) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8
2022-11-13T18:36:40.928442Z  WARN garage_table::sync: (key) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8    
2022-11-13T18:36:40.928486Z  WARN garage_table::sync: (bucket_alias) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8
2022-11-13T18:36:40.928490Z  WARN garage_table::sync: (bucket_alias) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8    
2022-11-13T18:36:40.928529Z  WARN garage_table::sync: (bucket_v2) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8
2022-11-13T18:36:40.928532Z  WARN garage_table::sync: (bucket_v2) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8    
2022-11-13T18:36:40.929623Z  WARN garage_table::sync: (version) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8
2022-11-13T18:36:40.929629Z  WARN garage_table::sync: (version) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8    
2022-11-13T18:36:40.930843Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21
2022-11-13T18:36:40.930851Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21 

Note: The instance that those Not connected errors are for, 7d4aafc54fdae0f8, is neither the instance being added nor the instance which the REST requests are being performed against. I don't know if they have to do with anything.

Thanks @quentin, glad the assertion caught the bug :) 1) Yes, even with a third instance in a third zone the bug still happens. It doesn't seem to matter if I'm applying the layout change on the new instance or an existing instance either. 2) Attached toml file for the instance being added. The other config files are only different in their storage locations and bind addresses. 3) Yes, it happens everytime. I added some logging to get the exact HTTP requests which are occurring: ** REQUEST 1 ** ``` POST /v0/layout HTTP/1.1 Host: 10.242.0.1:3902 User-Agent: Go-http-client/1.1 Content-Length: 480 Authorization: Bearer ADMINSECRET Accept-Encoding: gzip {"3e44b73b1dc65666f6a2f7bdd14e1a0742298fb4becfc19e6b84c7f772b49f82":{"capacity":1,"zone":"mediocre-desktop","tags":[]},"3e70b1a7dabfd25bb70804a09e1a7bcda667ce10d1d0f9633a8ea473da274adc":{"capacity":1,"zone":"mediocre-desktop","tags":[]},"660fbe3df998664339778886325963b369112e155d6a5632da92c683bb5da0fd":{"capacity":1,"zone":"mediocre-other-desktop","tags":[]},"7d4aafc54fdae0f80b3a1cf11b3c78f17b0b7dc990a9ddd2a4b81a889e4f13d8":{"capacity":1,"zone":"mediocre-desktop","tags":[]}} --- HTTP/1.1 200 OK Content-Length: 0 Date: Sun, 13 Nov 2022 18:36:40 GMT ``` ** REQUEST 2 ** ``` GET /v0/layout HTTP/1.1 Host: 10.242.0.1:3902 User-Agent: Go-http-client/1.1 Authorization: Bearer ADMINSECRET Accept-Encoding: gzip --- HTTP/1.1 200 OK Content-Length: 682 Content-Type: application/json Date: Sun, 13 Nov 2022 18:36:40 GMT { "version": 2, "roles": { "3e70b1a7dabfd25bb70804a09e1a7bcda667ce10d1d0f9633a8ea473da274adc": { "zone": "mediocre-desktop", "capacity": 1, "tags": [] }, "3e44b73b1dc65666f6a2f7bdd14e1a0742298fb4becfc19e6b84c7f772b49f82": { "zone": "mediocre-desktop", "capacity": 1, "tags": [] }, "7d4aafc54fdae0f80b3a1cf11b3c78f17b0b7dc990a9ddd2a4b81a889e4f13d8": { "zone": "mediocre-desktop", "capacity": 1, "tags": [] } }, "stagedRoleChanges": { "660fbe3df998664339778886325963b369112e155d6a5632da92c683bb5da0fd": { "zone": "mediocre-other-desktop", "capacity": 1, "tags": [] } } } ``` ** REQUEST 3 ** ``` POST /v0/layout/apply HTTP/1.1 Host: 10.242.0.1:3902 User-Agent: Go-http-client/1.1 Content-Length: 14 Authorization: Bearer ADMINSECRET Accept-Encoding: gzip {"version":3} ``` At this point the request never returns, and this shows up in the logs: ``` 2022-11-13T18:36:40.927620Z  INFO garage_api::generic_server: 10.242.0.1:60688 POST /v0/layout/apply 2022-11-13T18:36:40.927626Z  INFO garage_api::generic_server: 10.242.0.1:60688 POST /v0/layout/apply Calculating updated partition assignation, this may take some time... 2022-11-13T18:36:40.928435Z  WARN garage_table::sync: (key) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8 2022-11-13T18:36:40.928442Z  WARN garage_table::sync: (key) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8 2022-11-13T18:36:40.928486Z  WARN garage_table::sync: (bucket_alias) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8 2022-11-13T18:36:40.928490Z  WARN garage_table::sync: (bucket_alias) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8 2022-11-13T18:36:40.928529Z  WARN garage_table::sync: (bucket_v2) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8 2022-11-13T18:36:40.928532Z  WARN garage_table::sync: (bucket_v2) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8 2022-11-13T18:36:40.929623Z  WARN garage_table::sync: (version) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8 2022-11-13T18:36:40.929629Z  WARN garage_table::sync: (version) Sync error: Netapp error: Not connected: 7d4aafc54fdae0f8 2022-11-13T18:36:40.930843Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21 2022-11-13T18:36:40.930851Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21 ``` Note: The instance that those `Not connected` errors are for, `7d4aafc54fdae0f8`, is neither the instance being added nor the instance which the REST requests are being performed against. I don't know if they have to do with anything.
Owner

@mediocregopher thanks for the detailed report.

Could you check if the bug still happens:

  1. if you set all capacity values to 100 instead of 1 ?

  2. if you delete the previous layout, i.e. the cluster_layout file in the metadata directory, on all your nodes ? (stop all nodes, delete cluster_layout everywhere, restart nodes, retry creating the layout you tried)

@mediocregopher thanks for the detailed report. Could you check if the bug still happens: 1. if you set all capacity values to 100 instead of 1 ? 2. if you delete the previous layout, i.e. the `cluster_layout` file in the metadata directory, on all your nodes ? (stop all nodes, delete `cluster_layout` everywhere, restart nodes, retry creating the layout you tried)

@lx

  1. The panic did not occur after setting capacity values to 100, rather than 1. Are capacity values on the admin API not units of 100GB? I assumed they would behave the same as on the CLI, but the docs don't explicitly say.

  2. After deleting the cluster_layout on all instances the panic still did not occur. I suppose that indicates that the old cluster_layout had gotten itself into some kind of weird state?

@lx 1. The panic did _not_ occur after setting capacity values to 100, rather than 1. Are capacity values on the admin API not units of 100GB? I assumed they would behave the same as on the CLI, but the docs don't explicitly say. 2. After deleting the cluster_layout on all instances the panic still did not occur. I suppose that indicates that the old cluster_layout had gotten itself into some kind of weird state?
Owner

Thanks for testing, @mediocregopher .

This issue is indeed a bug in the layout computation code. The current version of the code is currently at end of life, soon to be replaced by #296 (planned for v0.9), so I don't think we should spend too much time debugging this. There are two courses of action we could take from here:

  1. Do nothing and just tell people not to use too small capacity values

  2. Add a hack that detects when the capacity values are too small and multiplies them all by 10 or 100 when the layout is computed

What do you think?

Thanks for testing, @mediocregopher . This issue is indeed a bug in the layout computation code. The current version of the code is currently at end of life, soon to be replaced by #296 (planned for v0.9), so I don't think we should spend too much time debugging this. There are two courses of action we could take from here: 1. Do nothing and just tell people not to use too small capacity values 2. Add a hack that detects when the capacity values are too small and multiplies them all by 10 or 100 when the layout is computed What do you think?
lx added this to the v0.8 milestone 3 weeks ago

Can you clarify what is meant by "too small capacity values"? I was under the impression that the capacity values of 1 corresponded to 100 GB as per the Deployment on a cluster cookbook

Capacity values must be integers but can be given any signification. Here we chose that 1 unit of capacity = 100 GB.

Are capacity values actually unit-less and only used to determine the proportion of data each node will replicate? If so then (2) sounds ok, if I understand your suggestion right. Within the computation code you would multiply the capacity of each instance by 100, and since the capacities only have meaning relative to each other it wouldn't change anything.

It sounds like I can remove the "divide capacity by 100" operation in my codebase, since I was assuming capacity is a multiple of 100GB.

Or perhaps I'm completely misunderstanding :)

Can you clarify what is meant by "too small capacity values"? I was under the impression that the capacity values of `1` corresponded to `100 GB` as per the [Deployment on a cluster cookbook](https://garagehq.deuxfleurs.fr/documentation/cookbook/real-world/) > Capacity values must be integers but can be given any signification. Here we chose that 1 unit of capacity = 100 GB. Are capacity values actually unit-less and only used to determine the proportion of data each node will replicate? If so then (2) sounds ok, if I understand your suggestion right. Within the computation code you would multiply the capacity of each instance by 100, and since the capacities only have meaning relative to each other it wouldn't change anything. It sounds like I can remove the "divide capacity by 100" operation in my codebase, since I was assuming capacity is a multiple of 100GB. Or perhaps I'm completely misunderstanding :)

@quentin @lx we're still seeing the panic even when using larger capacity values:

> garage layout show
2022-11-20T16:43:32.856607Z  INFO netapp::netapp: Connected to 10.242.0.1:3900, negotiating handshake...
2022-11-20T16:43:32.900392Z  INFO netapp::netapp: Connection established to 3e44b73b1dc65666
==== CURRENT CLUSTER LAYOUT ====
ID                Tags  Zone              Capacity
3e44b73b1dc65666        mediocre-desktop  100
3e70b1a7dabfd25b        mediocre-desktop  100
7d4aafc54fdae0f8        mediocre-desktop  100

Current cluster layout version: 5

==== STAGED ROLE CHANGES ====
ID                Tags  Zone  Capacity
1afa07e9ea2d54b2        ig88  200
1ba1bffe7aa26414        ig88  200
3175dd13ae762d56        ig88  200

==== NEW CLUSTER LAYOUT AFTER APPLYING CHANGES ====
ID                Tags  Zone              Capacity
1afa07e9ea2d54b2        ig88              200
1ba1bffe7aa26414        ig88              200
3175dd13ae762d56        ig88              200
3e44b73b1dc65666        mediocre-desktop  100
3e70b1a7dabfd25b        mediocre-desktop  100
7d4aafc54fdae0f8        mediocre-desktop  100

Calculating updated partition assignation, this may take some time...

2022-11-20T16:43:32.916727Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21
2022-11-20T16:43:32.916734Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21
[1]    699462 IOT instruction  sudo cryptic-net garage cli layout show
@quentin @lx we're still seeing the panic even when using larger capacity values: ``` > garage layout show 2022-11-20T16:43:32.856607Z INFO netapp::netapp: Connected to 10.242.0.1:3900, negotiating handshake... 2022-11-20T16:43:32.900392Z INFO netapp::netapp: Connection established to 3e44b73b1dc65666 ==== CURRENT CLUSTER LAYOUT ==== ID Tags Zone Capacity 3e44b73b1dc65666 mediocre-desktop 100 3e70b1a7dabfd25b mediocre-desktop 100 7d4aafc54fdae0f8 mediocre-desktop 100 Current cluster layout version: 5 ==== STAGED ROLE CHANGES ==== ID Tags Zone Capacity 1afa07e9ea2d54b2 ig88 200 1ba1bffe7aa26414 ig88 200 3175dd13ae762d56 ig88 200 ==== NEW CLUSTER LAYOUT AFTER APPLYING CHANGES ==== ID Tags Zone Capacity 1afa07e9ea2d54b2 ig88 200 1ba1bffe7aa26414 ig88 200 3175dd13ae762d56 ig88 200 3e44b73b1dc65666 mediocre-desktop 100 3e70b1a7dabfd25b mediocre-desktop 100 7d4aafc54fdae0f8 mediocre-desktop 100 Calculating updated partition assignation, this may take some time... 2022-11-20T16:43:32.916727Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21 2022-11-20T16:43:32.916734Z ERROR garage: panicked at 'assertion failed: part.nodes.len() == self.replication_factor', layout.rs:262:21 [1] 699462 IOT instruction sudo cryptic-net garage cli layout show ```
Owner

Thanks for this new report @mediocregopher. So since the "solution" of multiplying everything by 100 doesn't work, I won't try fixing this for v0.8.0, which is long overdue and which I'm going to try and release today. This will likely be fixed by the new algorithm in #296 is merged (I can't give a precise timeframe but we're making good progress!). In the meantime, I think your options are:

  • starting from an empty cluster layout by deleting the cluster_layout file

  • tweaking the capacity values to see if there are some values for which the algorithm doesn't panic

I'm also quite intrigued by the fact that you are AFAIK the only person that stumbled on this issue, please tell me if you hear of anyone else having the same.

Thanks for this new report @mediocregopher. So since the "solution" of multiplying everything by 100 doesn't work, I won't try fixing this for v0.8.0, which is long overdue and which I'm going to try and release today. This will likely be fixed by the new algorithm in #296 is merged (I can't give a precise timeframe but we're making good progress!). In the meantime, I think your options are: - starting from an empty cluster layout by deleting the `cluster_layout` file - tweaking the capacity values to see if there are some values for which the algorithm doesn't panic I'm also quite intrigued by the fact that you are AFAIK the only person that stumbled on this issue, please tell me if you hear of anyone else having the same.

@lx I'm also surprised no one else has run into this, though perhaps not so many people are using multiple instances in the same zone with the same capacity.

Can you give any hints as to what capacity values might result in this panic? Also, do you think 0.7.0 would have this same issue? We had tried it previously but had to reset our cluster for other reasons. I was hoping to be able to start from scratch with 0.8.0, but if 0.7.0 would work fine then we might just start there.

@lx I'm also surprised no one else has run into this, though perhaps not so many people are using multiple instances in the same zone with the same capacity. Can you give any hints as to what capacity values might result in this panic? Also, do you think 0.7.0 would have this same issue? We had tried it previously but had to reset our cluster for other reasons. I was hoping to be able to start from scratch with 0.8.0, but if 0.7.0 would work fine then we might just start there.
Owner

I might have figured out the cause of this issue, if I make a branch with a patch will you try it out?

(I think this part of the code hasn't moved since v0.7.0, so you would probably also have the panic.)

I might have figured out the cause of this issue, if I make a branch with a patch will you try it out? (I think this part of the code hasn't moved since v0.7.0, so you would probably also have the panic.)

Yes absolutely!

Yes absolutely!
lx referenced this issue from a commit 2 weeks ago
Owner

Here is the proposed fix: #429

Sorry I don't have time to make a full build today, tell me if you need one and I'll do it ASAP.

Here is the proposed fix: https://git.deuxfleurs.fr/Deuxfleurs/garage/pulls/429 Sorry I don't have time to make a full build today, tell me if you need one and I'll do it ASAP.

np, I can build it. Will get back asap

np, I can build it. Will get back asap

@lx This seems to work better! Neither of the configurations which caused panics previously are doing so now. One piece of oddness I'm seeing though is that the partition layout is a bit wacky:

# garage status
2022-11-21T17:03:58.019993Z  INFO netapp::netapp: Connected to 10.242.0.1:3900, negotiating handshake...
2022-11-21T17:03:58.064114Z  INFO netapp::netapp: Connection established to 3e44b73b1dc65666
==== HEALTHY NODES ====
ID                Hostname          Address          Tags  Zone                  Capacity
7d4aafc54fdae0f8  mediocre-desktop  10.242.0.1:3920  []    mediocre-desktop      100
3e70b1a7dabfd25b  mediocre-desktop  10.242.0.1:3910  []    mediocre-desktop      100
3e44b73b1dc65666  mediocre-desktop  10.242.0.1:3900  []    mediocre-desktop      100
c15e3709cacb1ae2  mediocre-desktop  10.242.0.1:3930  []    medicore-desktop-tmp  200
fa4caaa1121129f4  mediocre-desktop  10.242.0.1:3950  []    medicore-desktop-tmp  200
9bee8edce347bad3  mediocre-desktop  10.242.0.1:3940  []    medicore-desktop-tmp  200

# garage stats
2022-11-21T17:04:00.468517Z  INFO netapp::netapp: Connected to 10.242.0.1:3900, negotiating handshake...
2022-11-21T17:04:00.650795Z  INFO netapp::netapp: Connection established to 3e44b73b1dc65666

Garage version: 0.8.0-rc1 [features: k2v, sled, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs]

Database engine: Sled

Ring nodes & partition count:
  fa4caaa1121129f4 128
  3e70b1a7dabfd25b 102
  9bee8edce347bad3 128
  3e44b73b1dc65666 103
  7d4aafc54fdae0f8 102
  c15e3709cacb1ae2 205
  
...

I'd expect that the partition counts of c15e3709cacb1ae2, fa4caaa1121129f4, and 9bee8edce347bad3 to be roughly the same, but that doesn't seem to be so.

Maybe this isn't a real issue, or not something you'd want to fix for 0.8.0, just letting you know.

In any case, no more panics!

@lx This seems to work better! Neither of the configurations which caused panics previously are doing so now. One piece of oddness I'm seeing though is that the partition layout is a bit wacky: ``` # garage status 2022-11-21T17:03:58.019993Z INFO netapp::netapp: Connected to 10.242.0.1:3900, negotiating handshake... 2022-11-21T17:03:58.064114Z INFO netapp::netapp: Connection established to 3e44b73b1dc65666 ==== HEALTHY NODES ==== ID Hostname Address Tags Zone Capacity 7d4aafc54fdae0f8 mediocre-desktop 10.242.0.1:3920 [] mediocre-desktop 100 3e70b1a7dabfd25b mediocre-desktop 10.242.0.1:3910 [] mediocre-desktop 100 3e44b73b1dc65666 mediocre-desktop 10.242.0.1:3900 [] mediocre-desktop 100 c15e3709cacb1ae2 mediocre-desktop 10.242.0.1:3930 [] medicore-desktop-tmp 200 fa4caaa1121129f4 mediocre-desktop 10.242.0.1:3950 [] medicore-desktop-tmp 200 9bee8edce347bad3 mediocre-desktop 10.242.0.1:3940 [] medicore-desktop-tmp 200 # garage stats 2022-11-21T17:04:00.468517Z INFO netapp::netapp: Connected to 10.242.0.1:3900, negotiating handshake... 2022-11-21T17:04:00.650795Z INFO netapp::netapp: Connection established to 3e44b73b1dc65666 Garage version: 0.8.0-rc1 [features: k2v, sled, lmdb, sqlite, consul-discovery, kubernetes-discovery, metrics, telemetry-otlp, bundled-libs] Database engine: Sled Ring nodes & partition count: fa4caaa1121129f4 128 3e70b1a7dabfd25b 102 9bee8edce347bad3 128 3e44b73b1dc65666 103 7d4aafc54fdae0f8 102 c15e3709cacb1ae2 205 ... ``` I'd expect that the partition counts of `c15e3709cacb1ae2`, `fa4caaa1121129f4`, and `9bee8edce347bad3` to be roughly the same, but that doesn't seem to be so. Maybe this isn't a real issue, or not something you'd want to fix for 0.8.0, just letting you know. In any case, no more panics!
Owner

Great news!

I'd expect that the partition counts of c15e3709cacb1ae2, fa4caaa1121129f4, and 9bee8edce347bad3 to be roughly the same, but that doesn't seem to be so.

I think that's because the algorithm is fundamentally broken, it's unable to properly optimize in a global way in some cases. I don't think this is something I can fix on its own, #296 will hopefully solve this by redoing everything.

Thanks for your help working through this issue :)

Great news! > I'd expect that the partition counts of c15e3709cacb1ae2, fa4caaa1121129f4, and 9bee8edce347bad3 to be roughly the same, but that doesn't seem to be so. I think that's because the algorithm is fundamentally broken, it's unable to properly optimize in a global way in some cases. I don't think this is something I can fix on its own, #296 will hopefully solve this by redoing everything. Thanks for your help working through this issue :)
lx closed this issue 2 weeks ago

Yeah fair enough, thank you!

Yeah fair enough, thank you!
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#414
Loading…
There is no content yet.