slow deletion? #594

Closed
opened 2023-07-04 08:21:42 +00:00 by schmitch · 14 comments

hello,

we currently deleted a ton of objects inside garage cluster that is inside k8s without any resource constraints. the cluster is not even close to be satisified:

garage-0   615m         25349Mi
garage-1   1172m        14635Mi
garage-2   1129m        14581Mi

and these are the nodes:

rancher1   1497m        2%     97548Mi         75%
rancher2   1383m        2%     37465Mi         29%
rancher3   187m         0%     37595Mi         29

so we still have some headroom.

the disks are NVMe drives in a software raid 10, so and have a way higher iops rate, however after deleting like 1 million or more objects (we use wal-g and had a misconfiguration that it did not cleanup old backups, so we had like 1 1/2 years of wal files) it only deletes like a few each second, these are the busy worker stats:

1    Busy*  Block resync worker #1        2      -     44852     -       -
27   Busy   version sync                  -      -     187       11      0       1 week ago
29   Busy   version queue                 -      -     2491      -       -
30   Busy   block_ref Merkle              -      -     57        -       -
31   Busy   block_ref sync                -      -     227       11      0       1 week ago
33   Busy   block_ref queue               -      -     13672002  -       -

and these are some of the logs:

2023-07-04T08:08:30.335827Z  INFO garage_block::resync: Resync block 7d32274b632bc947: offloading and deleting
2023-07-04T08:08:44.477305Z  INFO garage_block::resync: Deleting unneeded block 7d32274b632bc947, offload finished (0 / 2)
2023-07-04T08:08:44.594597Z  INFO garage_table::sync: (version) Sending 10 items to 24b64e7509178ced
2023-07-04T08:09:35.614130Z  INFO garage_block::resync: Resync block af8829ecb57f1772: offloading and deleting
2023-07-04T08:09:36.637834Z  INFO garage_block::resync: Deleting unneeded block af8829ecb57f1772, offload finished (0 / 2)
2023-07-04T08:10:07.153267Z  INFO garage_table::sync: (block_ref) Sending 23 items to be7961fcb29af5bb
2023-07-04T08:10:07.417247Z  INFO garage_block::resync: Resync block 607ab7f0ca683c50: offloading and deleting
2023-07-04T08:10:25.501625Z  INFO garage_block::resync: Deleting unneeded block 607ab7f0ca683c50, offload finished (0 / 2)
2023-07-04T08:11:00.475748Z  INFO garage_block::resync: Resync block 9ab043656cd454ae: offloading and deleting
2023-07-04T08:11:27.228733Z  INFO garage_block::resync: Deleting unneeded block 9ab043656cd454ae, offload finished (0 / 2)
2023-07-04T08:11:45.749563Z  WARN garage_table::sync: (block_ref) Sync error: Netapp error: Not connected: be7961fcb29af5bb
2023-07-04T08:12:13.145169Z  INFO garage_block::resync: Resync block 490f8ff101f7d1ea: offloading and deleting
2023-07-04T08:12:18.970101Z  INFO garage_block::resync: Deleting unneeded block 490f8ff101f7d1ea, offload finished (0 / 2)
2023-07-04T08:13:40.171391Z  INFO garage_block::resync: Resync block 765feae278118997: offloading and deleting
2023-07-04T08:14:02.700252Z  INFO garage_block::resync: Deleting unneeded block 765feae278118997, offload finished (0 / 2)
2023-07-04T08:15:07.331317Z  INFO garage_block::resync: Resync block 93b0b28c56215fbc: offloading and deleting
2023-07-04T08:15:29.884928Z  INFO garage_block::resync: Deleting unneeded block 93b0b28c56215fbc, offload finished (0 / 2)
2023-07-04T08:15:32.019989Z  INFO garage_table::sync: (version) Sending 46 items to be7961fcb29af5bb
2023-07-04T08:16:04.838419Z  INFO garage_block::resync: Resync block b0d1c157fc1c5a90: offloading and deleting
2023-07-04T08:16:15.823083Z  INFO garage_block::resync: Deleting unneeded block b0d1c157fc1c5a90, offload finished (0 / 2)
2023-07-04T08:16:17.339009Z  INFO garage_table::sync: (version) Sending 40 items to be7961fcb29af5bb
2023-07-04T08:16:19.142009Z  INFO garage_table::sync: (block_ref) Sending 32 items to be7961fcb29af5bb
2023-07-04T08:17:21.907721Z  INFO garage_block::resync: Resync block 1c9941bc8d1a36f3: offloading and deleting
2023-07-04T08:17:37.712007Z  INFO garage_block::resync: Deleting unneeded block 1c9941bc8d1a36f3, offload finished (0 / 2)

it deletes, but really really slowly. is there any way to speed this up? i'm not sure if it is even a good idea to change the config now.
does a restart do any harm to the deletion process? it's not that big of a deal that it deletes that slowly, but something is probably not correct here

hello, we currently deleted a ton of objects inside garage cluster that is inside k8s without any resource constraints. the cluster is not even close to be satisified: ``` garage-0 615m 25349Mi garage-1 1172m 14635Mi garage-2 1129m 14581Mi ``` and these are the nodes: ``` rancher1 1497m 2% 97548Mi 75% rancher2 1383m 2% 37465Mi 29% rancher3 187m 0% 37595Mi 29 ``` so we still have some headroom. the disks are NVMe drives in a software raid 10, so and have a way higher iops rate, however after deleting like 1 million or more objects (we use wal-g and had a misconfiguration that it did not cleanup old backups, so we had like 1 1/2 years of wal files) it only deletes like a few each second, these are the busy worker stats: ``` 1 Busy* Block resync worker #1 2 - 44852 - - 27 Busy version sync - - 187 11 0 1 week ago 29 Busy version queue - - 2491 - - 30 Busy block_ref Merkle - - 57 - - 31 Busy block_ref sync - - 227 11 0 1 week ago 33 Busy block_ref queue - - 13672002 - - ``` and these are some of the logs: ``` 2023-07-04T08:08:30.335827Z INFO garage_block::resync: Resync block 7d32274b632bc947: offloading and deleting 2023-07-04T08:08:44.477305Z INFO garage_block::resync: Deleting unneeded block 7d32274b632bc947, offload finished (0 / 2) 2023-07-04T08:08:44.594597Z INFO garage_table::sync: (version) Sending 10 items to 24b64e7509178ced 2023-07-04T08:09:35.614130Z INFO garage_block::resync: Resync block af8829ecb57f1772: offloading and deleting 2023-07-04T08:09:36.637834Z INFO garage_block::resync: Deleting unneeded block af8829ecb57f1772, offload finished (0 / 2) 2023-07-04T08:10:07.153267Z INFO garage_table::sync: (block_ref) Sending 23 items to be7961fcb29af5bb 2023-07-04T08:10:07.417247Z INFO garage_block::resync: Resync block 607ab7f0ca683c50: offloading and deleting 2023-07-04T08:10:25.501625Z INFO garage_block::resync: Deleting unneeded block 607ab7f0ca683c50, offload finished (0 / 2) 2023-07-04T08:11:00.475748Z INFO garage_block::resync: Resync block 9ab043656cd454ae: offloading and deleting 2023-07-04T08:11:27.228733Z INFO garage_block::resync: Deleting unneeded block 9ab043656cd454ae, offload finished (0 / 2) 2023-07-04T08:11:45.749563Z WARN garage_table::sync: (block_ref) Sync error: Netapp error: Not connected: be7961fcb29af5bb 2023-07-04T08:12:13.145169Z INFO garage_block::resync: Resync block 490f8ff101f7d1ea: offloading and deleting 2023-07-04T08:12:18.970101Z INFO garage_block::resync: Deleting unneeded block 490f8ff101f7d1ea, offload finished (0 / 2) 2023-07-04T08:13:40.171391Z INFO garage_block::resync: Resync block 765feae278118997: offloading and deleting 2023-07-04T08:14:02.700252Z INFO garage_block::resync: Deleting unneeded block 765feae278118997, offload finished (0 / 2) 2023-07-04T08:15:07.331317Z INFO garage_block::resync: Resync block 93b0b28c56215fbc: offloading and deleting 2023-07-04T08:15:29.884928Z INFO garage_block::resync: Deleting unneeded block 93b0b28c56215fbc, offload finished (0 / 2) 2023-07-04T08:15:32.019989Z INFO garage_table::sync: (version) Sending 46 items to be7961fcb29af5bb 2023-07-04T08:16:04.838419Z INFO garage_block::resync: Resync block b0d1c157fc1c5a90: offloading and deleting 2023-07-04T08:16:15.823083Z INFO garage_block::resync: Deleting unneeded block b0d1c157fc1c5a90, offload finished (0 / 2) 2023-07-04T08:16:17.339009Z INFO garage_table::sync: (version) Sending 40 items to be7961fcb29af5bb 2023-07-04T08:16:19.142009Z INFO garage_table::sync: (block_ref) Sending 32 items to be7961fcb29af5bb 2023-07-04T08:17:21.907721Z INFO garage_block::resync: Resync block 1c9941bc8d1a36f3: offloading and deleting 2023-07-04T08:17:37.712007Z INFO garage_block::resync: Deleting unneeded block 1c9941bc8d1a36f3, offload finished (0 / 2) ``` it deletes, but really really slowly. is there any way to speed this up? i'm not sure if it is even a good idea to change the config now. does a restart do any harm to the deletion process? it's not that big of a deal that it deletes that slowly, but something is probably not correct here
Owner

Start by setting the resync worker count to 4 and resync tranquility to zero.

To make this run faster overall, we might need to change some parameters which are currently compile time constants, like the maximum number of resync workers or the batch sizes for internal data transfers. I can point you to the lines of code to change if you are interested in trying this.

These are quite large volumes of data which I don't think we ever experienced running garage with, so it's not necessarily expected that things will go smoothly. Please keep us updated on how it goes.

Start by setting the resync worker count to 4 and resync tranquility to zero. To make this run faster overall, we might need to change some parameters which are currently compile time constants, like the maximum number of resync workers or the batch sizes for internal data transfers. I can point you to the lines of code to change if you are interested in trying this. These are quite large volumes of data which I don't think we ever experienced running garage with, so it's not necessarily expected that things will go smoothly. Please keep us updated on how it goes.
Owner

Btw, to answer your question, no, restarting garage will not impact the deletion process

Btw, to answer your question, no, restarting garage will not impact the deletion process
Author

btw. what is also a little bit strange is that the queue did raise, besides that we did not change a thing:

1    Busy*  Block resync worker #1        2      -     50124     -       -
30   Busy   block_ref Merkle              -      -     51        -       -
31   Busy   block_ref sync                -      -     89        11      0       1 week ago
33   Busy   block_ref queue               -      -     14627730  -       -

I will change the values, however I'm not sure if I'm coming to that today, I will give an update tomorrow.

These are quite large volumes of data which I don't think we ever experienced running garage with, so it's not necessarily expected that things will go smoothly. Please keep us updated on how it goes.

well under normal circumstances that wouldn't be the amount of data that I wanted to work with aswell, was just a wal-g misconfiguration...

To make this run faster overall, we might need to change some parameters which are currently compile time constants, like the maximum number of resync workers or the batch sizes for internal data transfers. I can point you to the lines of code to change if you are interested in trying this.

might be a good idea.

btw. what is also a little bit strange is that the queue did raise, besides that we did not change a thing: ``` 1 Busy* Block resync worker #1 2 - 50124 - - 30 Busy block_ref Merkle - - 51 - - 31 Busy block_ref sync - - 89 11 0 1 week ago 33 Busy block_ref queue - - 14627730 - - ``` I will change the values, however I'm not sure if I'm coming to that today, I will give an update tomorrow. > These are quite large volumes of data which I don't think we ever experienced running garage with, so it's not necessarily expected that things will go smoothly. Please keep us updated on how it goes. well under normal circumstances that wouldn't be the amount of data that I wanted to work with aswell, was just a wal-g misconfiguration... > To make this run faster overall, we might need to change some parameters which are currently compile time constants, like the maximum number of resync workers or the batch sizes for internal data transfers. I can point you to the lines of code to change if you are interested in trying this. might be a good idea.
Author

it's still running, but I can just let it keep going, will probably take a while:

1    Busy   Block resync worker #1        0      -     21257     -       -
2    Busy   Block resync worker #2        0      -     21255     -       -
3    Busy   Block resync worker #3        0      -     21255     -       -
4    Busy   Block resync worker #4        0      -     21254     -       -
26   Busy   version Merkle                -      -     2         -       -
30   Busy   block_ref Merkle              -      -     2         -       -
32   Busy   block_ref GC                  -      -     152177    -       -
33   Busy   block_ref queue               -      -     14350788  -       -

somehow it did not reclaim that much storage space, yet

it's still running, but I can just let it keep going, will probably take a while: ``` 1 Busy Block resync worker #1 0 - 21257 - - 2 Busy Block resync worker #2 0 - 21255 - - 3 Busy Block resync worker #3 0 - 21255 - - 4 Busy Block resync worker #4 0 - 21254 - - 26 Busy version Merkle - - 2 - - 30 Busy block_ref Merkle - - 2 - - 32 Busy block_ref GC - - 152177 - - 33 Busy block_ref queue - - 14350788 - - ``` somehow it did not reclaim that much storage space, yet
Author

actually it does not look that everything is deleted:

31   Busy   block_ref sync                -      -     5         3       0       1 day ago
32   Busy   block_ref GC                  -      -     272598    -       -
33   Busy   block_ref queue               -      -     12017385  4       0       1 day ago
1    Idle   Block resync worker #1        0      -     309       -       -
2    Idle   Block resync worker #2        0      -     309       -       -
3    Idle   Block resync worker #3        0      -     310       -       -
4    Idle   Block resync worker #4        0      -     311       -       -

this is now after a few days.

stats call gives me like:

Storage nodes:
  ID                Hostname  Zone     Capacity  Part.  DataAvail                MetaAvail
  a030e3d0111a7a7a  garage-0  envisia  10        256    690.0 GB/1.9 TB (36.0%)  690.0 GB/1.9 TB (36.0%)
  be7961fcb29af5bb  garage-2  envisia  10        256    680.1 GB/1.9 TB (35.4%)  680.1 GB/1.9 TB (35.4%)
  24b64e7509178ced  garage-1  envisia  10        256    682.8 GB/1.9 TB (35.6%)  682.8 GB/1.9 TB (35.6%)

however garage is still at like 1.1 tb of storage usage:

1.1T	pvc-4171eefe-1f2f-4d41-af59-f42eb7a1cbb0_garage_garage-data-garage-2

we use kubernetes with local storage, so I'm not sure what to do now?
rebuilding garage ? with better parameter?

actually it does not look that everything is deleted: ``` 31 Busy block_ref sync - - 5 3 0 1 day ago 32 Busy block_ref GC - - 272598 - - 33 Busy block_ref queue - - 12017385 4 0 1 day ago 1 Idle Block resync worker #1 0 - 309 - - 2 Idle Block resync worker #2 0 - 309 - - 3 Idle Block resync worker #3 0 - 310 - - 4 Idle Block resync worker #4 0 - 311 - - ``` this is now after a few days. `stats` call gives me like: ``` Storage nodes: ID Hostname Zone Capacity Part. DataAvail MetaAvail a030e3d0111a7a7a garage-0 envisia 10 256 690.0 GB/1.9 TB (36.0%) 690.0 GB/1.9 TB (36.0%) be7961fcb29af5bb garage-2 envisia 10 256 680.1 GB/1.9 TB (35.4%) 680.1 GB/1.9 TB (35.4%) 24b64e7509178ced garage-1 envisia 10 256 682.8 GB/1.9 TB (35.6%) 682.8 GB/1.9 TB (35.6%) ``` however garage is still at like 1.1 tb of storage usage: ``` 1.1T pvc-4171eefe-1f2f-4d41-af59-f42eb7a1cbb0_garage_garage-data-garage-2 ``` we use kubernetes with local storage, so I'm not sure what to do now? rebuilding garage ? with better parameter?
Author

I would say the biggest problem is not the fact, that there are not enough worker or anything, it is more or less a bigger problem that it will only does work every 10 seconds and not clearing the queue faster.

in the current speed it will probably years to cleanup the whole cluster.

I would say the biggest problem is not the fact, that there are not enough worker or anything, it is more or less a bigger problem that it will only does work every 10 seconds and not clearing the queue faster. in the current speed it will probably years to cleanup the whole cluster.
Owner

What is most worrying to me is the size of your block_ref resync queue, 12M items will definitely take a while to clear unless you increase the batch size on this line (unfortunately this is not tunable right now). This queue has to be processed before the block resync worker can do stuff, I'm not sure exactly of what's in your cluster but that might be a reason why blocks don't get deleted fast enough.

The other source of possible delay is the resync tranquility parameter which you can check with garage worker get -a resync-tranquility. If it is larger than zero, Garage introduces some delays between block resyncs to avoid saturating the network. This is a safety measure that is usefull during standard operation but in your case you just want the resync queue to clear as fast as possible, so you should do a garage worker set -a resync-tranquility 0.

Concerning the size taken by Garage in your local storage, be mindfull that Garage reports available space. So if 1.1T is your used space, Garage reporting 680GB free space for 1.9TB total is consistent.

What is most worrying to me is the size of your block_ref resync queue, 12M items will definitely take a while to clear unless you increase the batch size on [this line](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/table/queue.rs#L15) (unfortunately this is not tunable right now). This queue has to be processed before the block resync worker can do stuff, I'm not sure exactly of what's in your cluster but that might be a reason why blocks don't get deleted fast enough. The other source of possible delay is the resync tranquility parameter which you can check with `garage worker get -a resync-tranquility`. If it is larger than zero, Garage introduces some delays between block resyncs to avoid saturating the network. This is a safety measure that is usefull during standard operation but in your case you just want the resync queue to clear as fast as possible, so you should do a `garage worker set -a resync-tranquility 0`. Concerning the size taken by Garage in your local storage, be mindfull that Garage reports **available** space. So if 1.1T is your **used** space, Garage reporting 680GB free space for 1.9TB total is consistent.
Author

actually the resync-tranquility is already zero:

kubectl -n garage exec -t -i garage-2 -- ./garage worker get -a resync-tranquility
Defaulted container "garage" out of: garage, garage-upgrade (init)
24b64e7509178ced  resync-tranquility  0
a030e3d0111a7a7a  resync-tranquility  0
be7961fcb29af5bb  resync-tranquility  0

not sure if it needs a restart tough. (the worker were updated on the fly)

This queue has to be processed before the block resync worker can do stuff, I'm not sure exactly of what's in your cluster but that might be a reason why blocks don't get deleted fast enough.

actually that is the stuff that I deleted. So it's not new data or anything. I stopped wal-g completly till I can resolve this problem.

Concerning the size taken by Garage in your local storage, be mindfull that Garage reports available space. So if 1.1T is your used space, Garage reporting 680GB free space for 1.9TB total is consistent.

that is fine, garage took 1.1T right now while the rest is used by other stuff. the disk is not partitioned so instead we have a raid 10 software raideded NVMe drive that is not at it's i/o limit (not even close at the limit and in the night the only thing that needs i/o is basically garage deletions...) (it's not the best idea to put multiple things on the same disk(-array) but we still have plenty i/o left and we will probably keep that until we get to our next roadblock)
the 1.1T should go down since basically we were at like ~1.3T and I deleted ~800G, but as you can see it's now a week ago and did not really delete that much. only ~150g so it's painfull slow.

and as said resync-tranq is at zero and it still only does deletions every around 10s or something like that, I only get new log entries every 10-20 seconds.

actually the resync-tranquility is already zero: ``` kubectl -n garage exec -t -i garage-2 -- ./garage worker get -a resync-tranquility Defaulted container "garage" out of: garage, garage-upgrade (init) 24b64e7509178ced resync-tranquility 0 a030e3d0111a7a7a resync-tranquility 0 be7961fcb29af5bb resync-tranquility 0 ``` not sure if it needs a restart tough. (the worker were updated on the fly) > This queue has to be processed before the block resync worker can do stuff, I'm not sure exactly of what's in your cluster but that might be a reason why blocks don't get deleted fast enough. actually that is the stuff that I deleted. So it's not new data or anything. I stopped wal-g completly till I can resolve this problem. > Concerning the size taken by Garage in your local storage, be mindfull that Garage reports available space. So if 1.1T is your used space, Garage reporting 680GB free space for 1.9TB total is consistent. that is fine, garage took 1.1T right now while the rest is used by other stuff. the disk is not partitioned so instead we have a raid 10 software raideded NVMe drive that is not at it's i/o limit (not even close at the limit and in the night the only thing that needs i/o is basically garage deletions...) (it's not the best idea to put multiple things on the same disk(-array) but we still have plenty i/o left and we will probably keep that until we get to our next roadblock) the 1.1T should go down since basically we were at like ~1.3T and I deleted ~800G, but as you can see it's now a week ago and did not really delete that much. only ~150g so it's painfull slow. and as said resync-tranq is at zero and it still only does deletions every around 10s or something like that, I only get new log entries every 10-20 seconds.
Author

today I restarted the nodes, but it did more harm.

it now hangs when I try to run a command and the logs say something like:

2023-07-12T09:40:46.256862Z  INFO garage_block::resync: Resync block 3ea2ee7a57a74b28: offloading and deleting
2023-07-12T09:40:46.257062Z ERROR garage_block::resync: Error when resyncing 3ea2ee7a57a74b28: NeedBlockQuery RPC
Netapp error: Not connected: a030e3d0111a7a7a
2023-07-12T09:40:46.257930Z  INFO garage_block::resync: Resync block 3ea30d9ec3768e8d: offloading and deleting
2023-07-12T09:40:46.257967Z ERROR garage_block::resync: Error when resyncing 3ea30d9ec3768e8d: NeedBlockQuery RPC
Netapp error: Not connected: a030e3d0111a7a7a
2023-07-12T09:40:46.282018Z  INFO garage_rpc::kubernetes: Found Pod: Some("24b64e7509178ced5dd0e404a63c4b7c2be4cad3b66971b3236705f058aacafc")
2023-07-12T09:40:46.282068Z  INFO garage_rpc::kubernetes: Found Pod: Some("a030e3d0111a7a7a86a799bbd926a24facc270a4fa485d2700a92ce94a094cd2")
2023-07-12T09:40:46.282084Z  INFO garage_rpc::kubernetes: Found Pod: Some("be7961fcb29af5bb64d38efacc4c746b020d72d9c979dd6057206c2cf7bfe401")
2023-07-12T09:41:46.284075Z  INFO garage_rpc::system: Doing a bootstrap/discovery step (not_configured: false, no_peers: false, bad_peers: true)
2023-07-12T09:41:46.293514Z  INFO garage_rpc::kubernetes: Found Pod: Some("24b64e7509178ced5dd0e404a63c4b7c2be4cad3b66971b3236705f058aacafc")
2023-07-12T09:41:46.293527Z  INFO garage_rpc::kubernetes: Found Pod: Some("a030e3d0111a7a7a86a799bbd926a24facc270a4fa485d2700a92ce94a094cd2")
2023-07-12T09:41:46.293532Z  INFO garage_rpc::kubernetes: Found Pod: Some("be7961fcb29af5bb64d38efacc4c746b020d72d9c979dd6057206c2cf7bfe401")

the crd of the pod has the correct ip address tough

EDIT: Looks like after an hour or so the commands started to work again

today I restarted the nodes, but it did more harm. it now hangs when I try to run a command and the logs say something like: ``` 2023-07-12T09:40:46.256862Z INFO garage_block::resync: Resync block 3ea2ee7a57a74b28: offloading and deleting 2023-07-12T09:40:46.257062Z ERROR garage_block::resync: Error when resyncing 3ea2ee7a57a74b28: NeedBlockQuery RPC Netapp error: Not connected: a030e3d0111a7a7a 2023-07-12T09:40:46.257930Z INFO garage_block::resync: Resync block 3ea30d9ec3768e8d: offloading and deleting 2023-07-12T09:40:46.257967Z ERROR garage_block::resync: Error when resyncing 3ea30d9ec3768e8d: NeedBlockQuery RPC Netapp error: Not connected: a030e3d0111a7a7a 2023-07-12T09:40:46.282018Z INFO garage_rpc::kubernetes: Found Pod: Some("24b64e7509178ced5dd0e404a63c4b7c2be4cad3b66971b3236705f058aacafc") 2023-07-12T09:40:46.282068Z INFO garage_rpc::kubernetes: Found Pod: Some("a030e3d0111a7a7a86a799bbd926a24facc270a4fa485d2700a92ce94a094cd2") 2023-07-12T09:40:46.282084Z INFO garage_rpc::kubernetes: Found Pod: Some("be7961fcb29af5bb64d38efacc4c746b020d72d9c979dd6057206c2cf7bfe401") 2023-07-12T09:41:46.284075Z INFO garage_rpc::system: Doing a bootstrap/discovery step (not_configured: false, no_peers: false, bad_peers: true) 2023-07-12T09:41:46.293514Z INFO garage_rpc::kubernetes: Found Pod: Some("24b64e7509178ced5dd0e404a63c4b7c2be4cad3b66971b3236705f058aacafc") 2023-07-12T09:41:46.293527Z INFO garage_rpc::kubernetes: Found Pod: Some("a030e3d0111a7a7a86a799bbd926a24facc270a4fa485d2700a92ce94a094cd2") 2023-07-12T09:41:46.293532Z INFO garage_rpc::kubernetes: Found Pod: Some("be7961fcb29af5bb64d38efacc4c746b020d72d9c979dd6057206c2cf7bfe401") ``` the crd of the pod has the correct ip address tough **EDIT:** Looks like after an hour or so the commands started to work again
Author

i tried to compile garage on my own, however whatever I did the binary turned out to be quite big > 500 even with a low amount of features, if I used bundled-libs, did I do something wrong?

i tried to compile garage on my own, however whatever I did the binary turned out to be quite big > 500 even with a low amount of features, if I used bundled-libs, did I do something wrong?

in the top level Cargo.toml, you can comment these lines

[profile.release]
debug = true

to get smaller binaries

in the top level `Cargo.toml`, you can comment these lines ```toml [profile.release] debug = true ``` to get smaller binaries
Contributor

To get smaller binaries, you should build Garage with LTO, panic=abort and strip the resulting binary.

Instead of modifying Cargo.toml, you can also set the following environment variables:

export CARGO_PROFILE_RELEASE_OPT_LEVEL="2"
export CARGO_PROFILE_RELEASE_PANIC="abort"
export CARGO_PROFILE_RELEASE_CODEGEN_UNITS=1
export CARGO_PROFILE_RELEASE_LTO="true"
cargo build --release --locked
strip target/release/garage

You can also link garage dynamically with system-provided libsodium, sqlite and zstd instead of bundling vendored libraries.

export SODIUM_USE_PKG_CONFIG=1
cargo build --release --locked --no-default-features --features system-libs,sled,lmdb

(BTW, that’s how is the garage package in Alpine Linux built. The resulting size is 14.24 MiB.)

To get smaller binaries, you should build Garage with LTO, panic=abort and strip the resulting binary. Instead of modifying `Cargo.toml`, you can also set the following environment variables: ```sh export CARGO_PROFILE_RELEASE_OPT_LEVEL="2" export CARGO_PROFILE_RELEASE_PANIC="abort" export CARGO_PROFILE_RELEASE_CODEGEN_UNITS=1 export CARGO_PROFILE_RELEASE_LTO="true" cargo build --release --locked strip target/release/garage ``` You can also link garage dynamically with system-provided libsodium, sqlite and zstd instead of bundling vendored libraries. ``` export SODIUM_USE_PKG_CONFIG=1 cargo build --release --locked --no-default-features --features system-libs,sled,lmdb ``` (BTW, that’s how is the [garage](https://pkgs.alpinelinux.org/packages?name=garage) package in Alpine Linux built. The resulting size is 14.24 MiB.)
Author

I tried compiling it without the debug and that made it smaller.
However the docker image did not work, but than I compiled it via:

RUSTFLAGS="-C target-feature=+crt-static" cargo build --release --locked --no-default-features --features bundled-libs,kubernetes-discovery,sled,lmdb --target x86_64-unknown-linux-gnu

however for some reason it than failed by correctly applying the k8s discovery:

2023-07-13T16:11:45.909634Z  WARN garage_rpc::system: Could not retrieve node list from Kubernetes: HyperError: error trying to connect: dns error: Device or resource busy (os error 16)
2023-07-13T16:11:45.910451Z ERROR garage_rpc::system: Error while publishing node to Kubernetes: HyperError: error trying to connect: dns error: Device or resource busy (os error 16)
2023-07-13T16:11:45.914772Z ERROR garage_util::background::worker: Error in worker block_ref GC (TID 32): in try_send_and_delete in table GC:
GC: send tombstones
Could not reach quorum of 2. 0 of 2 request succeeded, others returned errors: ["Netapp error: Not connected: 24b64e7509178ced", "Netapp error: Not connected: a030e3d0111a7a7a"]
2023-07-13T16:12:17.752777Z ERROR garage_rpc::system: Error establishing RPC connection to remote node: 24b64e7509178ced5dd0e404a63c4b7c2be4cad3b66971b3236705f058aacafc@10.42.2.33:3901.
This can happen if the remote node is not reachable on the network, but also if the two nodes are not configured with the same rpc_secret.
IO error: Connection refused (os error 111)
2023-07-13T16:12:45.911344Z  INFO garage_rpc::system: Doing a bootstrap/discovery step (not_configured: false, no_peers: true, bad_peers: true)
2023-07-13T16:12:45.912453Z  WARN garage_rpc::system: Could not retrieve node list from Kubernetes: HyperError: error trying to connect: dns error: Device or resource busy (os error 16)
2023-07-13T16:12:45.913546Z ERROR garage_rpc::system: Error while publishing node to Kubernetes: HyperError: error trying to connect: dns error: Device or resource busy (os error 16)

maybe there is a problem when using a scratch docker image. somehow the dns stuff starts failing than, hm...

Maybe I should use the musl target: x86_64-unknown-linux-musl (but that resulted in build failures...) I need to look more into it

I tried compiling it without the debug and that made it smaller. However the docker image did not work, but than I compiled it via: ``` RUSTFLAGS="-C target-feature=+crt-static" cargo build --release --locked --no-default-features --features bundled-libs,kubernetes-discovery,sled,lmdb --target x86_64-unknown-linux-gnu ``` however for some reason it than failed by correctly applying the k8s discovery: ``` 2023-07-13T16:11:45.909634Z WARN garage_rpc::system: Could not retrieve node list from Kubernetes: HyperError: error trying to connect: dns error: Device or resource busy (os error 16) 2023-07-13T16:11:45.910451Z ERROR garage_rpc::system: Error while publishing node to Kubernetes: HyperError: error trying to connect: dns error: Device or resource busy (os error 16) 2023-07-13T16:11:45.914772Z ERROR garage_util::background::worker: Error in worker block_ref GC (TID 32): in try_send_and_delete in table GC: GC: send tombstones Could not reach quorum of 2. 0 of 2 request succeeded, others returned errors: ["Netapp error: Not connected: 24b64e7509178ced", "Netapp error: Not connected: a030e3d0111a7a7a"] 2023-07-13T16:12:17.752777Z ERROR garage_rpc::system: Error establishing RPC connection to remote node: 24b64e7509178ced5dd0e404a63c4b7c2be4cad3b66971b3236705f058aacafc@10.42.2.33:3901. This can happen if the remote node is not reachable on the network, but also if the two nodes are not configured with the same rpc_secret. IO error: Connection refused (os error 111) 2023-07-13T16:12:45.911344Z INFO garage_rpc::system: Doing a bootstrap/discovery step (not_configured: false, no_peers: true, bad_peers: true) 2023-07-13T16:12:45.912453Z WARN garage_rpc::system: Could not retrieve node list from Kubernetes: HyperError: error trying to connect: dns error: Device or resource busy (os error 16) 2023-07-13T16:12:45.913546Z ERROR garage_rpc::system: Error while publishing node to Kubernetes: HyperError: error trying to connect: dns error: Device or resource busy (os error 16) ``` maybe there is a problem when using a scratch docker image. somehow the dns stuff starts failing than, hm... Maybe I should use the musl target: `x86_64-unknown-linux-musl` (but that resulted in build failures...) I need to look more into it
Author

I remade the whole cluster

I remade the whole cluster
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#594
No description provided.