0.8.1: Heavy data directory reading after upgrade #470

New issue

Closed

opened 2023-01-09 17:46:25 +00:00 by jpds · 2 comments

jpds commented

2023-01-09 17:46:25 +00:00

Contributor

I upgraded my cluster from 0.7.2 to 0.8.1, and followed all the steps at: https://garagehq.deuxfleurs.fr/documentation/working-documents/migration-08/

There's currently zero S3/web activity on the cluster nor anything in the queues, but I'm observating a lot of disk reading:

$ zpool iostat 1
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
garage       132G   4.4T      5      0  2.33M      0
zroot       7.46G   225G      0      5      0   447K
----------  -----  -----  -----  -----  -----  -----
garage       132G   4.4T      6      0  3.08M      0
zroot       7.46G   225G      0     70      0   575K
----------  -----  -----  -----  -----  -----  -----
garage       132G   4.4T     13      0  2.94M      0
zroot       7.46G   225G      0      5      0   469K
----------  -----  -----  -----  -----  -----  -----
garage       132G   4.4T     13      0  4.59M      0
zroot       7.46G   225G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
garage       132G   4.4T      5      0  3.39M      0
zroot       7.46G   225G      0      5      0   384K
----------  -----  -----  -----  -----  -----  -----
garage       132G   4.4T      6      0  5.01M      0
zroot       7.46G   225G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
garage       132G   4.4T      6      0  5.39M      0
zroot       7.46G   225G      0     78      0  1.02M
----------  -----  -----  -----  -----  -----  -----
garage       132G   4.4T      7      0  4.99M      0
zroot       7.46G   225G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
garage       132G   4.4T      6      0  5.66M      0
zroot       7.46G   225G      0      5      0   456K
----------  -----  -----  -----  -----  -----  -----
garage       132G   4.4T     13      0  6.97M      0
zroot       7.46G   225G      0      0      0  3.98K
----------  -----  -----  -----  -----  -----  -----
garage       132G   4.4T      8      0  3.19M      0
zroot       7.46G   225G      0      4      0   382K
----------  -----  -----  -----  -----  -----  -----

If I probe DTrace for things being opened, it seems garage is crawling its own data directory:

$ dtrace -n 'syscall::open*:entry { printf("%s %s", execname, copyinstr(arg0)); }'
  2  82631                       open:entry garage /srv/garage/20/c5
  2  82631                       open:entry garage /srv/garage/20/d2
  3  82631                       open:entry garage /srv/garage/20/ad
  2  82631                       open:entry garage /srv/garage/20/fc
  3  82631                       open:entry garage /srv/garage/20/d8
  3  82631                       open:entry garage /srv/garage/20/7a
...  
  2  82631                       open:entry garage /srv/garage/7f/8c
  0  82631                       open:entry garage /srv/garage/7f/46
  0  82631                       open:entry garage /srv/garage/7f/31
  2  82631                       open:entry garage /srv/garage/7f/b3
  1  82631                       open:entry garage /srv/garage/7f/e4
  3  82631                       open:entry garage /srv/garage/7f/e9
...
  0  82631                       open:entry garage /srv/garage/0a/9d
  0  82631                       open:entry garage /srv/garage/0a/51
  1  82631                       open:entry garage /srv/garage/0a/26
  1  82631                       open:entry garage /srv/garage/0a/ae
  1  82631                       open:entry garage /srv/garage/0a/fb
  0  82631                       open:entry garage /srv/garage/0a/d9

Is there something else I can try to debug where this is coming from?

I upgraded my cluster from `0.7.2` to `0.8.1`, and followed all the steps at: https://garagehq.deuxfleurs.fr/documentation/working-documents/migration-08/ There's currently zero S3/web activity on the cluster nor anything in the queues, but I'm observating a lot of disk reading: ``` $ zpool iostat 1 capacity operations bandwidth pool alloc free read write read write ---------- ----- ----- ----- ----- ----- ----- garage 132G 4.4T 5 0 2.33M 0 zroot 7.46G 225G 0 5 0 447K ---------- ----- ----- ----- ----- ----- ----- garage 132G 4.4T 6 0 3.08M 0 zroot 7.46G 225G 0 70 0 575K ---------- ----- ----- ----- ----- ----- ----- garage 132G 4.4T 13 0 2.94M 0 zroot 7.46G 225G 0 5 0 469K ---------- ----- ----- ----- ----- ----- ----- garage 132G 4.4T 13 0 4.59M 0 zroot 7.46G 225G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- garage 132G 4.4T 5 0 3.39M 0 zroot 7.46G 225G 0 5 0 384K ---------- ----- ----- ----- ----- ----- ----- garage 132G 4.4T 6 0 5.01M 0 zroot 7.46G 225G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- garage 132G 4.4T 6 0 5.39M 0 zroot 7.46G 225G 0 78 0 1.02M ---------- ----- ----- ----- ----- ----- ----- garage 132G 4.4T 7 0 4.99M 0 zroot 7.46G 225G 0 0 0 0 ---------- ----- ----- ----- ----- ----- ----- garage 132G 4.4T 6 0 5.66M 0 zroot 7.46G 225G 0 5 0 456K ---------- ----- ----- ----- ----- ----- ----- garage 132G 4.4T 13 0 6.97M 0 zroot 7.46G 225G 0 0 0 3.98K ---------- ----- ----- ----- ----- ----- ----- garage 132G 4.4T 8 0 3.19M 0 zroot 7.46G 225G 0 4 0 382K ---------- ----- ----- ----- ----- ----- ----- ``` If I probe DTrace for things being opened, it seems `garage` is crawling its own data directory: ``` $ dtrace -n 'syscall::open*:entry { printf("%s %s", execname, copyinstr(arg0)); }' 2 82631 open:entry garage /srv/garage/20/c5 2 82631 open:entry garage /srv/garage/20/d2 3 82631 open:entry garage /srv/garage/20/ad 2 82631 open:entry garage /srv/garage/20/fc 3 82631 open:entry garage /srv/garage/20/d8 3 82631 open:entry garage /srv/garage/20/7a ... 2 82631 open:entry garage /srv/garage/7f/8c 0 82631 open:entry garage /srv/garage/7f/46 0 82631 open:entry garage /srv/garage/7f/31 2 82631 open:entry garage /srv/garage/7f/b3 1 82631 open:entry garage /srv/garage/7f/e4 3 82631 open:entry garage /srv/garage/7f/e9 ... 0 82631 open:entry garage /srv/garage/0a/9d 0 82631 open:entry garage /srv/garage/0a/51 1 82631 open:entry garage /srv/garage/0a/26 1 82631 open:entry garage /srv/garage/0a/ae 1 82631 open:entry garage /srv/garage/0a/fb 0 82631 open:entry garage /srv/garage/0a/d9 ``` Is there something else I can try to debug where this is coming from?

lx commented

2023-01-09 18:00:31 +00:00

Owner

Probably your node is doing a scrub of the stored data to check for corruptions. It does it once every month. It's meant to be a background process that limits itself in terms of I/O in order to leave space for interactive requests to be served first. You can check the progress of the scrub using garage worker list and garage worker info. You can change the speed of the scrub using garage worker set scrub-tranquility (zero is the fastest possible, larger values mean more interval between iterations and therefore a smaller proportion of I/O time used by the scrub).

Probably your node is doing a scrub of the stored data to check for corruptions. It does it once every month. It's meant to be a background process that limits itself in terms of I/O in order to leave space for interactive requests to be served first. You can check the progress of the scrub using `garage worker list` and `garage worker info`. You can change the speed of the scrub using `garage worker set scrub-tranquility` (zero is the fastest possible, larger values mean more interval between iterations and therefore a smaller proportion of I/O time used by the scrub).

jpds commented

2023-01-09 18:07:06 +00:00

Author

Contributor

Very interesting - that was indeed it.