Garage connects to Consul node address instead of service address when using agent API #675

Open
opened 2023-12-02 12:35:28 +00:00 by max · 2 comments

Using Garage version 0.8.4. This used to work in 0.8.2.

Discovered this in a NixOS test, with the following nodes:

  • 192.168.1.1 is a node running Consul server, this node also acts as the agent for all Garage nodes to keep the test simple
  • 192.168.1.2, 192.168.1.3 and 192.168.1.4 are Garage nodes, configured to use Consul service discovery

All Garage nodes correctly (meaning with the right addresses) register themselves as services in Consul. However, when trying to connect to each other, they instead try to connect to the Consul node.

It seems like 0.8.4 has reworked how registration works with the agent API. In 0.8.2, Garage used to create its own "virtual" garage:deadbeef nodes in Consul and register a service on them. In 0.8.4, this is still how it works with the catalog API, but with the agent API, there are no virtual nodes being created and Garage just registers a service on the agent node itself. Presumably, then, Garage has always used the node IP address instead of the service IP address, but this only became visible now as Garage is now in a situation where it doesn't control what the node address is.

Using Garage version 0.8.4. This used to work in 0.8.2. Discovered this in a NixOS test, with the following nodes: - 192.168.1.1 is a node running Consul server, this node also acts as the agent for all Garage nodes to keep the test simple - 192.168.1.2, 192.168.1.3 and 192.168.1.4 are Garage nodes, configured to use Consul service discovery All Garage nodes correctly (meaning with the right addresses) register themselves as services in Consul. However, when trying to connect to each other, they instead try to connect to the Consul node. It seems like 0.8.4 has reworked how registration works with the agent API. In 0.8.2, Garage used to create its own "virtual" `garage:deadbeef` nodes in Consul and register a service on them. In 0.8.4, this is still how it works with the catalog API, but with the agent API, there are no virtual nodes being created and Garage just registers a service on the agent node itself. Presumably, then, Garage has always used the node IP address instead of the service IP address, but this only became visible now as Garage is now in a situation where it doesn't control what the node address is.
Contributor

Hi Max, using the agent API does not create "virtual" nodes, as you point out, by design, based on my assumption (informed by consul's Architecture reference), that In a typical deployment, you must run client agents on every compute node in your datacenter. That doesn't seem to be the case for your setup, so the provided strace screenshot showing the process connecting to 192.168.1.1:3901 makes sense, considering how I built the agent api integration.

To accommodate for your architecture a change could be made here to read from the service's TaggedAddresses field instead of the Address of the node (and the corresponding query declarations). The change must consider making sure to keep reading from the Address field as to not break the workflow for users of the catalog API.

As additional context, Address was used since the docs for that field specify address corresponds to the service IP, ad fall backs to the node's:

address: String value that specifies a service-specific IP address or hostname. If no value is specified, the IP address of the agent node is used by default.

I'm not sure how it would not be specified, though, perhaps a bug here?


edit

Took another look at the image where you query consul's API and Address does seem to map to what's expected, so wondering if garage1 host gets a different response than consul shows in the screenshot?

Hi Max, using the `agent` API does not create "virtual" nodes, as you point out, by design, based on my assumption (informed by consul's [Architecture reference](https://developer.hashicorp.com/consul/docs/architecture#client-agents)), that _In a typical deployment, you must run client agents on every compute node in your datacenter_. That doesn't seem to be the case for your setup, so the provided strace screenshot showing the process connecting to `192.168.1.1:3901` makes sense, considering how I built the agent api integration. To accommodate for your architecture a change could be made [here](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/rpc/consul.rs#L141) to read from the service's [TaggedAddresses](https://developer.hashicorp.com/consul/api-docs/agent/service#sample-response-1) field instead of the `Address` of the node (and the corresponding query declarations). The change must consider making sure to keep reading from the `Address` field as to not break the workflow for users of the `catalog` API. As additional context, `Address` was used since [the docs](https://developer.hashicorp.com/consul/docs/services/configuration/services-configuration-reference#address) for that field specify `address` corresponds to the service IP, ad fall backs to the node's: > `address`: String value that specifies a service-specific IP address or hostname. If no value is specified, the IP address of the agent node is used by default. I'm not sure how it would not be specified, though, perhaps a bug [here](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/branch/main/src/rpc/consul.rs#L208)? --- _edit_ Took another look at the image where you query consul's API and `Address` does seem to map to what's expected, so wondering if `garage1` host gets a different response than `consul` shows in the screenshot?
Author

It's not nice that I'm using a single Consul agent for 3 nodes, that's correct. However, that's not the only scenario where service addresses might differ from the Consul node addresses. For example, in my actual deployment, I have a Consul agent on every node, and the Consul agents talk to each other through a WireGuard mesh, so the node address listed for each agent is that node's IP address for that WireGuard mesh interface. Depending on the individual service's exposure, I can then configure its service address to be that same mesh IP, or the node's public IP, or the IP of a different tunnel interface. For example, the service representing Garage's S3 API is configured to use public IP addresses.

The problem seems to occur while querying, not while registering.
I'm assuming that ServiceAddress should be used in this line instead of Address, as per https://developer.hashicorp.com/consul/api-docs/catalog#serviceaddress

It's not nice that I'm using a single Consul agent for 3 nodes, that's correct. However, that's not the only scenario where service addresses might differ from the Consul node addresses. For example, in my actual deployment, I have a Consul agent on every node, and the Consul agents talk to each other through a WireGuard mesh, so the node address listed for each agent is that node's IP address for that WireGuard mesh interface. Depending on the individual service's exposure, I can then configure its service address to be that same mesh IP, or the node's public IP, or the IP of a different tunnel interface. For example, the service representing Garage's S3 API is configured to use public IP addresses. The problem seems to occur while querying, not while registering. I'm assuming that `ServiceAddress` should be used in [this line](https://git.deuxfleurs.fr/Deuxfleurs/garage/src/commit/a2c1de646bce4a96cf8dc526f82bd88bcf3dde70/src/rpc/consul.rs#L18) instead of `Address`, as per https://developer.hashicorp.com/consul/api-docs/catalog#serviceaddress
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Deuxfleurs/garage#675
No description provided.