When there are many agents managed by the controller, the controller can sometimes have trouble starting up as it needs to connect to itself over the API for many of the workers, and additionally every other agent is also trying to connect to the controllers at the same time.
The problem here is how to enable the controller to start and get all the workers stable before opening the flood-gates to the external agents.
A key problem here is that the controller should really not even accept any connections from the external agents before the controller has established that it is stable.
A side consideration that we should make as part of any change here is that we need to add some callback from the http server worker so it can notify the port number (or numbers) for the API connections. This is to deal with the race condition we currently get in some of the full stack agent tests where it opens a port just to get a number and then closes it and passes that number in as config so the machine agent can use it only to find that it has been used by another test.
Proposal
Based on convesations with @jameinel I think that the best approach is to have a second API server port that is used just for the controller to communicate with itself, and controller to controller.
I had been trying to work out how to shoe-horn this into the current system with minimal impact and I think I have worked out the best place.
In the main loop
function of worker/httpserver
a create the main listener with this:
listener, err := net.Listen("tcp", listenAddr)
What I think we need here is a custom type.
This type needs to implement net.Listener
, and should initially wait on the controller_api
port. It should also have another method to enable waiting on the normal api
port. The Accept
method needs to coallesce the Accept
calls from both the ports.
The httpserver
manifold should also provide an interface over the output method for the interface to open the api
port.
The code in agent/agent.go
already knows about whether or not to use localhost
as the address, and this code would need to be updated to use the controller port when connecting to localhost.
The primary hard piece then is when does the apiserver tell the new listener type to open the other port. I think it should be once the peer grouper has sent its initial message, as this indicates that the api connection is running, and the peer grouper has determined who should be up and running.
This new listener type is also a key place where we should be ratelimiting the Accept
calls. We should rate limit only the api
port and not the controller_api
port.