Starting: timed out waiting for server handshake
The services in our group provide both GRPC and HTTP interfaces, and most of the HTTP interfaces are directly converted from GRPC via grpc-gateway, but suddenly, one day, after I updated the grpc version, I had a problem accessing the HTTP interface with the following error.
[root@liqiang.io]# GET http://192.168.67.41/api/v3/metrics:query?query=host_cpu_usage_overall%5B2h%5D
HTTP/1.1 503 Service Unavailable
Server: nginx/1.10.2
Date: Wed, 10 Apr 2019 11:51:45 GMT
Content-Type: application/json
Transfer-Encoding: chunked
Connection: keep-alive
Trailer: Grpc-Trailer-Content-Type
{
"error": "all SubConns are in TransientFailure, latest connection error: timed out waiting for server handshake",
"code": 14,
"details": []
}
Committed: Add protocol handshake to ‘READY’ connectivity requirements
Based on the description of the problem, we quickly located that the problem was caused by the GRPC upgrade, but why the problem occurred is unknown. However, we could see from the logs that it was the handshake that was causing the problem, so we first looked at the release note of grpc-go to see what changes had been made related to the handshake, and we soon found it in release note v1.18.0 grpc/grpc-go/releases/tag/v1.18.0), you can find this entry.
client: make handshake required ‘on’ by default, not ‘hybrid’ (#2565)
See issue #2406 for more information
Then looking at the specific issue description and pr changes, we can conclude that the problem (#2406) is this.
Go’s http2 implementation distinguishes between “connection ready” and “connection successful”, which doesn’t match the rest of the implementation and can cause other clients to communicate abnormally. In practice, the connection may be made before the encrypted channel is established, which is a potential risk for DOS-like attacks.
Therefore, since 1.18, the default is to wait until the connection is encrypted before communicating. And after 1.19 the communication of non-encrypted channels was removed, which is the root cause of this problem in my case. Since the release was urgent, I saw a simple solution to deal with it by setting the GRPC_GO_REQUIRE_HANDSHAKE environment variable to off: GRPC_GO_REQUIRE_HANDSHAKE=off
, but since I thought this was a nasty way to do things, I rolled back the grpc version first.
And that’s the end of the story.
Turn: timed out waiting for server handshake again
Recently, as I tried to switch the dependency management tool from dep to go module, the problem came again, so go module only has the minimum version for version management, not the maximum and specified version, so the grpc version was upgraded again, to v1.33.0, so I thought the old method still works, but was slapped in the face. smacked in the face, I encountered.
[root@liqiang.io]# tail log
panic: rpc error: code = Unavailable desc = timed out waiting for server handshake
goroutine 1 [running]:
main.main()
/gopath/src/github.smartx.com/xxxxx/xxxxx/cmd/client/main.go:24 +0x245
I thought I could just add an environment variable or set the Client’s Options, but then it happened anyway. So I had to go back and look at the previous issue again (luckily we did a good job of documenting the case), and then I realized that I was in such a hurry that I stupidly missed a very important paragraph: > During development for the 1.19 release
During development for the 1.19 release, support for changing this behavior via the environment variable will be removed Also, the grpc.WithWaitForHandshake() DialOption (was “experimental”; now “deprecated”) will be removed.
Users impacted: as far as we are aware, the only usage that may be impacted by the new behavior is cmux. cmux has a workaround for Java using MatchWithWriters for Java using MatchWithWriters to allow it to continue working in the face of this behavior.
Two important points are made here in a straightforward manner.
- (a) After 1.19, neither the environment variable nor the
grpc.WithWaitForHandshake()
option are valid anymore. - cmux (exactly what I chose) can solve the Go implementation problem by adding
MatchWithWriters
.
Hopefully: solving the problem
This time I had more time and I wasn’t going to hold back, so I chose the second way cmux plus the parameter MatchWithWriters
to solve the problem, and the final code reads
[root@liqiang.io]# cat main.go
... ...
import "github.com/soheilhy/cmux"
... ...
l, err := net.Listen("tcp", fmt.Sprintf(":%d", host, port))
if err ! = nil {
log.Fatalf("[E] Failed to listen on :%d: %v", port, err)
}
var (
m = cmux.New(l)
grpcListener = m.MatchWithWriters(
cmux.HTTP2MatchHeaderFieldSendSettings("content-type", "application/grpc"),
)
httpListener = m.Match(cmux.HTTP1Fast())
)
... ...
This is actually one of the implementations that provide http and grpc using the same port. So the question arises again, GRPC uses HTTP2, how does the HTTP2 connection establishment process work? (Another hole dug for myself)