![]() It also shows the time until the next retransmission attempt (1min36sec) and the number of retransmissions so far (9) How should we fix this? “ss -o” command showing TCP retransmission timer (given the label on). Using tcpdumpand iptables we reproduced this 15-minute wait locally by opening a connection, blocking traffic to port 5432, and sending a query to the database. Searching for this error message in Postgresql code reveals that the suffix Connection timed out is coming directly from the OS which gives more credibility to the TCP retransmissions hypothesis. On these workers, the errors we saw were PQconsumeInput() could not receive data from server: Connection timed out We also saw this issue with background workers that don’t have application-level timeouts. ![]() The default settings yield a retransmission timeout that can reach up to 15 minutes which lines up very well with the timeouts we are seeing. Okay, so we have a smoking gun, but how does the increase in TCP retransmissions relate to the 15-minute requests?īy default, the Linux kernel will retransmit unacknowledged packets up to 15 times with exponential backoff, after that, it gives up and notifies the application that the connection is lost. If the query is not assigned to a server during that time, the client is disconnected) so the query can’t spend more than a few seconds on Pgbouncer before being sent to the database or rejected with a query_wait_timeouterror.Ī more recent but much more severe retransmission spike/storm If it is not the database, maybe the query was blocked on Pgbouncer? That is also highly unlikely because we have a very short query_wait_timeout configured for our Pgbouncers (from Pgbouncer docs, query_wait_timeout is the maximum time queries are allowed to spend waiting for execution. Our general database access layout looks something like this Application → AWS Network Load Balancer → Pgbouncer → Database ![]() What’s more, we log slow queries using Postgres log_min_duration_statement setting, a query that takes 15 minutes was sure to pop up in that log but it didn’t. Now, this is impossible because we have a tight statement_timeout on Postgres on the scale of seconds (not minutes!). What was particularly interesting here was that all 15-minute requests were blocked on what appeared to be a 15-minute database query. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |