[uWSGI] listen queue overflow in a coroae app (not restart)

Discussion:

Руслан Закиров

2018-04-26 09:42:25 UTC

Hello,

We see the following messages in logs:

uWSGI listen queue of socket "0.0.0.0:5002" (fd: 10) full !!! (101/100)

It's not related to restart. I can reproduce it by bombarding app using ab
tool:

ab -k -l -r -c 200 -n 10000 http://192.168.1.48:5002/

This request lookups in apps' routing table, fails and returns 404.

App's configuration:

[uwsgi]
buffer-size = 16384
log-master-bufsize = 32768
http-socket = 0.0.0.0:5002
psgi = /srv/sports/perl-deps/bin/sprt-combiner.psgi
processes = 1
master = true
uid = sports
gid = sports
plugin = http, coroae, logfile
coroae = 512
harakiri = 10
harakiri-verbose = true
reload-mercy = 4
max-requests = 100000
req-logger = file:/var/log/uwsgi/app/combiner.access.log
stats = 0.0.0.0:5052

I've increased number cores and decreased number of processes to 1 so I can
strace worker, here is strace from worker:

08:54:44.965905 accept4(10, {sa_family=AF_INET, sin_port=htons(53595),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 15 <0.000021>
08:54:44.965986 read(23, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000021>
08:54:44.966752 write(2, "[TRACE][b414fb46]\n\t0 - 0.000495("..., 136) =
136 <0.000040>
08:54:44.966922 writev(23, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000035>
08:54:44.967017 close(23) = 0 <0.000031>
08:54:44.967102 writev(8, [{"[pid: 26800|app: 0|req: 113092/1"..., 208}],
1) = 208 <0.000027>
08:54:44.967237 epoll_ctl(11, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15,
u64=25378961752079}}) = 0 <0.000022>
08:54:44.967310 epoll_wait(11, [{EPOLLIN, {u32=10, u64=4294967306}},
{EPOLLIN, {u32=15, u64=25378961752079}}], 64, 4780) = 2 <0.000018>
08:54:44.967378 accept4(10, {sa_family=AF_INET, sin_port=htons(53587),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 23 <0.000021>
08:54:44.967464 read(15, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000021>
08:54:44.968212 write(2, "[TRACE][cb58862b]\n\t0 - 0.0005(0."..., 132) =
132 <0.000046>
08:54:44.968357 writev(15, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000035>
08:54:44.968450 close(15) = 0 <0.000031>
08:54:44.968524 writev(8, [{"[pid: 26800|app: 0|req: 113093/1"..., 208}],
1) = 208 <0.000027>
08:54:44.968654 epoll_ctl(11, EPOLL_CTL_ADD, 23, {EPOLLIN, {u32=23,
u64=8194797600791}}) = 0 <0.000023>
08:54:44.968727 epoll_wait(11, [{EPOLLIN, {u32=10, u64=4294967306}},
{EPOLLIN, {u32=23, u64=8194797600791}}], 64, 4778) = 2 <0.000018>
08:54:44.968789 accept4(10, {sa_family=AF_INET, sin_port=htons(53599),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 15 <0.000021>
08:54:44.968869 read(23, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000020>
08:54:44.969653 write(2, "[TRACE][b5f9dff7]\n\t0 - 0.000506("..., 136) =
136 <0.000047>
08:54:44.969812 writev(23, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000036>
08:54:44.969908 close(23) = 0 <0.000030>
08:54:44.969979 writev(8, [{"[pid: 26800|app: 0|req: 113094/1"..., 208}],
1) = 208 <0.000027>
08:54:44.970106 epoll_ctl(11, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15,
u64=25383256719375}}) = 0 <0.000022>
08:54:44.970179 epoll_wait(11, [{EPOLLIN, {u32=10, u64=4294967306}},
{EPOLLIN, {u32=15, u64=25383256719375}}], 64, 4777) = 2 <0.000018>
08:54:44.970242 accept4(10, {sa_family=AF_INET, sin_port=htons(53598),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 23 <0.000022>
08:54:44.970326 read(15, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000022>
08:54:44.971013 write(2, "[TRACE][c8da4f8a]\n\t0 - 0.000477("..., 136) =
136 <0.000045>
08:54:44.971156 writev(15, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000034>
08:54:44.971252 close(15) = 0 <0.000029>
08:54:44.971328 writev(8, [{"[pid: 26800|app: 0|req: 113095/1"..., 208}],
1) = 208 <0.000026>

Usage of cores is very low:

$ zcat -f /var/log/uwsgi/app/combiner.access.log | perl -nE '/core (\d+)/
&& say $1' | sort | uniq -c | wc -l
18
$ zcat -f /var/log/uwsgi/app/combiner.access.log.1 | perl -nE '/core (\d+)/
&& say $1' | sort | uniq -c | wc -l
77

Do you have any ideas on what may cause this? Ideas on debugging it further?

--
Ð ÑÑÐ»Ð°Ðœ ÐÐ°ÐºÐžÑÐŸÐ²
Ð ÑÐºÐŸÐ²ÐŸÐŽÐžÑÐµÐ»Ñ ÐŸÑÐŽÐµÐ»Ð° ÑÐ°Ð·ÑÐ°Ð±ÐŸÑÐºÐž Ð²ÐµÐ±-ÑÐµÑÐ²ÐžÑÐŸÐ²
+7(916) 597-92-69, ruz @ <http://www.sports.ru/>

Roberto De Ioris

2018-04-26 15:35:23 UTC

Permalink

HI, ab is not a good tool for this kind of tests, by the way, it looks
like the system is not able to dequeue connections fast enough.

I would increase the listen queue (remember that on linux you need to set
values in /proc files)

--
Roberto De Ioris
http://unbit.com

Руслан Закиров

2018-04-26 15:47:37 UTC

Permalink

Post by Roberto De Ioris

Post by Ð ÑÑÐ»Ð°Ð½ ÐÐ°ÐºÐ¸ÑÐ¾Ð²
ab -k -l -r -c 200 -n 10000 http://192.168.1.48:5002/

HI, ab is not a good tool for this kind of tests, by the way, it looks

Why? it does the job of reproducing problem and simulating concurrent
pressure.

Post by Roberto De Ioris
like the system is not able to dequeue connections fast enough.

and I wonder why.

I would increase the listen queue (remember that on linux you need to set

Post by Roberto De Ioris
values in /proc files)

Increasing listen queue will just make the problem worse.

Post by Roberto De Ioris
--
Roberto De Ioris
http://unbit.com
_______________________________________________
uWSGI mailing list
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi

INADA Naoki

2018-04-27 07:36:18 UTC

Permalink

https://uwsgi-docs.readthedocs.io/en/latest/Options.html#listen
https://serverfault.com/questions/271380/how-can-i-increase-the-value-of-somaxconn

Post by Ð ÑÑÐ»Ð°Ð½ ÐÐ°ÐºÐ¸ÑÐ¾Ð²
Hello,
uWSGI listen queue of socket "0.0.0.0:5002" (fd: 10) full !!! (101/100)
It's not related to restart. I can reproduce it by bombarding app using ab
ab -k -l -r -c 200 -n 10000 http://192.168.1.48:5002/
This request lookups in apps' routing table, fails and returns 404.
[uwsgi]
buffer-size = 16384
log-master-bufsize = 32768
http-socket = 0.0.0.0:5002
psgi = /srv/sports/perl-deps/bin/sprt-combiner.psgi
processes = 1
master = true
uid = sports
gid = sports
plugin = http, coroae, logfile
coroae = 512
harakiri = 10
harakiri-verbose = true
reload-mercy = 4
max-requests = 100000
req-logger = file:/var/log/uwsgi/app/combiner.access.log
stats = 0.0.0.0:5052
I've increased number cores and decreased number of processes to 1 so I
08:54:44.965905 accept4(10, {sa_family=AF_INET, sin_port=htons(53595),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 15 <0.000021>
08:54:44.965986 read(23, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000021>
08:54:44.966752 write(2, "[TRACE][b414fb46]\n\t0 - 0.000495("..., 136) =
136 <0.000040>
08:54:44.966922 writev(23, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000035>
08:54:44.967017 close(23) = 0 <0.000031>
08:54:44.967102 writev(8, [{"[pid: 26800|app: 0|req: 113092/1"..., 208}],
1) = 208 <0.000027>
08:54:44.967237 epoll_ctl(11, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15,
u64=25378961752079}}) = 0 <0.000022>
08:54:44.967310 epoll_wait(11, [{EPOLLIN, {u32=10, u64=4294967306}},
{EPOLLIN, {u32=15, u64=25378961752079}}], 64, 4780) = 2 <0.000018>
08:54:44.967378 accept4(10, {sa_family=AF_INET, sin_port=htons(53587),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 23 <0.000021>
08:54:44.967464 read(15, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000021>
08:54:44.968212 write(2, "[TRACE][cb58862b]\n\t0 - 0.0005(0."..., 132) =
132 <0.000046>
08:54:44.968357 writev(15, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000035>
08:54:44.968450 close(15) = 0 <0.000031>
08:54:44.968524 writev(8, [{"[pid: 26800|app: 0|req: 113093/1"..., 208}],
1) = 208 <0.000027>
08:54:44.968654 epoll_ctl(11, EPOLL_CTL_ADD, 23, {EPOLLIN, {u32=23,
u64=8194797600791}}) = 0 <0.000023>
08:54:44.968727 epoll_wait(11, [{EPOLLIN, {u32=10, u64=4294967306}},
{EPOLLIN, {u32=23, u64=8194797600791}}], 64, 4778) = 2 <0.000018>
08:54:44.968789 accept4(10, {sa_family=AF_INET, sin_port=htons(53599),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 15 <0.000021>
08:54:44.968869 read(23, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000020>
08:54:44.969653 write(2, "[TRACE][b5f9dff7]\n\t0 - 0.000506("..., 136) =
136 <0.000047>
08:54:44.969812 writev(23, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000036>
08:54:44.969908 close(23) = 0 <0.000030>
08:54:44.969979 writev(8, [{"[pid: 26800|app: 0|req: 113094/1"..., 208}],
1) = 208 <0.000027>
08:54:44.970106 epoll_ctl(11, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15,
u64=25383256719375}}) = 0 <0.000022>
08:54:44.970179 epoll_wait(11, [{EPOLLIN, {u32=10, u64=4294967306}},
{EPOLLIN, {u32=15, u64=25383256719375}}], 64, 4777) = 2 <0.000018>
08:54:44.970242 accept4(10, {sa_family=AF_INET, sin_port=htons(53598),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 23 <0.000022>
08:54:44.970326 read(15, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000022>
08:54:44.971013 write(2, "[TRACE][c8da4f8a]\n\t0 - 0.000477("..., 136) =
136 <0.000045>
08:54:44.971156 writev(15, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000034>
08:54:44.971252 close(15) = 0 <0.000029>
08:54:44.971328 writev(8, [{"[pid: 26800|app: 0|req: 113095/1"..., 208}],
1) = 208 <0.000026>
$ zcat -f /var/log/uwsgi/app/combiner.access.log | perl -nE '/core (\d+)/
&& say $1' | sort | uniq -c | wc -l
18
$ zcat -f /var/log/uwsgi/app/combiner.access.log.1 | perl -nE '/core
(\d+)/ && say $1' | sort | uniq -c | wc -l
77
Do you have any ideas on what may cause this? Ideas on debugging it further?
--
Ð ÑÑÐ»Ð°Ðœ ÐÐ°ÐºÐžÑÐŸÐ²
Ð ÑÐºÐŸÐ²ÐŸÐŽÐžÑÐµÐ»Ñ ÐŸÑÐŽÐµÐ»Ð° ÑÐ°Ð·ÑÐ°Ð±ÐŸÑÐºÐž Ð²ÐµÐ±-ÑÐµÑÐ²ÐžÑÐŸÐ²
_______________________________________________
uWSGI mailing list
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi

--
INADA Naoki <***@gmail.com>

Руслан Закиров

2018-04-27 11:42:21 UTC

Permalink

Post by INADA Naoki
https://uwsgi-docs.readthedocs.io/en/latest/Options.html#listen
https://serverfault.com/questions/271380/how-can-i-increase-the-value-of-
somaxconn

Guys,

I know how to change size of the listen queue.

Why do you think I should change it? Do you suggest me to increase it or
decrease?

--
Ð ÑÑÐ»Ð°Ðœ ÐÐ°ÐºÐžÑÐŸÐ²
Ð ÑÐºÐŸÐ²ÐŸÐŽÐžÑÐµÐ»Ñ ÐŸÑÐŽÐµÐ»Ð° ÑÐ°Ð·ÑÐ°Ð±ÐŸÑÐºÐž Ð²ÐµÐ±-ÑÐµÑÐ²ÐžÑÐŸÐ²
+7(916) 597-92-69, ruz @ <http://www.sports.ru/>

INADA Naoki

2018-04-27 11:46:49 UTC

Permalink

Post by Ð ÑÑÐ»Ð°Ð½ ÐÐ°ÐºÐ¸ÑÐ¾Ð²

Post by INADA Naoki
https://uwsgi-docs.readthedocs.io/en/latest/Options.html#listen

https://serverfault.com/questions/271380/how-can-i-increase-the-value-of-somaxconn

Post by Ð ÑÑÐ»Ð°Ð½ ÐÐ°ÐºÐ¸ÑÐ¾Ð²
Guys,
I know how to change size of the listen queue.
Why do you think I should change it? Do you suggest me to increase it or

decrease?

Because the log say "uWSGI listen queue of socket "0.0.0.0:5002" (fd: 10)
full !!! (101/100)".

You used ab -c 200. 200 connections are created at once. 100 listen queue
size is not enough.

I'm curious why you didn't think you should change listen queue size....

--
INADA Naoki <***@gmail.com>

Руслан Закиров

2018-04-27 16:10:08 UTC

Permalink

Post by INADA Naoki

Post by Ð ÑÑÐ»Ð°Ð½ ÐÐ°ÐºÐ¸ÑÐ¾Ð²

Post by INADA Naoki
https://uwsgi-docs.readthedocs.io/en/latest/Options.html#listen

https://serverfault.com/questions/271380/how-can-i-increase-the-value-of-
somaxconn

Post by Ð ÑÑÐ»Ð°Ð½ ÐÐ°ÐºÐ¸ÑÐ¾Ð²
Guys,
I know how to change size of the listen queue.
Why do you think I should change it? Do you suggest me to increase it or

decrease?
Because the log say "uWSGI listen queue of socket "0.0.0.0:5002" (fd: 10)
full !!! (101/100)".
You used ab -c 200. 200 connections are created at once. 100 listen queue
size is not enough.
I'm curious why you didn't think you should change listen queue size....

Because it's not right thing to do in production.

However, I agree that 100 is less than 200 and queue will be overflown if ab
generates requests faster than the app processes them. Sure ab fires things
quite fast.

I oversimplified my production case. In the presented case it's 1.5ms
between accept()
calls. In production requests I see 600ms gaps between accept() calls. This
is what
brought me here to the list.

Probably I have to dig more into production straces...

What do you think? Ideas?
--

Post by INADA Naoki
_______________________________________________
uWSGI mailing list
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi

Roberto De Ioris

2018-04-27 16:18:32 UTC

Permalink

Post by Ð ÑÑÐ»Ð°Ð½ ÐÐ°ÐºÐ¸ÑÐ¾Ð²

Post by INADA Naoki

Post by Ð ÑÑÐ»Ð°Ð½ ÐÐ°ÐºÐ¸ÑÐ¾Ð²

Post by INADA Naoki
https://uwsgi-docs.readthedocs.io/en/latest/Options.html#listen

https://serverfault.com/questions/271380/how-can-i-increase-the-value-of-
somaxconn

Post by Ð ÑÑÐ»Ð°Ð½ ÐÐ°ÐºÐ¸ÑÐ¾Ð²
Guys,
I know how to change size of the listen queue.
Why do you think I should change it? Do you suggest me to increase it

or
decrease?
Because the log say "uWSGI listen queue of socket "0.0.0.0:5002" (fd: 10)
full !!! (101/100)".
You used ab -c 200. 200 connections are created at once. 100 listen queue
size is not enough.
I'm curious why you didn't think you should change listen queue size....

Because it's not right thing to do in production.
However, I agree that 100 is less than 200 and queue will be overflown if ab
generates requests faster than the app processes them. Sure ab fires things
quite fast.
I oversimplified my production case. In the presented case it's 1.5ms
between accept()
calls. In production requests I see 600ms gaps between accept() calls. This
is what
brought me here to the list.
Probably I have to dig more into production straces...
What do you think? Ideas?
-

Hi, if i understand correctly you are testing a production code-base so
isolating the problem would be very hard.

Can you try with a test with a simple hello world ? (just to ensure coroae
is working right)

--
Roberto De Ioris
http://unbit.com

INADA Naoki

2018-04-28 12:23:37 UTC

Permalink

Post by Ð ÑÑÐ»Ð°Ð½ ÐÐ°ÐºÐ¸ÑÐ¾Ð²
However, I agree that 100 is less than 200 and queue will be overflown

Post by Ð ÑÑÐ»Ð°Ð½ ÐÐ°ÐºÐ¸ÑÐ¾Ð²
ab
generates requests faster than the app processes them. Sure ab fires things
quite fast.
I oversimplified my production case. In the presented case it's 1.5ms
between accept()
calls. In production requests I see 600ms gaps between accept() calls. This
is what
brought me here to the list.
Probably I have to dig more into production straces...
What do you think? Ideas?

ab and wrk are tool for measuring throughput, not emulating realistic
workload.
They make 200 connection at once (within very short time rage) if you
specified 200 concurrency.

I think constant rate load testing tool (e.g. vegeta, wrk2) may help you.

And iff you want to emulate production requests, you must use other load
simulator tools (e.g. locust).