Discussion:
[uWSGI] listen queue overflow in a coroae app (not restart)
Руслан Закиров
2018-04-26 09:42:25 UTC
Permalink
Hello,

We see the following messages in logs:

uWSGI listen queue of socket "0.0.0.0:5002" (fd: 10) full !!! (101/100)

It's not related to restart. I can reproduce it by bombarding app using ab
tool:

ab -k -l -r -c 200 -n 10000 http://192.168.1.48:5002/

This request lookups in apps' routing table, fails and returns 404.

App's configuration:

[uwsgi]
buffer-size = 16384
log-master-bufsize = 32768
http-socket = 0.0.0.0:5002
psgi = /srv/sports/perl-deps/bin/sprt-combiner.psgi
processes = 1
master = true
uid = sports
gid = sports
plugin = http, coroae, logfile
coroae = 512
harakiri = 10
harakiri-verbose = true
reload-mercy = 4
max-requests = 100000
req-logger = file:/var/log/uwsgi/app/combiner.access.log
stats = 0.0.0.0:5052


I've increased number cores and decreased number of processes to 1 so I can
strace worker, here is strace from worker:

08:54:44.965905 accept4(10, {sa_family=AF_INET, sin_port=htons(53595),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 15 <0.000021>
08:54:44.965986 read(23, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000021>
08:54:44.966752 write(2, "[TRACE][b414fb46]\n\t0 - 0.000495("..., 136) =
136 <0.000040>
08:54:44.966922 writev(23, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000035>
08:54:44.967017 close(23) = 0 <0.000031>
08:54:44.967102 writev(8, [{"[pid: 26800|app: 0|req: 113092/1"..., 208}],
1) = 208 <0.000027>
08:54:44.967237 epoll_ctl(11, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15,
u64=25378961752079}}) = 0 <0.000022>
08:54:44.967310 epoll_wait(11, [{EPOLLIN, {u32=10, u64=4294967306}},
{EPOLLIN, {u32=15, u64=25378961752079}}], 64, 4780) = 2 <0.000018>
08:54:44.967378 accept4(10, {sa_family=AF_INET, sin_port=htons(53587),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 23 <0.000021>
08:54:44.967464 read(15, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000021>
08:54:44.968212 write(2, "[TRACE][cb58862b]\n\t0 - 0.0005(0."..., 132) =
132 <0.000046>
08:54:44.968357 writev(15, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000035>
08:54:44.968450 close(15) = 0 <0.000031>
08:54:44.968524 writev(8, [{"[pid: 26800|app: 0|req: 113093/1"..., 208}],
1) = 208 <0.000027>
08:54:44.968654 epoll_ctl(11, EPOLL_CTL_ADD, 23, {EPOLLIN, {u32=23,
u64=8194797600791}}) = 0 <0.000023>
08:54:44.968727 epoll_wait(11, [{EPOLLIN, {u32=10, u64=4294967306}},
{EPOLLIN, {u32=23, u64=8194797600791}}], 64, 4778) = 2 <0.000018>
08:54:44.968789 accept4(10, {sa_family=AF_INET, sin_port=htons(53599),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 15 <0.000021>
08:54:44.968869 read(23, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000020>
08:54:44.969653 write(2, "[TRACE][b5f9dff7]\n\t0 - 0.000506("..., 136) =
136 <0.000047>
08:54:44.969812 writev(23, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000036>
08:54:44.969908 close(23) = 0 <0.000030>
08:54:44.969979 writev(8, [{"[pid: 26800|app: 0|req: 113094/1"..., 208}],
1) = 208 <0.000027>
08:54:44.970106 epoll_ctl(11, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15,
u64=25383256719375}}) = 0 <0.000022>
08:54:44.970179 epoll_wait(11, [{EPOLLIN, {u32=10, u64=4294967306}},
{EPOLLIN, {u32=15, u64=25383256719375}}], 64, 4777) = 2 <0.000018>
08:54:44.970242 accept4(10, {sa_family=AF_INET, sin_port=htons(53598),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 23 <0.000022>
08:54:44.970326 read(15, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000022>
08:54:44.971013 write(2, "[TRACE][c8da4f8a]\n\t0 - 0.000477("..., 136) =
136 <0.000045>
08:54:44.971156 writev(15, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000034>
08:54:44.971252 close(15) = 0 <0.000029>
08:54:44.971328 writev(8, [{"[pid: 26800|app: 0|req: 113095/1"..., 208}],
1) = 208 <0.000026>


Usage of cores is very low:

$ zcat -f /var/log/uwsgi/app/combiner.access.log | perl -nE '/core (\d+)/
&& say $1' | sort | uniq -c | wc -l
18
$ zcat -f /var/log/uwsgi/app/combiner.access.log.1 | perl -nE '/core (\d+)/
&& say $1' | sort | uniq -c | wc -l
77


Do you have any ideas on what may cause this? Ideas on debugging it further?
--
РуслаМ ЗакОрПв
РукПвПЎОтель ПтЎела разрабПткО веб-сервОсПв
+7(916) 597-92-69, ruz @ <http://www.sports.ru/>
Roberto De Ioris
2018-04-26 15:35:23 UTC
Permalink
Post by Руслан Закиров
Hello,
uWSGI listen queue of socket "0.0.0.0:5002" (fd: 10) full !!! (101/100)
It's not related to restart. I can reproduce it by bombarding app using ab
ab -k -l -r -c 200 -n 10000 http://192.168.1.48:5002/
HI, ab is not a good tool for this kind of tests, by the way, it looks
like the system is not able to dequeue connections fast enough.

I would increase the listen queue (remember that on linux you need to set
values in /proc files)
--
Roberto De Ioris
http://unbit.com
Руслан Закиров
2018-04-26 15:47:37 UTC
Permalink
Post by Roberto De Ioris
Post by Руслан Закиров
Hello,
uWSGI listen queue of socket "0.0.0.0:5002" (fd: 10) full !!! (101/100)
It's not related to restart. I can reproduce it by bombarding app using
ab
Post by Руслан Закиров
ab -k -l -r -c 200 -n 10000 http://192.168.1.48:5002/
HI, ab is not a good tool for this kind of tests, by the way, it looks
Why? it does the job of reproducing problem and simulating concurrent
pressure.
Post by Roberto De Ioris
like the system is not able to dequeue connections fast enough.
and I wonder why.

I would increase the listen queue (remember that on linux you need to set
Post by Roberto De Ioris
values in /proc files)
Increasing listen queue will just make the problem worse.
Post by Roberto De Ioris
--
Roberto De Ioris
http://unbit.com
_______________________________________________
uWSGI mailing list
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi
--
РуслаМ ЗакОрПв
РукПвПЎОтель ПтЎела разрабПткО веб-сервОсПв
+7(916) 597-92-69, ruz @ <http://www.sports.ru/>
INADA Naoki
2018-04-27 07:36:18 UTC
Permalink
https://uwsgi-docs.readthedocs.io/en/latest/Options.html#listen
https://serverfault.com/questions/271380/how-can-i-increase-the-value-of-somaxconn
Post by Руслан Закиров
Hello,
uWSGI listen queue of socket "0.0.0.0:5002" (fd: 10) full !!! (101/100)
It's not related to restart. I can reproduce it by bombarding app using ab
ab -k -l -r -c 200 -n 10000 http://192.168.1.48:5002/
This request lookups in apps' routing table, fails and returns 404.
[uwsgi]
buffer-size = 16384
log-master-bufsize = 32768
http-socket = 0.0.0.0:5002
psgi = /srv/sports/perl-deps/bin/sprt-combiner.psgi
processes = 1
master = true
uid = sports
gid = sports
plugin = http, coroae, logfile
coroae = 512
harakiri = 10
harakiri-verbose = true
reload-mercy = 4
max-requests = 100000
req-logger = file:/var/log/uwsgi/app/combiner.access.log
stats = 0.0.0.0:5052
I've increased number cores and decreased number of processes to 1 so I
08:54:44.965905 accept4(10, {sa_family=AF_INET, sin_port=htons(53595),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 15 <0.000021>
08:54:44.965986 read(23, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000021>
08:54:44.966752 write(2, "[TRACE][b414fb46]\n\t0 - 0.000495("..., 136) =
136 <0.000040>
08:54:44.966922 writev(23, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000035>
08:54:44.967017 close(23) = 0 <0.000031>
08:54:44.967102 writev(8, [{"[pid: 26800|app: 0|req: 113092/1"..., 208}],
1) = 208 <0.000027>
08:54:44.967237 epoll_ctl(11, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15,
u64=25378961752079}}) = 0 <0.000022>
08:54:44.967310 epoll_wait(11, [{EPOLLIN, {u32=10, u64=4294967306}},
{EPOLLIN, {u32=15, u64=25378961752079}}], 64, 4780) = 2 <0.000018>
08:54:44.967378 accept4(10, {sa_family=AF_INET, sin_port=htons(53587),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 23 <0.000021>
08:54:44.967464 read(15, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000021>
08:54:44.968212 write(2, "[TRACE][cb58862b]\n\t0 - 0.0005(0."..., 132) =
132 <0.000046>
08:54:44.968357 writev(15, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000035>
08:54:44.968450 close(15) = 0 <0.000031>
08:54:44.968524 writev(8, [{"[pid: 26800|app: 0|req: 113093/1"..., 208}],
1) = 208 <0.000027>
08:54:44.968654 epoll_ctl(11, EPOLL_CTL_ADD, 23, {EPOLLIN, {u32=23,
u64=8194797600791}}) = 0 <0.000023>
08:54:44.968727 epoll_wait(11, [{EPOLLIN, {u32=10, u64=4294967306}},
{EPOLLIN, {u32=23, u64=8194797600791}}], 64, 4778) = 2 <0.000018>
08:54:44.968789 accept4(10, {sa_family=AF_INET, sin_port=htons(53599),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 15 <0.000021>
08:54:44.968869 read(23, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000020>
08:54:44.969653 write(2, "[TRACE][b5f9dff7]\n\t0 - 0.000506("..., 136) =
136 <0.000047>
08:54:44.969812 writev(23, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000036>
08:54:44.969908 close(23) = 0 <0.000030>
08:54:44.969979 writev(8, [{"[pid: 26800|app: 0|req: 113094/1"..., 208}],
1) = 208 <0.000027>
08:54:44.970106 epoll_ctl(11, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=15,
u64=25383256719375}}) = 0 <0.000022>
08:54:44.970179 epoll_wait(11, [{EPOLLIN, {u32=10, u64=4294967306}},
{EPOLLIN, {u32=15, u64=25383256719375}}], 64, 4777) = 2 <0.000018>
08:54:44.970242 accept4(10, {sa_family=AF_INET, sin_port=htons(53598),
sin_addr=inet_addr("192.168.150.158")}, [16], SOCK_NONBLOCK) = 23 <0.000022>
08:54:44.970326 read(15, "GET / HTTP/1.0\r\nConnection: Keep"..., 16384) =
105 <0.000022>
08:54:44.971013 write(2, "[TRACE][c8da4f8a]\n\t0 - 0.000477("..., 136) =
136 <0.000045>
08:54:44.971156 writev(15, [{"HTTP/1.0 404 Not Found\r\ncontent-"..., 67},
{"Not found", 9}], 2) = 76 <0.000034>
08:54:44.971252 close(15) = 0 <0.000029>
08:54:44.971328 writev(8, [{"[pid: 26800|app: 0|req: 113095/1"..., 208}],
1) = 208 <0.000026>
$ zcat -f /var/log/uwsgi/app/combiner.access.log | perl -nE '/core (\d+)/
&& say $1' | sort | uniq -c | wc -l
18
$ zcat -f /var/log/uwsgi/app/combiner.access.log.1 | perl -nE '/core
(\d+)/ && say $1' | sort | uniq -c | wc -l
77
Do you have any ideas on what may cause this? Ideas on debugging it further?
--
РуслаМ ЗакОрПв
РукПвПЎОтель ПтЎела разрабПткО веб-сервОсПв
_______________________________________________
uWSGI mailing list
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi
--
INADA Naoki <***@gmail.com>
Руслан Закиров
2018-04-27 11:42:21 UTC
Permalink
Post by INADA Naoki
https://uwsgi-docs.readthedocs.io/en/latest/Options.html#listen
https://serverfault.com/questions/271380/how-can-i-increase-the-value-of-
somaxconn
Guys,

I know how to change size of the listen queue.

Why do you think I should change it? Do you suggest me to increase it or
decrease?



--
РуслаМ ЗакОрПв
РукПвПЎОтель ПтЎела разрабПткО веб-сервОсПв
+7(916) 597-92-69, ruz @ <http://www.sports.ru/>
INADA Naoki
2018-04-27 11:46:49 UTC
Permalink
Post by Руслан Закиров
Post by INADA Naoki
https://uwsgi-docs.readthedocs.io/en/latest/Options.html#listen
https://serverfault.com/questions/271380/how-can-i-increase-the-value-of-somaxconn
Post by Руслан Закиров
Guys,
I know how to change size of the listen queue.
Why do you think I should change it? Do you suggest me to increase it or
decrease?


Because the log say "uWSGI listen queue of socket "0.0.0.0:5002" (fd: 10)
full !!! (101/100)".

You used ab -c 200. 200 connections are created at once. 100 listen queue
size is not enough.

I'm curious why you didn't think you should change listen queue size....
--
INADA Naoki <***@gmail.com>
Руслан Закиров
2018-04-27 16:10:08 UTC
Permalink
Post by INADA Naoki
Post by Руслан Закиров
Post by INADA Naoki
https://uwsgi-docs.readthedocs.io/en/latest/Options.html#listen
https://serverfault.com/questions/271380/how-can-i-increase-the-value-of-
somaxconn
Post by Руслан Закиров
Guys,
I know how to change size of the listen queue.
Why do you think I should change it? Do you suggest me to increase it or
decrease?
Because the log say "uWSGI listen queue of socket "0.0.0.0:5002" (fd: 10)
full !!! (101/100)".
You used ab -c 200. 200 connections are created at once. 100 listen queue
size is not enough.
I'm curious why you didn't think you should change listen queue size....
Because it's not right thing to do in production.

However, I agree that 100 is less than 200 and queue will be overflown if ab
generates requests faster than the app processes them. Sure ab fires things
quite fast.

I oversimplified my production case. In the presented case it's 1.5ms
between accept()
calls. In production requests I see 600ms gaps between accept() calls. This
is what
brought me here to the list.

Probably I have to dig more into production straces...

What do you think? Ideas?
--
Post by INADA Naoki
_______________________________________________
uWSGI mailing list
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi
--
РуслаМ ЗакОрПв
РукПвПЎОтель ПтЎела разрабПткО веб-сервОсПв
+7(916) 597-92-69, ruz @ <http://www.sports.ru/>
Roberto De Ioris
2018-04-27 16:18:32 UTC
Permalink
Post by Руслан Закиров
Post by INADA Naoki
Post by Руслан Закиров
Post by INADA Naoki
https://uwsgi-docs.readthedocs.io/en/latest/Options.html#listen
https://serverfault.com/questions/271380/how-can-i-increase-the-value-of-
somaxconn
Post by Руслан Закиров
Guys,
I know how to change size of the listen queue.
Why do you think I should change it? Do you suggest me to increase it
or
decrease?
Because the log say "uWSGI listen queue of socket "0.0.0.0:5002" (fd: 10)
full !!! (101/100)".
You used ab -c 200. 200 connections are created at once. 100 listen queue
size is not enough.
I'm curious why you didn't think you should change listen queue size....
Because it's not right thing to do in production.
However, I agree that 100 is less than 200 and queue will be overflown if ab
generates requests faster than the app processes them. Sure ab fires things
quite fast.
I oversimplified my production case. In the presented case it's 1.5ms
between accept()
calls. In production requests I see 600ms gaps between accept() calls. This
is what
brought me here to the list.
Probably I have to dig more into production straces...
What do you think? Ideas?
-
Hi, if i understand correctly you are testing a production code-base so
isolating the problem would be very hard.

Can you try with a test with a simple hello world ? (just to ensure coroae
is working right)
--
Roberto De Ioris
http://unbit.com
INADA Naoki
2018-04-28 12:23:37 UTC
Permalink
Post by Руслан Закиров
However, I agree that 100 is less than 200 and queue will be overflown
if
Post by Руслан Закиров
ab
generates requests faster than the app processes them. Sure ab fires things
quite fast.
I oversimplified my production case. In the presented case it's 1.5ms
between accept()
calls. In production requests I see 600ms gaps between accept() calls. This
is what
brought me here to the list.
Probably I have to dig more into production straces...
What do you think? Ideas?
ab and wrk are tool for measuring throughput, not emulating realistic
workload.
They make 200 connection at once (within very short time rage) if you
specified 200 concurrency.

I think constant rate load testing tool (e.g. vegeta, wrk2) may help you.

And iff you want to emulate production requests, you must use other load
simulator tools (e.g. locust).

Loading...