Discussion:
[uWSGI] uWSGI master process killing workers at random
Ryan Showalter
2011-10-14 08:11:05 UTC
Permalink
Hey everyone,

I've been running into an issue lately where after my server has been
running for a few hours, the uWSGI master process will start killing
off workers for seemingly no reason. When this happens, spool jobs
also fail to be created, but no failures seem to occur in the logs.
What's weird is that this error seems to have only begun manifesting
itself in the last couple of weeks. I'm not sure if this is related
to increased traffic or what. My memory usage is not outrageous, I
still have plenty of physical memory and swap available. Also, CPU
usage is very minimal. Restarting uwsgi will fix the issue for a few
hours before it begins to creep up again, at first just one or two
workers will be killed off every few minutes, and eventually it gets
to the point where at least 1 worker is killed at every request.

Below is my configuration, along with an excerpt from my log file
which shows several workers being killed and respawned within less
than a minute. I'm running uWSGI 0.9.9.2 with debug enabled (this
issue also happens with debug disabled).

Any help is much appreciated!

[program:uwsgi-www]
command=/opt/webapps/www.example.com/bin/uwsgi
--master
--processes 16
--memory-report
--harakiri 7200
--vacuum
--max-requests 500
--optimize 2
--enable-threads
--cache 1000
--import example-app.custom.uwsgi_utils
--spooler /opt/webapps/www.example.com/spooler
--home /opt/webapps/www.example.com/
--pythonpath /opt/webapps/www.example.com
--spooler-import example-app.custom.uwsgi_utils
--socket /opt/webapps/www.example.com/sock/uwsgi.sock
--pidfile /opt/webapps/www.example.com/pid/wsgi.pid
--wsgi-file /opt/webapps/www.example.com/django.wsgi
directory=/opt/webapps/www.example.com/example-app
environment=DJANGO_SETTINGS_MODULE='example-app.settings'
stdout_logfile=/var/log/uwsgi/www.example.com_supervisord.log
stderr_logfile=/var/log/uwsgi/www.example.com_supervisord.err
user=www-data
autostart=true
autorestart=true
stopsignal=QUIT


DAMN ! worker 12 (pid: 446) died :( trying respawn ...
Respawned uWSGI worker 12 (new pid: 1174)
DAMN ! worker 9 (pid: 893) died :( trying respawn ...
[uWSGI DEBUG] called 0xb2e9d534 0xb2f4702c 6950
Respawned uWSGI worker 9 (new pid: 1175)
DAMN ! worker 10 (pid: 1074) died :( trying respawn ...
[uWSGI DEBUG] called 0xb2e9d534 0xb2f4702c 6950
Respawned uWSGI worker 10 (new pid: 1176)
DAMN ! worker 14 (pid: 1101) died :( trying respawn ...
[uWSGI DEBUG] called 0xb2e9d534 0xb2f4702c 6950
Respawned uWSGI worker 14 (new pid: 1177)
DAMN ! worker 13 (pid: 1123) died :( trying respawn ...
[uWSGI DEBUG] called 0xb2e9d534 0xb2f4702c 6950
Respawned uWSGI worker 13 (new pid: 1178)
DAMN ! worker 1 (pid: 1134) died :( trying respawn ...
[uWSGI DEBUG] called 0xb2e9d534 0xb2f4702c 6950
Respawned uWSGI worker 1 (new pid: 1179)
DAMN ! worker 8 (pid: 1148) died :( trying respawn ...
[uWSGI DEBUG] called 0xb2e9d534 0xb2f4702c 6950
Respawned uWSGI worker 8 (new pid: 1180)
DAMN ! worker 5 (pid: 1149) died :( trying respawn ...
Respawned uWSGI worker 5 (new pid: 1181)
DAMN ! worker 6 (pid: 1157) died :( trying respawn ...
[uWSGI DEBUG] called 0xb2ec3bec 0xb2f43b0c 1
{address space usage: 111661056 bytes/106MB} {rss usage: 19304448
bytes/18MB} [pid: 1159|app: 0|req: 176/70] 50.23.94.74 () {30 vars in
390 bytes} [Fri Oct 14 00:04:46 2011] GET / => generated 8514 bytes in
227 msecs (HTTP/1.0 200) 4 headers in 333 bytes (1 switches on core 0)
[uWSGI DEBUG] called 0xb2e9d534 0xb2f4702c 6950
[uWSGI DEBUG] called 0xb2e9d534 0xb2f4702c 6950
Respawned uWSGI worker 6 (new pid: 1182)
[uWSGI DEBUG] called 0xb2e9d534 0xb2f4702c 6950
Fri Oct 14 00:04:57 2011 - master sent signal 63 to worker 8
Fri Oct 14 00:04:58 2011 - master sent signal 40 to worker 1
DAMN ! worker 8 (pid: 1180) died :( trying respawn ...
Respawned uWSGI worker 8 (new pid: 1183)
DAMN ! worker 1 (pid: 1179) died :( trying respawn ...
[uWSGI DEBUG] called 0xb2e9d534 0xb2f4702c 6950
Respawned uWSGI worker 1 (new pid: 1184)
[uWSGI DEBUG] called 0xb2e9d534 0xb2f4702c 6950
Fri Oct 14 00:04:59 2011 - master sent signal 47 to worker 9
DAMN ! worker 9 (pid: 1175) died :( trying respawn ...
Respawned uWSGI worker 9 (new pid: 1185)
[uWSGI DEBUG] called 0xb2e9d534 0xb2f4702c 6950
Fri Oct 14 00:05:02 2011 - master sent signal 42 to worker 9
DAMN ! worker 9 (pid: 1185) died :( trying respawn ...
Respawned uWSGI worker 9 (new pid: 1191)
[uWSGI DEBUG] called 0xb2e9d534 0xb2f4702c 6950

Ryan-
Roberto De Ioris
2011-10-14 08:15:20 UTC
Permalink
Post by Ryan Showalter
Hey everyone,
I've been running into an issue lately where after my server has been
running for a few hours, the uWSGI master process will start killing
off workers for seemingly no reason. When this happens, spool jobs
also fail to be created, but no failures seem to occur in the logs.
What's weird is that this error seems to have only begun manifesting
itself in the last couple of weeks. I'm not sure if this is related
to increased traffic or what. My memory usage is not outrageous, I
still have plenty of physical memory and swap available. Also, CPU
usage is very minimal. Restarting uwsgi will fix the issue for a few
hours before it begins to creep up again, at first just one or two
workers will be killed off every few minutes, and eventually it gets
to the point where at least 1 worker is killed at every request.
Below is my configuration, along with an excerpt from my log file
which shows several workers being killed and respawned within less
than a minute. I'm running uWSGI 0.9.9.2 with debug enabled (this
issue also happens with debug disabled).
Any help is much appreciated!
[program:uwsgi-www]
command=/opt/webapps/www.example.com/bin/uwsgi
--master
--processes 16
--memory-report
--harakiri 7200
--vacuum
--max-requests 500
Can you try removing --max-requests directive ?
You should check in the logs if the continuous reloads starts after the max requests count has been reached
Post by Ryan Showalter
--optimize 2
remove even optimizations (just to be sure)


--
Roberto De Ioris
http://unbit.it
JID: ***@jabber.unbit.it
Ryan Showalter
2011-10-14 08:32:40 UTC
Permalink
Post by Roberto De Ioris
Post by Ryan Showalter
Hey everyone,
I've been running into an issue lately where after my server has been
running for a few hours, the uWSGI master process will start killing
off workers for seemingly no reason.  When this happens, spool jobs
also fail to be created, but no failures seem to occur in the logs.
What's weird is that this error seems to have only begun manifesting
itself in the last couple of weeks.  I'm not sure if this is related
to increased traffic or what.  My memory usage is not outrageous, I
still have plenty of physical memory and swap available.  Also, CPU
usage is very minimal.  Restarting uwsgi will fix the issue for a few
hours before it begins to creep up again, at first just one or two
workers will be killed off every few minutes, and eventually it gets
to the point where at least 1 worker is killed at every request.
Below is my configuration, along with an excerpt from my log file
which shows several workers being killed and respawned within less
than a minute.  I'm running uWSGI 0.9.9.2 with debug enabled (this
issue also happens with debug disabled).
Any help is much appreciated!
[program:uwsgi-www]
command=/opt/webapps/www.example.com/bin/uwsgi
 --master
 --processes 16
 --memory-report
 --harakiri 7200
 --vacuum
 --max-requests 500
Can you try removing --max-requests directive ?
You should check in the logs if the continuous reloads starts after the max requests count has been reached
Post by Ryan Showalter
 --optimize 2
remove even optimizations (just to be sure)
Thanks Roberto,

I'll let you know how it goes tomorrow after it's had a chance to
build up the error again.
Post by Roberto De Ioris
--
Roberto De Ioris
http://unbit.it
_______________________________________________
uWSGI mailing list
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi
Ryan-
Ryan Showalter
2011-10-14 18:30:06 UTC
Permalink
Unfortunately the issue persists, albeit at a lesser scale. Spooler
jobs were successfully created this morning and the workers don't
appear to be dying off quite as often. The config is the same as I
had posted before minus the max-requests and optimize flags. Here are
two excerpts from the logs:

[uWSGI DEBUG] HTTP_ACCEPT:
application/vnd.rim.html,text/html,application/xhtml+xml,application/vnd.wap.xhtml+xml,text/vnd.sun.j2me.app-descriptor,image/vnd.rim.png,image/jpeg,application/x-vnd.rim.pme.b,application/vnd.rim.ucs,image/gif;anim=1,application/vnd.rim.css;v=1,text/css;media=screen,application/vnd.wap.wmlc;q=0.9,application/vnd.wap.wmlscriptc;q=0.7,text/vnd.wap.wml;q=0.7,*/*;q=0.5
[uWSGI DEBUG] HTTP_PROFILE:
http://www.blackberry.net/go/mobile/profiles/uaprof/8330m/4.5.0.rdf
[uWSGI DEBUG] HTTP_VIA: BISB_3.5.1.84, 1.1
pmds116.p4.bisb5.blackberry:3128 (squid/2.7.STABLE7)
[uWSGI DEBUG] HTTP_CACHE_CONTROL: max-age=259200
[uWSGI DEBUG] HTTP_CONNECTION: keep-alive
[uWSGI DEBUG] called 0xb2f478ec 0xb2f4742c 1
{address space usage: 115351552 bytes/110MB} {rss usage: 23814144
bytes/22MB} [pid: 8244|app: 0|req: 870/662] 74.82.64.16 () {46 vars in
1311 bytes} [Fri Oct 14 11:22:15 2011] GET / => generated 0 bytes in 2
msecs (HTTP/1.0 301) 4 headers in 248 bytes (1 switches on core 0)
Fri Oct 14 11:22:15 2011 - master sent signal 19 to worker 4
DAMN ! worker 4 (pid: 8305) died :( trying respawn ...
Respawned uWSGI worker 4 (new pid: 8309)
[uWSGI DEBUG] called 0xb2f18374 0xb2fc802c 4641
Fri Oct 14 11:22:21 2011 - master sent signal 45 to worker 10
DAMN ! worker 10 (pid: 8295) died :( trying respawn ...
Respawned uWSGI worker 10 (new pid: 8310)
[uWSGI DEBUG] called 0xb2f18374 0xb2fc802c 4641

...

[uWSGI DEBUG] called 0xb2f478ec 0xb2f4742c 1
{address space usage: 149348352 bytes/142MB} {rss usage: 24137728
bytes/23MB} [pid: 8280|app: 0|req: 719/692] 154.5.185.254 () {40 vars
in 1057 bytes} [Fri Oct 14 11:24:44 2011] GET
/register/dJhBARJLRDhkJbGw/31e-b71d55fcd539a7ebdb2b/ => generated 0
bytes in 1 msecs (HTTP/1.1 301) 4 headers in 299 bytes (1 switches on
core 0)
DAMN ! worker 10 (pid: 8333) died :( trying respawn ...
Respawned uWSGI worker 10 (new pid: 8341)
[uWSGI DEBUG] called 0xb2f18374 0xb2fc802c 4641
[uWSGI DEBUG] uwsgi payload size: 972 (0x3CC) modifier1: 0 modifier2: 0
[uWSGI DEBUG] PATH_INFO=/register/dJhBARJLRDhkJbGw/31e-b71d55fcd539a7ebdb2b/
[uWSGI DEBUG] SERVER_NAME=www.example.com
searching for in 0xb2f478ec
[uWSGI DEBUG] QUERY_STRING:
[uWSGI DEBUG] REQUEST_METHOD: GET
[uWSGI DEBUG] CONTENT_TYPE:
[uWSGI DEBUG] CONTENT_LENGTH:
[uWSGI DEBUG] REQUEST_URI: /register/dJhBARJLRDhkJbGw/31e-b71d55fcd539a7ebdb2b/
[uWSGI DEBUG] PATH_INFO: /register/dJhBARJLRDhkJbGw/31e-b71d55fcd539a7ebdb2b/
[uWSGI DEBUG] DOCUMENT_ROOT: /var/www/www.example.com
[uWSGI DEBUG] SERVER_PROTOCOL: HTTP/1.1
[uWSGI DEBUG] REMOTE_ADDR: 154.5.185.254
[uWSGI DEBUG] REMOTE_PORT: 51910
[uWSGI DEBUG] SERVER_PORT: 443
[uWSGI DEBUG] SERVER_NAME: www.example.com
[uWSGI DEBUG] UWSGI_SCHEME: https
[uWSGI DEBUG] HTTP_ACCEPT: */*
[uWSGI DEBUG] HTTP_ACCEPT_LANGUAGE: en
[uWSGI DEBUG] HTTP_ACCEPT_ENCODING: gzip, deflate
[uWSGI DEBUG] HTTP_COOKIE: __utmc=241079170;
sessionid=3f76d18acea925f202791ea70d43fead;
__utmb=241079170.2.10.1318616649;
__utmz=241079170.1318616649.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none);
csrftoken=8d1ae47378b993aa2fb251c497800867;
__utma=241079170.1269303664.1318616649.1318616649.1318616649.1
[uWSGI DEBUG] HTTP_USER_AGENT: Mozilla/5.0 (Macintosh; U; Intel Mac OS
X 10_4_11; en) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/4.1.3
Safari/533.19.4
[uWSGI DEBUG] HTTP_CONNECTION: keep-alive
[uWSGI DEBUG] HTTP_HOST: www.example.com
[uWSGI DEBUG] called 0xb2f478ec 0xb2f4742c 1
{address space usage: 115351552 bytes/110MB} {rss usage: 23818240
bytes/22MB} [pid: 8244|app: 0|req: 875/693] 154.5.185.254 () {40 vars
in 972 bytes} [Fri Oct 14 11:24:44 2011] GET
/register/dJhBARJLRDhkJbGw/31e-b71d55fcd539a7ebdb2b/ => generated 1250
bytes in 25 msecs (HTTP/1.1 200) 5 headers in 258 bytes (1 switches on
core 0)
[uWSGI DEBUG] uwsgi payload size: 1066 (0x42A) modifier1: 0 modifier2: 0
[uWSGI DEBUG] PATH_INFO=/biz/196/223/self_visit/
[uWSGI DEBUG] SERVER_NAME=www.example.com
searching for in 0xb2f478ec
[uWSGI DEBUG] QUERY_STRING:
[uWSGI DEBUG] REQUEST_METHOD: GET
[uWSGI DEBUG] CONTENT_TYPE:
[uWSGI DEBUG] CONTENT_LENGTH:
[uWSGI DEBUG] REQUEST_URI: /biz/196/223/self_visit/
[uWSGI DEBUG] PATH_INFO: /biz/196/223/self_visit/
[uWSGI DEBUG] DOCUMENT_ROOT: /var/www/www.example.com
[uWSGI DEBUG] SERVER_PROTOCOL: HTTP/1.1
[uWSGI DEBUG] REMOTE_ADDR: 198.228.213.97
[uWSGI DEBUG] REMOTE_PORT: 49905
[uWSGI DEBUG] SERVER_PORT: 80
[uWSGI DEBUG] SERVER_NAME: www.example.com
[uWSGI DEBUG] UWSGI_SCHEME: http
[uWSGI DEBUG] HTTP_HOST: www.example.com
[uWSGI DEBUG] HTTP_USER_AGENT: Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like
Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Mobile/8C148
[uWSGI DEBUG] HTTP_ACCEPT:
application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
[uWSGI DEBUG] HTTP_REFERER: http://www.example.com/biz/196/223/self_visit/
[uWSGI DEBUG] HTTP_CACHE_CONTROL: max-age=0
[uWSGI DEBUG] HTTP_ACCEPT_LANGUAGE: en-us
[uWSGI DEBUG] HTTP_ACCEPT_ENCODING: gzip, deflate
[uWSGI DEBUG] HTTP_COOKIE: csrftoken=65533535a38a0fd90719a6b85537978f;
sessionid=9d0a6e858a6edf4b3d07e3b2e0863cd1;
__utma=241079170.59454198.1311924321.1316443452.1316448832.26;
__utmz=241079170.1311924321.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none);
__qca=P0-871702505-1311924321780
[uWSGI DEBUG] HTTP_CONNECTION: keep-alive
[uWSGI DEBUG] called 0xb2f478ec 0xb2f4742c 1
{address space usage: 132333568 bytes/126MB} {rss usage: 23863296
bytes/22MB} [pid: 8155|app: 0|req: 730/691] 198.228.213.97 () {44 vars
in 1066 bytes} [Fri Oct 14 11:24:46 2011] GET /biz/196/223/self_visit/
=> generated 2982 bytes in 89 msecs (HTTP/1.1 200) 6 headers in 379
bytes (1 switches on core 0)
Fri Oct 14 11:24:46 2011 - master sent signal 56 to worker 10
DAMN ! worker 10 (pid: 8341) died :( trying respawn ...
Respawned uWSGI worker 10 (new pid: 8342)
[uWSGI DEBUG] called 0xb2f18374 0xb2fc802c 4641


Ryan-
Post by Ryan Showalter
Post by Roberto De Ioris
Post by Ryan Showalter
Hey everyone,
I've been running into an issue lately where after my server has been
running for a few hours, the uWSGI master process will start killing
off workers for seemingly no reason.  When this happens, spool jobs
also fail to be created, but no failures seem to occur in the logs.
What's weird is that this error seems to have only begun manifesting
itself in the last couple of weeks.  I'm not sure if this is related
to increased traffic or what.  My memory usage is not outrageous, I
still have plenty of physical memory and swap available.  Also, CPU
usage is very minimal.  Restarting uwsgi will fix the issue for a few
hours before it begins to creep up again, at first just one or two
workers will be killed off every few minutes, and eventually it gets
to the point where at least 1 worker is killed at every request.
Below is my configuration, along with an excerpt from my log file
which shows several workers being killed and respawned within less
than a minute.  I'm running uWSGI 0.9.9.2 with debug enabled (this
issue also happens with debug disabled).
Any help is much appreciated!
[program:uwsgi-www]
command=/opt/webapps/www.example.com/bin/uwsgi
 --master
 --processes 16
 --memory-report
 --harakiri 7200
 --vacuum
 --max-requests 500
Can you try removing --max-requests directive ?
You should check in the logs if the continuous reloads starts after the max requests count has been reached
Post by Ryan Showalter
 --optimize 2
remove even optimizations (just to be sure)
Thanks Roberto,
I'll let you know how it goes tomorrow after it's had a chance to
build up the error again.
Post by Roberto De Ioris
--
Roberto De Ioris
http://unbit.it
_______________________________________________
uWSGI mailing list
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi
Ryan-
Roberto De Ioris
2011-10-15 05:35:45 UTC
Permalink
Post by Ryan Showalter
Unfortunately the issue persists, albeit at a lesser scale. Spooler
jobs were successfully created this morning and the workers don't
appear to be dying off quite as often. The config is the same as I
had posted before minus the max-requests and optimize flags. Here are
Ok, this confirm an issue with max-requests and the spooler (propagating
to workers).

Your latest errors seems triggered by the signal handling subsystem.
I do not know if you need to spawn threads in your app, but if you do not
need it, please disable --enable-threads to see if the signal-handler
crash is triggered by a broken GIL lock.

Many Thanks
--
Roberto De Ioris
http://unbit.it
Ryan Showalter
2011-10-15 18:04:06 UTC
Permalink
Post by Roberto De Ioris
Unfortunately the issue persists, albeit at a lesser scale.  Spooler
jobs were successfully created this morning and the workers don't
appear to be dying off quite as often.  The config is the same as I
had posted before minus the max-requests and optimize flags.  Here are
Ok, this confirm an issue with max-requests and the spooler (propagating
to workers).
Your latest errors seems triggered by the signal handling subsystem.
I do not know if you need to spawn threads in your app, but if you do not
need it, please disable --enable-threads to see if the signal-handler
crash is triggered by a broken GIL lock.
Unfortunately I am currently using the enable-threads option to send
out emails. It looks something like this:

queues = {}

class queueconsumer(object):
def __init__(self, name, num=1, **kwargs):
self.name = name
self.num = num
self.queue = Queue.Queue()
self.threads = []
self.func = None
queues[self.name] = self

@staticmethod
def consumer(self):
while True:
req = self.queue.get()
self.func(req)
self.queue.task_done()

def __call__(self, f):
self.func = f
for i in range(self.num):
t = Thread(target=self.consumer, args=(self,))
self.threads.append(t)
t.daemon = True
t.start()


@queueconsumer('email', 2)
def send_email_task(args):
from django.core.mail import EmailMultiAlternatives
from django.utils import simplejson

body = args['body']
email_details = simplejson.loads(body)
email = EmailMultiAlternatives(**email_details)
email.send(fail_silently=False)

It's basically taken from the examples you guys posted a while ago for
multi-threaded spool jobs, but I don't create a spool job in this
situation, I just queue the email to be sent into the dictionary and
let the threads take care of it. I can re-work the code if you guys
would like me to see if this is causing the issue or not.

Also, I spoke too soon when I said that spooler jobs were being
created again, after being left running long enough, the spooler
system shut down and no longer created spool jobs or executed spool
jobs that had been left in it's spool.

Thanks for all the help on this,
Ryan-
Post by Roberto De Ioris
Many Thanks
--
Roberto De Ioris
http://unbit.it
_______________________________________________
uWSGI mailing list
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi
Roberto De Ioris
2011-10-15 19:21:08 UTC
Permalink
Post by Ryan Showalter
Post by Roberto De Ioris
Unfortunately the issue persists, albeit at a lesser scale.  Spooler
jobs were successfully created this morning and the workers don't
appear to be dying off quite as often.  The config is the same as I
had posted before minus the max-requests and optimize flags.  Here are
Ok, this confirm an issue with max-requests and the spooler (propagating
to workers).
Your latest errors seems triggered by the signal handling subsystem.
I do not know if you need to spawn threads in your app, but if you do not
need it, please disable --enable-threads to see if the signal-handler
crash is triggered by a broken GIL lock.
Unfortunately I am currently using the enable-threads option to send
queues = {}
self.name = name
self.num = num
self.queue = Queue.Queue()
self.threads = []
self.func = None
queues[self.name] = self
@staticmethod
req = self.queue.get()
self.func(req)
self.queue.task_done()
self.func = f
t = Thread(target=self.consumer, args=(self,))
self.threads.append(t)
t.daemon = True
t.start()
@queueconsumer('email', 2)
from django.core.mail import EmailMultiAlternatives
from django.utils import simplejson
body = args['body']
email_details = simplejson.loads(body)
email = EmailMultiAlternatives(**email_details)
email.send(fail_silently=False)
It's basically taken from the examples you guys posted a while ago for
multi-threaded spool jobs, but I don't create a spool job in this
situation, I just queue the email to be sent into the dictionary and
let the threads take care of it. I can re-work the code if you guys
would like me to see if this is causing the issue or not.
Also, I spoke too soon when I said that spooler jobs were being
created again, after being left running long enough, the spooler
system shut down and no longer created spool jobs or executed spool
jobs that had been left in it's spool.
Thanks for all the help on this,
Ryan-
The problem with python threads is that you can lose exceptions, and the
result is in your spooler process silently dying. If the master process
see a process respawning too fast, it slowdown to avoid fork bombing. So i
assume this is your current status.

For some reason one of the thread in your spooler is crashing the whole
subsystem, but you do not get the exception.

If you can re-work this part of your app to be sure the problem is with
threads+spooleer it would be useful (check for mules in 1.0 branch). In
the mean time i will check if we can find a workaround to get threads
exception asynchronously.
--
Roberto De Ioris
http://unbit.it
Ryan Showalter
2011-10-19 18:01:29 UTC
Permalink
I found the cause behind why the workers were being killed. A few
weeks ago I added the following function in the uwsgi_utils.py file
(see config from original post):

@timer(900)
def update_directory_visible_cache(num):
import cPickle as pickle

from django.core.cache import cache

from businesses.models import Business

dir_visible_list = []
for business in Business.objects.filter(biz_status='ACTIVE'):
if business.directory_visible:
dir_visible_list.append(business.pk)

cache.set('dir_visible_list', pickle.dumps(dir_visible_list), 1800)


After removing this function, everything has been working perfectly
over the last 2 days and no workers have been killed (that I can see,
at least) in the logs. After reviewing the old logs more thoroughly,
I noticed several lines like this:

registered signal 66
you can register max 64 timers !!!

registered signal 67
you can register max 64 timers !!!

registered signal 68
you can register max 64 timers !!!

etc etc..

I currently have no way to confirm that the timer jobs were actually
being executed, I guess I could re-implement this on my dev site and
print debug info from the timing jobs if you guys need more info.

I hope this helps!

Ryan-
Post by Roberto De Ioris
Post by Ryan Showalter
Post by Roberto De Ioris
Unfortunately the issue persists, albeit at a lesser scale.  Spooler
jobs were successfully created this morning and the workers don't
appear to be dying off quite as often.  The config is the same as I
had posted before minus the max-requests and optimize flags.  Here are
Ok, this confirm an issue with max-requests and the spooler (propagating
to workers).
Your latest errors seems triggered by the signal handling subsystem.
I do not know if you need to spawn threads in your app, but if you do not
need it, please disable --enable-threads to see if the signal-handler
crash is triggered by a broken GIL lock.
Unfortunately I am currently using the enable-threads option to send
queues = {}
        self.name = name
        self.num = num
        self.queue = Queue.Queue()
        self.threads = []
        self.func = None
        queues[self.name] = self
            req = self.queue.get()
            self.func(req)
            self.queue.task_done()
        self.func = f
            t = Thread(target=self.consumer, args=(self,))
            self.threads.append(t)
            t.daemon = True
            t.start()
@queueconsumer('email', 2)
    from django.core.mail import EmailMultiAlternatives
    from django.utils import simplejson
    body = args['body']
    email_details = simplejson.loads(body)
    email = EmailMultiAlternatives(**email_details)
    email.send(fail_silently=False)
It's basically taken from the examples you guys posted a while ago for
multi-threaded spool jobs, but I don't create a spool job in this
situation, I just queue the email to be sent into the dictionary and
let the threads take care of it.  I can re-work the code if you guys
would like me to see if this is causing the issue or not.
Also, I spoke too soon when I said that spooler jobs were being
created again, after being left running long enough, the spooler
system shut down and no longer created spool jobs or executed spool
jobs that had been left in it's spool.
Thanks for all the help on this,
Ryan-
The problem with python threads is that you can lose exceptions, and the
result is in your spooler process silently dying. If the master process
see a process respawning too fast, it slowdown to avoid fork bombing. So i
assume this is your current status.
For some reason one of the thread in your spooler is crashing the whole
subsystem, but you do not get the exception.
If you can re-work this part of your app to be sure the problem is with
threads+spooleer it would be useful (check for mules in 1.0 branch). In
the mean time i will check if we can find a workaround to get threads
exception asynchronously.
--
Roberto De Ioris
http://unbit.it
_______________________________________________
uWSGI mailing list
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi
Roberto De Ioris
2011-10-19 18:22:15 UTC
Permalink
Post by Ryan Showalter
I found the cause behind why the workers were being killed. A few
weeks ago I added the following function in the uwsgi_utils.py file
@timer(900)
import cPickle as pickle
from django.core.cache import cache
from businesses.models import Business
dir_visible_list = []
dir_visible_list.append(business.pk)
cache.set('dir_visible_list', pickle.dumps(dir_visible_list), 1800)
After removing this function, everything has been working perfectly
over the last 2 days and no workers have been killed (that I can see,
at least) in the logs. After reviewing the old logs more thoroughly,
registered signal 66
you can register max 64 timers !!!
registered signal 67
you can register max 64 timers !!!
registered signal 68
you can register max 64 timers !!!
etc etc..
This condition should raise a python exception:

ValueError("unable to add timer")

can you check if it is raised ? (you should see the message in the logs
soon after the "you can register max 64 timers !!!" error)
--
Roberto De Ioris
http://unbit.it
Ryan Showalter
2011-10-19 18:43:17 UTC
Permalink
Post by Roberto De Ioris
I found the cause behind why the workers were being killed.  A few
weeks ago I added the following function in the uwsgi_utils.py file
@timer(900)
    import cPickle as pickle
    from django.core.cache import cache
    from businesses.models import Business
    dir_visible_list = []
            dir_visible_list.append(business.pk)
    cache.set('dir_visible_list', pickle.dumps(dir_visible_list), 1800)
After removing this function, everything has been working perfectly
over the last 2 days and no workers have been killed (that I can see,
at least) in the logs.  After reviewing the old logs more thoroughly,
registered signal 66
you can register max 64 timers !!!
registered signal 67
you can register max 64 timers !!!
registered signal 68
you can register max 64 timers !!!
etc etc..
ValueError("unable to add timer")
can you check if it is raised ? (you should see the message in the logs
soon after the "you can register max 64 timers !!!" error)
I do in fact see the following in my stdout logs:

... blah blah...

unable to add timer
... blah blah blah...

But since the "you can register max 64 timers !!!" is in my stderr
logs and "unable to add timer" is in the stdout logs, it's impossible
for me to correlate the two since neither of the messages in either of
the logs have any sort of timestamps.
Post by Roberto De Ioris
--
Roberto De Ioris
http://unbit.it
_______________________________________________
uWSGI mailing list
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi
Roberto De Ioris
2011-10-19 18:47:26 UTC
Permalink
Post by Ryan Showalter
Post by Roberto De Ioris
I found the cause behind why the workers were being killed.  A few
weeks ago I added the following function in the uwsgi_utils.py file
@timer(900)
    import cPickle as pickle
    from django.core.cache import cache
    from businesses.models import Business
    dir_visible_list = []
            dir_visible_list.append(business.pk)
    cache.set('dir_visible_list', pickle.dumps(dir_visible_list), 1800)
After removing this function, everything has been working perfectly
over the last 2 days and no workers have been killed (that I can see,
at least) in the logs.  After reviewing the old logs more thoroughly,
registered signal 66
you can register max 64 timers !!!
registered signal 67
you can register max 64 timers !!!
registered signal 68
you can register max 64 timers !!!
etc etc..
ValueError("unable to add timer")
can you check if it is raised ? (you should see the message in the logs
soon after the "you can register max 64 timers !!!" error)
... blah blah...
unable to add timer
... blah blah blah...
But since the "you can register max 64 timers !!!" is in my stderr
logs and "unable to add timer" is in the stdout logs, it's impossible
for me to correlate the two since neither of the messages in either of
the logs have any sort of timestamps.
Signals are registered ASAP so if your module define function at address
0xAAAAAAAA uWSGI will put this address in the signal table.

But at a point in your module code, an exception is raised and all of the
module memory is freed and 0xAAAAAAAA is no more valid.

The signal subsystem has no way to know this, so it will simply call an
undefined memory area and the worker crashes.

Honestly i see no way to fix this behaviour (the signal subsystem should
rollback all of the registered signals, and this is unpractical). By the
way the 64 timer limit is only "theoretical" this limit could be raised in
the code without problems (or made configurable). In the mean time you can
use @rbtimer instead of @timer for timers > 64 (you can mix rbtimers and
timers without problems)
--
Roberto De Ioris
http://unbit.it
Ryan Showalter
2011-10-19 19:15:36 UTC
Permalink
Post by Roberto De Ioris
Post by Ryan Showalter
Post by Roberto De Ioris
I found the cause behind why the workers were being killed.  A few
weeks ago I added the following function in the uwsgi_utils.py file
@timer(900)
    import cPickle as pickle
    from django.core.cache import cache
    from businesses.models import Business
    dir_visible_list = []
            dir_visible_list.append(business.pk)
    cache.set('dir_visible_list', pickle.dumps(dir_visible_list),
1800)
After removing this function, everything has been working perfectly
over the last 2 days and no workers have been killed (that I can see,
at least) in the logs.  After reviewing the old logs more thoroughly,
registered signal 66
you can register max 64 timers !!!
registered signal 67
you can register max 64 timers !!!
registered signal 68
you can register max 64 timers !!!
etc etc..
ValueError("unable to add timer")
can you check if it is raised ? (you should see the message in the logs
soon after the "you can register max 64 timers !!!" error)
... blah blah...
unable to add timer
... blah blah blah...
But since the "you can register max 64 timers !!!" is in my stderr
logs and "unable to add timer" is in the stdout logs, it's impossible
for me to correlate the two since neither of the messages in either of
the logs have any sort of timestamps.
Signals are registered ASAP so if your module define function at address
0xAAAAAAAA uWSGI will put this address in the signal table.
But at a point in your module code, an exception is raised and all of the
module memory is freed and 0xAAAAAAAA is no more valid.
Where exactly does it become an issue when an exception is raised?
Raising exceptions is an essential part of my @spool functions so that
they keep retrying (SPOOL_RETRY) until the function returns
successfully and thus SPOOL_OK is sent to uwsgi
Post by Roberto De Ioris
The signal subsystem has no way to know this, so it will simply call an
undefined memory area and the worker crashes.
Honestly i see no way to fix this behaviour (the signal subsystem should
rollback all of the registered signals, and this is unpractical). By the
way the 64 timer limit is only "theoretical" this limit could be raised in
the code without problems (or made configurable). In the mean time you can
timers without problems)
--
Roberto De Ioris
http://unbit.it
_______________________________________________
uWSGI mailing list
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi
Ryan Showalter
2011-10-19 20:03:41 UTC
Permalink
Post by Ryan Showalter
Post by Roberto De Ioris
Post by Ryan Showalter
Post by Roberto De Ioris
I found the cause behind why the workers were being killed.  A few
weeks ago I added the following function in the uwsgi_utils.py file
@timer(900)
    import cPickle as pickle
    from django.core.cache import cache
    from businesses.models import Business
    dir_visible_list = []
            dir_visible_list.append(business.pk)
    cache.set('dir_visible_list', pickle.dumps(dir_visible_list),
1800)
After removing this function, everything has been working perfectly
over the last 2 days and no workers have been killed (that I can see,
at least) in the logs.  After reviewing the old logs more thoroughly,
registered signal 66
you can register max 64 timers !!!
registered signal 67
you can register max 64 timers !!!
registered signal 68
you can register max 64 timers !!!
etc etc..
ValueError("unable to add timer")
can you check if it is raised ? (you should see the message in the logs
soon after the "you can register max 64 timers !!!" error)
... blah blah...
unable to add timer
... blah blah blah...
But since the "you can register max 64 timers !!!" is in my stderr
logs and "unable to add timer" is in the stdout logs, it's impossible
for me to correlate the two since neither of the messages in either of
the logs have any sort of timestamps.
Signals are registered ASAP so if your module define function at address
0xAAAAAAAA uWSGI will put this address in the signal table.
But at a point in your module code, an exception is raised and all of the
module memory is freed and 0xAAAAAAAA is no more valid.
Where exactly does it become an issue when an exception is raised?
they keep retrying (SPOOL_RETRY) until the function returns
successfully and thus SPOOL_OK is sent to uwsgi
Ah, I just re-read the thread and noticed that you said that the rogue
exception may be escaping from the spool thread somewhere. If this is
the case, why do you think the spooler exception is getting through
(and then possibly crashing the subsystem) only when timers are being
registered? As far as I can tell, the timer is not raising any sort
of exception when they're actually being executed.
Post by Ryan Showalter
Post by Roberto De Ioris
The signal subsystem has no way to know this, so it will simply call an
undefined memory area and the worker crashes.
Honestly i see no way to fix this behaviour (the signal subsystem should
rollback all of the registered signals, and this is unpractical). By the
way the 64 timer limit is only "theoretical" this limit could be raised in
the code without problems (or made configurable). In the mean time you can
timers without problems)
--
Roberto De Ioris
http://unbit.it
_______________________________________________
uWSGI mailing list
http://lists.unbit.it/cgi-bin/mailman/listinfo/uwsgi
Roberto De Ioris
2011-10-19 23:59:53 UTC
Permalink
Post by Ryan Showalter
Post by Ryan Showalter
Post by Roberto De Ioris
Signals are registered ASAP so if your module define function at address
0xAAAAAAAA uWSGI will put this address in the signal table.
But at a point in your module code, an exception is raised and all of the
module memory is freed and 0xAAAAAAAA is no more valid.
Where exactly does it become an issue when an exception is raised?
they keep retrying (SPOOL_RETRY) until the function returns
successfully and thus SPOOL_OK is sent to uwsgi
Ah, I just re-read the thread and noticed that you said that the rogue
exception may be escaping from the spool thread somewhere. If this is
the case, why do you think the spooler exception is getting through
(and then possibly crashing the subsystem) only when timers are being
registered? As far as I can tell, the timer is not raising any sort
of exception when they're actually being executed.
You can raise all of the exceptions you want safely. You have only to
worry about exceptions raised during module import (with --import) as all
of the signal handlers registered before the exception will became invalid
and will crash workers as soon as they are called. Ex.

# mymodule.py
register_signal(1,...)
register_signal(2,...)
register_signal(3,...)
register_signal(4,...)
register_signal(5,...)
raise Exception("foo")
register_signal(6,...)

If you do a --import mymodule

you will register the first 5 signals, but as soon as the exception is
raised, their function pointers got removed from memory. Then, if you
raise signal from 1 to 5, your worker will crash.

You need only to check for exception in the importing phase, then you are
safe. Removing your 65-67 timer registering (raising the exception) you
allowed uWSGI ti successfully import the module

In a standard python environment this is not a problem, as having an
uncatched exception will always destroy the interpreter exiting the
process.

uWSGI --import instead, can continue its execution (as you can try to
import multiple module in sequence) opening to this specific problem.

Simply solve the timer numbers issue and you are ready. (by the way i will
raise the number of allowed timers to 256 in 1.0)
--
Roberto De Ioris
http://unbit.it
Loading...