Using strace and lsof to debug blocked processes ================================================ You can use strace on a specific pid to figure out what a specific process is doing, e.g.: strace -fp You might see something like: select(9, [3 5 8], [], [], {0, 999999}) = 0 (Timeout) In this case, 3, 5 and 8 are the file descriptors select() may read from, and the 9 will be ([highest FD] + 1). `{0, 999999}` is a time struct which says that select will wait just under one second to timeout. `= 0 (Timeout)` is the return value of select, indicating that none of the file descriptors were ready to read from. Now to figure out what these specific file descriptors are. As root, run: lsof -p -ad to see what it's doing, like waiting for a response over a socket. You can also separate file handles with a comma: [root@ops-2-portal ~]# lsof -p 2947 -ad 3,5,8 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME mongrel_r 2947 deploy 3u IPv4 57390385 TCP *:vcom-tunnel (LISTEN) mongrel_r 2947 deploy 5u IPv4 57390749 TCP ops-2-portal:42717 (LISTEN) mongrel_r 2947 deploy 8u IPv4 58983912 TCP ops-2-portal:35191->ops-2-websvc:7077 (ESTABLISHED) As you can see, select() was looking for data on these file handles, and with the presence of FD 8, you can determine that this mongrel has a TCP connection established to ops-2-websvc:7077, but isn't reading any data. Resources --------- * [Using lsof in the Real World](http://wikis.sun.com/pages/viewpage.action?pageId=49906332) * [Finding open files with lsof](http://www.ibm.com/developerworks/aix/library/au-lsof.html) * [5 simple ways to troubleshoot using strace](http://www.hokstad.com/5-simple-ways-to-troubleshoot-using-strace.html)