DenyHostsWentBad
A few months back a friend called me about a weird problem on the linux server that runs his small company. The symptoms were that incoming ssh and http connections were hanging indefinitely. This was pretty odd, as the machine had been working just fine for several years.
I tried logging in remotely, and observed the same behavior - my ssh connection was accepted, but then just hung. Logging in with 'ssh -v' was no more helpful.
I suggested to my friend that he reboot the machine on the off chance it would come back alive, since I couldn't log in. No luck, the machine came back up with the same symptoms. I also walked him through starting another sshd on another port for me to attempt to long in - again, the same failure. He looked at the system logs and couldn't find anything suspicious to report. I did have him examine the top(1) output, which showed that every time someone attempted to ssh in, the child ssh process would start up and sit there consuming all the available cpu, indefinitely. My troubleshooting ideas were exhausted so we made plans for me to come in to the office and examine the machine myself the next day.
When I was finally in front of the system, I started doing the usual investigative work. I didn't see anything in the logs either, and I verified that every incoming ssh connection caused another child server process to be spawned that started sucking up cpu and not doing anything. The weirdest part was that the apache and sendmail were behaving exactly the same way - every incoming request spawned a new httpd or sendmail process that hung forever sucking cpu. I was completely stumped. Also, outgoing network traffic worked fine, for example I could ssh to external servers with no problems.
At this point my friend was ready to throw in the towel and rebuild the machine from scratch, but I persisted a bit longer. Perhaps, I thought, further instrumentation on sshd could provide an answer? I first tried setting the log level to DEBUG in /etc/ssh/sshd_config. That didn't reveal anything interesting. I then decided to try running strace on sshd. Since ssh forks child processes I knew I would need to tell strace to follow children. Here's the commandline I used: strace -o sshd.trace -ff -p PID
.
After launching that strace on the sshd pid, I tried to ssh in to the server from an external machine. Same hang result. I inspected the sshd.trace file and saw this:
27600 open("/etc/hosts.deny", O_RDONLY) = 4 27600 fstat(4, {st_mode=S_IFREG|0644, st_size=1087, ...}) = 0 27600 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2a9556c000 27600 read(4, "#\n# hosts.deny\tThis file contain"..., 4096) = 1087
followed by thousands and thousands of lines of reads on /etc/hosts.deny. Huh? How could reading /etc/hosts.deny take so long?
Looking at the file revealed the answer. We'd were running denyhosts on this server to protect against brute-force ssh attempts. Somehow, denyhosts had gone haywire while writing entries to /etc/hosts.deny. Instead of hosts.deny being a few hundred K, it was 500MB! I looked at the file and found that the first few hundred lines were normal, but then the rest of it was garbage data. Every application built with tcpwrapper support was reading this corrupt file and then running off into the reeds trying to deal with way too much data. Result: hung incoming network connections.
Once we figured that out, it was a simple matter to fix the problem: First I cleared out /etc/hosts.deny. Then I cleared the bad data out of the denyhosts database files under /usr/share/denyhosts/data
to ensure that denyhosts wouldn't continue to stuff garbage into /etc/hosts.deny.
That fixed the problem, although I don't know what originally caused it, other than some random filesystem corruption possibly. This problem has never reappeared in the several months since it originally happened. The moral of the story is that strace is always a great idea when it comes to debugging weird hung process issues.