June 14, 2002

system process monitoring...

http://linux.oreillynet.com/pub/a/linux/2002/05/09/sysadminguide.html

looks like it has one of the tools referenced might be useful..

monit web site: http://www.tildeslash.com/monit/

extract:

monit
Finally, I found Jan-Henrik Haukeland's monit on freshmeat.net. (I installed
version 2.2.1 for this article; the newest version is 2.3, the most
significant change of which is the addition of service grouping.) I chose to
implement it because, unlike supervise, it integrated easily into my
existing server, and, unlike mon, its feature set mapped very neatly onto my
requirement set. monit does just about exactly what I need a DMD to do, and
it doesn't do much else besides, which made it easier to install, configure,
and forget about.

While monit lacks some of the features of mon, the features missing are ones
I decided I could do very nicely without. Because it is too often overlooked
and undervalued, let me say that chief among monit's virtues is its
excellent documentation, which the author conveniently provides in man form.

As for its feature set, monit runs as a daemon; can start, stop, and restart
the service daemons it monitors; can manage services individually or in
groups; logs either to its own logfile or to syslog; has a very comfortable
configuration and control syntax; can do runtime and TCP/IP port checking
and knows about the protocol of most common service types, including HTTP,
FTP, SMTP, POP, IMAP, NNTP; can be configured to take actions depending on
the stability of a service over some time slice; will compute and monitor
MD5 checksums of service binaries; and does alert notification via email.
(monit also has a built-in Web server for remote control, but the author
does not recommend using it over the public Web; I enthusiastically endorse
that recommendation, at the very least until monit gets Digest
Authentication, as opposed to Basic.)

Written in C, installing monit on my Linux box was as simple as invoking the
standard:

[root@chomsky monit-2.2.1]# ./configure; make
[root@chomsky monit-2.2.1]# make installOnce installed, you have two main
tasks before you: first, you must gather the information monit needs in
order to manage the services you want it to manage; and, second, you have to
configure monit.

I decided I wanted it to monitor the daemons which provide SSH, HTTP, and
DNS on my remote box. (I also have monit watching over my RDBMS and SMTP
daemons, too, but for the purposes of this article, I'll only show
configuration examples for the first three.) I subsequently started using
monit to monitor a misbehaving Zope server, which I describe below.

As for the information you need to gather about each service, that's basic
stuff and you shouldn't have any trouble; you need start and stop scripts
and the fully qualified path name of the service PID file. The basic runtime
setup of monit is to create a /root/.monitrc, which contains configuration
information for each daemon to be monitored, plus general configuration
directives in the prologue. monit is then invoked thus (you can change some
configuration directives via command-line switches, but I like to put that
kind of stuff into the control file, whenever possible):

[root@chomsky monit-2.2.1]# monit -c /path/to/.monitrcA working monitrc file
looks something like this:

(1) #
(2) #$Id: monit.html,v 1.1 2002/04/30 16:52:36 kclark Exp $
(3) #
(4)
(5) set daemon 300
(6) set logfile /var/log/monit
(7)
(8) check apache with pidfile /var/log/httpd/httpd.pid
(9) start = "/root/apache-start"
(10) stop = "/root/apache-stop"
(11) checksum /usr/local/bin/httpd
(12) timeout(3, 3) and alert kendall@monkeyfist.com
(13) host foo.com port 80 protocol http
(14) host bar.org port 80 protocol http
(15)
(16) check sshd with pidfile /var/run/sshd.pid
(17) start = "/root/sshd-start"
(18) stop = "/root/sshd-stop"
(19) timeout(3, 3) and alert kendall@monkeyfist.com
(20) checksum /usr/local/sbin/sshd
(21)
(22) check named with pidfile /var/run/named.pid
(23) start = "/root/named-start"
(24) stop = "/root/named-stop"
(25) checksum /usr/local/sbin/named
(26) timeout(3, 3) and alert kendall@monkeyfist.com
(27) port 53 use type udpThe first few lines are obvious. I use RCS to
manage important config files, especially since I share sysadmin duties with
a fellow geek. That way we don't step on each other's toes. Lines 5 and 6
contain general monit configuration directives; the first tells monit how
often I want it to poll each service, and the second tells it where to write
its logfile.

The other directives that are legal anywhere in the control file include
setting an SMTP server for monit alerts, setting the port number of its
built-in HTTP server, and specifying host names allowed to use the HTTP
server, including username-password pairs. One important note: monit uses
"localhost" as the SMTP server by default; it may make sense in some cases
to set it to a secondary SMTP server, if you have one, in case your SMTP
daemon is misbehaving. Postfix has been amazingly reliable for me, so I
haven't specified another SMTP server.

The configuration of daemons for monit to monitor is fairly straightforward,
but there are some features covered in the man page I don't discuss here, so
it's worth a look. Lines 16 through 20 tell monit how to monitor my SSH
daemon, which was my original reason for installing a DMD.

The format of service-monitoring statements in the control file is flexible,
but monit expects the first line to be of the form: check [service name]
with pidfile [fully qualified path name of PID file]. The start and stop
declarations are not mandatory, but monit is less useful if it has no way to
restart a daemon when it has died.

I ask monit to checksum the binary, because it's free and it would be nice
to know if it is tampered with. The timeout(foo,bar) and alert statement is
very useful; it instructs monit that if the service has to be restarted foo
times within bar cycles (in my case, 3 times in 3 cycles, or 900 seconds), I
want to be alerted, since that's usually an indication something needs
explicit sysadmin attention.

Lines 13 and 14 are worth mentioning. They tell monit to not only check the
Apache binary but also to check Apache at the HTTP protocol level, for which
port and which virtual host. In the case that Apache stops being able to
answer requests -- because, say, one of my users has published an article
everyone suddenly wants to read -- but is still running, monit may be able
to alert me more quickly than I would otherwise be alerted.

One of monit's most valuable bits, which wasn't immediately apparent to me
when I installed and configured it, is that it can be run by any user on my
system, which means users can use it to monitor daemons which they are
running, whether short or long term.

I run the Zope Web application server under a special user account, and
lately it's been falling down more often than I'd like, sometimes in the
middle of the night, which means its sites are unreachable until the next
day. So I created another control file, installed it in the Zope user
account, and spawned another monit instance to monitor Zope:

(1) set daemon 240
(2) set logfile /home/k/monit.log
(3)
(4) check Zope with pidfile
/home/k/Zope-2.4.0-linux2-x86/var/zProcessManager.pid
(5) start = "/home/k/Zope-2.4.0-linux2-x86/start"
(6) stop = "/home/k/Zope-2.4.0-linux2-x86/stop"
(7) alert restart
(8) timeout(3, 3) and alert kendall@monkeyfist.com
(9) host foobarbaz.com port 8080 protocol httpIn this case, I want a
separate logfile, and I want monit to check on Zope every 4 minutes, rather
than every 5. This capability is useful in more than production-service
monitoring cases. For example, I'm starting work on a WebDAV server in
Python, and I expect it will be very unstable at first. I will likely use a
monit daemon to keep my prototype WebDAV server running continuously while I
iterate through develop-debug cycles. I can set the daemon polling time very
low, to say 60 seconds, so that I never have to wait more than a minute to
retest the server, and I don't have to continually restart it by hand.

Posted by Steve at June 14, 2002 03:06 PM