donate


Plans for future monit releases

Introduction:

This document is a draft for future releases of monit. Each feature item is listed with the responsible for the implementation, the current progress (the color will go from blue to green) and how we judge the importance of this feature (high, medium, low).

Items marked with a low importance will not make it into the nearest monit release, but maybe in a future release. If you would like to change anything or add stuff to this list join the monit mailing lists and let us know.

Feature list:

Done In progress Planned
  1. MySQL authentication test
  2. Customize monit log file output
  3. Event traceback
  4. URL request for protocol tests (like ldap, ftp, etc.)
  5. Network interfaces health and load monitoring
  6. Filesystem load average tests
  7. Filesystem related caches test
  8. Timeofday actions
  9. IPv6 support
  10. S.M.A.R.T capable devices monitoring support
  11. ARP (MAC address) tests in host services
  12. SCSI ping support for device test
  13. Action list support and optional service name target
  14. Support for status listing by service group
  15. Support for hard service dependency
  16. Matching timeout rule should set the service state to 'timed out'
  17. Display filesystem type
  18. Log full start/stop/exec command
  19. Allow to override the implicit action on some events
  20. Handle multiple lines matching the pattern in single MATCH statement as single group with one action
  21. Add the start/stop/restart throttling
  22. Log the start/stop/exec program output
  23. Support timestamp test relative to other file
  24. Watch process' filedecriptors count

MySQL authentication test

Allow specifying a username and password in the mysql protocol test for authentication. Monit is currently supporting anonymous authentication, but in those cases where anonymous authentication is disabled this may be useful.

Example statement to be used in monitrc:

 
  if failed host 192.168.1.1 port 3306 
     protocol mysql://user:password@localhost:3306/mydatabase 
  then restart 

See e.g. http://www.redferni.uklinux.net/mysql/MySQL-Protocol.html for a description of the the MySQL protocol

Responsible: ?
Progress: 0%
Importance: MEDIUM

Customize monit log file output

Use the same approach as the apache project for configuring log file output. The log file format will be set using a global set-statement,

set logformat "%h %l %u %t %>s %b"

and where applicable the format specifiers match those of apache log file.

Responsible: ?
Progress: 0%
Importance: MEDIUM

Call external script from monit and check return value

We plan to support two levels of running an external script.
  1. As a full 'check status' service
  2. As an if test.
Here is the syntax defined more formal (keywords in uppercase) for those two cases:
1) CHECK STATUS OF name WITH PATH "/path/to/script"
[[AND] TIMEOUT AFTER X sec]
IF FAILED THEN
{ALERT|MONITOR|UNMONITOR|START|STOP|RESTART|EXEC}
[ELSE
        {ALERT|MONITOR|UNMONITOR|START|STOP|RESTART|EXEC}
        ]
[ALERT ..]
[EVERY ..]
[DEPENDS ON ..]
[GROUP ..]

2) check X ...
IF FAILED STATUS OF [SCRIPT|PROGRAM] "/path/to/script"
[AND TIMEOUT AFTER X sec] THEN
{ALERT|MONITOR|UNMONITOR|START|STOP|RESTART|EXEC}
ELSE
{ALERT|MONITOR|UNMONITOR|START|STOP|RESTART|EXEC}
...
Detailed discussion: The script is executed by monit and the return value is used to decide the success. That is, if the script returns 0 it succeded and if it returns anything else it failed. The new sub-statement [TIMEOUT AFTER X sec] is used to timeout execution. I.e if the script did not return after X seconds, monit aborts the execution and the test failed. This statement is optional and if not used, defaults to 5 seconds. We should not use the popen(3) function. It is considered unsafe and is only a variant of the system(3) call. Instead we should do our own plumbing and use fork(2), execv(3) and pipe(2) to read output from the script. The output will be logged if and only if an error occured and also sent in any alert message.
Responsible: hauk
Progress: 40%
Importance: HIGH

Event traceback

Refactor the internal message passing conducted inside validate.c to make code more flexible and to allow protocol routines to pass detailed error-messages upwards so they are part of the alert message. Having a kind of chained exception traceback would be nice. Something like,
      Event backtrace: 
      1. 'hostname' failed protocol test [http] at 192.168.1.1
      2. APACHE-STATUS error: 80 percent of Apache processes are logging
      
Currently only the first event line (1.) are sent in the alert message. The error in line 2 is logged, but it could be nice to include it in the alert message to describe why the http protocol test failed.
Responsible: ?
Progress: 0%
Importance: MEDIUM

Timeofday actions

Make it possible to decide monit's actions based on the time of day.

Request: For example I have a nightly script that runs and kicks CPU load up high causing monit to alert, but since its expected between 02:00 and 02:10 it creates a false alert therefore I want to instruct monit not to alert about CPU load between these times.

Suggested solution from Mike Jackson

if timeofday 0400-0410 then {action}

Responsible: hauk
Progress: 0%
Importance: MEDIUM

URL request for protocol tests (like ldap, ftp, etc.)

Add document request to relevant protocol tests. Currently only the http protocol test support a request option.
Responsible: ?
Progress: 0%
Importance: LOW

Network interfaces health and load monitoring

Allows to monitor network interfaces (for example "eth0" on linux, "hme0" on solaris, etc.) status - functionality and throughput. In the case that the interface will fail or the load exceeds some limit, monit will do appropriate action.
Responsible: ?
Progress: 0%
Importance: LOW

Filesystem load average tests

Watch filesystem load:

- read/write blocks per second ratio
- transactions per second ratio
- queue lengths
- response times

Responsible: Martin
Progress: 0%
Importance: LOW

Filesystem related caches test

Watch cache hit ratio for inode, directory entry, buffer and similar caches.
Responsible: Martin
Progress: 0%
Importance: LOW

IPv6 support

Make monit speak Ipv6, both for network protocol test and in the built-in web server.
Responsible: ?
Progress: 0%
Importance: LOW

S.M.A.R.T capable devices monitoring support

Support for monitoring health of devices which supports S.M.A.R.T technology. It allows you to watch for example disks and tape health, temperature, block realloacation, number of start count, power on hours, spin up time, etc. and allows you to detect bad device before catastrophic failure will occure.
Responsible: ?
Progress: 0%
Importance: LOW

ARP (MAC address) tests in host services

Responsible: Christian
Progress: 0%
Importance: LOW

SCSI ping support for device test

Allows to test whether the device is accessible. It is common test used by clusters for shared device (disk) based quorums (based on SCSI reservation).
Responsible: Martin
Progress: 0%
Importance: LOW

Action list support and optional service name target

Allows to specify list of actions, optionaly referencing other service name in monit control file.

Possible syntax (example):

      IF FAILED test THEN {action [service], ...}
      

Example usage:

      check process ipsec with pidfile /var/run/ipsec.pid
        start program = "/etc/init.d/ipsec start"
        stop program = "/etc/init.d/ipsec stop"

      check host theotherside with address the.other.side
        if failed icmp type echo then alert, restart ipsec
      
Responsible: ?
Progress: 0%
Importance: LOW

Support for status listing by service group

The monit status and summary should support the group option for restricting output to particular service group (currently status of all services is listed regardless the group option).
Responsible: ?
Progress: 0%
Importance: LOW

Support for hard service dependency

Monit currently supports the correct action sequence for dependency chain, however it doesn't check whether the particular parent has started and is running correctly before the child action is performed. Hard dependencies support should be added to allow to wait for parent to start and validate using the related testing rules that it is available without errors before handling its dependants (and vice versa in the case of stop action). This behavior could be optional, i.e. hard (blocking) and soft (nonblocking) dependencies could be supported.
Responsible: Martin
Progress: 0%
Importance: LOW

Matching timeout rule should set the service state to 'timed out'

The timeout rule currently sets the state to 'unmonitored' (besides sending alarm), thus it is not possible to differentiate the reason for which the service is unmonitored just in monit http interface nor CLI (just the user who received alarm or who checks the monit logs may know that the restart attempt ratio was too high). We should mark the unmonitored-by-timeout state in http and CLI as well and rather use red color in http then standard unmonitored yellow. The state should be marked as 'timed out' or 'unmonitored by timeout'.
Responsible: ?
Progress: 0%
Importance: LOW

Display filesystem type

It could be good to display the filesystem type in Monit and M/Monit http interface, such as ext3, hsfs, ntfs, ufs, etc.
Responsible: ?
Progress: 0%
Importance: LOW

Log full start/stop/exec command

It could be good to log full command as used during start, stop program or exec action execution. Currently just the command itself (argv[0]) is logged, for example when start program is defined as '/etc/init.d/policyd start' then following message is logged on start:
Feb 7 07:26:29 somehost monit[321]: 'policyd' start: /etc/init.d/policyd
instead of:
Feb 7 07:26:29 somehost monit[321]: 'policyd' start: /etc/init.d/policyd start
Responsible: Martin
Progress: 0%
Importance: LOW

Add the MONIT_DESCRIPTION environment variable for exec

It could be good to add the MONIT_DESCRIPTION environment variable when executing the external programs (start/stop/exec). Currently there is MONIT_EVENT, but it contains just the short event description.
Responsible:
Progress: 100%
Importance: LOW

Allow to override the implicit action on some events

There are few internal events currently, which are not exposed for optional override:
  • DATA
  • EXEC
  • INVALID
  • NONEXIST
  • TIMEOUT
The actions are initialized in the parser ... it could be good to allow the user to change the default behavior when needed, as in the PID, PPID and FSFLAG case.
Responsible: ?
Progress: 0%
Importance: LOW

Start/stop/exec method timeouts

It could be good to add support for optional start/stop/exec methods timeout. Currently monit waits for 1 cycle for service to start ... when the service didn't recovered then this is handled as error. Some services however start longer thus it could be useful to provide temporary "protection" to the method to do its job.
Example syntax:
  start program = "/etc/init.d/httpd start" with timeout 3 cycles
One cycle is default if the 'timeout' option is omitted (full backward compatibility provided).
Responsible: Martin
Progress: 100%
Importance: LOW

Handle multiple lines matching the pattern in single MATCH statement as single group with one action

Monit currently performs the action defined by the MATCH statement for each matching line immediately. When multiple lines match in one cycle, monit thus performs the action multiple times.
 
It could be good to optionally group the matching lines per one cycle and allow to perform single/common action, since when monit is watching for example logfile, the burst of identical messages can be found in one cycle, whereas they can be handled once (even one alert is enough to know that the given event occured).
 
Few ideas:
 
1.) either perform the action on first match and suspend the given matching rule for the rest of the cycle. Advantage of this approach is that monit reaction to the event will be fast and the handling simple. Disadvantage is that when more complex matching rule is used, only the first match will be send as part of the alert and the other will be ignored, even though they may differ little bit (for example can contain the name of the failed device, etc.)
 
2.) or evaluate the matching rules at the end of the input and buffer the matching lines. The advantage is, that all instances of lines matching even complex regex will be reported. Disadvantage is, that the reaction can be slower (suppose 100MB added to the file i.e. to process it will take some time), it can require more memory (when 100MB lines will match, then 100MB buffer will be needed unless some hardlimit will be used and even the user won't be most probably interested for 100MB mail). On the other side, such situations (extremely large data addition between monit cycles) may be rare and even this approach could work well for most setups. We can also improve it by adding just lines which differ to the buffer and report number of times the given message occured (although this reduces the disadvantage in most extreme ciscumstances with multiple same lines, when each line is different little bit the problem still exist).
 
Syntax proposal:
----------------
 
IF MATCH [FIRST] {regex|path} THEN <action>
the FIRST is new option, which makes the MATCH rule act on first matching occurence
 
IF MATCH [GROUP [LIMIT <x>]] {regex|path} THEN <action>
the GROUP is new option, which makes the MATCH rule act on first matching occurence
 
the LIMIT is another new extra option, which allows to limit the number of lines per group, thus reducing the memory footprint and size of alert body. The reaction can still be slow if lot of data were added - even if two instances (GROUP MAX 2) will be defined, the delta between instances may be large or there could be even just one matching line.
Responsible: ?
Progress: 0%
Importance: LOW

Add the start/stop/restart throttling

It could be good to support start throttling to control the startup parallelism. Some servies (such as zope) may be configured as many standalone instances, whereas the startup may create burst in the system resources when they all start in parallel. We can combine this with the service groups, so the user will for example configure the limit to start at maximum two services from the group per cycle.
Responsible: ?
Progress: 0%
Importance: LOW

Log the start/stop/exec program output

It could be useful if we can catch the output from the start/stop/exec script and optionally log it and/or add it to the start-failed event's error message, so the user can figure out directly why the script failed. Currently monit logs:
Mar 26 20:58:14 localhost monit[24709]: 'testapp_mongrel_1' trying to restart
Mar 26 20:58:14 localhost monit[24709]: 'testapp_mongrel_1' start: /usr/bin/mongrel_rails
Mar 26 20:58:44 localhost monit[24709]: 'testapp_mongrel_1' failed to start
Whereas the script says more about the reason:
starting port 10001
 !!! Prefix must begin with / and not end in /
 !!! User does not exist: Rtestapp
 !!! Group does not exist: Rtestapp
 mongrel::start reported an error. Use mongrel_rails mongrel::start -h to get help.
Responsible: ?
Progress: 0%
Importance: LOW

Support timestamp test relative to other file

Feature for timestamp test for comparing file ages, such as "older than" and "newer than". Example:
 if timestamp /etc/aliases newer than /etc/aliases.db
    then /sbin/newaliases
Responsible: ?
Progress: 0%
Importance: LOW

Watch process' filedecriptors count

Possibility to monitor process' filedescriptor count ... every OS usually has per-process soft and hard limits and if the process exceeds the quota, there can be problems. If monit will be able to watch filedescriptors, it can prevent the problem (either report or automatically fix).
Responsible: ?
Progress: 0%
Importance: LOW


Contact Us | About Us
Copyright (C) 2008 Tildeslash Ltd. All Rights Reserved