URI: 
       README - reed-alert - Lightweight agentless alerting system for server
  HTML git clone git://bitreich.org/reed-alert/ git://enlrupgkhuxnvlhsf6lc3fziv5h2hhfrinws65d7roiv6bfj7d652fid.onion/reed-alert/
   DIR Log
   DIR Files
   DIR Refs
   DIR Tags
   DIR README
   DIR LICENSE
       ---
       README (16626B)
       ---
            1 Description
            2 ===========
            3 
            4 reed-alert is a small and simple monitoring tool for your server,
            5 written in Common LISP.
            6 
            7 reed-alert checks the status of various processes on a server and
            8 triggers user defined notifications.
            9 
           10 Each triggered message is called an 'alert'.
           11 Each check is called a 'probe'.
           12 Each probe can be customized by different parameters.
           13 
           14 
           15 Dependencies
           16 ============
           17 
           18 reed-alert is regularly tested on FreeBSD/OpenBSD/Linux and has been
           19 tested with both **sbcl** and **ecl** - which should be available for
           20 most distributions.
           21 
           22 (On OpenBSD you may prefer to use ecl because sbcl needs 'wxallowed'
           23 on the partition where the binary is.)
           24 
           25 To make reed-alert's deployment easier I avoid using external
           26 libraries. reed-alert only requires a Common LISP interpreter and a
           27 its own files.
           28 
           29 A development to use quicklisp libraries to write more sophisticated
           30 checks like "does this url contains a pattern ?" had begun and had
           31 been abandoned, it has been decided to write shell command in the
           32 probe **command** if the user need more elaborated checks.
           33 
           34 
           35 Code-Readability
           36 ================
           37 
           38 Although the code is very rough for now, I think it's already fairly
           39 understandable by people who do need this kind of tool.
           40 
           41 I will try to improve on the readability of the config file in future
           42 commits. NOTE : declaration of notifiers is easier now.
           43 
           44 
           45 Usage
           46 =====
           47 
           48 Install reed-alert
           49 ------------------
           50 
           51     $ cd reed-alert
           52     $ make
           53     $ sudo make install
           54     $ /usr/local/bin/reed-alert ~/monitoring/my_config.lisp
           55 
           56 
           57 Special folder
           58 --------------
           59 
           60 reed-alert will create a folder using the following path, in order to
           61 save the probes states between each invocation.
           62 
           63     ~/.reed-alert/states/
           64 
           65 If you delete it, you will lose the failures states of previous run.
           66 
           67 
           68 Reed-alert start automation
           69 ---------------------------
           70 
           71 You can use cron to start reed-alert every n minutes (or whatever time
           72 range you want). The frequency depend on what you check, if you only
           73 want to check the daily backup worked, running reed-alert once a day
           74 is fine but if you need to monitor a critical service then every
           75 minute seems more adapted.
           76 
           77 As always with cron jobs, be sure that either you call the interpreter
           78 using its full path or that $PATH inside the crontab contains it.
           79 
           80 A cron job every minute using ecl would looks like this :
           81 
           82     */5 * * * * ( cd /opt/reed-alert/ && /usr/local/bin/ecl --shell server.lisp )
           83 
           84 
           85 Personal Configuration File
           86 ---------------------------
           87 You may want to rename **example-simple.lisp** to **config.lisp** in
           88 order to create your own configuration file.
           89 
           90 The configuration is explained below.
           91 
           92 
           93 The Notification System
           94 =======================
           95 
           96 When a check return a failure, a previously defined notifier will be
           97 called. This will be triggered only after reed-alert find **3**
           98 failures (not more or less, but this can be changed globally by
           99 modifying *tries* variable) in a row for this check, this is a default
          100 value that can be changed per probe with the :try parameter as
          101 explained later in this document. This is to prevent reed-alert to
          102 spam notifications for a long time (number of failures very high, like
          103 a disk space usage that can't be fixed before a long time) OR
          104 preventing reed-alert to send notifications about a check on the edge
          105 of the limit like a ping almost working but failing from time to time
          106 or the load average around the limit.
          107 
          108 reed-alert will use the notifier system when it reach its try number
          109 and when the problem is fixed, so you know when it begins and when it
          110 ends.
          111 
          112 It is possible to be reminded about a failure every n tries by setting
          113 the keyword :reminder and using a number. This is useful if you want
          114 to be reminded from time to time if a problem is not fixed, using some
          115 alerts like mails can be easily overlooked or lost in a huge mail
          116 amount. The :reminder is a setting per check. For a global reminder
          117 setting, one can set *reminder* variable.
          118 
          119 reed-alert keep tracks of the count of failures with one file per
          120 probe failing in the "states" folder. To ensure unique filenames, the
          121 following format is used (+ means it's concatenated) :
          122 
          123     alert-name + probe-name + hash of probe parameters
          124 
          125 The notifier is a shell command with a name. The shell command can
          126 contains variables from reed-alert.
          127 
          128 + %function%    : the name of the probe
          129 + %date%        : the current date with format YYYY/MM/DD hh:mm:ss
          130 + %params%      : the parameters of the probe
          131 + %hostname%    : the hostname of the server
          132 + %result%      : the error returned (the value exceeding the limit, file not found)
          133 + %desc         : an arbitrary description naming a check, default to empty string
          134 + %level%       : the type of notification used
          135 + %os%          : the type of operating system (FreeBSD/Linux/OpenBSD)
          136 + %newline%     : a newline character
          137 + %state%       : "start" / "end" when problem happen / is solved
          138 
          139 
          140 Example Probe 1: 'Check For Load Average'
          141 ---------------------------------------
          142 If you want to send a mail with a message like:
          143 
          144         "On 2016/10/06 11:11:12 server.foo.com has encountered a problem
          145         during LOAD-AVERAGE-15 (:LIMIT 10) with a value of 30"
          146 
          147 
          148 write the following at the top of the file and use **pretty-mail** in your checks:
          149 
          150    (alert pretty-mail "echo 'On %date% %hostname% has encountered a problem during %function%
          151                          %params% with a value of %result%' | mail yourmail@foo.bar")
          152 
          153 Example Probe 2: 'Don't do anything'
          154 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          155 If you don't want anything to be done when an error occur, use the following :
          156 
          157     (alert nothing-to-send "")
          158 
          159 Example Probe 3: 'Send SMS'
          160 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
          161 You may want to use an external service to send a SMS, this is totally
          162 possible as we rely on a shell command :
          163 
          164     (alert sms "echo 'error on %hostname : %function% %result%'
          165                       | curl -u login:pass http://api.sendsms.com/")
          166 
          167 
          168 The Probes
          169 ==========
          170 
          171 Probes are written in Common LISP. They are predefined checks.
          172 
          173 The :desc Parameter
          174 -------------------
          175 The :desc parameter allows you to describe specifically what your check
          176 does. It can be put in every probe.
          177 
          178     :desc "STRING"
          179 
          180 
          181 The :try Parameter
          182 ------------------
          183 The :try parameter allows you to change how many failure to wait
          184 before the alert is triggered. By default, it's triggered after 3
          185 failures. Sometimes, when using ping for example, you want to be
          186 notified when it fails a few cycles and not at first failure.
          187 
          188     :try INTEGER
          189 
          190 
          191 Overview
          192 --------
          193 As of this commit, reed-alert ships with the following probes:
          194 
          195         (1)         number-of-processes
          196         (2)         pid-running
          197         (3)         disk-usage
          198         (4)         check-file-exists
          199         (5)         file-updated
          200         (6)         load-average-1
          201         (7)         load-average-5
          202         (8)         load-average-15
          203         (9)        ping
          204         (10)        command
          205         (11)        service
          206         (12)        file-less-than
          207 
          208 
          209 number-of-processes
          210 -------------------
          211 Check if the actual number of processes of the system exceeds a specific limit.
          212 
          213 > Set the limit that will trigger an alert when exceeded.
          214     :limit INTEGER
          215 
          216 Example: `(=> alert number-of-processes :limit 200)`
          217 
          218 
          219 pid-running
          220 -----------
          221 Check if the PID number found in a .pid file is alive.
          222 
          223 > Set the path of the pid file. If $USER doesn't have permission to open it, return "file not found".
          224     :path "STRING"
          225 
          226 Example: `(=> alert pid-running :path "/var/run/nginx.pid")`
          227 
          228 
          229 disk-usage
          230 ----------
          231 Check if the disk-usage of a chosen partition does exceed a specific limit.
          232 
          233 > Set the mountpoint to check.
          234     :path "STRING"
          235 
          236 > Set the limit that will trigger an alert when exceeded.
          237     :limit INTEGER
          238 
          239 Example: `(=> alert disk-usage :path "/tmp" :limit 50)`
          240 
          241 
          242 check-file-exists
          243 -----------
          244 Check if a file exists.
          245 
          246 > Set the path of the file to check.
          247     :path "STRING"
          248 
          249 Example: `(=> alert check-file-exists :path "/var/postgresql/standby")`
          250 
          251 
          252 file-updated
          253 ------------
          254 Check if a file exists and has been updated since a defined time.
          255 
          256 > Set the path of the file to check.
          257     :path "STRING"
          258 
          259 > Set the limit in minutes since the last modification time before triggering an alert.
          260     :limit INTEGER
          261 
          262 Example: `(=> alert file-updated :path "/var/log/nginx/access.log" :limit 60)`
          263 
          264 
          265 load-average-1
          266 --------------
          267 Check if the load average during the last minute exceeds a specific limit.
          268 
          269 > Set the limit not to exceed.
          270     :limit INTEGER
          271 
          272 Example: `(=> alert load-average-1 :limit 2)`
          273 
          274 
          275 load-average-5
          276 --------------
          277 Check if the load average during the last five minutes exceeds a specific limit.
          278 
          279 > Set the limit not to exceed.
          280     :limit INTEGER
          281 
          282 Example: `(=> alert load-average-5 :limit 2)`
          283 
          284 
          285 load-average-15
          286 ---------------
          287 Check if the load average during the last fifteen minutes exceeds a specific limit.
          288 
          289 > Set the limit not to exceed.
          290     :limit INTEGER
          291 
          292 Example: `(=> alert load-average-15 :limit 2)`
          293 
          294 
          295 ping
          296 ----
          297 Check if a remote host answers the 2 ICMP ping.
          298 
          299 > Set the host to ping. Return an error if ping command returns non-zero.
          300     :host "STRING" (can be IP or hostname)
          301 
          302 Example: `(=> alert ping :host "8.8.8.8")`
          303 
          304 
          305 command
          306 -------
          307 Execute an arbitrary command which triggers an alert if it returns a non-zero value.
          308 This may be the most useful probe because it let the user do any check needed.
          309 
          310 > Command to execute, accept commands with pipes.
          311     :command "STRING"
          312 
          313 Example: `(=> alert command :command "tail -n 10 /var/log/messages | grep -v CRITICAL")`
          314 
          315 
          316 service
          317 -------
          318 Check if a service is started on the system.
          319 
          320 > Set the name of the service to test
          321     :name STRING
          322 
          323 Example: `(=> alert service :name "mysql-server")`
          324 
          325 
          326 file-less-than
          327 --------------
          328 Check if a file has a size less than a specified limit.
          329 
          330 > Set the path of the file to check.
          331     :path "STRING"
          332 
          333 > Set the limit in bytes before triggering an alert.
          334     :limit INTEGER
          335 
          336 Example: `(=> alert file-less-than :path "/var/log/nginx.log" :limit 60)`
          337 
          338 
          339 curl-http-status
          340 ----------------
          341 Do a HTTP request and return an error if the return code isn't
          342 200. Requires curl.
          343 
          344 > Set the url to request.
          345     :url "STRING"
          346 
          347 > Set the time to wait before aborting.
          348     :timeout INTEGER
          349 
          350 
          351 ssl-expiration
          352 --------------------
          353 Check if a remote SSL certificate expires in less than a specified
          354 time. Requires openssl.
          355 
          356 > Set the hostname for the request.
          357     :host "STRING"
          358 
          359 > Set the expiration time limit in seconds.
          360     :seconds INTEGER
          361 
          362 > Set the port for the request (OPTIONAL).
          363     :port INTEGER (default to 443)
          364 
          365 > Use starttls (OPTIONAL).
          366     :starttls STRING
          367 
          368 Example: `(=> alert ssl-expiration :host "domain.local" :seconds (* 7 24 60 60))
          369 Example: `(=> alert ssl-expiration :host "domain.local" :seconds 86400 :port 6697)
          370 Example: `(=> alert ssl-expiration :host "smtp.domain.local" :seconds 86400 :starttls "smtp" :port 25)
          371 
          372 
          373 write-to-file
          374 --------------------
          375 Write content to a file, create it if non existent.
          376 
          377 The purpose of this probe is to be used at the end of a reed-alert
          378 script to update the modification time of a file, and use file-updated
          379 on this file at the beginning of a script to monitor if reed-alert did
          380 finish correctly on last run.
          381 
          382 > Set the path of the file.
          383     :path "STRING"
          384 
          385 > Set the content of the file (OPTIONAL).
          386     :text "STRING" (default to current time in seconds)
          387 
          388 Example: `(=> alert write-to-file :path "/tmp/reed-alert.txt")`
          389 Example: `(=> alert write-to-file :path "/tmp/reed-alert.txt" :text "hello world")`
          390 
          391 
          392 The configuration file
          393 ======================
          394 
          395 The configuration file is Common LISP code, so it's evaluated. It's
          396 possible to write some logic within it.
          397 
          398 
          399 Loops
          400 -----
          401 It's possible to write loops if you don't want to repeat code
          402 
          403     (loop for host in '("bitreich.org" "dataswamp.org" "floodgap.com")
          404      do
          405        (=> mail ping :host host))
          406 
          407 or another example
          408 
          409     (loop for service in '("smtpd" "nginx" "mysqld" "postgresql")
          410      do
          411        (=> mail service :name service))
          412 
          413 and another example using rows from a file to check remote hosts
          414 
          415     (with-open-file (stream "hosts.txt")
          416       (loop for line = (read-line stream nil)
          417         while line
          418         do
          419           (=> mail ping :host line)))
          420 
          421 
          422 Conditional
          423 -----------
          424 It is also possible to achieve conditionals. There are two very useful
          425 conditionals groups.
          426 
          427 
          428 Dependency
          429 ~~~~~~~~~~
          430 Sometimes it may be a good idea to stop some probes if a probe
          431 fail. In a case where you need to check a path through a network, from
          432 the nearest machine to the remote target. If we can't reach our local
          433 router, probes requiring the router to work will trigger errors so we
          434 should skip them.
          435 
          436 (stop-if-error
          437   (=> mail ping :host "192.168.1.1" :desc "My local router")
          438   (=> mail ping :host "89.89.89.89" :desc "My ISP DNS server")
          439   (=> mail ping :host "kernel.org"  :desc "Remote website"))
          440 
          441 Note : stop-if-error is an alias for the **and** function.
          442 
          443 
          444 Escalation
          445 ~~~~~~~~~~
          446 It could be a good idea to use different alerts
          447 depending on how critical a check is, but sometimes, the critical
          448 level may depend of the value of the error and/or the delay between
          449 the detection and fixing it. You could want to receive a mail when
          450 things need to be fixed on spare time, but mail another people if
          451 things aren't fixed after some level.
          452 
          453     (escalation
          454       (=> mail-me disk-usage :path "/" :limit 70)
          455       (=> sms-me  disk-usage :path "/" :limit 90)
          456       (=> buzzer  disk-usage :path "/" :limit 98))
          457 
          458 In this example, we check the disk usage, I will get a mail through
          459 "mail-me" alert if the disk usage go get more than 70%. Once it goes
          460 that far, it will check if the disk usage gets more than 90%, if so,
          461 I'll receive a sms through "sms-me" alert. And then, if it goes more
          462 than 98%, the "buzzer" alert will make some bad noises in the room to
          463 warn me about this.
          464 
          465 Note : escalation is an alias for the **or** function.
          466 
          467 
          468 Extend with your own probes
          469 ===========================
          470 
          471 It is likely that you want to write your own probes. While using the
          472 command probe can be convenient, you may want to have a probe with
          473 more parameters and better integration than the command probe.
          474 
          475 There are two methods for adding probes :
          476 - in the configuration file before using it
          477 - in a separated lisp file that you load from the configuration file
          478 
          479 If you want to reuse for multiples configuration files or servers, I
          480 would recommend a separate file, otherwise, adding it at the top of
          481 the configuration file can be convenient too.
          482 
          483 
          484 Using a shell command
          485 ---------------------
          486 
          487 A minimum of Common LISP comprehension is needed for this. But using
          488 the easiest way to go by writing a probe using a command shell, the
          489 declaration can be really simple.
          490 
          491 We are going to write a probe that will use curl to fetch an page and
          492 then grep on the output to look for a pattern. The return code of grep
          493 will be the return status of the probe, if grep finds the pattern,
          494 it's a success, if not it's a failure.
          495 
          496 In the following code, the "create-probe" part is a macro that will
          497 write most of the code for you. Then, we use "command-return-code"
          498 function which will execute the shell command passed as a string (or
          499 as a list) and return the correct values in case of success or
          500 failure.
          501 
          502     (create-probe
          503      check-http-pattern
          504      (command-return-code (format nil "curl ~a | grep -i ~a"
          505                                   (getf params :url) (getf params :pattern))))
          506 
          507 If you don't know LISP, "format" function works like "printf", using
          508 "~a" instead of "%s". This is the only required thing to know if you
          509 want to reuse the previous code.
          510 
          511 Then we can call it like this :
          512 
          513     (=> notifier check-http-pattern :url "http://127.0.0.1" :pattern "Powered by cl-yag")
          514 
          515 
          516 Using plain LISP
          517 ----------------
          518 
          519 We have seen previously how tocreate new probes from a shell command,
          520 but one may want to do it in LISP, allowing to use full features of
          521 the language and even some libraries to check values in a database for
          522 example. I recommend to read the "probes.lisp" file, it's the best way
          523 to learn how to write a new probe. But as an example, we will learn
          524 from the easiest probe included : check-file-exists
          525 
          526     (create-probe
          527      check-file-exists
          528      (let ((result (probe-file (getf params :path))))
          529        (if result
          530            t
          531            (list nil "file not found"))))
          532 
          533 Like before, we use the "create-probe" macro and give a name to the
          534 probe. Then, we have to write some code, in the current case, check if
          535 the file exists. Finally, if it is a success, we have to return **t**,
          536 if it fails we return a list containing **nil** and a value or a
          537 string. The second element in the list will replaced %result% in the
          538 notification command, so you can use something explicit, a
          539 concatenation of a message with the return value etc..". Parameters
          540 should be get with getf from **params** variable, allowing to use a
          541 default value in case it's not defined in the configuration file.