README - reed-alert - Lightweight agentless alerting system for server
HTML git clone git://bitreich.org/reed-alert/ git://enlrupgkhuxnvlhsf6lc3fziv5h2hhfrinws65d7roiv6bfj7d652fid.onion/reed-alert/
DIR Log
DIR Files
DIR Refs
DIR Tags
DIR README
DIR LICENSE
---
README (16626B)
---
1 Description
2 ===========
3
4 reed-alert is a small and simple monitoring tool for your server,
5 written in Common LISP.
6
7 reed-alert checks the status of various processes on a server and
8 triggers user defined notifications.
9
10 Each triggered message is called an 'alert'.
11 Each check is called a 'probe'.
12 Each probe can be customized by different parameters.
13
14
15 Dependencies
16 ============
17
18 reed-alert is regularly tested on FreeBSD/OpenBSD/Linux and has been
19 tested with both **sbcl** and **ecl** - which should be available for
20 most distributions.
21
22 (On OpenBSD you may prefer to use ecl because sbcl needs 'wxallowed'
23 on the partition where the binary is.)
24
25 To make reed-alert's deployment easier I avoid using external
26 libraries. reed-alert only requires a Common LISP interpreter and a
27 its own files.
28
29 A development to use quicklisp libraries to write more sophisticated
30 checks like "does this url contains a pattern ?" had begun and had
31 been abandoned, it has been decided to write shell command in the
32 probe **command** if the user need more elaborated checks.
33
34
35 Code-Readability
36 ================
37
38 Although the code is very rough for now, I think it's already fairly
39 understandable by people who do need this kind of tool.
40
41 I will try to improve on the readability of the config file in future
42 commits. NOTE : declaration of notifiers is easier now.
43
44
45 Usage
46 =====
47
48 Install reed-alert
49 ------------------
50
51 $ cd reed-alert
52 $ make
53 $ sudo make install
54 $ /usr/local/bin/reed-alert ~/monitoring/my_config.lisp
55
56
57 Special folder
58 --------------
59
60 reed-alert will create a folder using the following path, in order to
61 save the probes states between each invocation.
62
63 ~/.reed-alert/states/
64
65 If you delete it, you will lose the failures states of previous run.
66
67
68 Reed-alert start automation
69 ---------------------------
70
71 You can use cron to start reed-alert every n minutes (or whatever time
72 range you want). The frequency depend on what you check, if you only
73 want to check the daily backup worked, running reed-alert once a day
74 is fine but if you need to monitor a critical service then every
75 minute seems more adapted.
76
77 As always with cron jobs, be sure that either you call the interpreter
78 using its full path or that $PATH inside the crontab contains it.
79
80 A cron job every minute using ecl would looks like this :
81
82 */5 * * * * ( cd /opt/reed-alert/ && /usr/local/bin/ecl --shell server.lisp )
83
84
85 Personal Configuration File
86 ---------------------------
87 You may want to rename **example-simple.lisp** to **config.lisp** in
88 order to create your own configuration file.
89
90 The configuration is explained below.
91
92
93 The Notification System
94 =======================
95
96 When a check return a failure, a previously defined notifier will be
97 called. This will be triggered only after reed-alert find **3**
98 failures (not more or less, but this can be changed globally by
99 modifying *tries* variable) in a row for this check, this is a default
100 value that can be changed per probe with the :try parameter as
101 explained later in this document. This is to prevent reed-alert to
102 spam notifications for a long time (number of failures very high, like
103 a disk space usage that can't be fixed before a long time) OR
104 preventing reed-alert to send notifications about a check on the edge
105 of the limit like a ping almost working but failing from time to time
106 or the load average around the limit.
107
108 reed-alert will use the notifier system when it reach its try number
109 and when the problem is fixed, so you know when it begins and when it
110 ends.
111
112 It is possible to be reminded about a failure every n tries by setting
113 the keyword :reminder and using a number. This is useful if you want
114 to be reminded from time to time if a problem is not fixed, using some
115 alerts like mails can be easily overlooked or lost in a huge mail
116 amount. The :reminder is a setting per check. For a global reminder
117 setting, one can set *reminder* variable.
118
119 reed-alert keep tracks of the count of failures with one file per
120 probe failing in the "states" folder. To ensure unique filenames, the
121 following format is used (+ means it's concatenated) :
122
123 alert-name + probe-name + hash of probe parameters
124
125 The notifier is a shell command with a name. The shell command can
126 contains variables from reed-alert.
127
128 + %function% : the name of the probe
129 + %date% : the current date with format YYYY/MM/DD hh:mm:ss
130 + %params% : the parameters of the probe
131 + %hostname% : the hostname of the server
132 + %result% : the error returned (the value exceeding the limit, file not found)
133 + %desc : an arbitrary description naming a check, default to empty string
134 + %level% : the type of notification used
135 + %os% : the type of operating system (FreeBSD/Linux/OpenBSD)
136 + %newline% : a newline character
137 + %state% : "start" / "end" when problem happen / is solved
138
139
140 Example Probe 1: 'Check For Load Average'
141 ---------------------------------------
142 If you want to send a mail with a message like:
143
144 "On 2016/10/06 11:11:12 server.foo.com has encountered a problem
145 during LOAD-AVERAGE-15 (:LIMIT 10) with a value of 30"
146
147
148 write the following at the top of the file and use **pretty-mail** in your checks:
149
150 (alert pretty-mail "echo 'On %date% %hostname% has encountered a problem during %function%
151 %params% with a value of %result%' | mail yourmail@foo.bar")
152
153 Example Probe 2: 'Don't do anything'
154 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
155 If you don't want anything to be done when an error occur, use the following :
156
157 (alert nothing-to-send "")
158
159 Example Probe 3: 'Send SMS'
160 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
161 You may want to use an external service to send a SMS, this is totally
162 possible as we rely on a shell command :
163
164 (alert sms "echo 'error on %hostname : %function% %result%'
165 | curl -u login:pass http://api.sendsms.com/")
166
167
168 The Probes
169 ==========
170
171 Probes are written in Common LISP. They are predefined checks.
172
173 The :desc Parameter
174 -------------------
175 The :desc parameter allows you to describe specifically what your check
176 does. It can be put in every probe.
177
178 :desc "STRING"
179
180
181 The :try Parameter
182 ------------------
183 The :try parameter allows you to change how many failure to wait
184 before the alert is triggered. By default, it's triggered after 3
185 failures. Sometimes, when using ping for example, you want to be
186 notified when it fails a few cycles and not at first failure.
187
188 :try INTEGER
189
190
191 Overview
192 --------
193 As of this commit, reed-alert ships with the following probes:
194
195 (1) number-of-processes
196 (2) pid-running
197 (3) disk-usage
198 (4) check-file-exists
199 (5) file-updated
200 (6) load-average-1
201 (7) load-average-5
202 (8) load-average-15
203 (9) ping
204 (10) command
205 (11) service
206 (12) file-less-than
207
208
209 number-of-processes
210 -------------------
211 Check if the actual number of processes of the system exceeds a specific limit.
212
213 > Set the limit that will trigger an alert when exceeded.
214 :limit INTEGER
215
216 Example: `(=> alert number-of-processes :limit 200)`
217
218
219 pid-running
220 -----------
221 Check if the PID number found in a .pid file is alive.
222
223 > Set the path of the pid file. If $USER doesn't have permission to open it, return "file not found".
224 :path "STRING"
225
226 Example: `(=> alert pid-running :path "/var/run/nginx.pid")`
227
228
229 disk-usage
230 ----------
231 Check if the disk-usage of a chosen partition does exceed a specific limit.
232
233 > Set the mountpoint to check.
234 :path "STRING"
235
236 > Set the limit that will trigger an alert when exceeded.
237 :limit INTEGER
238
239 Example: `(=> alert disk-usage :path "/tmp" :limit 50)`
240
241
242 check-file-exists
243 -----------
244 Check if a file exists.
245
246 > Set the path of the file to check.
247 :path "STRING"
248
249 Example: `(=> alert check-file-exists :path "/var/postgresql/standby")`
250
251
252 file-updated
253 ------------
254 Check if a file exists and has been updated since a defined time.
255
256 > Set the path of the file to check.
257 :path "STRING"
258
259 > Set the limit in minutes since the last modification time before triggering an alert.
260 :limit INTEGER
261
262 Example: `(=> alert file-updated :path "/var/log/nginx/access.log" :limit 60)`
263
264
265 load-average-1
266 --------------
267 Check if the load average during the last minute exceeds a specific limit.
268
269 > Set the limit not to exceed.
270 :limit INTEGER
271
272 Example: `(=> alert load-average-1 :limit 2)`
273
274
275 load-average-5
276 --------------
277 Check if the load average during the last five minutes exceeds a specific limit.
278
279 > Set the limit not to exceed.
280 :limit INTEGER
281
282 Example: `(=> alert load-average-5 :limit 2)`
283
284
285 load-average-15
286 ---------------
287 Check if the load average during the last fifteen minutes exceeds a specific limit.
288
289 > Set the limit not to exceed.
290 :limit INTEGER
291
292 Example: `(=> alert load-average-15 :limit 2)`
293
294
295 ping
296 ----
297 Check if a remote host answers the 2 ICMP ping.
298
299 > Set the host to ping. Return an error if ping command returns non-zero.
300 :host "STRING" (can be IP or hostname)
301
302 Example: `(=> alert ping :host "8.8.8.8")`
303
304
305 command
306 -------
307 Execute an arbitrary command which triggers an alert if it returns a non-zero value.
308 This may be the most useful probe because it let the user do any check needed.
309
310 > Command to execute, accept commands with pipes.
311 :command "STRING"
312
313 Example: `(=> alert command :command "tail -n 10 /var/log/messages | grep -v CRITICAL")`
314
315
316 service
317 -------
318 Check if a service is started on the system.
319
320 > Set the name of the service to test
321 :name STRING
322
323 Example: `(=> alert service :name "mysql-server")`
324
325
326 file-less-than
327 --------------
328 Check if a file has a size less than a specified limit.
329
330 > Set the path of the file to check.
331 :path "STRING"
332
333 > Set the limit in bytes before triggering an alert.
334 :limit INTEGER
335
336 Example: `(=> alert file-less-than :path "/var/log/nginx.log" :limit 60)`
337
338
339 curl-http-status
340 ----------------
341 Do a HTTP request and return an error if the return code isn't
342 200. Requires curl.
343
344 > Set the url to request.
345 :url "STRING"
346
347 > Set the time to wait before aborting.
348 :timeout INTEGER
349
350
351 ssl-expiration
352 --------------------
353 Check if a remote SSL certificate expires in less than a specified
354 time. Requires openssl.
355
356 > Set the hostname for the request.
357 :host "STRING"
358
359 > Set the expiration time limit in seconds.
360 :seconds INTEGER
361
362 > Set the port for the request (OPTIONAL).
363 :port INTEGER (default to 443)
364
365 > Use starttls (OPTIONAL).
366 :starttls STRING
367
368 Example: `(=> alert ssl-expiration :host "domain.local" :seconds (* 7 24 60 60))
369 Example: `(=> alert ssl-expiration :host "domain.local" :seconds 86400 :port 6697)
370 Example: `(=> alert ssl-expiration :host "smtp.domain.local" :seconds 86400 :starttls "smtp" :port 25)
371
372
373 write-to-file
374 --------------------
375 Write content to a file, create it if non existent.
376
377 The purpose of this probe is to be used at the end of a reed-alert
378 script to update the modification time of a file, and use file-updated
379 on this file at the beginning of a script to monitor if reed-alert did
380 finish correctly on last run.
381
382 > Set the path of the file.
383 :path "STRING"
384
385 > Set the content of the file (OPTIONAL).
386 :text "STRING" (default to current time in seconds)
387
388 Example: `(=> alert write-to-file :path "/tmp/reed-alert.txt")`
389 Example: `(=> alert write-to-file :path "/tmp/reed-alert.txt" :text "hello world")`
390
391
392 The configuration file
393 ======================
394
395 The configuration file is Common LISP code, so it's evaluated. It's
396 possible to write some logic within it.
397
398
399 Loops
400 -----
401 It's possible to write loops if you don't want to repeat code
402
403 (loop for host in '("bitreich.org" "dataswamp.org" "floodgap.com")
404 do
405 (=> mail ping :host host))
406
407 or another example
408
409 (loop for service in '("smtpd" "nginx" "mysqld" "postgresql")
410 do
411 (=> mail service :name service))
412
413 and another example using rows from a file to check remote hosts
414
415 (with-open-file (stream "hosts.txt")
416 (loop for line = (read-line stream nil)
417 while line
418 do
419 (=> mail ping :host line)))
420
421
422 Conditional
423 -----------
424 It is also possible to achieve conditionals. There are two very useful
425 conditionals groups.
426
427
428 Dependency
429 ~~~~~~~~~~
430 Sometimes it may be a good idea to stop some probes if a probe
431 fail. In a case where you need to check a path through a network, from
432 the nearest machine to the remote target. If we can't reach our local
433 router, probes requiring the router to work will trigger errors so we
434 should skip them.
435
436 (stop-if-error
437 (=> mail ping :host "192.168.1.1" :desc "My local router")
438 (=> mail ping :host "89.89.89.89" :desc "My ISP DNS server")
439 (=> mail ping :host "kernel.org" :desc "Remote website"))
440
441 Note : stop-if-error is an alias for the **and** function.
442
443
444 Escalation
445 ~~~~~~~~~~
446 It could be a good idea to use different alerts
447 depending on how critical a check is, but sometimes, the critical
448 level may depend of the value of the error and/or the delay between
449 the detection and fixing it. You could want to receive a mail when
450 things need to be fixed on spare time, but mail another people if
451 things aren't fixed after some level.
452
453 (escalation
454 (=> mail-me disk-usage :path "/" :limit 70)
455 (=> sms-me disk-usage :path "/" :limit 90)
456 (=> buzzer disk-usage :path "/" :limit 98))
457
458 In this example, we check the disk usage, I will get a mail through
459 "mail-me" alert if the disk usage go get more than 70%. Once it goes
460 that far, it will check if the disk usage gets more than 90%, if so,
461 I'll receive a sms through "sms-me" alert. And then, if it goes more
462 than 98%, the "buzzer" alert will make some bad noises in the room to
463 warn me about this.
464
465 Note : escalation is an alias for the **or** function.
466
467
468 Extend with your own probes
469 ===========================
470
471 It is likely that you want to write your own probes. While using the
472 command probe can be convenient, you may want to have a probe with
473 more parameters and better integration than the command probe.
474
475 There are two methods for adding probes :
476 - in the configuration file before using it
477 - in a separated lisp file that you load from the configuration file
478
479 If you want to reuse for multiples configuration files or servers, I
480 would recommend a separate file, otherwise, adding it at the top of
481 the configuration file can be convenient too.
482
483
484 Using a shell command
485 ---------------------
486
487 A minimum of Common LISP comprehension is needed for this. But using
488 the easiest way to go by writing a probe using a command shell, the
489 declaration can be really simple.
490
491 We are going to write a probe that will use curl to fetch an page and
492 then grep on the output to look for a pattern. The return code of grep
493 will be the return status of the probe, if grep finds the pattern,
494 it's a success, if not it's a failure.
495
496 In the following code, the "create-probe" part is a macro that will
497 write most of the code for you. Then, we use "command-return-code"
498 function which will execute the shell command passed as a string (or
499 as a list) and return the correct values in case of success or
500 failure.
501
502 (create-probe
503 check-http-pattern
504 (command-return-code (format nil "curl ~a | grep -i ~a"
505 (getf params :url) (getf params :pattern))))
506
507 If you don't know LISP, "format" function works like "printf", using
508 "~a" instead of "%s". This is the only required thing to know if you
509 want to reuse the previous code.
510
511 Then we can call it like this :
512
513 (=> notifier check-http-pattern :url "http://127.0.0.1" :pattern "Powered by cl-yag")
514
515
516 Using plain LISP
517 ----------------
518
519 We have seen previously how tocreate new probes from a shell command,
520 but one may want to do it in LISP, allowing to use full features of
521 the language and even some libraries to check values in a database for
522 example. I recommend to read the "probes.lisp" file, it's the best way
523 to learn how to write a new probe. But as an example, we will learn
524 from the easiest probe included : check-file-exists
525
526 (create-probe
527 check-file-exists
528 (let ((result (probe-file (getf params :path))))
529 (if result
530 t
531 (list nil "file not found"))))
532
533 Like before, we use the "create-probe" macro and give a name to the
534 probe. Then, we have to write some code, in the current case, check if
535 the file exists. Finally, if it is a success, we have to return **t**,
536 if it fails we return a list containing **nil** and a value or a
537 string. The second element in the list will replaced %result% in the
538 notification command, so you can use something explicit, a
539 concatenation of a message with the return value etc..". Parameters
540 should be get with getf from **params** variable, allowing to use a
541 default value in case it's not defined in the configuration file.