README - reed-alert - Lightweight agentless alerting system for server HTML git clone git://bitreich.org/reed-alert/ git://enlrupgkhuxnvlhsf6lc3fziv5h2hhfrinws65d7roiv6bfj7d652fid.onion/reed-alert/ DIR Log DIR Files DIR Refs DIR Tags DIR README DIR LICENSE --- README (16626B) --- 1 Description 2 =========== 3 4 reed-alert is a small and simple monitoring tool for your server, 5 written in Common LISP. 6 7 reed-alert checks the status of various processes on a server and 8 triggers user defined notifications. 9 10 Each triggered message is called an 'alert'. 11 Each check is called a 'probe'. 12 Each probe can be customized by different parameters. 13 14 15 Dependencies 16 ============ 17 18 reed-alert is regularly tested on FreeBSD/OpenBSD/Linux and has been 19 tested with both **sbcl** and **ecl** - which should be available for 20 most distributions. 21 22 (On OpenBSD you may prefer to use ecl because sbcl needs 'wxallowed' 23 on the partition where the binary is.) 24 25 To make reed-alert's deployment easier I avoid using external 26 libraries. reed-alert only requires a Common LISP interpreter and a 27 its own files. 28 29 A development to use quicklisp libraries to write more sophisticated 30 checks like "does this url contains a pattern ?" had begun and had 31 been abandoned, it has been decided to write shell command in the 32 probe **command** if the user need more elaborated checks. 33 34 35 Code-Readability 36 ================ 37 38 Although the code is very rough for now, I think it's already fairly 39 understandable by people who do need this kind of tool. 40 41 I will try to improve on the readability of the config file in future 42 commits. NOTE : declaration of notifiers is easier now. 43 44 45 Usage 46 ===== 47 48 Install reed-alert 49 ------------------ 50 51 $ cd reed-alert 52 $ make 53 $ sudo make install 54 $ /usr/local/bin/reed-alert ~/monitoring/my_config.lisp 55 56 57 Special folder 58 -------------- 59 60 reed-alert will create a folder using the following path, in order to 61 save the probes states between each invocation. 62 63 ~/.reed-alert/states/ 64 65 If you delete it, you will lose the failures states of previous run. 66 67 68 Reed-alert start automation 69 --------------------------- 70 71 You can use cron to start reed-alert every n minutes (or whatever time 72 range you want). The frequency depend on what you check, if you only 73 want to check the daily backup worked, running reed-alert once a day 74 is fine but if you need to monitor a critical service then every 75 minute seems more adapted. 76 77 As always with cron jobs, be sure that either you call the interpreter 78 using its full path or that $PATH inside the crontab contains it. 79 80 A cron job every minute using ecl would looks like this : 81 82 */5 * * * * ( cd /opt/reed-alert/ && /usr/local/bin/ecl --shell server.lisp ) 83 84 85 Personal Configuration File 86 --------------------------- 87 You may want to rename **example-simple.lisp** to **config.lisp** in 88 order to create your own configuration file. 89 90 The configuration is explained below. 91 92 93 The Notification System 94 ======================= 95 96 When a check return a failure, a previously defined notifier will be 97 called. This will be triggered only after reed-alert find **3** 98 failures (not more or less, but this can be changed globally by 99 modifying *tries* variable) in a row for this check, this is a default 100 value that can be changed per probe with the :try parameter as 101 explained later in this document. This is to prevent reed-alert to 102 spam notifications for a long time (number of failures very high, like 103 a disk space usage that can't be fixed before a long time) OR 104 preventing reed-alert to send notifications about a check on the edge 105 of the limit like a ping almost working but failing from time to time 106 or the load average around the limit. 107 108 reed-alert will use the notifier system when it reach its try number 109 and when the problem is fixed, so you know when it begins and when it 110 ends. 111 112 It is possible to be reminded about a failure every n tries by setting 113 the keyword :reminder and using a number. This is useful if you want 114 to be reminded from time to time if a problem is not fixed, using some 115 alerts like mails can be easily overlooked or lost in a huge mail 116 amount. The :reminder is a setting per check. For a global reminder 117 setting, one can set *reminder* variable. 118 119 reed-alert keep tracks of the count of failures with one file per 120 probe failing in the "states" folder. To ensure unique filenames, the 121 following format is used (+ means it's concatenated) : 122 123 alert-name + probe-name + hash of probe parameters 124 125 The notifier is a shell command with a name. The shell command can 126 contains variables from reed-alert. 127 128 + %function% : the name of the probe 129 + %date% : the current date with format YYYY/MM/DD hh:mm:ss 130 + %params% : the parameters of the probe 131 + %hostname% : the hostname of the server 132 + %result% : the error returned (the value exceeding the limit, file not found) 133 + %desc : an arbitrary description naming a check, default to empty string 134 + %level% : the type of notification used 135 + %os% : the type of operating system (FreeBSD/Linux/OpenBSD) 136 + %newline% : a newline character 137 + %state% : "start" / "end" when problem happen / is solved 138 139 140 Example Probe 1: 'Check For Load Average' 141 --------------------------------------- 142 If you want to send a mail with a message like: 143 144 "On 2016/10/06 11:11:12 server.foo.com has encountered a problem 145 during LOAD-AVERAGE-15 (:LIMIT 10) with a value of 30" 146 147 148 write the following at the top of the file and use **pretty-mail** in your checks: 149 150 (alert pretty-mail "echo 'On %date% %hostname% has encountered a problem during %function% 151 %params% with a value of %result%' | mail yourmail@foo.bar") 152 153 Example Probe 2: 'Don't do anything' 154 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 155 If you don't want anything to be done when an error occur, use the following : 156 157 (alert nothing-to-send "") 158 159 Example Probe 3: 'Send SMS' 160 ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 161 You may want to use an external service to send a SMS, this is totally 162 possible as we rely on a shell command : 163 164 (alert sms "echo 'error on %hostname : %function% %result%' 165 | curl -u login:pass http://api.sendsms.com/") 166 167 168 The Probes 169 ========== 170 171 Probes are written in Common LISP. They are predefined checks. 172 173 The :desc Parameter 174 ------------------- 175 The :desc parameter allows you to describe specifically what your check 176 does. It can be put in every probe. 177 178 :desc "STRING" 179 180 181 The :try Parameter 182 ------------------ 183 The :try parameter allows you to change how many failure to wait 184 before the alert is triggered. By default, it's triggered after 3 185 failures. Sometimes, when using ping for example, you want to be 186 notified when it fails a few cycles and not at first failure. 187 188 :try INTEGER 189 190 191 Overview 192 -------- 193 As of this commit, reed-alert ships with the following probes: 194 195 (1) number-of-processes 196 (2) pid-running 197 (3) disk-usage 198 (4) check-file-exists 199 (5) file-updated 200 (6) load-average-1 201 (7) load-average-5 202 (8) load-average-15 203 (9) ping 204 (10) command 205 (11) service 206 (12) file-less-than 207 208 209 number-of-processes 210 ------------------- 211 Check if the actual number of processes of the system exceeds a specific limit. 212 213 > Set the limit that will trigger an alert when exceeded. 214 :limit INTEGER 215 216 Example: `(=> alert number-of-processes :limit 200)` 217 218 219 pid-running 220 ----------- 221 Check if the PID number found in a .pid file is alive. 222 223 > Set the path of the pid file. If $USER doesn't have permission to open it, return "file not found". 224 :path "STRING" 225 226 Example: `(=> alert pid-running :path "/var/run/nginx.pid")` 227 228 229 disk-usage 230 ---------- 231 Check if the disk-usage of a chosen partition does exceed a specific limit. 232 233 > Set the mountpoint to check. 234 :path "STRING" 235 236 > Set the limit that will trigger an alert when exceeded. 237 :limit INTEGER 238 239 Example: `(=> alert disk-usage :path "/tmp" :limit 50)` 240 241 242 check-file-exists 243 ----------- 244 Check if a file exists. 245 246 > Set the path of the file to check. 247 :path "STRING" 248 249 Example: `(=> alert check-file-exists :path "/var/postgresql/standby")` 250 251 252 file-updated 253 ------------ 254 Check if a file exists and has been updated since a defined time. 255 256 > Set the path of the file to check. 257 :path "STRING" 258 259 > Set the limit in minutes since the last modification time before triggering an alert. 260 :limit INTEGER 261 262 Example: `(=> alert file-updated :path "/var/log/nginx/access.log" :limit 60)` 263 264 265 load-average-1 266 -------------- 267 Check if the load average during the last minute exceeds a specific limit. 268 269 > Set the limit not to exceed. 270 :limit INTEGER 271 272 Example: `(=> alert load-average-1 :limit 2)` 273 274 275 load-average-5 276 -------------- 277 Check if the load average during the last five minutes exceeds a specific limit. 278 279 > Set the limit not to exceed. 280 :limit INTEGER 281 282 Example: `(=> alert load-average-5 :limit 2)` 283 284 285 load-average-15 286 --------------- 287 Check if the load average during the last fifteen minutes exceeds a specific limit. 288 289 > Set the limit not to exceed. 290 :limit INTEGER 291 292 Example: `(=> alert load-average-15 :limit 2)` 293 294 295 ping 296 ---- 297 Check if a remote host answers the 2 ICMP ping. 298 299 > Set the host to ping. Return an error if ping command returns non-zero. 300 :host "STRING" (can be IP or hostname) 301 302 Example: `(=> alert ping :host "8.8.8.8")` 303 304 305 command 306 ------- 307 Execute an arbitrary command which triggers an alert if it returns a non-zero value. 308 This may be the most useful probe because it let the user do any check needed. 309 310 > Command to execute, accept commands with pipes. 311 :command "STRING" 312 313 Example: `(=> alert command :command "tail -n 10 /var/log/messages | grep -v CRITICAL")` 314 315 316 service 317 ------- 318 Check if a service is started on the system. 319 320 > Set the name of the service to test 321 :name STRING 322 323 Example: `(=> alert service :name "mysql-server")` 324 325 326 file-less-than 327 -------------- 328 Check if a file has a size less than a specified limit. 329 330 > Set the path of the file to check. 331 :path "STRING" 332 333 > Set the limit in bytes before triggering an alert. 334 :limit INTEGER 335 336 Example: `(=> alert file-less-than :path "/var/log/nginx.log" :limit 60)` 337 338 339 curl-http-status 340 ---------------- 341 Do a HTTP request and return an error if the return code isn't 342 200. Requires curl. 343 344 > Set the url to request. 345 :url "STRING" 346 347 > Set the time to wait before aborting. 348 :timeout INTEGER 349 350 351 ssl-expiration 352 -------------------- 353 Check if a remote SSL certificate expires in less than a specified 354 time. Requires openssl. 355 356 > Set the hostname for the request. 357 :host "STRING" 358 359 > Set the expiration time limit in seconds. 360 :seconds INTEGER 361 362 > Set the port for the request (OPTIONAL). 363 :port INTEGER (default to 443) 364 365 > Use starttls (OPTIONAL). 366 :starttls STRING 367 368 Example: `(=> alert ssl-expiration :host "domain.local" :seconds (* 7 24 60 60)) 369 Example: `(=> alert ssl-expiration :host "domain.local" :seconds 86400 :port 6697) 370 Example: `(=> alert ssl-expiration :host "smtp.domain.local" :seconds 86400 :starttls "smtp" :port 25) 371 372 373 write-to-file 374 -------------------- 375 Write content to a file, create it if non existent. 376 377 The purpose of this probe is to be used at the end of a reed-alert 378 script to update the modification time of a file, and use file-updated 379 on this file at the beginning of a script to monitor if reed-alert did 380 finish correctly on last run. 381 382 > Set the path of the file. 383 :path "STRING" 384 385 > Set the content of the file (OPTIONAL). 386 :text "STRING" (default to current time in seconds) 387 388 Example: `(=> alert write-to-file :path "/tmp/reed-alert.txt")` 389 Example: `(=> alert write-to-file :path "/tmp/reed-alert.txt" :text "hello world")` 390 391 392 The configuration file 393 ====================== 394 395 The configuration file is Common LISP code, so it's evaluated. It's 396 possible to write some logic within it. 397 398 399 Loops 400 ----- 401 It's possible to write loops if you don't want to repeat code 402 403 (loop for host in '("bitreich.org" "dataswamp.org" "floodgap.com") 404 do 405 (=> mail ping :host host)) 406 407 or another example 408 409 (loop for service in '("smtpd" "nginx" "mysqld" "postgresql") 410 do 411 (=> mail service :name service)) 412 413 and another example using rows from a file to check remote hosts 414 415 (with-open-file (stream "hosts.txt") 416 (loop for line = (read-line stream nil) 417 while line 418 do 419 (=> mail ping :host line))) 420 421 422 Conditional 423 ----------- 424 It is also possible to achieve conditionals. There are two very useful 425 conditionals groups. 426 427 428 Dependency 429 ~~~~~~~~~~ 430 Sometimes it may be a good idea to stop some probes if a probe 431 fail. In a case where you need to check a path through a network, from 432 the nearest machine to the remote target. If we can't reach our local 433 router, probes requiring the router to work will trigger errors so we 434 should skip them. 435 436 (stop-if-error 437 (=> mail ping :host "192.168.1.1" :desc "My local router") 438 (=> mail ping :host "89.89.89.89" :desc "My ISP DNS server") 439 (=> mail ping :host "kernel.org" :desc "Remote website")) 440 441 Note : stop-if-error is an alias for the **and** function. 442 443 444 Escalation 445 ~~~~~~~~~~ 446 It could be a good idea to use different alerts 447 depending on how critical a check is, but sometimes, the critical 448 level may depend of the value of the error and/or the delay between 449 the detection and fixing it. You could want to receive a mail when 450 things need to be fixed on spare time, but mail another people if 451 things aren't fixed after some level. 452 453 (escalation 454 (=> mail-me disk-usage :path "/" :limit 70) 455 (=> sms-me disk-usage :path "/" :limit 90) 456 (=> buzzer disk-usage :path "/" :limit 98)) 457 458 In this example, we check the disk usage, I will get a mail through 459 "mail-me" alert if the disk usage go get more than 70%. Once it goes 460 that far, it will check if the disk usage gets more than 90%, if so, 461 I'll receive a sms through "sms-me" alert. And then, if it goes more 462 than 98%, the "buzzer" alert will make some bad noises in the room to 463 warn me about this. 464 465 Note : escalation is an alias for the **or** function. 466 467 468 Extend with your own probes 469 =========================== 470 471 It is likely that you want to write your own probes. While using the 472 command probe can be convenient, you may want to have a probe with 473 more parameters and better integration than the command probe. 474 475 There are two methods for adding probes : 476 - in the configuration file before using it 477 - in a separated lisp file that you load from the configuration file 478 479 If you want to reuse for multiples configuration files or servers, I 480 would recommend a separate file, otherwise, adding it at the top of 481 the configuration file can be convenient too. 482 483 484 Using a shell command 485 --------------------- 486 487 A minimum of Common LISP comprehension is needed for this. But using 488 the easiest way to go by writing a probe using a command shell, the 489 declaration can be really simple. 490 491 We are going to write a probe that will use curl to fetch an page and 492 then grep on the output to look for a pattern. The return code of grep 493 will be the return status of the probe, if grep finds the pattern, 494 it's a success, if not it's a failure. 495 496 In the following code, the "create-probe" part is a macro that will 497 write most of the code for you. Then, we use "command-return-code" 498 function which will execute the shell command passed as a string (or 499 as a list) and return the correct values in case of success or 500 failure. 501 502 (create-probe 503 check-http-pattern 504 (command-return-code (format nil "curl ~a | grep -i ~a" 505 (getf params :url) (getf params :pattern)))) 506 507 If you don't know LISP, "format" function works like "printf", using 508 "~a" instead of "%s". This is the only required thing to know if you 509 want to reuse the previous code. 510 511 Then we can call it like this : 512 513 (=> notifier check-http-pattern :url "http://127.0.0.1" :pattern "Powered by cl-yag") 514 515 516 Using plain LISP 517 ---------------- 518 519 We have seen previously how tocreate new probes from a shell command, 520 but one may want to do it in LISP, allowing to use full features of 521 the language and even some libraries to check values in a database for 522 example. I recommend to read the "probes.lisp" file, it's the best way 523 to learn how to write a new probe. But as an example, we will learn 524 from the easiest probe included : check-file-exists 525 526 (create-probe 527 check-file-exists 528 (let ((result (probe-file (getf params :path)))) 529 (if result 530 t 531 (list nil "file not found")))) 532 533 Like before, we use the "create-probe" macro and give a name to the 534 probe. Then, we have to write some code, in the current case, check if 535 the file exists. Finally, if it is a success, we have to return **t**, 536 if it fails we return a list containing **nil** and a value or a 537 string. The second element in the list will replaced %result% in the 538 notification command, so you can use something explicit, a 539 concatenation of a message with the return value etc..". Parameters 540 should be get with getf from **params** variable, allowing to use a 541 default value in case it's not defined in the configuration file.