Monitoring MDADM RAID arrays with Cacti & SNMP

As I’ve mentioned in a previous post, I use MDADM RAID arrays on Ubuntu for storing all of my media, plus a separate array elsewhere for backups. I also use Cacti to keep an eye on my different machines, combined with the Threshold plugin to allow monitoring of anything going wrong within the network. While MDADM does allow you to use the monitoring feature to email when a disk gets dropped, I was looking for something that would hook into my current infrastructure, and was surprised to find that no templates existed to allow graphing MDADM details in Cacti – so I set about making some! This post explains how to set this up on your own machine.

Rather than using a local script, I have used SNMP to allow for this being added on multiple machines and monitored remotely. So, our first step is to get the information into our SNMP configuration. This is accomplished using two scripts (as MDADM requires root to run; we’ll see how this is achieved shortly) – in order to get them, you need to run (as root):

$ cd /etc/snmp/
$ svn co http://projects.mattdyson.org/projects/cacti/mdadm/scripts/ mdadm
$ cd mdadm
$ chmod -R 0777 *

This will download 2 files – mdadm-monitor and mdadm-read. The first one is the “business end” of the operation, this will take various arguments (as you’ll see in the snmpd.conf shortly) and output an indexed list representing the value for each array in our system. The mdadm-read script simply calls mdadm --detail /dev/$1, so we can add this to /etc/sudoers to allow us to access this information without requiring a root password each time – add the following line into /etc/sudoers in order to make this happen. You may need to change the user (“snmp” in this example) to whatever snmpd runs under on your system

snmp    ALL=(ALL) NOPASSWD: /etc/snmp/mdadm/mdadm-read

At this point, you should be able to run /etc/snmp/mdadm/mdadm-monitor Name as your SNMP user and see a list of all MDADM arrays on your system!

In order to access these through SNMP, we need to edit /etc/snmp/snmpd.conf and add the following lines:

extend .1.3.6.1.4.1.2021.80 Num /etc/snmp/mdadm/mdadm-monitor Num
extend .1.3.6.1.4.1.2021.81 Index /etc/snmp/mdadm/mdadm-monitor Index
extend .1.3.6.1.4.1.2021.82 Name /etc/snmp/mdadm/mdadm-monitor Name
extend .1.3.6.1.4.1.2021.83 Active /etc/snmp/mdadm/mdadm-monitor Active
extend .1.3.6.1.4.1.2021.84 Working /etc/snmp/mdadm/mdadm-monitor Working
extend .1.3.6.1.4.1.2021.85 Failed /etc/snmp/mdadm/mdadm-monitor Failed
extend .1.3.6.1.4.1.2021.86 Spare /etc/snmp/mdadm/mdadm-monitor Spare
extend .1.3.6.1.4.1.2021.87 Degraded /etc/snmp/mdadm/mdadm-monitor Degraded

Once this is done, you’ll need to restart SNMP with /etc/init.d/snmpd restart, and then you should be able to run snmpwalk -v1 -cpublic localhost .1.3.6.1.4.1.2021.82 (if you have snmpwalk installed) and see some output similar to this:

iso.3.6.1.4.1.2021.82.1.0 = INTEGER: 1
iso.3.6.1.4.1.2021.82.2.1.2.4.78.97.109.101 = STRING: "/etc/snmp/mdadm/mdadm-monitor"
iso.3.6.1.4.1.2021.82.2.1.3.4.78.97.109.101 = STRING: "Name"
iso.3.6.1.4.1.2021.82.2.1.4.4.78.97.109.101 = ""
iso.3.6.1.4.1.2021.82.2.1.5.4.78.97.109.101 = INTEGER: 5
iso.3.6.1.4.1.2021.82.2.1.6.4.78.97.109.101 = INTEGER: 1
iso.3.6.1.4.1.2021.82.2.1.7.4.78.97.109.101 = INTEGER: 1
iso.3.6.1.4.1.2021.82.2.1.20.4.78.97.109.101 = INTEGER: 4
iso.3.6.1.4.1.2021.82.2.1.21.4.78.97.109.101 = INTEGER: 1
iso.3.6.1.4.1.2021.82.3.1.1.4.78.97.109.101 = STRING: "md0"
iso.3.6.1.4.1.2021.82.3.1.2.4.78.97.109.101 = STRING: "md0"
iso.3.6.1.4.1.2021.82.3.1.3.4.78.97.109.101 = INTEGER: 1
iso.3.6.1.4.1.2021.82.3.1.4.4.78.97.109.101 = INTEGER: 0
iso.3.6.1.4.1.2021.82.4.1.2.4.78.97.109.101.1 = STRING: "md0"

Excellent! We’re now up and running with details of all MDADM arrays being output to SNMP, including numbers of active, working, failed and spare disks for each array! Next up, lets add the SNMP query XML into our Cacti resource directory (substitute the relevant Cacti path for your system). The code I’ve written works fine on net-snmp 5.4.3, but I believe lower versions may not play nicely with using the long OIDs in net-snmp_mdadm.xml – you may need to remove everything in the OID after .1.3.6.1.4.1.2021.8<X>.4

$ cd /usr/share/cacti/site/resource/snmp_queries/
$ svn co http://projects.mattdyson.org/projects/cacti/mdadm/queries/ mdadm

This is referenced by the Cacti Data Template that I have put together, which can be downloaded here. Once you’ve used the “Import Templates” feature to import this file, add the SNMP – Get MDADM Arrays template to a host, and create the relevant graphs for each of your arrays. Here’s one of mine that’s a little boring, but at least we know everything is okay!

saturn_md0_1hr

There we have it – MDADM data available in Cacti! I’ve also created a Threshold template that monitors the Failed data source for any value other than 0, and then kicks off the usual alerting to let me know to replace a disk, but Cacti won’t let me export that one easily, and it was simple enough.

I haven’t fully tested this method with multiple arrays on the same host yet, please let me know in the comments if you find any problems, and I’ll do my best to help out.

Update 08/05/13: I’ve updated the graph template to make a bit more sense when an array is rebuilding.

Update 05/11/13: The graph template should now parse properly – thanks to Evandro Nabor for spotting the error!

Update 28/12/13: Doh! My ‘fix’ for the graph template above introduced another problem! I’ve updated the file again, and hopefully things should now be working – thanks to C. Alexander for spotting this!

15 thoughts on “Monitoring MDADM RAID arrays with Cacti & SNMP

  1. Hi Matt, great post from you. I’m trying to do the same but stuck on last step.
    Cacti is on Freebsd system Version 0.8.7g
    There is no Dir -> cd /usr/share/cacti/site/resource/snmp_queries/
    And secondly i can not import your xml template. It says Error: XML parse error.
    Thanks

    • Hi Rudi,
      This template was built with Cacti 0.8.8a, which may explain why the template fails to parse.
      As for the directory – this will vary depending on your installation directory – there should be a folder resource/snmp_queries in your Cacti directory, which is where the queries need to be checked out into.
      Hope this helps!

      • Updated my cacti to the latest version, found my snmp_queries catalog.
        I’ve checked snmpwalk which gives me the correct results of md0 software array like you describe using the snmpwalk -v1 -cpublic (host ip) .1.3.6.1.4.1.2021.82 command.
        Now i can add my Data Query, but the Graph showing now results.
        Active Disks: 0
        Failed: 0
        Spare: 0
        Working: 0

    • I should really check my fixes before declaring things fixed!! Thanks for pointing out that mistake – I’ve updated the template file and hopefully it should now be working!

  2. For some reason i get a End of MIB when i try to run snmpwalk. I tried at localhost and all other stuff, but no go. Everything is installed correctly and no errors. Someone know what the problem can be ?

  3. I composed this “Monitoring MDADM RAID” one year ago in my debian raid server, and monitored it with “The Dude”, a well known Mikrotik SMTP software. It seems work, but after one or two months it causes complete failure of the server monitored, which needs to be locally restarted (power removed). After some investigations I found that SMTP causes lot of log lines (about 10-20 per second!) on the monitored server, and I’m sure that the cause of failure is the monitoring smtp server because stopping SMTP requests prevents those failures. I tried lot of things, as slow down Dude smtp requests, but whithout success (always 10-20 log entries per second and always failures after one month or so), so now I disabled monitoring by putting an incorrect ip address in The Dude. I change this ip address with the correct server one only for one minute sometimes, to see if raid is ok, and later I get back to the incorrect one.
    So I think that, for me, this monitoring system did not work, as I can only use it to check raid manually as I would do with normal mdadm cli.

  4. Now made some modificatios on the SNMP monitor I use, “The Dude”, to reduce requests to snmp server.
    Those modifications includes: 1) increase appearance update time to 1 min or more (snmp udp requests made to fill the devide label in The Dude);
    2) reduncing polls, increasing polling time to 5 minutes or more ;
    3) reduce The Dude requests in this way: “If you have a device on 192.168.1.3, change it’s configured ip address to 192.168.1.50 or some other unused address and place the following on the device label the SNMP system name will show up. Which proves my suggestion would work although it is nuts πŸ™‚
    [oid(“1.3.6.1.2.1.1.5.0″,5,5,”192.168.1.3″,”public”)]”
    as described here: http://forum.mikrotik.com/viewtopic.php?t=63960 by forum guru lebowski .
    I hope those modifications will let server work with no troubles frim snmp requests.

  5. Hi, thank you for your script.

    I trying to run it in my server, but when i run mdadm-monitor it show “error: unknown query”. I have chmod 0777 as your request and run as root but still shown error.

    • Hi Andi,

      you have to add an option to the commandline:

      Index , Name, Active, Working, Failed, Spare or Degraded

      e.g.: /etc/snmp/mdadm/mdadm-monitor Name

  6. I made an alteration to /etc/snmp/mdadm/mdadm-read
    I kept getting the following error when it was run:
    mdadm: metadata format 1.00 unknown, ignored.

    The below changed help me fix the issue:
    /sbin/mdadm –detail “/dev/$1” 2>&1 |sed ‘/mdadm/d’

  7. Happy New Year

    Thanks for this. I was attempting to do something similar with bash, but I stumbled across your work. I modified mdadm-monitor so the /dev/md/ directory is not included:

    ls /dev/ | grep -E ‘md[0-9]{1}’

    Hope that helps

    Regards,
    George

Leave a Reply

Your email address will not be published. Required fields are marked *