An agent, auditor, and bodyguard walk into a bar…

This evening I wasted a bunch of time on what turned out to be a simple problem. I really hate it when that happens.

I fixed a bug in tpe-lkm where users weren’t seeing all of their processes, and updated my servers with the new module. Suddenly, my phone starts buzzing off the desk; nagios was complaining that some daemons were down. This data is retrieved via snmp, and upon further investigation, I noticed that the daemons were in-fact up.

So it was a snmp problem.

Except, I couldn’t duplicate it. I would get a shell as the snmp user and could see all processes just fine; ps auxf showed the process list just fine. The snmp user was in the proper group for viewing all processes, and that was enabled in /proc/sys/tpe/extras/ps_gid. I ran the good old strace command on snmpd, and saw:

[pid  4749] open("/proc/744/status", O_RDONLY) = -1 EPERM (Operation not permitted)
[pid  4749] open("/proc/745/status", O_RDONLY) = -1 EPERM (Operation not permitted)
[pid  4749] open("/proc/748/status", O_RDONLY) = -1 EPERM (Operation not permitted)
[pid  4749] open("/proc/751/status", O_RDONLY) = -1 EPERM (Operation not permitted)
[pid  4749] open("/proc/754/status", O_RDONLY) = -1 EPERM (Operation not permitted)
[pid  4749] open("/proc/769/status", O_RDONLY) = -1 EPERM (Operation not permitted)
[pid  4749] open("/proc/775/status", O_RDONLY) = -1 EPERM (Operation not permitted)
[pid  4749] open("/proc/776/status", O_RDONLY) = -1 EPERM (Operation not permitted)
[pid  4749] open("/proc/937/status", O_RDONLY) = -1 EPERM (Operation not permitted)
[pid  4749] open("/proc/962/status", O_RDONLY) = -1 EPERM (Operation not permitted)
[pid  4749] open("/proc/986/status", O_RDONLY) = -1 EPERM (Operation not permitted)
[pid  4749] open("/proc/1002/status", O_RDONLY) = -1 EPERM (Operation not permitted)

Long story short, I thought the problem was in the tpe-lkm “ps” restriction code. I went through the code and inserted all sorts of printks to see what the problem was. As it turned out, according to the kernel, snmp wasn’t in the ps_gid group. What?!?!

As it turns out, before snmpd drops its root privileges, it makes a call to setgroups() and drops all groups it belongs to. So it doesn’t matter that the snmp user belongs to the correct group to start with, snmpd flees the group on startup. To fix the issue, I had to specifically set the ps_gid group in the “agentgroup” in snmpd.conf.

So we have the agent (snmpd), the auditor (nagios), and the bodyguard (tpe-lkm), in a bar (the server), and the agent … meh, I can’t come up with something to make this funny. Oh well, I tried.

Hope this helps someone else who may run into the same problem.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>