Monday, July 20, 2015

Heads Up, Issues with UCM SIP Processing on 10.5(2)SU2 and 11.0

Users that recently downloaded Cisco Unified Communications Manager (CUCM) 10.5(2)SU2 [10.5(2.12900-14] or CUCM 11.0 [11.0(1.10000-10)] may have received an email from Cisco alerting them that these versions have been deferred due to some serious defects. If so, then good for you. If you ignored the e-mail then you receive the wag of the finger.

Maybe, you are managing a system affected by the defects we are going to discuss but you weren't the one to download the files. Well, in that case, you wouldn't have been alerted. In all cases, if you have installed 10.5(2)SU2 or 11.0 and are running SIP then you want to heed the warnings and look at applying the appropriate fixes. 


Background

One of the nasty software defects that you could run into if you are using SIP trunks to ITSPs or other call processing systems is CSCuu97800. I have a customer that hit this defect recently. 

The basic gist is that if a SIP message is received by the CUCM that causes the Call Manager process to resolve a FQDN via DNS the CallManager process will abnormally terminate. This means that route lists, hunt lists, phones, gateways, media resources, etc. will be immediately impacted. IOW, all hell breaks loose. 

Edit: Technically, you can still download 10.5(2)SU2 and 11.0 from CCO. This is at odds with the typical handling of Deferred Releases where you should not be able to download them. Suffice it to say, these defects are serious enough that you should treat the software as deferred. 


Are You Impacted?

It is very easy to determine if you are encountering this issue. According to the defect notes, this defect (and related defect CSCut30176) affect specific versions of CUCM:

10.5(2.12900-14
11.0(1.10000-10)

So, check versions on your CUCM as a preliminary check. After this, check to see if your cluster nodes are generating core dumps. To do this, SSH to the console of any CUCM node running the CallManager service and execute this command:

admin:utils core active list

      Size         Date            Core File Name
=================================================================
 254020 KB   2015-07-20 09:13:07   core.8414.11.ccm.1437397985
 249276 KB   2015-07-20 10:24:38   core.3855.11.ccm.1437402278
 244924 KB   2015-07-20 08:38:13   core.22333.11.ccm.1437395891
 248900 KB   2015-07-20 10:18:28   core.28705.11.ccm.1437401907
 244228 KB   2015-07-20 11:28:22   core.1786.11.ccm.1437406099
 242040 KB   2015-07-20 10:46:03   core.10190.11.ccm.1437403562
 245716 KB   2015-07-20 12:30:38   core.27040.11.ccm.1437409838
 253316 KB   2015-07-20 09:26:18   core.29177.11.ccm.1437398777
 246008 KB   2015-07-20 12:02:19   core.27950.11.ccm.1437408138
 246908 KB   2015-07-20 08:04:03   core.10454.11.ccm.1437393837
 252348 KB   2015-07-20 09:33:50   core.9115.11.ccm.1437399230

To verify if you are affected by CSCuu97800 (or related defect CSCut30176) you should analyze one of the files. For example:

admin:utils core active analyze core.8414.11.ccm.1437397985

You are prompted with a warning that says that says core analysis will eat up CPU cycles. If you are doing this during core business hours then you will most likely not care about the CPU cycles because your system is already compromised. 

Once the analysis is ran, scroll down until you find the section "backtrace - CUCM". Review the trace and compare it to the conditions provided in CSCuu97800. If they line up then you are definitely running into this bug.

Resolution

Simple. Follow the recommendations in the software defect. Take a couple of minutes to review the Readme file of the fix (for 10.5: http://www.cisco.com/web/software/282204704/18582/ciscocm.FQDNwithDNS-v1.0.k3.readme.pdf).

Then install the patch. It will restart the Call Manager service. Again, given the nature of this defect I doubt it matters much that this patch requires a service restart. That said, in a mulit-node cluster environment, select specific nodes and patch them first. Then stop the CM service on the remain nodes (allowing devices to fail to the patched systems). After phones get to a stable node then patch the rest of the cluster.

We applied the necessary patch and so far, so good. 

Other Thoughts

The nature if this kind of defect definitely separates the pro troubleshooters from the amateurs. On the surface, it will appear like the whole world is coming down around you. When that happens I follow a simple rule: Don't chase the symptoms. 

If you have something like "All of my SIP calls are failing but everything else is working" then, by all means, follow the symptoms and try to isolate your fault domain. If you are seeing multiple, apparently unrelated symptoms, across different services and devices then following one symptom is going to waste time. Start from the "bottom up". 

1. Check physical: network, compute resources, hypervisor management software (so, v-resources fall here)

2. Check logical. Focus on LACP, Layer 2 failures/convergence status indicators, Layer 3 routing topology changes, etc.

3. Check service logs on CUCM [ This is where we got our first clue of a service issue ]

4. Check specific applications / features

Once you find a clue, follow it. If you have a team working on the issue. Pick one resource and have them start trolling for software defects while other resources are applying a logical troubleshooting methodology. If you are an integrator, line up three people. One to manage the customer, one to focus on the basic troubleshooting / data gathering, and one to work the vendor angle (including TAC). 


Thanks for reading. If you have time, post a comment!

3 comments:

  1. We hit this issue as well.

    Our symptoms were:

    CallManager service failed at start up
    IMP services would not even ACTIVATE.

    DNS SRVs + GDPR with FQDN broke the world. We did not see any cores.

    Upgraded to SU immediately resolved this. Great work to Cisco - I got an automated mail in my inbox while grepping logs that instantly identified these cluster defects as the root cause.

    ReplyDelete
  2. Nice post Bill! Two things:

    1) 10.5(2)SU2 nor 11.0(1) were deferred, despite the advisory notice email From address containing the word "deferral". I replied to the advisory notice email, and inquired about the deferral, and was informed that there is no deferral. Feel free to do the same, as the more emails they receive about it, the more likely they'll change the name of the alias, or have cleared language in the email about deferrals.

    2) Are the defects limited to systems where CUCM does not have DNS properly configured and therefore cannot correct resolve the FQDN in the header? The other two accompanying defects mentioned in the COP file ReadMe state this, though the defect your referenced does not.

    In other words, the customer you had impacted by this defect, was their DNS setup in such a way that it could in fact correctly resolve the FQDN, or was that failing, and hence the core dump?

    ReplyDelete
  3. Looks like this is listed as resolved in 10.5.2(SU2a)

    ReplyDelete