Wednesday, February 22, 2006

DEBUG_THREADID=1 (Show-n-Tell Thursday)

For my own personal benefit as much as that of others, I try to maintain a list of some debugging parameters for Notes/Domino. For this Show-n-Tell Thursday (SnTT - per Chris Linfoot) tip, I would like to highlight the benefits of one debug parameter and a real world example of how it helped determine the cause of a crash.

The debugging parameter is:

DEBUG_THREADID=1
Per this Lotus technote, "This prefixes the console output with the process and threadid information in the format [ProcessID:Virtual Thread ID-Native Thread ID]. This can be helpful in identifying the process or thread holding a semaphore."

Well here is an example from the NSD of a crash we experienced this week:
Fatal Error signal=0x00000001 JOB=AMGR/QNOTES/070217 PID/TID=17095/0x000001c9
2/20/2006 4:39:47 Fault cleanup is in progress


Obviously that informs the Administrator that the AMGR task caused the crash. While this is a good start, it can only leave one wondering the agent that caused the crash. Please note, though the PID/TID (Process ID/Thread ID) section in bold. Searching through a dump of the console (or console.log, etc...), I would specifically look for the string "000001c9" since it is the Thread ID. It is best to search starting at the bottom. When I do that, I come to the following 3 console entries:

[17095:00002-000001C9] 02/20/2006 09:36:03 AMgr: Agent ('OS Administration' in 'workflow/costplus.nsf') message box: (unknown constant -MsgText002-)
[17095:00002-000001C9] 02/20/2006 09:36:06 AMgr: Agent ('OS Administration' in 'workflow/E16.nsf') error message: Object variable not set
[17095:00002-000001C9] 02/20/2006 09:36:06 AMgr: Agent ('OS Administration' in 'workflow/E16.nsf') error message: Cannot find external name: INITIALIZEADMINISTRATION

This informs me of several things regarding this AMGR task. First, we see that at this time there were two different databases as possible culprits for this error. But, since an AMGR task can only run against a single database per thread, I can rule out the top one as the problem. Secondly, I see that the last entry for this Thread was an error in a specific database and a specific agent having an issue with finding an "external name". In the past, I have generally been able to clear this error up by a design refresh or a recompile of LotusScript. (This agent, by the way, is an unmodified agent in a Lotus Workflow database from IBM - I would never (wink, wink) write an agent that crashes a server...) Third, we see that the last action by this thread was about 3 minutes prior to the server crash. It may be possible that another agent could be the culprit. It could also be possible that this agent simply hung for several minutes on this error or a subsequent error in the same agent. Further debugging may be necessary as well as potentially calling Lotus Support. If it is a specific agent, you could add some print statements to see where the agent is hanging.

This is, of course, not specific for the AMGR task and it is really most beneficial if you have more than one of the same task running on your system. I have about 5 AMGR tasks and multiple CLREPL and UPDATE tasks. I would definitely recommend turning this on as a first step in troubleshooting!

Now Playing: "Pull Me Under" by Accomplice

No comments: