Most bizarre BES problem I have ever had
OK, there is a really odd problem with my BES, not even sure how to describe it.
* Lotus Domino Server (Release 7.0.2 for Windows/32) - BES is NOT the Domino mail server, it actually connects to four mail servers (2 on same LAN in UK, 2 via VPNs in USA/CA)
* BlackBerry Enterprise Server, Version 220.127.116.11
* Windows 2000 Server
Problem seems to be:
* In a nutshell, BES slowly gives up processing anything for users.
- I do not ever receive a BES alert email
- When I check the server, Domino is running
- When I check "show tasks" on Domino, BES is running
- When I check services, all BlackBerry services are running
How does problem present?
- One by one, BES seems to just start "ignoring" the fact it has certain users.
- Watching the server console, it never seems to scan these users' inboxes for mail. Nor does it attempt to contact the device. Users find that they just "don't receive" any mail, and can have problems sending mail and completing lookups.
- Server does not show errors for those users, it just *ignores* the fact they exist
- Affected users are on different mail servers, so this is not (obviously) linked to one of the 4 domino servers running mail
- Only "fix" seems to be restarting Domino & BES
- Once this issue has appeared (it does not happen for some time after a restart), Domino & BES will not shut-down cleanly, I normally have to end task on Domino
- Have to put all services to manual before server reboot, and start them when server is up - when they are on auto, Domino & BES do not start reliably
- Domino does not seem able to start BES services any more, even when BES services are set to automatic. I bring up Domino, which says it has loaded BES but nothing processes, then I have to go to each BB service and manually start them.
- Different users are affected on different days
- Problem starts with one user, then another, then another, till everyone affected, and this makes it hard to know when server is problematic. For example, today my BB fine, 2 colleagues on same mail server had no mail since early hours of morning.
Any error messages?
* Only in logs, never on BES or Domino console
* These messages will start appearing in Application Logs on Windows for affected users:
Event Type: Warning
Event Source: BlackBerry Messaging Agent XXXXXXXX
Event Category: None
Event ID: 20148
The description for Event ID ( 20148 ) in Source ( BlackBerry Messaging Agent BLACKBERRY ) cannot be found. The local computer may not have the necessary registry information or message DLL files to display messages from a remote computer. You may be able to use the /AUXSOURCE= flag to retrieve this description; see Help and Support for details. The following information is part of the event: Thread: *** No Response *** Thread Id=0x175C, Handle=0x8F0, WaitCount=121, Last Activity: New Message for user Mark Smith/OU/ORG.
Event Type: Warning
Event Source: BlackBerry Messaging Agent XXXXXXXX
Event Category: None
Event ID: 20149
Thread 175C, utilization=0.0000%, failed health check 121 times
Has anyone seen anything like this before?
Any ideas where to start?!
Any requests for further info/Win logs/BES logs please let me know and I will post.
The result of this is that I have a defective and very difficult to manage BES service, has been going on for a week or so now, and I am running out of ideas!
PLEASE HELP IF YOU CAN!
The alerting service was bunk in 4.1.3 and 4.1.4. It is suppose to be fixed in 4.1.5.
When a thread fails a health check its going to fail for the users that are utilizing those particular threads. So in your case "Thread 175C, utilization=0.0000%, failed health check 121 times" if there were 10 people on that thread 10 people will not have service. When you start seeing threads failing it usually waterfalls over time...meaning more and more threads are going to fail.
How many users do you have? What is your current threading model on the BES and what is it optimized to? Also what lvl is your logging set to right now?
32 users on this BES
Sorry no idea about threading model - how/where would I check?
Log levels are currently:
A lot of these were at 2, but we tried changing the logging levels after this issue started.
How long does it take to stop processing?
When we had our old 2.2 BES on Windows 2000 it did something similar to this about every 3 or 4 weeks. We would restart and it would be fine. I did not have to play with the services though.
You could try updating to 4.1.5.
Is 4.1.5 OK though? I've seen a lot of posts for 4.1.5 and Domino issues, but have not had time to look into how serious they are.
When this started a few weeks ago, it'd take days to start messing up. Now it is doing it in a matter of hours.
I have just noticed through testing:
-Restarting individual BES services does not appear to help (only tested Controller, Router & Dispatcher)
- Doing a "tell bes quit" then "load bes" on the Domino server clears the problem... albeir temporarily!
What's the latency of each mail server to the BES? If it's over 40ms, it won't work properly.
The 2 mailservers on same LAN as BES are fine.
Hard to test the overseas servers, they are on VPNs with firewall rules that only allow notes traffic over. Can't ping.
However, nothing on VPNs has changed, and typically it is threads to one of the local server that start to die first. So I don't think latency is an issue - particularly when "tell bes quit" "load bes" fixes it (temporarily) without a domino restart or server restart or anything else...
Perhaps try to increase your MaxTotalThreads from default of 40 decimal to 80 decimal. This might just be a band aid however, as I believe the issue to be latency over the VPN links.
What is your ping time from your BES to your Domino mail servers over the VPN? If you can't ping how about NotesConnect (nPing)?
Are your VPN users mailfiles increasing greatly? That could explain the issues now vs before. Scanning large mailfiles can be intensive over our high speed low latency WAN links within the US, I can't imaging doing that over a VPN link overseas.
I really think the ultimate solution is to set up an additional BES VM in the US connecting via WAN or LAN links to the mailboxes in US / CA.
Also if you could perform the following command on your MAGT log and post or PM the results that would help:
grep -i "thread.*pool" [BESServerName]_MAGT_01_20080501_0001.txt
For example my server shows this:
[ENV] ThreadpoolOptimizationInterval = 240
[ENV] NumThreadPools = 10
Optimize ThreadPools, total number of users 680
No empty thread-pools were found.
Thread pool for mail server (DominoServer1) has 20 threads to serve 172 handhelds
Thread pool for mail server (DominoServer2) has 39 threads to serve 348 handhelds
Thread pool for mail server (DominoServer3) has 2 threads to serve 6 handhelds
Thread pool for mail server (DominoServer4) has 17 threads to serve 151 handhelds
Thread pool for mail server (DominoServer5) has 2 threads to serve 3 handhelds
Note that the total number of threads adds up to 80, whereas 40 is the BES default.
I think we have found our problem.
No wonder why it's going sideways on you.
I'm having trouble finding a command line grep utility for windows - anyone know where to get one?
Scratch that last question, found it
Thankfully only small # of users on this BES. All the USA users have replicas on local servers - I might have to in the short term (assuming threads increase doesn't work) just need to point their mail at the local replicas, and have a fast replication schedule for their mail to the states servers
That would work. I have heard of a enhancement request to be able to customize which replica to point to instead of pull from the person doc however doubt if / when that would be implemented.
Another part of the problem (which I did not state above, as it seemed less urgent)...
Around the time that all this faffing about started, we noticed problems with starting BES.
If I rebooted the server and left everything on manual, then although all the services reported as started, they actualy would not do anything (BlackBerry services and Domino). Currentlky, the only way I can get the server to come up and run is:
a) Set all BB and Domino services to Manual
b) Restart server
c) Set all the BB services to automatic
d) Log-on to Windows and start Domino as a service
e) Notice (again) that Domino seems unable to fully start BES, so then I set all the BES services to Automatic, and have to do a "tell BES quit" - "load BES" on the console
f) Then everything works
If I accidentally reboot the server with the BES services set to Automatic, I have to power off the server (as it hangs when windows starts) and boot into safe mode, set everything to manual and start again.
Anyway sorry if that doesn't make sense. That'll be because I was here till 22.30 last night, and back again at 06.30 this morning. I don't like spending so much time at work!!! :cry: :cry: :cry:
Bloody BES. It's at times like this I miss the old days, when Execs accepted "when you're out of the office, you don't have email". Gah.
Well, it seems that upgrading Win2000 server to Win2003 fixed, er, all of it.
I can't explain why!
I will post again if this issue kicks up again.
Thanks to everyone who posted...
Bloody windows. When will they port BES to Linux? :)
|All times are GMT -5. The time now is 10:18 AM.|
Powered by vBulletin® Version 3.6.12
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.