Summary
Three server redundancy (triad) failures.
Question
We're having problems with the stability of redundant servers. The problem has been intermittent for years. There is no specific error we're seeing or other deterministic data available. We're currently collecting:
- Number of tcp open requests/sec to the lmgrd daemon are measured.
- Service time in microseconds (from accept on port to the shutdown) of every request to the lmgrd daemon
- The socket receive backlog (set to 500 by lmgrd) and socket drops and socket overflows on port is measured every 10 seconds
- File descriptor counts are sampled every 10 seconds for each of the three servers (lmgrd and two vendor daemons)
- The only thing that correlates is the socket receive backlog increases/maxes out and usually they see the socket overflows increments when they see the failures to maintain quorum on the redundant servers. Increase in socket backlog leads to partial failures to verify a host.
Any ideas as to why this may be happening?
Answer
This is a case of one of the many symptoms that can occur when license servers are overloaded beyond their capacity. Symptoms can range from clients timing out to - in this case - quorum being lost in a triad.
This is not something with which engineering would practically be able to engage unless a deterministic error behavior is reproduced or enhancement request designed to mitigate the high-load behavior is raised (again, requiring determinism).
The kinds of things that you could experiment with today to alleviate lost-quorum are:
- Reduce licenses served per triad.
- Increase hardware resources on primary and secondary nodes (high CPU count more important than number cores or threads).
- Consider upgrading the vendor daemon to the latest version - there may be some benefit from our recent select->poll changes (such as FNP-17708: "Remove the OS select() call from the FNP code-base for all non-Windows platforms" implemented with FNP 11.15.1).
- Set a value for
LM_SERVER_HIGHEST_FD(see docs) to something in the low 100's, down from the default of 1024. This will have the effect lowering the accepted client connections/second to the lmgrd, which may help in mitigating lost-quorum situations.
In general, an approach to mitigate lost-quorum is to control client load. Enhancement FNP-18732: "Have a graceful and predictable error just before VD hits its producer-tested limits" is under consideration to help mitigate this type of situation.
Related Articles
Comparing Redundant Server Setup With Three Independent Servers 9Number of Views Three-Server Redundancy Basics 32Number of Views Does Three Server Redundancy Support the BORROW Keyword? Products 5Number of Views Three Server Requirement for a Redundant Server Configuration 9Number of Views How to Create a Three Server Redundancy (triad) License Within FlexNet Operations On-Premises? 5Number of Views
Hi, I am Reva - Ask me anything.
No new updates
Thanks for the feedback!
Your feedback has been saved.Rate this response:
Add Additional feedback ( Optional )
Are you sure you want to cancel
the case creation?
Are you sure you want to cancel the case creation?
Are you sure you want to close this case
| Products | Region | Phone Numbers |
|---|---|---|
| FlexNet Operations FlexNet Embedded FlexNet Publisher FlexNet Connect FlexNet Code Insight InstallAnywhere InstallShield |
North America * |
+1 630-332-2513 (toll) +1 877-279-2853 (toll-free in North America) |
| Europe * |
+44 1925 944367 (toll) +44 800 047 8642 (toll-free in Europe) |
|
| Japan * | +81 3-4540-5335 (select option 2) | |
| Australia * |
+61 3 9895 2177 +61 1800 560 603 (toll-free in Australia) |
|
|
Usage Intelligence (formerly
Revulytics) Compliance Intelligence |
Please use the Case Portal to submit your support ticket or reach out to your Revenera contact. | |
Case id: 00001065
Activity: Status change: 2 hours ago