Men and Mice Suite

Versions

Search 8.1 documentation

Search all documentation

Skip to end of metadata
Go to start of metadata

Symptom:

information compiled by Men & Mice with input from NLNetLabs:

The Unbound[1] resolving DNS Server, a secure and high-performance DNS Server used to resolve DNS queries for client machines, has an internal data structure called the "request-list". This article gives background information on how this request-list works internally, and how it enables Unbound to survive in Denial of Service (DoS) attacks while still serving good client requests.

Solution

The Unbound request-list can be seen as the internal “to-do” list of the Unbound DNS Server (there is one request-list per thread). Whenever a new query (a new “task” to work on) arrives, Unbound will place it in the request-list, unless the query can be served directly from cache.

The request-list is measured in “slots”. Each slot can hold one query at a time. The size of this request-list can be configured with the “num-queries-per-thread” configuration setting (default is 512 or 1024, depending on the compile options[2]).

Internally, Unbound splits the request-list in two equally sized parts. We call one part the “run-to-completion” list, and the other is called the “jostle-list”. When there are free slots in the “run-to-completion” list, Unbound places new requests into this list. Every query that is placed into this list will be worked on with all resources and time available until it is resolved a meaningful answer by following the DNS delegation tree. If an answer is found (NOERROR, Nodata or NXDOMAIN), it is placed into the cache. If the name resolution fails, Unbound returns a “SERVFAIL” condition to the client. Queries in the “run-to-completion” can timeout if the underlying network queries handled by the TCP/IP stack are timing out (the max waiting time here is 2 minutes).

Unbound tries very hard on every query to find an answer. It will not give up on the first sight of trouble (for example if the first authoritative DNS Server for a domain name turns out to be “lame”). Instead if will try to find and query all authoritative DNS Servers available for the domain name, as long as resources allow. In worst case scenarios, this can take some amount of time. If there are many authoritative DNS Servers for a domain, and each of these DNS Servers have multiple IP Addresses, the timeouts can accumulate.

When the “run-to-completion” list is filled (request-list has grown to half of maximum size), Unbound starts placing new queries into the “jostle-list”. (“to jostle”: to make one's way by pushing and shoving <jostling toward the exit>, see http://www.merriam-webster.com/dictionary/jostle). Life for a query is much more rough and dangerous in the “jostle-list”. If a query placed in this part of the request-list takes longer than the configured jostle timeout (by default 200ms, can be tuned with the “jostle-timeout” configuration parameter), and the “request-list” has no empty slots left, the oldest query in the “jostle-list” is terminated and replaced (overwritten) by the new, fresh query. These queries overwritten are counted in the “total.requestlist.overwritten” statistics.

Usually, a normal DNS request resolves just fine in 200ms on the Internet in the North Western Hemisphere. If a DNS query there takes more than 200ms, there is often something non-optimal (network-wise, or in the DNS configuration, the Unbound server is on a network not well connected to the Internet, or the authoritative DNS server is located outside our solar system ;) [3] ).

A different situation appears if the “jostle-list” gets filled with all young queries, all below the 200ms jostle-timeout threshold. Now Unbound cannot find an old query to kick out, so instead of kicking out the young queries (and creating a “thrashing” condition in Unbound[4]), it will drop the new incoming query. When Unbound needs to drop new incoming queries because the “jostle-list” is full (and the request-list has by definition reached its maximum size), these lost of DNS queries are accounted for in the “total.requestlist.exceeded” statistics counter. Request-list exceed situations are usually caused by denial-of-service attacks, where these kinds of attacks can also be cause by a misconfiguration of a downstream DNS resolver, so they are not always an aggressive, deliberate attack.

In both cases ("jostle-list" entries overwritten or new queries dropped = request-list exceeded), there will be always slots being freed from the “run-to-completion” list in the usual way, and new incoming queries will be placed in the free slots in this list with priority. This ensures that Unbound will be able to process normal queries even in the event of an denial-of-service attack. Some good queries will be lost, but many will make it through, and the answers will be placed in the cache for future queries to use.

Also, by taking the oldest query from the “jostle-list”, in effect the queries in the “jostle-list” will have more than 200msec to do the work, if the incoming query rate allows for that.  If the rate is only moderately high, say 300msec could effectively be allowed. To calculate: if the "jostle-list" is 2000 elements and each slot takes 200 msec, this satisfies an incoming rate of 10,000 qps (not caught by cache hits).  At 6,000 qps, elements have 333 msec to complete.  If more than 10,000 qps are send to this server, then the oldest element on the “jostle-list” has not yet reached an old age of 200msec.  (and if you include 80% cache hits in the calculation, you can imagine this server usefully replying to 50,000 qps based on the “jostle-list”; in reality also the other half of the list can help).  The “run-to-completion” part of the list is meant to fill up the cache (with elements that take long to satisfy).

If you monitor your Unbound server using the statistics function of “unbound-control”, you will see the counters for the average and maximum utilization of the request-list, and the counters for “overwritten” and “exceed”[5]. As long as the average utilization of the Unbound request-list is less or equal than half of the configured request-list size, your Unbound DNS server is doing fine. If you see the request-list filling up to more than half, and even seeing “overwritten” and “exceed” counter values during the normal production use of the DNS Server (esp. over a longer period of time), it is recommended to investigate the root cause for this high utilization of the request-list. Possible causes for this can be that the DNS servers IP Address is blacklisted, or that Unbound is trying to use IPv6 when there is no real IPv6 connectivity available (in this case, configure “do-ip6: no”).

But in real world production environments, the issue(s) causing the sub-optimal query processing inside Unbound might be difficult to find. We will post our findings in the Men & Mice FAQ System, as well as on the Unbound mailing list. Men & Mice would be happy to assist in if you see issues with the Unbound request-list (send an E-Mail to support@menandmice.com).

[1] This article covers Unbound 1.4.6 and newer. Older versions of Unbound have an issue with the request-list handling that is solved in version 1.4.6.

[2] The request-list is 512 slots per thread when compiled with the build-in event handler, and 1024 when compiled with “libevent” or “libev”.

[3] The 200 msec is also meant to be able to allow at least one roundtrip to an authority server, storing the result in the cache.  So that even if the request could not be completed, some intermediate result has already been cached.  In networks with long ping times (high latency) it is recommended to increase the “jostle-timeout” parameter to be slightly above the ping times to outside authoritative DNS Servers.

[4] http://en.wikipedia.org/wiki/Thrashing_(computer_science): “In computer science, thrashing is a situation where large amounts of computer resources are used to do a minimal amount of work, with the system in a continual state of resource contention.”

[5] The “total.requestlist.exceeded” counter is also increased if there are too many IP sources asking for a query that is already in the request-list. Unbound then does not allocate extra memory for them.  The total amount is scaled depending on the “num-queries-per-thread”.  The exceeded counter is also increased when the Unbound process is stopped or restarted and  the queries still waiting in the request-list have to be terminated.