Discussion:
[musl] Queries with less than `ndots` dots never lead to resolution using the global namespace if the `search` domains don't work
d***@glencore.com
2017-03-15 10:28:15 UTC
Permalink
As you can see from the comments starting here:

https://github.com/gliderlabs/docker-alpine/issues/8#issuecomment-223901519

quite a number of people are finding that the `search` and `domain` support added to musl libc doesn't work in their case. In that same issue I wrote my findings up, here:

https://github.com/gliderlabs/docker-alpine/issues/8#issuecomment-286561614

which I'll duplicate here so that's it's archived on the mailing list:

[MESSAGE]
Even based on its own documentation this would appear to be a bug in musl libc:

[QUOTE]
musl's resolver previously did not support the "domain" and "search" keywords in resolv.conf. This feature was added in version 1.1.13, but its behavior differs slightly from glibc's: queries with fewer dots than the `ndots` configuration variable are processed with search first then tried literally (just like glibc), but those with at least as many dots as `ndots` are only tried in the global namespace (never falling back to search, which glibc would do if the name is not found in the global DNS namespace). This difference comes from a consistency requirement not to return different results subject to transient failures or to global DNS namespace changes outside of one's control (addition of new TLDs).
[/QUOTE]

While I can confirm the second part (queries greater than `ndots` never fall-back to using search), the first part (queries smaller than `ndots` fall-back to using an absolute query) isn't what I observe.

Using dig on an Ubuntu container and attempting to resolve the nonsensical query `google.com.default.svc.cluster.local` (simulates the type of initial query for a short domain that would be occurring) returns a `QUESTION SECTION` and an `AUTHORITY SECTION`, but no `ANSWER SECTION`. This should cause musl libc to attempt to resolve the absolute query (`google.com`) instead, yet it doesn't seem to based on the final result of the query.

Here's the (tiny) commit where support for search and domain was added to musl libc, and here's the `name_from_dns` function that that diff relies on. I think this `dns_parse_callback` function maybe the thing that determines whether we consider we've received a result or not, yet the code indicates this would only occur if we receive either an `A`, `AAAA` or `CNAME` record, yet in our case there's no `ANSWER SECTION` whatsoever.
[/MESSAGE]

I'd really like to help debug this one if at all possible, and would appreciate any pointers as to how best to go about doing that?

Thanks, Dominic.
LEGAL DISCLAIMER. The contents of this electronic communication
and any attached documents are strictly confidential and they may not
be used or disclosed by someone who is not a named recipient.
If you have received this electronic communication in error please notify
the sender by replying to this electronic communication inserting the
word "misdirected" as the subject and delete this communication from
your system.
Rich Felker
2017-03-15 12:25:15 UTC
Permalink
Post by d***@glencore.com
https://github.com/gliderlabs/docker-alpine/issues/8#issuecomment-223901519
https://github.com/gliderlabs/docker-alpine/issues/8#issuecomment-286561614
[...]
While I can confirm the second part (queries greater
than `ndots` never fall-back to using search), the first part
(queries smaller than `ndots` fall-back to using an absolute query)
isn't what I observe.
Using dig on an Ubuntu container and attempting to resolve the
nonsensical query `google.com.default.svc.cluster.local` (simulates
the type of initial query for a short domain that would be
occurring) returns a `QUESTION SECTION` and an `AUTHORITY SECTION`,
but no `ANSWER SECTION`. This should cause musl libc to attempt to
resolve the absolute query (`google.com`) instead, yet it doesn't
seem to based on the final result of the query.
This is where your problem lies. A response with an empty answer
section is an affirmative answer that the requested name exists but
has no records of the requested type (A or AAAA). In this case the
answer must be accepted; otherwise results are inconsistent depending
on how the query is performed. See the previous discussion of the same
topic here: http://www.openwall.com/lists/musl/2017/01/19/4
and commit 0fef7ffac114befc94ab5fa794a1754442dcd531.

To fix the problem, whatever local nameserver is returning affirmative
no-A-record results for nonexistent domains needs to be fixed to
return NxDomain.

Rich
d***@glencore.com
2017-03-15 12:58:02 UTC
Permalink
HI Rich,

Thanks for the prompt response here. Apologies for any confusion I may have created, but I think the server is responding with an overall `NXDOMAIN` response. This is what I get from running `dig google.com.default.svc.cluster.local`:

```
; <<>> DiG 9.10.3-P4-Ubuntu <<>> google.com.default.svc.cluster.local
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 20863
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;google.com.default.svc.cluster.local. IN A

;; AUTHORITY SECTION:
cluster.local. 60 IN SOA ns.dns.cluster.local. hostmaster
.cluster.local. 1489579200 28800 7200 604800 60

;; Query time: 0 msec
;; SERVER: 10.43.0.10#53(10.43.0.10)
;; WHEN: Wed Mar 15 12:49:14 UTC 2017
;; MSG SIZE rcvd: 147
```

Although there's less information with nslookup, the response from running `nslookup google.com.default.svc.cluster.local` seems even more definitive:

```
Server: 10.43.0.10
Address: 10.43.0.10#53

** server can't find google.com.default.svc.cluster.local: NXDOMAIN
```

Maybe I was just reading too much into the output from dig regarding exactly what was being returned from the server. Any further thoughts?

Thanks again, Dominic.

-----Original Message-----
From: Rich Felker [mailto:***@aerifal.cx] On Behalf Of Rich Felker
Sent: 15 March 2017 13:25
To: ***@lists.openwall.com
Subject: Re: [musl] Queries with less than `ndots` dots never lead to resolution using the global namespace if the `search` domains don't work
Post by d***@glencore.com
https://github.com/gliderlabs/docker-alpine/issues/8#issuecomment-2239
01519
https://github.com/gliderlabs/docker-alpine/issues/8#issuecomment-2865
61614
[...]
While I can confirm the second part (queries greater than `ndots` 
never fall-back to using search), the first part (queries smaller than 
`ndots` fall-back to using an absolute query) isn't what I observe.
Using dig on an Ubuntu container and attempting to resolve the
nonsensical query `google.com.default.svc.cluster.local` (simulates
the type of initial query for a short domain that would be
occurring) returns a `QUESTION SECTION` and an `AUTHORITY SECTION`,
but no `ANSWER SECTION`. This should cause musl libc to attempt to
resolve the absolute query (`google.com`) instead, yet it doesn't seem
to based on the final result of the query.
This is where your problem lies. A response with an empty answer section is an affirmative answer that the requested name exists but has no records of the requested type (A or AAAA). In this case the answer must be accepted; otherwise results are inconsistent depending on how the query is performed. See the previous discussion of the same topic here: http://www.openwall.com/lists/musl/2017/01/19/4
and commit 0fef7ffac114befc94ab5fa794a1754442dcd531.

To fix the problem, whatever local nameserver is returning affirmative no-A-record results for nonexistent domains needs to be fixed to return NxDomain.

Rich
LEGAL DISCLAIMER. The contents of this electronic communication
and any attached documents are strictly confidential and they may not
be used or disclosed by someone who is not a named recipient.
If you have received this electronic communication in error please notify
the sender by replying to this electronic communication inserting the
word "misdirected" as the subject and delete this commun
Rich Felker
2017-03-15 15:11:03 UTC
Permalink
Post by d***@glencore.com
HI Rich,
Thanks for the prompt response here. Apologies for any confusion I
may have created, but I think the server is responding with an
overall `NXDOMAIN` response. This is what I get from running `dig
```
; <<>> DiG 9.10.3-P4-Ubuntu <<>> google.com.default.svc.cluster.local
;; global options: +cmd
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 20863
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;google.com.default.svc.cluster.local. IN A
cluster.local. 60 IN SOA ns.dns.cluster.local. hostmaster
.cluster.local. 1489579200 28800 7200 604800 60
;; Query time: 0 msec
;; SERVER: 10.43.0.10#53(10.43.0.10)
;; WHEN: Wed Mar 15 12:49:14 UTC 2017
;; MSG SIZE rcvd: 147
```
Although there's less information with nslookup, the response from
running `nslookup google.com.default.svc.cluster.local` seems even
```
Server: 10.43.0.10
Address: 10.43.0.10#53
** server can't find google.com.default.svc.cluster.local: NXDOMAIN
```
Maybe I was just reading too much into the output from dig regarding
exactly what was being returned from the server. Any further
thoughts?
Can you send an strace log of an affected lookup with musl's resolver
(rather than dig/nslookup which use bind's resolver) for me to look
at? Attached is source for a trivial sample utility to perform a
lookup.

Rich
d***@glencore.com
2017-03-15 16:58:54 UTC
Permalink
HI Rich,

I've attached the results from running the following two commands:

1. ` strace ./gai3a google.com 2> strace_of_google-com`
2. ` strace ./gai3a google.com.default.svc 2> strace_of_google-com-default-svc`

Is this what you were looking for or can I provide something more useful?

Thanks again, Dominic.

-----Original Message-----
From: Rich Felker [mailto:***@aerifal.cx] On Behalf Of Rich Felker
Sent: 15 March 2017 16:11
To: ***@lists.openwall.com
Subject: Re: [musl] Queries with less than `ndots` dots never lead to resolution using the global namespace if the `search` domains don't work
Post by d***@glencore.com
HI Rich,
Thanks for the prompt response here. Apologies for any confusion I may
have created, but I think the server is responding with an overall
`NXDOMAIN` response. This is what I get from running `dig
```
; <<>> DiG 9.10.3-P4-Ubuntu <<>> google.com.default.svc.cluster.local
;; global options: +cmd
qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;google.com.default.svc.cluster.local. IN A
cluster.local. 60 IN SOA ns.dns.cluster.local. hostmaster
.cluster.local. 1489579200 28800 7200 604800 60
;; Query time: 0 msec
;; SERVER: 10.43.0.10#53(10.43.0.10)
;; WHEN: Wed Mar 15 12:49:14 UTC 2017
;; MSG SIZE rcvd: 147
```
Although there's less information with nslookup, the response from
running `nslookup google.com.default.svc.cluster.local` seems even
```
Server: 10.43.0.10
Address: 10.43.0.10#53
** server can't find google.com.default.svc.cluster.local: NXDOMAIN
```
Maybe I was just reading too much into the output from dig regarding
exactly what was being returned from the server. Any further thoughts?
Can you send an strace log of an affected lookup with musl's resolver (rather than dig/nslookup which use bind's resolver) for me to look at? Attached is source for a trivial sample utility to perform a lookup.

Rich
LEGAL DISCLAIMER. The contents of this electronic communication
and any attached documents are strictly confidential and they may not
be used or disclosed by someone who is not a named recipient.
If you have received this electronic communication in error please notify
the sender by replying to this electronic communication inserting the
word "misdirected" as the subject and delete this communication from
your system.
d***@glencore.com
2017-03-15 17:10:08 UTC
Permalink
HI Rich,

I've attached the results from running the following two commands:

1. ` strace ./gai3a google.com 2> strace_of_google-com.txt`
2. ` strace ./gai3a google.com.default.svc 2> strace_of_google-com-default-svc.txt`

Is this what you were looking for or can I provide something more useful?

Thanks again, Dominic.

-----Original Message-----
From: Rich Felker [mailto:***@aerifal.cx] On Behalf Of Rich Felker
Sent: 15 March 2017 16:11
To: ***@lists.openwall.com
Subject: Re: [musl] Queries with less than `ndots` dots never lead to resolution using the global namespace if the `search` domains don't work
Post by d***@glencore.com
HI Rich,
Thanks for the prompt response here. Apologies for any confusion I may
have created, but I think the server is responding with an overall
`NXDOMAIN` response. This is what I get from running `dig
```
; <<>> DiG 9.10.3-P4-Ubuntu <<>> google.com.default.svc.cluster.local
;; global options: +cmd
qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;google.com.default.svc.cluster.local. IN A
cluster.local. 60 IN SOA ns.dns.cluster.local. hostmaster
.cluster.local. 1489579200 28800 7200 604800 60
;; Query time: 0 msec
;; SERVER: 10.43.0.10#53(10.43.0.10)
;; WHEN: Wed Mar 15 12:49:14 UTC 2017
;; MSG SIZE rcvd: 147
```
Although there's less information with nslookup, the response from
running `nslookup google.com.default.svc.cluster.local` seems even
```
Server: 10.43.0.10
Address: 10.43.0.10#53
** server can't find google.com.default.svc.cluster.local: NXDOMAIN
```
Maybe I was just reading too much into the output from dig regarding
exactly what was being returned from the server. Any further thoughts?
Can you send an strace log of an affected lookup with musl's resolver (rather than dig/nslookup which use bind's resolver) for me to look at? Attached is source for a trivial sample utility to perform a lookup.

Rich
LEGAL DISCLAIMER. The contents of this electronic communication
and any attached documents are strictly confidential and they may not
be used or disclosed by someone who is not a named recipient.
If you have received this electronic communication in error please notify
the sender by replying to this electronic communication inserting the
word "misdirected" as the subject and delete this communication from
your system.
Rich Felker
2017-03-15 17:22:53 UTC
Permalink
Post by d***@glencore.com
HI Rich,
1. ` strace ./gai3a google.com 2> strace_of_google-com.txt`
2. ` strace ./gai3a google.com.default.svc 2> strace_of_google-com-default-svc.txt`
Is this what you were looking for or can I provide something more useful?
execve("./gai3a", ["./gai3a", "google.com"], [/* 15 vars */]) = 0
[...]
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
sendto(3, "\271\377\1\0\0\1\0\0\0\0\0\0\6google\3com\7default\3"..., 54, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.43.0.10")}, 16) = 54
sendto(3, "\341\262\1\0\0\1\0\0\0\0\0\0\6google\3com\7default\3"..., 54, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.43.0.10")}, 16) = 54
poll([{fd=3, events=POLLIN}], 1, 2500) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\271\377\201\203\0\1\0\0\0\1\0\0\6google\3com\7default\3"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.43.0.10")}, [16]) = 147
recvfrom(3, "\341\262\201\203\0\1\0\0\0\1\0\0\6google\3com\7default\3"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.43.0.10")}, [16]) = 147
^^^^^^^
Post by d***@glencore.com
[...]
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
sendto(3, "H\244\1\0\0\1\0\0\0\0\0\0\6google\3com\7kubelet\n"..., 64, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.43.0.10")}, 16) = 64
sendto(3, "K\266\1\0\0\1\0\0\0\0\0\0\6google\3com\7kubelet\n"..., 64, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.43.0.10")}, 16) = 64
poll([{fd=3, events=POLLIN}], 1, 2500) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "H\244\205\200\0\1\0\0\0\1\0\0\6google\3com\7kubelet\n"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.43.0.10")}, [16]) = 100
recvfrom(3, 0x7fff9c3f83c0, 512, 0, 0x7fff9c3f7e70, [16]) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}], 1, 2498) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "K\266\205\200\0\1\0\0\0\1\0\0\6google\3com\7kubelet\n"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.43.0.10")}, [16]) = 100
^^^^^^^^

Here we see your nameserver returning a RCODE=0 (Success, the \200
byte) reply for google.com.kubelet[...] rather than NxDomain. Sorry I
don't have the full name; you need to pass a larger -s to strace to
get it not to truncate strings. You need to figure out why the
nameserver is claiming this exists; it might be some sort of wildcard
record or just a buggy nameserver (probably some component of
kubernetes).

Rich
d***@glencore.com
2017-03-15 19:26:46 UTC
Permalink
Post by Rich Felker
Post by d***@glencore.com
execve("./gai3a", ["./gai3a", "google.com"], [/* 15 vars */]) = 0
[...] socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK,
IPPROTO_IP) = 3 bind(3, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendto(3,
"\271\377\1\0\0\1\0\0\0\0\0\0\6google\3com\7default\3"..., 54,
MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.43.0.10")}, 16) = 54 sendto(3,
"\341\262\1\0\0\1\0\0\0\0\0\0\6google\3com\7default\3"..., 54,
MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.43.0.10")}, 16) = 54 poll([{fd=3,
events=POLLIN}], 1, 2500) = 1 ([{fd=3, revents=POLLIN}]) recvfrom(3,
"\271\377\201\203\0\1\0\0\0\1\0\0\6google\3com\7default\3"..., 512, 0,
{sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.43.0.10")}, [16]) = 147 recvfrom(3,
"\341\262\201\203\0\1\0\0\0\1\0\0\6google\3com\7default\3"..., 512, 0,
{sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.43.0.10")}, [16]) = 147
^^^^^^^
Post by d***@glencore.com
[...]
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("0.0.0.0")}, 16) = 0 sendto(3,
"H\244\1\0\0\1\0\0\0\0\0\0\6google\3com\7kubelet\n"..., 64,
MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.43.0.10")}, 16) = 64 sendto(3,
"K\266\1\0\0\1\0\0\0\0\0\0\6google\3com\7kubelet\n"..., 64,
MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.43.0.10")}, 16) = 64 poll([{fd=3,
events=POLLIN}], 1, 2500) = 1 ([{fd=3, revents=POLLIN}]) recvfrom(3,
"H\244\205\200\0\1\0\0\0\1\0\0\6google\3com\7kubelet\n"..., 512, 0,
{sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.43.0.10")}, [16]) = 100 recvfrom(3,
0x7fff9c3f83c0, 512, 0, 0x7fff9c3f7e70, [16]) = -1 EAGAIN (Resource
temporarily unavailable) poll([{fd=3, events=POLLIN}], 1, 2498) = 1
([{fd=3, revents=POLLIN}]) recvfrom(3,
"K\266\205\200\0\1\0\0\0\1\0\0\6google\3com\7kubelet\n"..., 512, 0,
{sa_family=AF_INET, sin_port=htons(53),
sin_addr=inet_addr("10.43.0.10")}, [16]) = 100
^^^^^^^^
Here we see your nameserver returning a RCODE=0 (Success, the \200
byte) reply for google.com.kubelet[...] rather than NxDomain. Sorry I don't have the full name; you need to pass a larger -s to strace to get it not to truncate strings. You need to figure out why the nameserver is claiming this exists; it might be some sort of wildcard record or just a buggy nameserver (probably some component of kubernetes).
Rich
By preventing the trace from being truncated (see attached) it appears that this only occurs when querying names for which Rancher's DNS is authoritative, and is not happening for names for which Kubernetes' DNS is authoritative.

FYI, this is how `search` was defined in `resolv.conf`:

```
search default.svc.cluster.local svc.cluster.local cluster.local kubelet.kubernetes.rancher.internal kubernetes.rancher.internal rancher.internal
```

Where `default.svc.cluster.local`, `svc.cluster.local` and `cluster.local` are for service discovery in Kubernetes and `kubelet.kubernetes.rancher.internal`, `kubernetes.rancher.internal` and `rancher.internal` are (I believe) something to do with rancher-dns.

This would explain why only some people have continued having problems since Alpine 3.4 was released with the `search` and `domain` support contained within musl libc.

Raising a bug with Rancher now. Thanks so much for your help here!
LEGAL DISCLAIMER. The contents of this electronic communication
and any attached documents are strictly confidential and they may not
be used or disclosed by someone who is not a named recipient.
If you have received this electronic communication in error please notify
the sender by replying to this electronic communication inserting the
word "misdirected" as the subject and delete this communication from
your system.

Loading...