tlsify - Re: Introduction & summary of tlsify discussions, part 2

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1443468478.2263.21.camel@zhasha.com>
Date: Mon, 28 Sep 2015 21:27:58 +0200
From: Joakim Sindholt <opensource@...sha.com>
To: tlsify@...ts.openwall.com
Subject: Re: Introduction & summary of tlsify discussions, part 2

On Mon, 2015-09-28 at 12:41 -0400, Rich Felker wrote:
> On Mon, Sep 28, 2015 at 06:01:56PM +0200, Joakim Sindholt wrote:
> > On Sun, 2015-09-27 at 22:20 -0400, Rich Felker wrote:
> > > The following are excerpts from notes by Daniel Kahn Gillmor (dkg),
> > > who was part of the CII-Madrid tlsify discussions, originally sent to
> > > me by email before the tlsify list was setup, and my replies. Reposted
> > > with permission. The original notes from the workshop were a lot more
> > > sparse than the expanded version I just sent to the tlsify list, so
> > > some of the questions below are probably already answered, but I think
> > > it's still useful discussion.
> > 
> > I do have some concerns based on this, mostly performance related. It
> > should be no secret that I think this should be in the kernel. Please
> > keep that in mind when reading my opinions.
> 
> Thanks for the feedback. I didn't go into the whole kernel topic yet
> because I'd written enough already, and it's still a ways off I think.
> In brief, the phased discussed (in both the musl community and at
> CII-Madrid) are roughly:
> 
> Phase 1: API development and implementations doing everything in a
> userspace in the tlsify child process, using whatever existing TLS
> backends make sense (they're isolated in the child, so library-safety
> issues and other implementation warts don't matter so much).
> 
> Phase 2: Produce TLS backend code for use by the tlsify process that
> makes proper use of shared text and minimizes libraries so that
> exec'ing it is sufficiently fast and light to be practical for many
> real-world loads.
> 
> Phase 3: Develop a mechanism for handing off the symmetric crypto to
> the kernel. Session management would still need userspace help. This
> phase is not well-defined at this point, but I would like to keep the
> external-command API as the preferred way of setting this up for
> simple apps without huge scalability requirements but also have a
> library framework for using kernel-side TLS help.

Phase 3 sounds like a reasonable tradeoff. This still leaves what I
would deem a necessity to design the initial handshake protocol to be
potentially really fast.

> Is supporting resumption something we really want to be encouraging?
> My impression is that it has negative impact on forward secrecy and
> little benefit for https (where keepalive/persistent connections
> achieve many of the same goals) but I don't by any means consider
> myself an expert on this topic.
> 
> In general, in cases where supporting a feature has purely negative
> impact on security and is not a hard requirement for usage cases, my
> leaning would be towards not supporting it. But I'm very open to
> discussion on this topic.

It does endanger forward secrecy but given the herculean cost of
negotiating a new session I would say it's currently well worth it.
Although my opinion is not the most informed on this matter so maybe
just disregard it for now.
Besides, it's not a hard requirement and it can always be added later. I
do however maintain my position that it should be handled by tlsify, not
the user. Our ultimate goal should be to provide the best possible TLS
layer with the least amount of API interaction from the user's point of
view.

> > * 500µs startup time gives you 2000 key exchanges per second. While
> >   nginx posted benchmarks showing around 350 poorly defined negotiations
> >   per core per second[1]. By no means is that negligible overhead.
> 
> Absolutely. I don't think the model is really appropriate for
> high-load servers with large numbers of transient connections, but it
> may be reasonable to support an extension where the child process
> handles multiple TLS sessions for its caller (all in one process) and
> still get lots of the same benefits. I say an "extension" because
> implementing this should not be mandatory for tlsify API
> implementations and callers should be able to fallback to
> process-per-session if it's not implemented.
> 
> Going back to the big picture, the problem I see tlsify as solving is
> that the current approaches to TLS are all tailored towards highly
> engineered applications intended to scale to large numbers of
> connections, and don't fit well with simple client applications that
> just need TLS for privacy/authentication/etc. The early adopters I
> have in mind are things like:
> 
> - Git
> - Downloaders (wget-like)
> - Chat clients and servers (IRC, etc.)
> - Light REST API clients for services
> - Mail services

Those are some good targets but anything statically linked in general is
going to practically scream for this. I am currently using mbedtls for
all my TLS needs and I am willing to take the performance hit over
libressl/libtls because it's "only" 600k of extra code linked in.
Clearly such numbers are completely unacceptable.

> > * Having one process per connection but still polling seems like kind of
> >   a waste. Might as well have two threads in it to send and recv
> >   asynchronously. Save some syscalls, parallelize, all that jazz.
> 
> This is on the other side of the API boundary, so there's no reason a
> tlsify implementation couldn't just use two threads like that. It
> probably wouldn't even add any startup latency if you create the
> second thread after sending the initial handshake while waiting for a
> reply.

Kinda lost track of context there. This was in the context of tlsify
doing more than one connection per process and being able to "handle
multiple sockets in its select loop without a problem". It sounded a bit
like the internals of the implementation were already more or less
settled on.

> > * It will undoubtedly waste an awful amount of time looking for and
> >   parsing certs in the CA folder if it's a client.
> 
> This sounds like an important problem we need to solve.

You know, this would all be so easy if we could just launch a
per-namespace CA cert caching daemon. Is there some linux-y way of
having tlsify spawn a caching daemon into a group with its parent pid
and namespacing it somehow so only that group can see the cache?

Principally I'm against the pre-processing on the file system. Frankly
it's like a cruel joke, making it stupidly difficult to add a
certificate to the CA store when it should just be a folder full of .pem
files. The current state of linux CA trust stores is abysmal by the way
and I have my doubts that we're gonna change that. At least we can do
our part and support a sane structure.

> > * And my main concern: this will be painful to integrate into existing
> >   applications. The goal here should be to replace the current model
> >   and requiring all new users to go through all their code, find all
> >   instances where they create new fds and ensure they set CLOEXEC on all
> >   of them is a big blocker. I don't consider enumerating and closing all
> >   1100 open fds to be an acceptable solution to this. Imagine doing that
> >   5000 times per second.
> 
> Missing close-on-exec is an issue, but it's only a race condition in
> multi-threaded applications. In others, the fd leaks either
> always-happen or never-happen, and if they always-happen, it's easy to
> find and fix them. It would be wonderful if there were some global
> solution to this problem like close-on-exec-by-default, but sadly that
> ship already sailed a long time ago...
> 
> Do you have any ideas for avoiding this?

None that aren't absolutely horrendous, supposing we go for the one
connection = one process model. Once you delve into the realm of using
/proc, you've gone too far.

> > Parents have one tlsify child handling all connections:
> > * This is only interesting for servers and suffers all the same problems
> >   as the approach above.
> > * Running it with the stdin/stdout pipes would effectively be a special
> >   case of its intended mode of operation, unless you then call it
> >   tlsifyd and have tlsify be a small shim around it.
> > 
> > One major tlsifyd and all users just connect to it:
> > * Security nightmare - even moreso than the current method and in more
> >   ways than one (file system permissions, probably more)
> > * This would require two binaries, the latter of which will rarely be
> >   used. That cannot possibly end well.
> > 
> > The last approach seems to have a far better potential for max
> > performance. You keep the whole CA cert pool in one place and you can
> > use session resuming across worker processes with zero issue.
> > 
> > The middle approach is a good midway station for servers but offers
> > nothing but some extra potential for screw-ups in clients.
> > 
> > The first approach is undoubtedly my favorite but it does have some
> > serious performance considerations that are vitally important when it
> > comes to servers and longrunning clients doing many connections. I would
> > like to see solutions to these problems rather than compromise on the
> > process model.
> 
> Ultimately I think we have to accept that we're sacrificing some of
> the high-performance and broken-code-friendly options for the sake of
> something that's much more secure and easy to integrate with clean and
> simple applications. I don't have any delusions that tlsify is going
> to displace direct in-process library usage for huge servers, but it's
> able to solve a problem that presently has no solution, and if we get
> to phase 3 (the kernel stuff) that could very well open a path to much
> greater performance for high-load https servers.

Broken-code-friendly I can stand to lose, however I think it's crucial
we design to allow high performance.

Funny story:
So there I was, drunk, thrashing my server (Xeon E3-1265L v2) with
posix_spawn. As it turns out I can get upward of 410000 spawns per
second. Mind you it wrecks all other processes trying to start.
Now I don't know how many 256 bit ECDHE negotiations I can do in that
time but it seems to be a very small number comparatively. Combined with
HTTP requests and all that back and forth we're only talking a couple of
hundred connections per second at best.
This is on linux 3.17.7 and I don't know if process spawning time is
changing. At these speeds I'm tempted to just say "fix the kernel"
should any serious issue arise here.

> > > > control channel formats
> > > > -----------------------
> > > > 
> > > > this seems like the big question: how can we structure this in a
> > > > friendly/simple way without needing to pull in json libraries or
> > > > other scary/dangerous/complex parsers.  My big fear is that there will
> > > > be control channel data that is large and possibly complex.
> > > > 
> > > > examples of large data: full certificate chains -- these can be quite
> > > > long.  i don't think there is a functional limit on their length,
> > > > actually, and the size of any one certificate can be huge.  Similarly
> > > > large are the hints provided by servers during CertificateRequest
> > > > messages.
> > > > 
> > > > examples of complex data: session initialization/configuration is
> > > > starting to sound possibly complex, though if it's just command-line
> > > > arguments, we should be able to get away with line-at-a-time reads
> > > > (maybe with embedded NULs between args so that the same command-line
> > > > parser can be used?).  what do you think is the most complex part you've
> > > > seen?
> > > 
> > > Embedded NULs aren't possible; arguments to exec are C strings. Of
> > > course it's possible to have multiple arguments. But since the command
> > > line is usually public (via ps, etc.) any potentially-private input
> > > needs to be via the environment or an active control-channel.
> > > 
> > > I'm really not sure what the most complex part is. I think we need to
> > > get some examples going with a complete inventory of inputs that would
> > > need to be passed.
> > 
> > I don't really see a need to make it particularly complicated. Pass
> > certs in PEM format over the ctl line. PEM is plain text and used
> > absolutely everywhere. This has the added benefit of making it slightly
> > easier to introspect for debugging purposes.
> > 
> > What other data is there that isn't plain text? I can't think of any.
> > For the love of all that is good, don't start with the JSON
> > serialization.
> > 
> > I assume this is going to be handled by a very tiny libtlsify that you'd
> > link in (always statically?) so it should be some manner of extensible
> > while also remaining backwards compatible. I fear the day I get the
> > "your tlsify is too new, please downgrade" message.
> 
> I started to write somewhere, but I forget whether I actually followed
> up on this -- I think it would be nice to have some canonical
> reference code for how to invoke tlsify as long as it's not mandatory
> to use (i.e. not the API boundary). But for many simple apps (think
> Busybox wget) it's just going to be a few lines of C you'd write
> inline. Script langs that automate opening a bidirectional IO channel
> to a child process will also have it very easy.

So I suppose the question here is: are we designing the IPC API to be so
user friendly that there may not even be a real tangible need for a
libtlsify? Because if you have some ideas on how to do that with
traditional UNIX then I am all for it. I mean sockets for IPC are not
the easiest most introspectable thing in the world.
The smaller the API surface the better.
Oh, I know, let's use D-Bus!^H^H^H^H^H^H

-- Joakim
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.