tlsify - Re: Introduction & summary of tlsify discussions, part 2

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20150928220101.GF17773@brightrain.aerifal.cx>
Date: Mon, 28 Sep 2015 18:01:01 -0400
From: Rich Felker <dalias@...c.org>
To: tlsify@...ts.openwall.com
Subject: Re: Introduction & summary of tlsify discussions, part 2

On Mon, Sep 28, 2015 at 09:27:58PM +0200, Joakim Sindholt wrote:
> On Mon, 2015-09-28 at 12:41 -0400, Rich Felker wrote:
> > On Mon, Sep 28, 2015 at 06:01:56PM +0200, Joakim Sindholt wrote:
> > > On Sun, 2015-09-27 at 22:20 -0400, Rich Felker wrote:
> > > > The following are excerpts from notes by Daniel Kahn Gillmor (dkg),
> > > > who was part of the CII-Madrid tlsify discussions, originally sent to
> > > > me by email before the tlsify list was setup, and my replies. Reposted
> > > > with permission. The original notes from the workshop were a lot more
> > > > sparse than the expanded version I just sent to the tlsify list, so
> > > > some of the questions below are probably already answered, but I think
> > > > it's still useful discussion.
> > > 
> > > I do have some concerns based on this, mostly performance related. It
> > > should be no secret that I think this should be in the kernel. Please
> > > keep that in mind when reading my opinions.
> > 
> > Thanks for the feedback. I didn't go into the whole kernel topic yet
> > because I'd written enough already, and it's still a ways off I think.
> > In brief, the phased discussed (in both the musl community and at
> > CII-Madrid) are roughly:
> > 
> > Phase 1: API development and implementations doing everything in a
> > userspace in the tlsify child process, using whatever existing TLS
> > backends make sense (they're isolated in the child, so library-safety
> > issues and other implementation warts don't matter so much).
> > 
> > Phase 2: Produce TLS backend code for use by the tlsify process that
> > makes proper use of shared text and minimizes libraries so that
> > exec'ing it is sufficiently fast and light to be practical for many
> > real-world loads.
> > 
> > Phase 3: Develop a mechanism for handing off the symmetric crypto to
> > the kernel. Session management would still need userspace help. This
> > phase is not well-defined at this point, but I would like to keep the
> > external-command API as the preferred way of setting this up for
> > simple apps without huge scalability requirements but also have a
> > library framework for using kernel-side TLS help.
> 
> Phase 3 sounds like a reasonable tradeoff. This still leaves what I
> would deem a necessity to design the initial handshake protocol to be
> potentially really fast.

Handshake protocol? Between tlsify and the calling application? My
idea is to avoid the need for any bidirectional communication for most
common uses. The concept of "acceptable reference identifiers" from
RFC 6125 is highly aligned with this goal: rather than receiving the
certificate first and then evaluating whether it's acceptable for the
peer presenting it, the application should have a _predetermined_ list
of "acceptable reference identifiers" which it can pass off to the TLS
implementation (tlsify) before the TLS handshake, and then only
certificates that match are accepted.

> > Is supporting resumption something we really want to be encouraging?
> > My impression is that it has negative impact on forward secrecy and
> > little benefit for https (where keepalive/persistent connections
> > achieve many of the same goals) but I don't by any means consider
> > myself an expert on this topic.
> > 
> > In general, in cases where supporting a feature has purely negative
> > impact on security and is not a hard requirement for usage cases, my
> > leaning would be towards not supporting it. But I'm very open to
> > discussion on this topic.
> 
> It does endanger forward secrecy but given the herculean cost of
> negotiating a new session I would say it's currently well worth it.
> Although my opinion is not the most informed on this matter so maybe
> just disregard it for now.
> Besides, it's not a hard requirement and it can always be added later. I
> do however maintain my position that it should be handled by tlsify, not
> the user. Our ultimate goal should be to provide the best possible TLS
> layer with the least amount of API interaction from the user's point of
> view.

I don't think silently storing information that could be used to
retroactively compromise a user's past communications is very friendly
to the user's/application's interests. It's also probably highly
undesirable for your wget (which is likely operating as part of a
script) to resume a TLS session your interactive browser started (or
to use pinned certificates, cert exceptions, etc. from your
interactive browser).

If it's desirable to do session resumption and other stateful things
without heavy lifting from the application side, the application
probably needs to pass (as an input) its own private location for this
data to be kept.

> > > * 500µs startup time gives you 2000 key exchanges per second. While
> > >   nginx posted benchmarks showing around 350 poorly defined negotiations
> > >   per core per second[1]. By no means is that negligible overhead.
> > 
> > Absolutely. I don't think the model is really appropriate for
> > high-load servers with large numbers of transient connections, but it
> > may be reasonable to support an extension where the child process
> > handles multiple TLS sessions for its caller (all in one process) and
> > still get lots of the same benefits. I say an "extension" because
> > implementing this should not be mandatory for tlsify API
> > implementations and callers should be able to fallback to
> > process-per-session if it's not implemented.
> > 
> > Going back to the big picture, the problem I see tlsify as solving is
> > that the current approaches to TLS are all tailored towards highly
> > engineered applications intended to scale to large numbers of
> > connections, and don't fit well with simple client applications that
> > just need TLS for privacy/authentication/etc. The early adopters I
> > have in mind are things like:
> > 
> > - Git
> > - Downloaders (wget-like)
> > - Chat clients and servers (IRC, etc.)
> > - Light REST API clients for services
> > - Mail services
> 
> Those are some good targets but anything statically linked in general is
> going to practically scream for this. I am currently using mbedtls for
> all my TLS needs and I am willing to take the performance hit over
> libressl/libtls because it's "only" 600k of extra code linked in.
> Clearly such numbers are completely unacceptable.

Yes but I don't think someone running https with 10k negotiations per
second is concerned about 600k of code.

> > > * Having one process per connection but still polling seems like kind of
> > >   a waste. Might as well have two threads in it to send and recv
> > >   asynchronously. Save some syscalls, parallelize, all that jazz.
> > 
> > This is on the other side of the API boundary, so there's no reason a
> > tlsify implementation couldn't just use two threads like that. It
> > probably wouldn't even add any startup latency if you create the
> > second thread after sending the initial handshake while waiting for a
> > reply.
> 
> Kinda lost track of context there. This was in the context of tlsify
> doing more than one connection per process and being able to "handle
> multiple sockets in its select loop without a problem". It sounded a bit
> like the internals of the implementation were already more or less
> settled on.

Oh, no, that was just one way it could be implemented. But if you
start using a thread-per-connection model internally in tlsify with a
"shared tlsify process" model as the external API, it's really not
much lighter than a process-per-connection external model. With an
efficient static-linked tlsify binary, posix_spawn would only take
about 3x the time of pthread_create.

I suspect if "shared tlsify process" model is an interesting
extension, it would be implemented with N threads (N proportional to #
of cores) each running a poll loop rather than one thread per
connection.

> > > * It will undoubtedly waste an awful amount of time looking for and
> > >   parsing certs in the CA folder if it's a client.
> > 
> > This sounds like an important problem we need to solve.
> 
> You know, this would all be so easy if we could just launch a
> per-namespace CA cert caching daemon. Is there some linux-y way of
> having tlsify spawn a caching daemon into a group with its parent pid
> and namespacing it somehow so only that group can see the cache?

I don't think there's any reasonable way to do this kind of thing, for
the same reasons dbus is such a mess...

> Principally I'm against the pre-processing on the file system. Frankly
> it's like a cruel joke, making it stupidly difficult to add a
> certificate to the CA store when it should just be a folder full of .pem
> files. The current state of linux CA trust stores is abysmal by the way
> and I have my doubts that we're gonna change that. At least we can do
> our part and support a sane structure.

Would it be that bad to cache positive results in the filesystem in an
efficient manner and only do a search through all the certs when a
match isn't found in the cache?

> > > * And my main concern: this will be painful to integrate into existing
> > >   applications. The goal here should be to replace the current model
> > >   and requiring all new users to go through all their code, find all
> > >   instances where they create new fds and ensure they set CLOEXEC on all
> > >   of them is a big blocker. I don't consider enumerating and closing all
> > >   1100 open fds to be an acceptable solution to this. Imagine doing that
> > >   5000 times per second.
> > 
> > Missing close-on-exec is an issue, but it's only a race condition in
> > multi-threaded applications. In others, the fd leaks either
> > always-happen or never-happen, and if they always-happen, it's easy to
> > find and fix them. It would be wonderful if there were some global
> > solution to this problem like close-on-exec-by-default, but sadly that
> > ship already sailed a long time ago...
> > 
> > Do you have any ideas for avoiding this?
> 
> None that aren't absolutely horrendous, supposing we go for the one
> connection = one process model. Once you delve into the realm of using
> /proc, you've gone too far.

I think applications just have to do it right or deal with the
consequences of fd leaks. Giving them an encouragement to fix this
mess is not such a bad thing. And like I said it's really not hard to
find and fix except in the multi-threaded case. Presumably most such
bugs would be found at the time of integrating tlsify support.

> > > Parents have one tlsify child handling all connections:
> > > * This is only interesting for servers and suffers all the same problems
> > >   as the approach above.
> > > * Running it with the stdin/stdout pipes would effectively be a special
> > >   case of its intended mode of operation, unless you then call it
> > >   tlsifyd and have tlsify be a small shim around it.
> > > 
> > > One major tlsifyd and all users just connect to it:
> > > * Security nightmare - even moreso than the current method and in more
> > >   ways than one (file system permissions, probably more)
> > > * This would require two binaries, the latter of which will rarely be
> > >   used. That cannot possibly end well.
> > > 
> > > The last approach seems to have a far better potential for max
> > > performance. You keep the whole CA cert pool in one place and you can
> > > use session resuming across worker processes with zero issue.
> > > 
> > > The middle approach is a good midway station for servers but offers
> > > nothing but some extra potential for screw-ups in clients.
> > > 
> > > The first approach is undoubtedly my favorite but it does have some
> > > serious performance considerations that are vitally important when it
> > > comes to servers and longrunning clients doing many connections. I would
> > > like to see solutions to these problems rather than compromise on the
> > > process model.
> > 
> > Ultimately I think we have to accept that we're sacrificing some of
> > the high-performance and broken-code-friendly options for the sake of
> > something that's much more secure and easy to integrate with clean and
> > simple applications. I don't have any delusions that tlsify is going
> > to displace direct in-process library usage for huge servers, but it's
> > able to solve a problem that presently has no solution, and if we get
> > to phase 3 (the kernel stuff) that could very well open a path to much
> > greater performance for high-load https servers.
> 
> Broken-code-friendly I can stand to lose, however I think it's crucial
> we design to allow high performance.
> 
> Funny story:
> So there I was, drunk, thrashing my server (Xeon E3-1265L v2) with
> posix_spawn. As it turns out I can get upward of 410000 spawns per
> second.

Wow, really? Doesn't that come out to ~2500 ns/spawn? That's about
100x the rate I've measured on machines I typically use.

> Mind you it wrecks all other processes trying to start.
> Now I don't know how many 256 bit ECDHE negotiations I can do in that
> time but it seems to be a very small number comparatively. Combined with
> HTTP requests and all that back and forth we're only talking a couple of
> hundred connections per second at best.
> This is on linux 3.17.7 and I don't know if process spawning time is
> changing. At these speeds I'm tempted to just say "fix the kernel"
> should any serious issue arise here.

Indeed. I would not be surprised if our concerns about excessive
performance cost at connection setup time are completely misplaced.
It's quite plausible that the only performance concern the cost of
extra copies through the local socket and extra context switching at
very high transfer rates, which of course is where passing off the
symmetric encryption to the kernel would fix everything.

> > > I assume this is going to be handled by a very tiny libtlsify that you'd
> > > link in (always statically?) so it should be some manner of extensible
> > > while also remaining backwards compatible. I fear the day I get the
> > > "your tlsify is too new, please downgrade" message.
> > 
> > I started to write somewhere, but I forget whether I actually followed
> > up on this -- I think it would be nice to have some canonical
> > reference code for how to invoke tlsify as long as it's not mandatory
> > to use (i.e. not the API boundary). But for many simple apps (think
> > Busybox wget) it's just going to be a few lines of C you'd write
> > inline. Script langs that automate opening a bidirectional IO channel
> > to a child process will also have it very easy.
> 
> So I suppose the question here is: are we designing the IPC API to be so
> user friendly that there may not even be a real tangible need for a
> libtlsify? Because if you have some ideas on how to do that with
> traditional UNIX then I am all for it. I mean sockets for IPC are not
> the easiest most introspectable thing in the world.
> The smaller the API surface the better.

I think I mentioned this in the "part 1" email: prior to CII-Madrid,
my idea was that the main API boundary would be a C library where the
backend could be provided by threads in the same process or external
processes. But for many reasons making the "command line" (I use that
term loosely since more might be passed in env vars or fds, just for
privacy purposes) the API boundary seems like a much more interesting
choice. I went over some of these reasons briefly in the "part 1"
email.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.