Unexpected NullPointerExceptions in Lift Production Applications

TLDR: NPEs/InvocationTargetExceptions happen when either the servlet container or the reverse proxy starts discarding requests associated with suspended continuations, as used by Lift for pending comet requests. This happens presumably when the container or proxy reaches a certain threshold of unprocessed requests and starts discarding old requests to keep up with new ones.

About eight months ago, when OpenStudy first started having decent amounts of online users at the same time interacting with the site, we started occasionally seeing NullPointerExceptions in Lift. These were very strange, stemming from attempting to access the server name or port, and seemed to happen mostly in comet or ajax requests. There has been a thread regarding jetty and these issues on the Lift mailing list (https://groups.google.com/d/msg/liftweb/x1SIveK_bK0/asHehgvGXe8J) for a while, so we decided to make the move to Glassfish.

Some people had mentioned jetty hightide didn't have the problem, but we tried it and ran into the same issue. Upon moving to Glassfish, we found other problems. Glassfish 3.1 leaked file descriptors on our server, so it needed to be periodically kicked. Deeming that unacceptable, we switched back to jetty 6 and decided to absorb the NPE issues until we had a chance to look deeper at the cause. Around this time, we also switched to a bigger EC2 instance to defray some load issues without scaling out quite yet. Jetty 6 stayed a lot quieter this time.

The issues seemed to have been tied to load from the very beginning, but the key that led to our conclusions was a bug that sneaked into one of our releases that caused a rapidly-increasing internal load to build up (this was a synchronization error in our actor registry that led to actors not being spawned properly). The reason this proved to be the key is because, once it produced the NPEs and we failed to track down the bug quickly, we decided to try the other servlet containers until we could sort things out.

We went through Glassfish again, then Jetty 7, Jetty 8, and finally Tomcat. Each and every one of these exhibited similar behavior and exceptions. The only difference that arose is that Tomcat finally gave us a useful underlying cause for the exception:

java.lang.reflect.InvocationTargetException: null
        at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) ~[na:na]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) ~[na:1.6.0_26]
        at java.lang.reflect.Method.invoke(Method.java:597) ~[na:1.6.0_26]
        at net.liftweb.http.provider.servlet.containers.Servlet30AsyncProvider.resume(Servlet30AsyncProvider.scala:102) ~[lift-webkit_2.8.1-2.4-M4.jar:2.4-M4]
[…]
Caused by: java.lang.IllegalStateException: The request associated with the AsyncContext has already completed processing.
        at org.apache.catalina.core.AsyncContextImpl.check(AsyncContextImpl.java:438) ~[catalina.jar:7.0.23]
        at org.apache.catalina.core.AsyncContextImpl.getResponse(AsyncContextImpl.java:196) ~[catalina.jar:7.0.23]
        ... 22 common frames omitted

Finally, the underlying IllegalStateException gave us the clue we needed to figure out what exactly had happened: the servlet containers are killing off the requests before the relevant continuation is woken up to deal with said request. Jetty 7/8 gave something similar, though much less clear:

java.lang.reflect.InvocationTargetException: null
        at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) ~[na:na]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) ~[na:1.6.0_26]
        at java.lang.reflect.Method.invoke(Method.java:597) ~[na:1.6.0_26]
        at net.liftweb.http.provider.servlet.containers.Servlet30AsyncProvider.resume(Servlet30AsyncProvider.scala:106) ~[lift-webkit_2.8.1-2.4-M4.jar:2.4-M4]
        at net.liftweb.http.provider.servlet.HTTPRequestServlet.resume(HTTPRequestServlet.scala:163) ~[lift-webkit_2.8.1-2.4-M4.jar:2.4-M4]
[…]
Caused by: java.lang.IllegalStateException: IDLE,initial
        at org.eclipse.jetty.server.AsyncContinuation.complete(AsyncContinuation.java:534) ~[na:na]
        ... 22 common frames omitted

All indications point to the problem being the same in the case of Jetty 6, since the behavior manifested in the same situation. We did fix our own internal bug, but shared this valuable information as to the cause of the null exceptions with David Pollak.

We've filed a ticket on Lift itself for a more informative error message in this case. That said, generally speaking, if you see this, it means there's a load issue somewhere along the application pipeline that you need to look at. Requests are dying, and you need to figure out how to deal with that. At the moment, we haven't gotten a chance to investigate whether it's the container or nginx terminating the request early, but we'll be looking into it if and when the problem shows up on our system again.
10 responses
That's a well-timed post, I've spent the day pondering exactly the same issue in my first-ever Lift app. I'm running on Jetty 8.0.4, and getting the:

Caused by: java.lang.IllegalStateException: IDLE,initial
at org.eclipse.jetty.server.AsyncContinuation.complete(AsyncContinuation.java:534)
... 22 more

Exception. This only occurs when I have a browser showing a page with a Comet element on it, though there is no other load on the application. A further data-point for analysis...

Wow. I'd definitely see if you can simplify that and then post an example app with the report to the Lift list. Comet components may well be the only part of Lift that actually makes use of continuations, so that's probably why that's what's triggering it. It may also be worth it to check if there's something strange about your jetty or nginx configurations—Lift comet requests are designed to run for 90s, if I'm not mistaken, so if, say, nginx is set up to time out requests in less than that, you may run into this issue.
Actually, following the discussion on the Lift list, I've just tried downgrading to Jetty 7.6.0.RC1 - the problem goes away. I'm guessing that the experimental version of Jetty has re-broken continuations, at least as far as Lift is concerned.
Thanks Antonio. This clarifies a lot of guesses.

>Requests are dying, and you need to figure out how to deal with that
Would you be kind enough to give a simple example about this?

I'm afraid it's been relatively tough to reproduce for me, except when we introduced that bug. To be honest, you should try triggering a loop of comet messages that will suspend the comet thread, then trigger many of those comets (200+ probably) and see what happens. I think that'll probably reproduce it relatively reliably. The server (either servlet container or nginx) will then probably time out the requests before the continuation gets woken up to handle them.
Thanks for this investigation work Antonio.

When you say "suspend the comet thread", do you mean block during message handling of comet actor events? I've long been able to reproduce this during stress testing, but just now managed to reproduce with a new piece of functionality that perhaps takes a little too long to handle some messages.

I used Thread.sleep, but blocking would probably do the same deali-o.

Another thing we've recently spotted is that full GC pauses can have a large adverse effect in the same way. Our full GC (tenured GC) pauses can take 5+ seconds sometimes, and that will back up requests enough to start screwing things up. That's something I'm investigating as we speak.

Yeah, I think that's what I see when stress testing causes the problem for me.

Jeeez this Posterous page hammers my Chrome performance! "GPU Process" goes nuts with the CPU.

Really? Eek. Not sure why that's happening, Safari's pretty smooth :/
1 visitor upvoted this post.