Unexpected NullPointerExceptions in Lift Production Applications

TLDR: NPEs/InvocationTargetExceptions happen when either the servlet container or the reverse proxy starts discarding requests associated with suspended continuations, as used by Lift for pending comet requests. This happens presumably when the container or proxy reaches a certain threshold of unprocessed requests and starts discarding old requests to keep up with new ones.

About eight months ago, when OpenStudy first started having decent amounts of online users at the same time interacting with the site, we started occasionally seeing NullPointerExceptions in Lift. These were very strange, stemming from attempting to access the server name or port, and seemed to happen mostly in comet or ajax requests. There has been a thread regarding jetty and these issues on the Lift mailing list (https://groups.google.com/d/msg/liftweb/x1SIveK_bK0/asHehgvGXe8J) for a while, so we decided to make the move to Glassfish.

Some people had mentioned jetty hightide didn't have the problem, but we tried it and ran into the same issue. Upon moving to Glassfish, we found other problems. Glassfish 3.1 leaked file descriptors on our server, so it needed to be periodically kicked. Deeming that unacceptable, we switched back to jetty 6 and decided to absorb the NPE issues until we had a chance to look deeper at the cause. Around this time, we also switched to a bigger EC2 instance to defray some load issues without scaling out quite yet. Jetty 6 stayed a lot quieter this time.

The issues seemed to have been tied to load from the very beginning, but the key that led to our conclusions was a bug that sneaked into one of our releases that caused a rapidly-increasing internal load to build up (this was a synchronization error in our actor registry that led to actors not being spawned properly). The reason this proved to be the key is because, once it produced the NPEs and we failed to track down the bug quickly, we decided to try the other servlet containers until we could sort things out.

We went through Glassfish again, then Jetty 7, Jetty 8, and finally Tomcat. Each and every one of these exhibited similar behavior and exceptions. The only difference that arose is that Tomcat finally gave us a useful underlying cause for the exception:

java.lang.reflect.InvocationTargetException: null
        at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source) ~[na:na]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) ~[na:1.6.0_26]
        at java.lang.reflect.Method.invoke(Method.java:597) ~[na:1.6.0_26]
        at net.liftweb.http.provider.servlet.containers.Servlet30AsyncProvider.resume(Servlet30AsyncProvider.scala:102) ~[lift-webkit_2.8.1-2.4-M4.jar:2.4-M4]
[…]
Caused by: java.lang.IllegalStateException: The request associated with the AsyncContext has already completed processing.
        at org.apache.catalina.core.AsyncContextImpl.check(AsyncContextImpl.java:438) ~[catalina.jar:7.0.23]
        at org.apache.catalina.core.AsyncContextImpl.getResponse(AsyncContextImpl.java:196) ~[catalina.jar:7.0.23]
        ... 22 common frames omitted

Finally, the underlying IllegalStateException gave us the clue we needed to figure out what exactly had happened: the servlet containers are killing off the requests before the relevant continuation is woken up to deal with said request. Jetty 7/8 gave something similar, though much less clear:

java.lang.reflect.InvocationTargetException: null
        at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) ~[na:na]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) ~[na:1.6.0_26]
        at java.lang.reflect.Method.invoke(Method.java:597) ~[na:1.6.0_26]
        at net.liftweb.http.provider.servlet.containers.Servlet30AsyncProvider.resume(Servlet30AsyncProvider.scala:106) ~[lift-webkit_2.8.1-2.4-M4.jar:2.4-M4]
        at net.liftweb.http.provider.servlet.HTTPRequestServlet.resume(HTTPRequestServlet.scala:163) ~[lift-webkit_2.8.1-2.4-M4.jar:2.4-M4]
[…]
Caused by: java.lang.IllegalStateException: IDLE,initial
        at org.eclipse.jetty.server.AsyncContinuation.complete(AsyncContinuation.java:534) ~[na:na]
        ... 22 common frames omitted

All indications point to the problem being the same in the case of Jetty 6, since the behavior manifested in the same situation. We did fix our own internal bug, but shared this valuable information as to the cause of the null exceptions with David Pollak.

We've filed a ticket on Lift itself for a more informative error message in this case. That said, generally speaking, if you see this, it means there's a load issue somewhere along the application pipeline that you need to look at. Requests are dying, and you need to figure out how to deal with that. At the moment, we haven't gotten a chance to investigate whether it's the container or nginx terminating the request early, but we'll be looking into it if and when the problem shows up on our system again.