"Timed out while waiting to write" under no load, in both local and EC2-based environments

Description

Getting sporadic "Timed out while waiting to write" exceptions, in a very unpredictable fashion.

It's thrown on line 94 of NorbertFuture.scala when the request.timestamp is more than writeTimeoutMillis old. However, we specifically configure NetworkClientConfig as shown below (writeTimeoutMillis is increased to 1500ms from the default 150ms).

This occurs when there's only one Norbert-based client accessing the system, under no load. How could it take a request more than 1500ms to clear from the Norbert networking (I'm assuming buffered) layer?

Is this a bug or operator error?

NetworkClient configuration
---------------------------------------------------------------------

private NetworkClientConfig newConfig(String serviceName)
{
NetworkClientConfig config = new NetworkClientConfig();
config.setServiceName(serviceName);
config.setZooKeeperConnectString(envDef.getZookeeper().getComputedConnectionStr());
config.setZooKeeperSessionTimeoutMillis(30000);

config.setConnectTimeoutMillis(1000);

// config.setWriteTimeoutMillis(150);
config.setWriteTimeoutMillis(1500);

// config.setMaxConnectionsPerNode(5);
config.setMaxConnectionsPerNode(50);

log.debug("Returning new Norbert NetworkClientConfig with writeTimeoutMillis: " + config.getWriteTimeoutMillis() + " maxConnectionsPerNode: " + config.getMaxConnectionsPerNode());

config.setStaleRequestTimeoutMins(10);
config.setStaleRequestCleanupFrequencyMins(10);
return config;
}

Here's the stacktrace:
---------------------------------------------------------------------

Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out while waiting to write
at com.linkedin.norbert.network.common.ResponseHelper$$anonfun$translateResponse$1.apply(NorbertFuture.scala:103)
at com.linkedin.norbert.network.common.ResponseHelper$$anonfun$translateResponse$1.apply(NorbertFuture.scala:103)
at scala.Either.fold(Either.scala:41)
at com.linkedin.norbert.network.common.ResponseHelper$class.translateResponse(NorbertFuture.scala:103)
at com.linkedin.norbert.network.common.FutureAdapter.translateResponse(NorbertFuture.scala:31)
at com.linkedin.norbert.network.common.FutureAdapter.get(NorbertFuture.scala:48)
at com.yakr.athena.zeus.client.ZeusClient.getOrCreateContext(ZeusClient.java:27)
... 31 more
Caused by: java.util.concurrent.TimeoutException: Timed out while waiting to write
at com.linkedin.norbert.network.netty.ChannelPool.com$linkedin$norbert$network$netty$ChannelPool$$checkinChannel(ChannelPool.scala:94)
at com.linkedin.norbert.network.netty.ChannelPool$$anon$2.operationComplete(ChannelPool.scala:134)
at org.jboss.netty.channel.DefaultChannelFuture.notifyListener(DefaultChannelFuture.java:381)
at org.jboss.netty.channel.DefaultChannelFuture.notifyListeners(DefaultChannelFuture.java:372)
at org.jboss.netty.channel.DefaultChannelFuture.setSuccess(DefaultChannelFuture.java:316)
at org.jboss.netty.channel.socket.nio.NioWorker$RegisterTask.run(NioWorker.java:776)
at org.jboss.netty.channel.socket.nio.NioWorker.processRegisterTaskQueue(NioWorker.java:257)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:199)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

Environment

  • MacPro/OSX 10.5 and/or EC2 instances with vanilla Amazon Linux (in the US-West region, where the cloud hasn't rained yet.)

  • JDK 1.6 in both cases

Activity

Show:
Denis K
April 27, 2011, 1:37 AM

Just updated to master and got a similar failure, albeit with a different stacktrace (also available at http://pastie.org/private/xflwvvxgkzm5bsj1t1wt8a):

Thoughts?

java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out while waiting to write
at com.linkedin.norbert.network.common.ResponseHelper$$anonfun$translateResponse$1.apply(NorbertFuture.scala:181)
at com.linkedin.norbert.network.common.ResponseHelper$$anonfun$translateResponse$1.apply(NorbertFuture.scala:181)
at scala.Either.fold(Either.scala:41)
at com.linkedin.norbert.network.common.ResponseHelper$class.translateResponse(NorbertFuture.scala:181)
at com.linkedin.norbert.network.common.FutureAdapter.translateResponse(NorbertFuture.scala:32)
at com.linkedin.norbert.network.common.FutureAdapter.get(NorbertFuture.scala:49)
at com.yakr.netq.client.NetQClient2.peekAllAtHead(NetQClient2.java:69)
at com.yakr.eos.data.managers.NotificationManager2.peekNotifications(NotificationManager2.java:70)
at com.yakr.eos.eosPager.NotificationCenter.fetchNotifications(NotificationCenter.java:50)
at com.yakr.eos.eosPager.NotificationCenter.getOrSuspend(NotificationCenter.java:34)
at com.yakr.eos.entryPoints.services.NotificationService.getNotifications(NotificationService.java:24)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.yakr.serverCommon.ice.common.contentHandlers.amf.AMFWebHandler.invokeSreviceMethod(AMFWebHandler.java:199)
at com.yakr.serverCommon.ice.common.contentHandlers.amf.AMFWebHandler.serviceRequest(AMFWebHandler.java:85)
at com.yakr.serverCommon.ice.common.contentHandlers.amf.AMFWebHandler.service(AMFWebHandler.java:74)
at com.yakr.serverCommon.ice.common.contentHandlers.WebHandlerInvoker.invokeAll(WebHandlerInvoker.java:51)
at com.yakr.serverCommon.ice.nettyImpl.web.NettyWebProtocolHandler.messageReceived(NettyWebProtocolHandler.java:49)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:754)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:302)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.unfoldAndfireMessageReceived(ReplayingDecoder.java:513)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:497)
at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:434)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:545)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:540)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:274)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:261)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349)
at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:281)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:201)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at com.yakr.serverCommon.ice.nettyImpl.EndpointCtx$1$1.run(EndpointCtx.java:54)
at java.lang.Thread.run(Thread.java:637)
Caused by: java.util.concurrent.TimeoutException: Timed out while waiting to write
at com.linkedin.norbert.network.netty.ChannelPool.com$linkedin$norbert$network$netty$ChannelPool$$checkinChannel(ChannelPool.scala:94)
at com.linkedin.norbert.network.netty.ChannelPool$$anon$2.operationComplete(ChannelPool.scala:134)
at org.jboss.netty.channel.DefaultChannelFuture.notifyListener(DefaultChannelFuture.java:381)
at org.jboss.netty.channel.DefaultChannelFuture.notifyListeners(DefaultChannelFuture.java:372)
at org.jboss.netty.channel.DefaultChannelFuture.setSuccess(DefaultChannelFuture.java:316)
at org.jboss.netty.channel.socket.nio.NioWorker$RegisterTask.run(NioWorker.java:776)
at org.jboss.netty.channel.socket.nio.NioWorker.processRegisterTaskQueue(NioWorker.java:257)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:199)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
... 1 more

Denis K
April 27, 2011, 7:54 AM

More of the same issue, slightly different stacktrace: http://pastie.org/private/jqbk6adzhmekadwxf5etfq

java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out while waiting to write
at com.linkedin.norbert.network.common.ResponseHelper$$anonfun$translateResponse$1.apply(NorbertFuture.scala:181)
at com.linkedin.norbert.network.common.ResponseHelper$$anonfun$translateResponse$1.apply(NorbertFuture.scala:181)
at scala.Either.fold(Either.scala:41)
at com.linkedin.norbert.network.common.ResponseHelper$class.translateResponse(NorbertFuture.scala:181)
at com.linkedin.norbert.network.common.FutureAdapter.translateResponse(NorbertFuture.scala:32)
at com.linkedin.norbert.network.common.FutureAdapter.get(NorbertFuture.scala:49)
at com.yakr.pluto.client.PlutoClient2.removeMemberFromRoster(PlutoClient2.java:121)
at com.yakr.eos.daemons.EosClientExpiryDaemon.removeClientFromAllRoomRosters(EosClientExpiryDaemon.java:74)
at com.yakr.eos.daemons.EosClientExpiryDaemon.access$000(EosClientExpiryDaemon.java:18)
at com.yakr.eos.daemons.EosClientExpiryDaemon$1.onClientExpired(EosClientExpiryDaemon.java:53)
at com.yakr.serverCommon.ice.common.userSessions.IceUserClientManager$1.invoke(IceUserClientManager.java:56)
at com.yakr.serverCommon.ice.common.userSessions.IceUserClientManager$1.invoke(IceUserClientManager.java:52)
at com.yakstar.events.ListenerSupport$1.fire(Unknown Source)
at com.yakr.serverCommon.ice.common.userSessions.IceUserClientManager.fireClientExpiredEvent(IceUserClientManager.java:62)
at com.yakr.serverCommon.ice.common.userSessions.IceUserClientExpiryDaemon$ExpireClientsTask.run(IceUserClientExpiryDaemon.java:118)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
Caused by: java.util.concurrent.TimeoutException: Timed out while waiting to write
at com.linkedin.norbert.network.netty.ChannelPool.com$linkedin$norbert$network$netty$ChannelPool$$checkinChannel(ChannelPool.scala:94)
at com.linkedin.norbert.network.netty.ChannelPool$$anon$2.operationComplete(ChannelPool.scala:134)
at org.jboss.netty.channel.DefaultChannelFuture.notifyListener(DefaultChannelFuture.java:381)
at org.jboss.netty.channel.DefaultChannelFuture.notifyListeners(DefaultChannelFuture.java:372)
at org.jboss.netty.channel.DefaultChannelFuture.setSuccess(DefaultChannelFuture.java:316)
at org.jboss.netty.channel.socket.nio.NioWorker$RegisterTask.run(NioWorker.java:776)
at org.jboss.netty.channel.socket.nio.NioWorker.processRegisterTaskQueue(NioWorker.java:257)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:199)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:637)

Joshua Hartman
April 27, 2011, 7:13 PM

Can you turn on debug logging? I just added a few more statements as well that might help us figure out what's going on.

Does the problem happen on the first write or clear itself after that?

Denis K
April 28, 2011, 12:53 AM

The problem does occur upon startup, and then doesn't seem to appear again. This is the case in the local environment.

I've seen it occur several times on EC2 as well, but it's tougher to say whether that was the first request or after a warm-up period.

Let me turn on logging and see if I can pull in some more information.

Joshua Hartman
May 10, 2011, 8:53 PM

Any data? I've been out of town a lot recently so I haven't been able to follow up.

Assignee

Joshua Hartman

Reporter

Denis K

Time tracking

48h

Components

Fix versions

Due date

2011/04/29

Priority

Major
Configure