Friday, 10 June 2011

Why ec2-ami-tools - ec2-upload-bundle - hangs

Me and my colleagues had been fighting an issue for some days. Since it made me really-really f**** upset I decided to investigate it now.

The issue: is that if you are behind a proxy (not necessarily) - http or socks does not matter much - you might have problem with ec2-ami-tools. When you start ec2-upload-bundle it does nothing just looks like hanging. We had been waiting for some minutes but nothing happened. Actually it is trying to do something but can not start upload.
In contrast on my home network ec2-upload-bundle almost immediately starts uploading and showing a message to indicate it.
So what is wrong? Why this is happening from behind a proxy? Is it proxy related?

We have tested many possible workarounds without any luck.
First we have specified http_proxy and https_proxy of course but it did not work.
Then we also tried to socksify the process. Without any luck.
At that point we decided to eliminate the proxies. We got an access to a machine in DMZ and we made a dynamic port forwarding and tried to use that as a socks proxy server. And that was really surprising because even that did not work.
Everything else - Firefox, ssh - was working fine but not ec2-ami-tools.
Here we ran out of the ideas for a while.

But  of course problem happens to be solved.

So I made a tcpdump, had a look and saw that the process is trying to access a special IP address 169.254.169.254 . This is a link-local IP address and this is used by amazon within the cloud to fetch some instance specific data. It is unclear why ec2-upload-bundle is trying to reach it. It is much more unclear how the program is written to be able to hang on it. I mean it is also visible that it makes a retry before the TCP timeout (0,3,9 sec) but than it gives up and hangs.

Workaround: to the problem can be to configure this 169.254.169.254 IP on the ssh server or http/socks proxy server. Also you can try somehow locally - with a firewall - reject these connections. But you have to prevent the connection to timeout since if the connection times out ec2-upload-bundle will hang. See the note for an alternative workaround.

Note: This is working on my home machine because normally my Ubuntu Linux has a 169.254.0.0/16 route. And this is enough to prevent the timeout. But only if you are not using proxy. If proxy is in between than your own routing is not used to reach this IP but the route of the proxy is used. So you have to add this route to the proxy/ssh server if you can.

route add -net 169.254.0.0 netmask 255.255.0.0 dev eth0 # Replace eth0 with your interface.


Comment: I have less motivation now to investigate it further but probably I will check what is wrong in the code and why ec2-upload-bundle hangs in such a case.

Update 1: I have checked the source code of ec2-ami-tools and it seems the error is in instance-data.rb. The initialize method is checking if the meta data is available or not and set @instance_data_accessible according. But there is no timeout defined here so you have to wait ... It seems even the TCP timeout is not realized by the open method. A Timeout::timeout(10) { } can solve this error.

Update 2: Later in instance-data.rb the read_meta_data(path) method should call the open() method only if @instance_data_accessible is true. But this is not the case. I do not know ruby at all but this seems to be a coding error. I have just added some print command to see what was going there but do not understand the code really. Besides the open() methods here also hanging indefinitely. So this is a complete hang here.