We have a cassandra cluster with 24 servers running version 1.2. Since a few weeks ago two of the servers started to crash everyday always at the same time. Their status appears as down on nodetool status comand and the only way to bring them back is to make a reboot on both servers. On one of them we are able to stop the cassandra service and do the reboot. On the other one the service does not stop until we force a reboot.
After the reboots they work normally until the next day at exactly the same time. We analised the logs for errors and the main problem seems to be the HEAP memory that passes the treshold. The servers have 32GB of memory and the HEAP is set to 22GB. All the other servers in the cluster have the same memory an HEAP size and there is no problem whatsoever.
We have checked that the repair and compactation processes run without any errors. We also noticed that just before it crash the gossip service starts to point that some servers sometimes are not responding tothe handshake but then they start responding again, they go DOWN and UP until these two servers crash.
If we do the rebbot on the servers before the time they ussualy crash they they don't crash anymore until the next day.
As a workaround we have setup a script that reboots the servers before the time they crash.
We are running out of options about what might be causing this problem on these two servers.
Any help would be much appreciated! thanks in advance.