elcarOnOsdnaH: June 2017

Monday, June 26, 2017

DSE Cassandra Node Failed To Start Post OS Upgrade

One of our Cassandra production cluster node refused to start after OS Patching was done with following errorr.

ERROR [main] 2017-06-25 18:09:40,906 CassandraDaemon.java:709 - Exception encountered during startup
org.apache.cassandra.io.FSReadError: java.io.EOFException
at org.apache.cassandra.hints.HintsDescriptor.readFromFile(HintsDescriptor.java:142) ~[cassandra-all-3.0.8.1293.jar:3.0.8.1293]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[na:1.8.0_66]
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[na:1.8.0_66]
at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[na:1.8.0_66]
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[na:1.8.0_66]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[na:1.8.0_66]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[na:1.8.0_66]
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[na:1.8.0_66]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[na:1.8.0_66]
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) ~[na:1.8.0_66]
at org.apache.cassandra.hints.HintsCatalog.load(HintsCatalog.java:65) ~[cassandra-all-3.0.8.1293.jar:3.0.8.1293]
at org.apache.cassandra.hints.HintsService.<init>(HintsService.java:88) ~[cassandra-all-3.0.8.1293.jar:3.0.8.1293]
at org.apache.cassandra.hints.HintsService.<clinit>(HintsService.java:63) ~[cassandra-all-3.0.8.1293.jar:3.0.8.1293]
at org.apache.cassandra.service.StorageProxy.<clinit>(StorageProxy.jav

Upon looking around, we found that there is a reported major bug with this error under this JIRA as following.

CassandraCASSANDRA-12728 Handling partially written hint files

Cause –
Corruption to the hints tables causing Cassandra to go in failure loop. This could have happen due to following.

1. Node was rebooted before service was shutdown properly.
2. service went down abruptly while writing Hints table.
3. Node rebooted due to power failure.

Since the cause of the issue was corrupted Hints table, we need to cleanup the hints for the node and then try to restart.

After that node started fine. Also, since the node was down, it is imperative to run the repair on the node to make sure the data is consistent.

Hope that helps.