I just read the article about a microsoft data center shutting down due to over-temperature because of a failure in the cooling system. This was the bit I found interesting:

"Engineers have isolated an issue with cooling in one part of the data center, which caused a localized spike in temperature, as the preliminary root-cause."

The root cause is not the cause, it is the ultimate cause of all the causes! Confused? Image an aircraft takes off and an engine falls off. The root cause is not that the engine fell off. An investigation would assume the engine should not fall off and ask questions of why it did.

Maybe the bolts were not made properly or fitted properly. Maybe the design process of the mountings didn't take the correct temperatures into account. A real root cause might be (real example!): Engines were fitted using incorrect equipment causing damage to the mounting as a result of a poor working environment at an airline where speed over correctness were being pursued.

So the root cause of the outage needs to ask why a failure in the cooling system, which happens, was not either noticed/corrected/mitigated or solved using redundancy - that would be the root cause.

Rant over.