Many applications today are data-intensive as opposed to compute-intensive. Raw CPU power is rarely a limiting factor for these applications—bigger problems are usually the amount of data, the complexity of data, and the speed at which is changing.
For softwares, reliability typically means
All of them, combined together, means "continuing to work correctly, even when things go wrong."
The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient.
Fault is not the same as failure. Fault is usually defined as one component of the system deviating from its spec, whereas failure is when the system as a whole stops providing the required service to the user.
Hardwares could die out easily, on a storage cluster with 10,000 disks, one should expect on average one disk to die per day.
One approach to solve the problem is to add redundancy. Setting up RAID configuration, having dual power supplies and hot-swappable CPUs for servers and having batteries and diesel generators for backup in datacenter are all possible solutions.
However, as data volumes and applications, computing demands increases, more applications have begun using larger numbers of machines which makes the first approach harder. Hence there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy. Doing that, a system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system.
In addition to hardware faults that only happens randomly, there could also be systematic error occurring within the system. For instance: