Many applications today are data-intensive as opposed to compute-intensive. Raw CPU power is rarely a limiting factor for these applications—bigger problems are usually the amount of data, the complexity of data, and the speed at which is changing.

Reliability

For softwares, reliability typically means

the application performs the function that the user expected
it can tolerate the user making mistakes or using it in unexpected ways
its performance is good enough for the required use case under the expected load and data volume
the system prevents any unauthorized access and abuse

All of them, combined together, means "continuing to work correctly, even when things go wrong."

Faults

The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient.

Fault is not the same as failure. Fault is usually defined as one component of the system deviating from its spec, whereas failure is when the system as a whole stops providing the required service to the user.

Hardware Faults

Hardwares could die out easily, on a storage cluster with 10,000 disks, one should expect on average one disk to die per day.

One approach to solve the problem is to add redundancy. Setting up RAID configuration, having dual power supplies and hot-swappable CPUs for servers and having batteries and diesel generators for backup in datacenter are all possible solutions.

However, as data volumes and applications, computing demands increases, more applications have begun using larger numbers of machines which makes the first approach harder. Hence there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy. Doing that, a system that can tolerate machine failure can be patched one node at a time, without downtime of the entire system.

Software Errors

In addition to hardware faults that only happens randomly, there could also be systematic error occurring within the system. For instance:

A software bug that causes every instance of an application server to crash when given a particular bad inpuit
A runaway process that uses up some shared resource - CPU time, memory, disk space...etc
A service that the system depends on that slows down becomes unresponsive, or starts returning corrupted responses