2026-02-02 - 2026-02-22-DDIA_5

| 1 min read

chp 2 - section 3:

  • reliability - it keeps running even if there is an issue, and the entire system does not grind to a halt
  • a fault - a part of the system not working
  • a failure - the how whole system is not working
  • SPOF - single point of failure
  • chaos engineering - purposely causing faults to test fault-tolerance
  • tolerate faults instead for preventing them
  • hardware faults:
    • increase hardware fault tolerance through redundancy - so that if one hard drive/ram/whatever fails there re other's already in place to do the work
    • if your system is a single node - it will have to come down for upgrades and patches
    • if your system is multi-node you can have rolling upgrades
  • software faults: a fault in a software propagates to all nodes its deployed to
  • sociotechnical: interplay between social and technical aspects in complex systems.
  • humans:
    • make mistakes
    • blaming them is counter-productive
    • design the systems to configure things to encourage the right things, the same way you designing for customers
    • Preventable mistakes should be seen as a problem of the system that allowed not the human you might have done it while rushed and overwhelmed
  • unreliable software has far reaching consequences beyond app being down and lose of productivity, humans have been wrongly sent to jail due to bugs