NetDAS Crash Recovery

The two main techniques for remote software crash recovery are Watchdog Timers (WDT) and Internet power switches. Every system will eventually crash so crash recovery is mandatory. One low-cost approach is a remote AC power switch controlled via a web page. When the system stops working the operator logs onto the web page and selects the power-cycle button. This cycles power, clears the fault and restores normal operation. The disadvantage is that the system may be down for a period of time before the non-operational status is detected.

A better approach is a WDT circuit. The system under
normal operation produces a periodic pulse which "kicks the dog", keeping the system powered up. If the system crashes, the WD pulse fails, the "dog is not kicked" and the WD circuitry power-cycles the system. In the case of NetDAS, the WD interval is about three minutes, so the longest outage will be about three minutes. No operator intervention is necessary, resulting in practically non-stop operation.

What causes system crashes? In new systems, the cause is often software bugs. In the case of mature systems like NetDAS, the cause is static discharge which corrupts the USB stack. Once the USB stack is corrupted, a software reboot is not sufficient to restore the USB stack and only a full power-cycle can clear this error. (This is a well known USB issue and has nothing to do with NetDAS). So why not just cut USB power? Unfortunately, at this time we are not aware of any Linux utility that can cycle USB power.