

Therefore, the approach taken by the existing monitoring tools was to either watch the kernel log, which is a risky business, since it might be wrapped by newer messages, or to query file system specific sysfs files, which register the last error.
#FILE MONITOR LINUX CODE#
There wasn't much going on other than the error code returned to the application that executed the failed operation, which doesn't tell much about the cause of the error, nor is useful for a health monitoring application.

The problem is that Linux didn't really expose a good interface to notify applications when a file system error happened. And, once one needs to watch over a large quantity of machines, like in a cloud provider with hundreds of machines, a reliable monitoring tool is essential. In fact, it is essential that administrators or recovery daemons be notified as soon as an error occurs, such that they can start emergency recovery procedures, like kickstarting a backup, rebuilding a RAID, replacing a disk or maybe just running fsck. This kind of real-time monitoring is quite important to ensure data consistency and availability in data centers. It is not the right tool to monitor the health of a file system in real-time, raising alarms and sirens when a problem is detected. Indeed, fsck is quite efficient in recovering from errors of several file systems, but it sometimes requires placing the file system offline and either walking through the disk to check for errors, or poking the super block for an error status. It is also run at boot-time on every few boots in almost every distro, through the systemd-fsck service, or equivalent logic. Some even go a step further with online repair tools.įsck, the file system check and repair tool, is usually run by an administrator when they suspect the volume to be corrupted, sometimes following a mount command that failed. In fact, all persistent file systems deployed in production are accompanied by check and repair tools, usually exposed through the fsck front-end.

This is why file system developers put a huge effort in not only testing their code, but also in developing tools to recover volumes when they fail.
#FILE MONITOR LINUX SOFTWARE#
Whether it is for an unknown reason, usually explained to managers as Alpha particles flying around the data center, or a more mundane (and way more likely) reason - a software bug - users don't usually enjoy losing their data for no reason. A fact of life, one that almost every computer user has to face at some point, is that file systems fail.
