Disk power management

Notes published the
Notes updated the
13 - 16 minutes to read, 3154 words

No power consumption has been systematically measured, but on an IDLE system, the battery definitively lasts longer, and especially older drives make much less noise.

Ideally, a machine would consume energy, only when it is doing something that helps me to get my work done. In practice, many programs are running in the background all the time, and most of those are not relevant to what I am doing.

This is a quick list of settings for GNU/Linux systems relative to disk drives.

Last access time

POSIX mandates that every access to the filesystem is recorded to the filesystem itself. Thus every read operation turns into a write operation.

This is not only a performance nightmare but also problematic on filesystems that provide a snapshot, copy-on-write, and/or rollback functionality, like the btrfs filesystem.

There are currently 4 options

  • be strictly POSIX-compliant, and update the atime field every time

  • use relatime, with this option, introduced in the Linux 2.6 kernel series, the access time is updated only if the modified time (or change status time) is never than the current access time, or if the access time is older than a configurable predefined interval

  • Use nodiratime to disable updating the access time for directories only, there are no known applications that would be affected by this change

  • Use noatime and never update the access time value. Some programs ( for example mbox, popcon and tmpreaper) might break

Warning ⚠️
One often-cited example is mutt/mbox, even if in the meantime both mutt and neomutt have the necessary logic for working even if the filesystem is mounted with noatime.

During 2007, the Linux Kernel discussed if changing the default values is a good idea, I believe the status quo, for most distributions, is to use relatime. If one wants to hurt himself, it can use norelatime to be strictly POSIX compliant.

Especially on older drives (SSDs in particular) it is recommended to use noatime, as it significantly reduces consumption and increases performance.

Using noatime for drives mounted through /etc/fstab is easy. In most cases, it is probably sufficient to add noatime to all or most entries in /etc/fstab. For example, the entry for / with noatime might look like

# <file system>                            <mount point>  <type>  <options>                 <dump>  <pass>
UUID=ac6b52c8-8125-46c6-9611-38910b3e3612  /              ext4    defaults,noatime,discard  0       1

External drives are mounted dynamically; it is possible to cat /proc/mounts to see how all filesystems are mounted.

In my case, external drives are mounted with relatime:

/dev/sdc1 /media/fekir/back fuseblk rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other,blksize=4096 0 0

The syntax is the same used by fstab, and it is clear that contrary to the drives registered in /etc/fstab, external drives are using relatime.

For testing purposes, it is possible to use mount -o remount, even for dynamically loaded drives

sudo mount -o remount,nodiratime /media/fekir/usb1

but after seeing that everything works as expected even with external drives, how to use noatime by default even on those?

Unfortunately, I found no official solution. A possible workaround (doable in my case as I have only a couple of external drives) is to

  • copy the output of /proc/mounts

  • replace relatime with noatime,nofail (nofail is necessary in case the external drive is not available)

  • replace /dev/sdc1 with UUID=<output of blkid>, in my case UUID=B076FED076FE95F6

  • add the resulting string at the bottom of /etc/fstab

At this point, it is possible to remove the external drive, reattach it to the machine, and verify that mounting still works and that the output of /proc/mounts changed accordingly.

Note 📝
In my case, it was also necessary to change fuseblk to ntfs, otherwise the drive could not be mounted correctly. The output of cat /proc/mounts still shows fuseblk instead of ntfs, but with noatime instead of relatime, as desired.

The main disadvantage is that this operation needs to be done for every external drive, and it requires administrator rights for editing /etc/fstab. After editing /etc/fstab, the external drive can be mounted and unmounted like before from the desktop manager. Another disadvantage of this approach is that (at least for NTFS filesystems) the group and owner of the files appear to be root instead of the (only) user I am using for reading and writing files.

Note 📝
On Windows, there is a similar situation. It is possible to disable and enable writing the last access time automatically with fsutil. This is the default behavior between Windows Vista 🗄️ and Windows 10 🗄️. I wonder why the default settings have been changed in Windows 10 again. The current status can be queried with fsutil behavior query disablelastaccess.

commit time

In /etc/fstab or when mounting a drive with mount, it is possible to specify how often the data and metadata of files and directories are synchronized to the disk. The default value is 5 seconds..

By setting a higher value, the drive is accessed less frequently

# <file system>                            <mount point>  <type>  <options>                           <dump>  <pass>
UUID=ac6b52c8-8125-46c6-9611-38910b3e3612  /              ext4    defaults,noatime,discard,commit=60  0       1

The drawback is that in case of power loss or system crash, more data might be lost.

Disk Power Management

Some hard disks support power management. With hdparm it is possible to query if this feature is supported, the following is the output on a couple of drives:

> sudo hdparm -i /dev/sda

/dev/sda:

 Model=ST1000LM049-2GH172, FwRev=SDM1, SerialNo=WGS4VER4
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs RotSpdTol>.5% }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0
 BuffType=unknown, BuffSize=unknown, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=1953525168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
 AdvancedPM=yes: unknown setting WriteCache=enabled
 Drive conforms to: Reserved:  ATA/ATAPI-4,5,6,7

 * signifies the current active mode

> sudo hdparm -i /dev/sdb

/dev/sdb:

 Model=Samsung SSD 860 EVO M.2 500GB, FwRev=RVT22B6Q, SerialNo=S414NB0M118864V
 Config={ Fixed }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0
 BuffType=unknown, BuffSize=unknown, MaxMultSect=1, MultSect=1
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=976773168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: unknown:  ATA/ATAPI-2,3,4,5,6,7

 * signifies the current active mode

if AdvancedPM=yes appears in the output of hdparm, as with /dev/sda, then power management is supported and can be configured with hdparm -B X -S Y /dev/sda.

From the man page

[...]
       -B     Get/set Advanced Power Management feature, if the drive supports
              it.  A  low  value  means aggressive power management and a high
              value means better performance.  Possible  settings  range  from
              values  1  through  127 (which permit spin-down), and values 128
              through 254 (which do not permit spin-down).  The highest degree
              of  power  management  is  attained with a setting of 1, and the
              highest I/O performance with a setting of 254.  A value  of  255
              tells  hdparm to disable Advanced Power Management altogether on
              the drive (not all drives support disabling it, but most do).
[...]
       -S     Put  the  drive  into  idle  (low-power)  mode, and also set the
              standby (spindown) timeout for the drive.  This timeout value is
              used  by  the  drive to determine how long to wait (with no disk
              activity) before turning off the spindle motor  to  save  power.
              Under  such circumstances, the drive may take as long as 30 sec‐
              onds to respond to a subsequent disk access, though most  drives
              are much quicker.  The encoding of the timeout value is somewhat
              peculiar.  A value of zero means "timeouts  are  disabled":  the
              device will not automatically enter standby mode.  Values from 1
              to 240 specify multiples of 5 seconds, yielding timeouts from  5
              seconds to 20 minutes.  Values from 241 to 251 specify from 1 to
              11 units of 30 minutes, yielding timeouts from 30 minutes to 5.5
              hours.   A  value  of  252  signifies a timeout of 21 minutes. A
              value of 253 sets a vendor-defined timeout period between 8  and
              12  hours, and the value 254 is reserved.  255 is interpreted as
              21 minutes plus 15 seconds.  Note that  some  older  drives  may
              have very different interpretations of these values.
— man hdparm
Note 📝
The encoding of the timeout value is not peculiar, it is pure madness.

To put the disk into power saving mode after 2 minutes of idle time, the command would be hdparm -B 127 -S 24 /dev/sda.

Setting a too-low value might mean that the disk continuously changes from idle/standby to something else. As switching state also costs some energy, it makes sense to find a value high enough to ensure that the state does not switch back shortly after, but not also too high that the low-power state is never entered.

In the case of a computer with multiple drives, it is much easier to ensure that a disk is used as little as possible (see vmtouch)

writeback time

Also the kernel, like the filesystem, does not write the changes immediately to the disk but has its own buffering layer. This caching allows the kernel to group consecutive writes into one big write, and thus optimize the usage of disks

It is possible to increase this time, at the expense that in case of an error like a system crash, power failure, or system freeze the data in a larger time window might be lost.

It is possible to query the current value with cat /proc/sys/vm/dirty_writeback_centisecs.

The value is expressed in centiseconds, and it is possible to change it (in Debian, as described here) by adding the following entry in /etc/sysctl.conf

vm.dirty_writeback_centisecs = 1500

tlp 🗄️, laptop-mode-tools, and other programs might override the value written in /etc/sysctl.conf.

A similar role is played by vm.dirty_expire_centisecs 🗄️.

Note that setting too high values not only increases the risk of data loss but might also create long pauses if the cache gets too big and needs to be written on the disk at once.

Other sysctl settings

There are a couple of other settings in sysctl that might be relevant.

The second is vm.swappiness 🗄️, which can be used for controlling how often the kernel will swap memory pages. The default value is 60, a lower value decreases the amount of swap.

There is also vm.dirty_bytes 🗄️ and vm.dirty_ratio 🗄️. Both control how much memory a process needs to generate before generating disk writes.

To avoid conflicts during a package upgrade, it might make sense to put those settings in a separate file. In my case I’m using /etc/sysctl.d/local.conf

Use sudo service procps force-reload to test if the config file has been edited correctly and the new values loaded.

tmpfs

tmpfs is a file system that keeps all files in virtual memory.

This means three things.

  • There is no disk activity when using tmpfs.

  • What is stored in a tmpfs drive is lost when the machine is rebooted.

  • Better be sure that we have enough ram

tmpfs is thus useful when storing temporary data, and we want to do it as fast as possible or without using the disk (which is generally slower than the RAM).

Note 📝
Some distribution mount /tmp as tmpfs, even if there are clear disadvantages doing so by default 🗄️. Otherwise one can use the already-existing /run/user/1000/ on systemd-based systems or /dev/shm directory, as they are both tmpfs filesystems.

In some cases, like fuzzing, the disk is stressed a lot 🗄️ as files are continuously written and deleted. Performance is also affected 🗄️, in those cases a sensible approach is to use tmpfs and periodically sync the content to a hard drive.

Some (especially older) tutorials might explain how to mount /tmp and some directories under /var through init-scripts as tmpfs.

Before changing anything, it makes sense to verify if those directories are already stored in memory. On a "modern" Debian system:

> grep "^tmpfs" /proc/mounts
tmpfs /run tmpfs rw,nosuid,nodev,noexec,relatime,size=3258592k,mode=755,inode64 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,inode64 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,inode64 0 0
tmpfs /tmp tmpfs rw,noatime,inode64 0 0
tmpfs /run/user/1000 tmpfs rw,nosuid,nodev,relatime,size=3258588k,nr_inodes=814647,mode=700,uid=1000,gid=1002,inode64 0 0

> file /var/{run,lock}
/var/run:  symbolic link to /run
/var/lock: symbolic link to /run/lock

Thus any guide describing how to mount at least /var/run or var/lock might be outdated. Otherwise, in most cases, it is sufficient to add an entry in /etc/fstab, like

# <file system>  <mount point>  <type>  <options>                   <dump>  <pass>
tmpfs            /tmp           tmpfs   defaults,noatime,mode=1777  0       0

Partitions

A carefully partitioned drive can improve how efficiently the drive is used by reducing seek times but splitting the system into multiple partitions has multiple disadvantages:

  • fewer possibilities to move files, more copies will be necessary, especially if /tmp is on another drive

  • problematic if one wants to handle a big file

  • it is easier to partition to fill up unless those are unnecessarily large

  • if partitions are too large, more space is unused/lost

With LVM it is possible to overcome most issues, it is not clear to me if it would still make the drive more efficient if carefully partitioned.

I do not believe that partitioning is useful for making the system work more efficiently, on the contrary. Also, the arguments seem to hold only for rotating drives, not for SSD.

iotop

iotop is a command-line tool that shows which processes are writing or reading to the disk, and how much they are reading.

For observing IO usage over time, something like iotop --batch --only --delay=15 is very practical, as it writes to the console, every 15 seconds, a list of processes that read or wrote to the disk, and how much data they transferred.

vmtouch

vmtouch is a program that permits controlling how the OS caches single files and directories.

I noticed, for example, that the graphical music player I was using constantly read something from the disk, thus I decided to try other music players.

Another provided a much better experience for my disk, and the drive was able to spin down most of the time. But when switching from one song to another, the music player reads another file from the disk, spinning the disk up again.

For such cases, as files are only read and not written, vmtouch is very practical. By telling the OS to cache all files in specific folders (like a Music Album) at once, instead of reading one file after the other, the reads happen only at the beginning of the first song, then the disk can spin down and stay idle the whole time.

Reduce disk writes (and reads)

With noatime, tmpfs, vmtoch, and by configuring a higher writeback time, it is possible to reduce the disk usage system-wide. And thanks to iotop it is possible to observe which programs are still using the disk while the system is idle.

But if a program periodically reads or writes data to the disk, and this data, generally, should not live on tmpfs, then it will be hard to configure the system in such a way that the disk goes on a lower power state, even if the user is not using the machine at all.

In those cases, it might be possible to configure the program differently or search for an alternative program.

/etc/(r)syslog.conf

The rsyslog (or syslog) daemon is a process that saves all kernel and related log messages to the /var/log/<file> files.

Some messages are synchronized immediately, in order to make sure that crash logs of critical systems have a higher chance of being on the disk in case of a system crash.

It is possible to edit the /etc/rsyslog.conf (or /etc/syslog.conf) file and add a "-" in front of this line:

For example, the following logs are flushed to the disk immediately

mail.err                        /var/log/mail.err

while the following not

mail.info                       -/var/log/mail.info

Other programs

There is no silver bullet, if a program needs to read and write from the disk, it might not even be possible to configure how often this happens

As a general solution, it is possible to copy all "interesting" files to a tmpfs drive, let the program work on the copied files, and sync those to the disk periodically.

Unfortunately, it is not a very user-friendly approach and is generally error-prone. The most common error would be forgetting to copy the files back before a shutdown or reboot.

Also, as a general guideline, programs that are not installed on the system cannot write to the drive. While obvious, is a valid reason for preferring a minimal system. While it is true that programs on the drive are not executed automatically, and do not write anything unless executed, some program starts services that do run all the time (or they are executed on startup in the background), even if those are not always needed by the user.

laptop-mode-tools

laptop-mode-tools is a collection of utilities for automatically adjusting system settings depending if the computer is running on battery or not.

One of the many settings it can handle is disk power management.

in /etc/laptop-mode/laptop-mode.conf, there is a whole section for hard drives:

###############################################################################
# Hard drive behavior settings
# -----------------------------
#
# These settings specify how laptop mode tools will adjust the various
# parameters of your hard drives and file systems.
###############################################################################

In particular, there are default configurations for read-ahead, noatime/relatime, idle time, write cache settings and dirty ratio.

Conclusion

As mentioned at the beginning, I did not measure most changes, and definitively not one independently from another.

Maybe some settings are not useful at all, and maybe others might be useful only in certain circumstances.

For sure, after tweaking the system, there are much fewer writes to the drive, especially when the system is idle, and the hard drive is much more time spun down. Even if the difference in energy consumption is low, the PC is definitively quieter, which is a very nice thing.

The most important part was to stop using programs that read or wrote continuously to the disk, use vmtouch when batch processing multiple files, and put the second drive to sleep as soon as possible (20 seconds).


Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me anytime.