Thorsten's World Random stuff

SSDs, wear-leveling and why you want TRIM

Ok, so you got one of this shiny, new, cool SSDs in your PC or you are going to get one.

Nice, now maybe you want to know what the problems with SSDs might occur if you use them incorrectly.

You might want:

  • constant high performance
  • a long lifetime
  • a better understanding on technology

What's the difference between a HDD and an SSD?

A HDD is reading and writing data from or to a rotating platter using magnetism. Seeking between chunks of data is quite slow since the arm for reading has to be physically moved to the right location to read or write.

An SSD instead has no moving parts and uses a controller to directly access flash media. The controller also emulates to be an HDD to be backwards compatible to HDD-based systems.

So here an example on how clear the difference is seen from the OS:


Disk /dev/sda: 30.0 GB, 30016659456 bytes, 58626288 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


Disk /dev/sdb: 3000.6 GB, 3000592982016 bytes, 5860533168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

So they differ in storage capacity and sector size but essentially it's more or less the same for the PC.

The Problem and the Solution to SSD wear

Flash memory has the problem that it has a finite number of how often it can rewritten. Your OS tends to store temporary data (like temporary internet files etc.) on the same locations on your disk. So that is not good for your flash chips in your SSD.

Wear leveling is trying to reduce the impact of this problem.

SSDs have no moving parts so you have (nearly) no impact in performance if your chunks of data is fragmented all over the SSD. So the controller using wear leveling is counting how often a block has been rewritten. If the count of rewites reaches a certain threshold the controller uses a different unused block and maps it to the block the data has been written to until now. So the wear of the emulated HDD block is levelled across different physical flash blocks.

Fine, so now we are done and everything is fine? no

Now come the real problems: the Operating System (OS) and the File System (FS).

Both have been in use for some decades with HDDs but SSDs appeared only recently for PCs. So both, the OS and FS have been optimized for use with HDDs.

HDD optimized Filesystems on SSDs

So what is the problem? The SSD emulates a HDD and uses wear leveling to use flash blocks evenly (if unused/free blocks are available). The file system is writing data to disk and writes an index to the filesystem to 'link' the data as files. If a file is deleted only the index (metadata) is deleted, but the files themselves are still on the disk, they are just no longer visible from the operating system. Now the controller of an SSD has the problem that it only knows that there is data on the blocks, but it does not know whether it is actually needed or if the blocks are used by long deleted data. So the unused space for the controller is shrinking with every file you put on your drive, but deleting files does not free up space for the controller and sooner or later you will 'hit the wall' because for the SSD controller no unused blocks are available anymore to use wear leveling. You notice this usually when the performance is degraded after some while of usage and the block wear increases.

Now what is TRIM?

It is the only possibility for the OS to tell the SSD that the blocks used by deleted files can be reused for wear levelling. The problem is that some old operating systems  (e. g. Windows XP) or some (old) filesystems under GNU/Linux don't have support for TRIM. So check the support lists for TRIM of your OS or filesystems if you want to have a fast SSD for a long time.

Benchmarking Harddisks under GNU/Linux with dd and what you can do wrong

I recently got into an argument where one of my collegues was telling me that the only correct way to benchmark a disk throughput in terms of bandwidth was to disable OS-buffers (oflag=direct). Here one example where this can get you into trouble:

dd if=/dev/zero of=/dev/sdx3 bs=4k count=10000k oflag=direct

So basically we are trying to write 4k-blocks in large numbers to a hdd (hopefully in a near sequential manner).

Our results (~50MB/s) were very bad compared to the maximum throughput announced by the manufacturer (~140 MB/s).

So what is wrong with this benchmark:

dd is not writing in one sequential heap of blocks here but 10 mio blocks. This is very bad for getting the throughput values of your hdd because the disk is writing a block with one write request each.

This would be like you are writing lots of 4k files onto your FS and sync after each block to disk. So if your FS is not doing this (and usually it should not) this is not the correct way to gain maximum throughput values for you disks.

there are several ways to get better values but some of them are better than others:

disabling the direct writing (enabling OS-buffers)

dd if=/dev/zero of=/dev/sdx3 bs=4k count=10000k

 Results: ~ 130MB/s

So what happens in the background here?

We are still doing writes for each block, but the OS (aka Linux) is buffering in memory and scheduling the writes and writing them to disk. So the OS is correcting what we are doing wrong: it translates (or at least tries to) each of the writes into a larger write operation and improving the perfomance results of the disk. So you can see the cached write is not harmful for benchmarking purposes. You use it under real usage with your OS too, so why would you want to do benchmarks without it?