Formatting a RAID volume with ext4

Contents: Overview of RAID volumes and the ext4 filesystem
In a previous post a RAID 5 array was created using mdadm. After the array was created it was ready for use, such as creating a partition and placing a filesystem on it. Formatting a RAID array is similar to formatting any other partition or device except there are a few configuration options to be mindful of to improve I/O performance and reduce overhead.

Through the process below a partition will be created on the /dev/md0 device, which is RAID level 5 running on four disks, then the partition will be formatted with the ext4 filesystem, the operating system will be configured to automount the new file system, and test data will be written to demonstrate functionality.

This guide was tested on a Debian 6.0 "Squeeze" system (Linux debian 2.6.32-5-amd64) with a RAID 5 array created out of four 5GB disks. root access is required. Some math will be required so the use of a calculator is recommended if the resulting values do not come easily.

The steps outlined below will damage any data on the selected device. Do not run them on a production system without fully understanding the process and testing in a development environment.

These instructions are not meant to be exhaustive and may not be appropriate for your environment. Always check with your hardware and software vendors for the appropriate steps to manage your infrastructure.

Instructions and information are detailed in black font with no decoration.
Code and shell text are in black font, gray background, and a dashed border.
Input is green.
Literal keys are enclosed in brackets such as [enter], [shift], and [ctrl+c].
Warnings are in red font.

Steps to format a RAID volume with ext4
  1. Log in to your system and open a command prompt.

  2. Change your shell to run as root:
    user@debian~$: su -

  3. View the current information for /dev/md0:
    root@debian~$: fdisk -l /dev/md0
    Disk /dev/md0: 16.1 GB, 16101408768 bytes
    2 heads, 4 sectors/track, 3931008 cylinders
    Units = cylinders of 8 * 512 = 4096 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 524288 bytes / 1572864 bytes
    Disk identifier: 0x00000000

    Disk /dev/md0 doesn't contain a valid partition table
    The results show the total size of the device, 16.1GB, and that there are no partitions on the device yet.

  4. Create a partition on /dev/md0:
    root@debian~$: fdisk /dev/md0
    Command (m for help): n
    Command action
       e   extended
       p   primary partition (1-4)
    Partition number (1-4): 1
    First cylinder (1-3931008, default 385): [enter]
    Using default value 385
    Last cylinder, +cylinders or +size{K,M,G} (385-3931008, default 3931008): [enter]
    Using default value 3931008

    Command (m for help): p

    Disk /dev/md0: 16.1 GB, 16101408768 bytes
    2 heads, 4 sectors/track, 3931008 cylinders
    Units = cylinders of 8 * 512 = 4096 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 524288 bytes / 1572864 bytes
    Disk identifier: 0x8afle479

    Device            Boot    Start    End    Blocks    Id    System
    /dev/md0p1              385    3931008    15722496    83    Linux

    Command (m for help): w
    The partition table has been altered!

    Calling ioctl() to re-read partition table.
    Syncing disks.
    The first command starts the fdisk application with the target as /dev/md0. The "n" command starts the process to create a new partition, "p" states we are creating a primary partition on the device, the "1" defines this new primary partition as the first partition, and the defaults are taken for the first and last cylinder locations. "p" prints the table information to confirm the settings just specified, at this moment the changes are not saved and nothing has occurred on the device. "w" writes the changes to disk and performs the task of creating the partition then closes the application upon success.

    The device /dev/md0 now has a single partition on it with the name of /dev/md0p1 (the output from the print command told us this).

  5. Determine the chunk size of the RAID device as reported by mdadm:
    root@debian~$: cat /proc/mdstat
    Personality : [raid6] [raid5] [raid4]
    md0 : active raid5 sde[5] sdf[4](S) sdd[2] sdc[1] sdb[0]
        15724032 blocks super 1.2 level 5, 512k chunk, algorith 2 [4/4] [UUUU]
    From the output above it is found that the chunk size is 512k.

  6. Calculate the stride and stripe-width sizes and then create the filesystem:

    Chunk size is 512KB as discovered in step 5.

    Block size is 4096 bytes, this is a chosen value.

    Stride size is the number of data blocks that fit in a chunk.
    (chunk size) / (block size) = stride size
    512KB / 4096 bytes = stride size
    524288 bytes / 4096 bytes = 128 blocks or 512KB / 4KB = 128 blocks

    Stripe width is the number of blocks that fit across an entire data stripe, does not include parity.
    (stride size) * (number of data disks) = stripe width
    128 blocks * (4 disks in array - 1 for parity) = stripe width
    128 blocks * 3 disks = 384 blocks.

    root@debian~$: mkfs.ext4 -v -m .1 -b 4096 -E stride=128,stripe-width=384 /dev/md0p1
    mke2fs 1.41.12 (17-May-2010)
    fs_types for mke2fs.conf resolution: 'ext4'
    OS type: Linux
    Block size=4096
    Fragment size=4096
    Stride=128 blocks, Stripe width=384 blocks
    983040 inodes, 3930624 blocks
    3930 blocks (0.10%) reserved for the super user
    First data block=0
    Maximum filesystem blocks=4026531840
    120 block groups
    32768 blocks per group, 32768 fragments per group
    8192 inodes per group
    Superblock backups stored on blocks:
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208

    Writing inode tables: done
    Creating journal (32768 blocks): done
    Writing superblocks and filesystem accounting information: done

    The filesystem will be automatically checked every 32 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override.
    This runs the mke2fs application and specifies the filesystem type as ext4. Verbose output is enabled, percentage of reserved blocks for super user is .1 percent, the block size is 4096 bytes, extended options are enable with the stride and stripe-width values set to 128 and 384 blocks respectively (see footnote 1), and the target of the operation is set to /dev/md0p1 (the first primary partition on /dev/md0 which was created in step 4).

  7. Create a directory to mount the new filesystem to:
    root@debian~$: mkdir /data

  8. Obtain the UUID for the new filesystem:
    root@debian~$: blkid
    [data removed for length]
    /dev/md0p1: UUID="fa887fla-03ba-48f4-a806-ea98d7fbae94" TYPE="ext4" root@debian~$:

  9. Add an entry for the new filesystem at the end of the file system table file:
    root@debian~$: vi /etc/fstab
    [shift+G] [end] [a] [enter]
    UUID="fa887fla-03ba-48f4-a806-ea98d7fbae94"    /data    ext4    defaults    0    2
    [escape] [escape] [shift+ZZ]
    Shift+G takes you to the last line of the file, end goes to the end of the line, "a" takes you to append mode, and enter will create a new line.

    The line created identifies the partition used with the UUID, sets the mount location as /data, defines the filesystem as ext4, uses the default mount options (see footnote 2), disables dump backups, and sets the filesystem check priority to 2 (because root is 1).

  10. Manually mount the filesystem and verify it worked:
    root@debian~$: mount /dev/md0p1 /data
    root@debian~$: df -h
    Filesystem    Size    Used    Avail    Use%    Mounted on
    [data removed for length]
    /dev/md0p1    15G    166M    15G    2%    /data
    The mount command just mounts the filesystem from the target /dev/md0p1 to /data, making it accessible from /data.

    Output from the df command shows that /dev/md0p1 is 15GB in size, has 166MB used, roughly 15GB available, 2% is used, and it is mounted on /data.

  11. Create test data to verify the filesystem is functional:
    root@debian~$: cd /data
    root@debian~$: touch testfile1
    root@debian~$: dd if=/dev/urandom of=testfile2 bs=10M count=1
    1+0 records in
    1+0 records out
    10485760 bytes (10MB) copied, 1.43536 s, 7.3 MB/s
    root@debian~$: ls -l
    total 10256
    drwx - - - - - - 2 root root 16384 Jul 27 00:34 lost+found
    -rw -r - - r - - 1 root root           0 Jul 27 3:11 testfile1
    -rw -r - - r - - 1 root root 10485760 Jul 27 3:11 testfile2
    root@debian~$: df -h
    Filesystem    Size    Used    Avail    Use%    Mounted on
    [data removed for length]
    /dev/md0p1    15G    176M    15G    2%    /data
    First the current directory is changed to /data.

    An empty file named testfile1 is created.

    The dd application is called to take the input from /dev/urandom and place it in the output file of testfile2, with the size of 10MB and to do it once.

    ls shows that the two files were created, testfile1 has no data and testfile2 is 10MB.

    The output from df shows that the used space has increased by 10MB to 176MB.

  12. The filesystem is ready for use and will be automounted when the system is powered on (from the configuration in step 9).

Following the steps from above should now allow you to create an ext4 filesystem on top of a RAID level 5 array. These steps can be modified to work with other RAID levels (or even non-raid devices), devices sizes, and file system types. Before running a system in production with these settings it is important to test different values in step 6 as it is not a one size fits all solution.

  1. The chunk size is the amount of data, in bytes, written to each independent device of a stripe-set. This is a configuration of the RAID array.

    Block size is the size in bytes that data is broken in to and read or written to on the disk. This is a filesystem setting.

    Stride is the number of blocks that can fit in a chunk. This is a filesystem setting that allows the filesystem to read/write data to the RAID array more efficiently.

    Stripe-width is the size, in blocks, of a stripe set, not including the parity. For this guide RAID 5 is used across 4 disks. Raid 5 is a way to stripe data and calculate a parity so each stripe will write data on 3 disks and parity on 1 disk. Parity for the RAID array is not calculated by the filesystem, it is lower in the stack and performed by the kernel for the multidisk device itself (here: /dev/md0).

    Depending on the application, hardware used, RAID configuration, and filesystem configuration the chunk, block, stride, and stripe-width sizes may need to be different. The numbers used in this guide are an example of possibilities and the numbers should be experimented with to find which configuration will perform better.

  2. The default mount settings are equivalent to: mounting the filesystem as read/write, allowing suid, allowing binary execution, automouting the filesystem at boot, only allow root to mount the filesystem, and sets I/O to be asynchronous.

 ©Eric Wamsley - ewams.net