All Categories


Creating ZFS Zpools for OpenVZ

You will first need to ensure ZOL (ZFS on Linux) is installed on your openvz server. See http://zfsonlinux.org/epel.html for help regarding installing ZOL. In this guide I am using OpenVZ version 2.6.32-042stab093.4 (CentOS release 6.5)



STEP#1 Prepare Disk Partitions

Create Solaris Partitions using fdisk

Before we can do anything we need to create partitions for the pool to use. In this guide I am using one of our own OpenVZ servers as an example of how to setup two zpools. I used the fdisk command to create the partitions as follows:

[root@usa2 /]# fdisk -l

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x41f3650e

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          64      512000   fd  Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sda2              64        1339    10240000   fd  Linux raid autodetect
/dev/sda3            1339        2614    10240000   fd  Linux raid autodetect
/dev/sda4            2614      121601   955767008+  bf  Solaris

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x000b32e6

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          64      512000   fd  Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdb2              64        1339    10240000   fd  Linux raid autodetect
/dev/sdb3            1339        2614    10240000   fd  Linux raid autodetect
/dev/sdb4            2614      121601   955767008+  bf  Solaris

Disk /dev/sdc: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x54247215

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1      121601   976760001   bf  Solaris
Partition 1 does not start on physical sector boundary.

Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x54244bde

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1               1      121601   976760001   bf  Solaris
Partition 1 does not start on physical sector boundary.
...
...

Here you can see our OpenVZ server has 4 x 1TB sata disks, with mdadm Software Raid1 already setup on 2 disks (fd Linux raid autodetect), this is for the /boot, swap and / (root) partitions, we are not going to touch these. You will also see a partition on each disk called "Solaris" (fdisk type "bf") which are for the zpools we are about to create. On this server we are going to create 2 separate mirrored pools:
vztank1 = /dev/sda4 + /dev/sdb4
vztank2 = /dev/sdc1 + /dev/sdd1
I will not go into the steps used to create these partitions using fdisk, I am assuming you already know how to use fdisk.



STEP#2 Create the Pool

Create Pool (mirrored) without a mount point

It's very important to use the "ID" of the disks rather than their typical /dev/sdX device names as sometimes device names change after reboots. To find your disk ID's :

[root@usa2 /]# cd /dev/disk/by-id
[root@usa2 by-id]# ls -l
total 0
lrwxrwxrwx 1 root root  9 Sep 29 07:55 ata-TOSHIBA_DT01ACA100_3458SZUMS -> ../../sdd
lrwxrwxrwx 1 root root 10 Sep 29 07:55 ata-TOSHIBA_DT01ACA100_3458SZUMS-part1 -> ../../sdd1
lrwxrwxrwx 1 root root  9 Sep 29 07:55 ata-TOSHIBA_DT01ACA100_34GD6P5NS -> ../../sdc
lrwxrwxrwx 1 root root 10 Sep 29 07:55 ata-TOSHIBA_DT01ACA100_34GD6P5NS-part1 -> ../../sdc1
lrwxrwxrwx 1 root root  9 Sep 29 07:55 ata-TOSHIBA_DT01ACA100_63TU2A1NS -> ../../sda
lrwxrwxrwx 1 root root 10 Sep 29 07:55 ata-TOSHIBA_DT01ACA100_63TU2A1NS-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Sep 29 07:55 ata-TOSHIBA_DT01ACA100_63TU2A1NS-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Sep 29 07:55 ata-TOSHIBA_DT01ACA100_63TU2A1NS-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Sep 29 07:55 ata-TOSHIBA_DT01ACA100_63TU2A1NS-part4 -> ../../sda4
lrwxrwxrwx 1 root root  9 Sep 29 07:55 ata-TOSHIBA_DT01ACA100_935LYJENS -> ../../sdb
lrwxrwxrwx 1 root root 10 Sep 29 07:55 ata-TOSHIBA_DT01ACA100_935LYJENS-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Sep 29 07:55 ata-TOSHIBA_DT01ACA100_935LYJENS-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 10 Sep 29 07:55 ata-TOSHIBA_DT01ACA100_935LYJENS-part3 -> ../../sdb3
lrwxrwxrwx 1 root root 10 Sep 29 07:55 ata-TOSHIBA_DT01ACA100_935LYJENS-part4 -> ../../sdb4
lrwxrwxrwx 1 root root  9 Sep 29 07:55 md-name-usa2.hostname.com:0 -> ../../md0
lrwxrwxrwx 1 root root  9 Sep 29 07:55 md-name-usa2.hostname.com:1 -> ../../md1
lrwxrwxrwx 1 root root  9 Sep 29 07:55 md-name-usa2.hostname.com:2 -> ../../md2
lrwxrwxrwx 1 root root  9 Sep 29 07:55 md-uuid-99b82f2a:42c4e0d2:c8fa265a:b96538d3 -> ../../md0
lrwxrwxrwx 1 root root  9 Sep 29 07:55 md-uuid-ad8c5588:6653e39c:69ec68e7:42098674 -> ../../md2
lrwxrwxrwx 1 root root  9 Sep 29 07:55 md-uuid-f9316e56:57e44d87:5c216d5d:8b5c6f2c -> ../../md1
lrwxrwxrwx 1 root root  9 Sep 29 07:55 scsi-SATA_TOSHIBA_DT01ACA_3458SZUMS -> ../../sdd
lrwxrwxrwx 1 root root 10 Sep 29 07:55 scsi-SATA_TOSHIBA_DT01ACA_3458SZUMS-part1 -> ../../sdd1
lrwxrwxrwx 1 root root  9 Sep 29 07:55 scsi-SATA_TOSHIBA_DT01ACA_34GD6P5NS -> ../../sdc
lrwxrwxrwx 1 root root 10 Sep 29 07:55 scsi-SATA_TOSHIBA_DT01ACA_34GD6P5NS-part1 -> ../../sdc1
lrwxrwxrwx 1 root root  9 Sep 29 07:55 scsi-SATA_TOSHIBA_DT01ACA_63TU2A1NS -> ../../sda
lrwxrwxrwx 1 root root 10 Sep 29 07:55 scsi-SATA_TOSHIBA_DT01ACA_63TU2A1NS-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Sep 29 07:55 scsi-SATA_TOSHIBA_DT01ACA_63TU2A1NS-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Sep 29 07:55 scsi-SATA_TOSHIBA_DT01ACA_63TU2A1NS-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Sep 29 07:55 scsi-SATA_TOSHIBA_DT01ACA_63TU2A1NS-part4 -> ../../sda4
lrwxrwxrwx 1 root root  9 Sep 29 07:55 scsi-SATA_TOSHIBA_DT01ACA_935LYJENS -> ../../sdb
lrwxrwxrwx 1 root root 10 Sep 29 07:55 scsi-SATA_TOSHIBA_DT01ACA_935LYJENS-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Sep 29 07:55 scsi-SATA_TOSHIBA_DT01ACA_935LYJENS-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 10 Sep 29 07:55 scsi-SATA_TOSHIBA_DT01ACA_935LYJENS-part3 -> ../../sdb3
lrwxrwxrwx 1 root root 10 Sep 29 07:55 scsi-SATA_TOSHIBA_DT01ACA_935LYJENS-part4 -> ../../sdb4
lrwxrwxrwx 1 root root  9 Sep 29 07:55 wwn-0x5000039ff6e79502 -> ../../sda
lrwxrwxrwx 1 root root 10 Sep 29 07:55 wwn-0x5000039ff6e79502-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Sep 29 07:55 wwn-0x5000039ff6e79502-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Sep 29 07:55 wwn-0x5000039ff6e79502-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Sep 29 07:55 wwn-0x5000039ff6e79502-part4 -> ../../sda4
lrwxrwxrwx 1 root root  9 Sep 29 07:55 wwn-0x5000039ff6f2e40b -> ../../sdb
lrwxrwxrwx 1 root root 10 Sep 29 07:55 wwn-0x5000039ff6f2e40b-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Sep 29 07:55 wwn-0x5000039ff6f2e40b-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 10 Sep 29 07:55 wwn-0x5000039ff6f2e40b-part3 -> ../../sdb3
lrwxrwxrwx 1 root root 10 Sep 29 07:55 wwn-0x5000039ff6f2e40b-part4 -> ../../sdb4
lrwxrwxrwx 1 root root  9 Sep 29 07:55 wwn-0x5000039ff7e02c0c -> ../../sdd
lrwxrwxrwx 1 root root 10 Sep 29 07:55 wwn-0x5000039ff7e02c0c-part1 -> ../../sdd1
lrwxrwxrwx 1 root root  9 Sep 29 07:55 wwn-0x5000039ff7efd40d -> ../../sdc
lrwxrwxrwx 1 root root 10 Sep 29 07:55 wwn-0x5000039ff7efd40d-part1 -> ../../sdc1
[root@usa2 by-id]#

Here you can see different prefixes for the disk names, "ata-", "md-", "scsi-", "wwn-". On this server we are using sata drives so I will use the ID's with the "ata-" prefix. You can also see here "-part1", "-part2", "-part3" or "-part4" in the ID's, these represent the existing partitions on the disks. Ok now to create the 2 mirrored pools for this server with no mountpoints:

[root@usa2 /]# zpool create -m none vztank1 mirror ata-TOSHIBA_DT01ACA100_63TU2A1NS-part4 ata-TOSHIBA_DT01ACA100_935LYJENS-part4
[root@usa2 /]# zpool create -m none vztank2 mirror ata-TOSHIBA_DT01ACA100_34GD6P5NS-part1 ata-TOSHIBA_DT01ACA100_3458SZUMS-part1
[root@usa2 /]# zpool status
  pool: vztank1
 state: ONLINE
  scan: none requested
config:

	NAME                                        STATE     READ WRITE CKSUM
	vztank1                                     ONLINE       0     0     0
	  mirror-0                                  ONLINE       0     0     0
	    ata-TOSHIBA_DT01ACA100_63TU2A1NS-part4  ONLINE       0     0     0
	    ata-TOSHIBA_DT01ACA100_935LYJENS-part4  ONLINE       0     0     0

errors: No known data errors

  pool: vztank2
 state: ONLINE
  scan: none requested
config:

	NAME                                        STATE     READ WRITE CKSUM
	vztank2                                     ONLINE       0     0     0
	  mirror-0                                  ONLINE       0     0     0
	    ata-TOSHIBA_DT01ACA100_34GD6P5NS-part1  ONLINE       0     0     0
	    ata-TOSHIBA_DT01ACA100_3458SZUMS-part1  ONLINE       0     0     0

errors: No known data errors
[root@usa2 /]#

Here you can see the 2 "zpool create" commands executed to create each mirrored pool. The "-m none" options ensure the pools are created with no mount points. The "zpool status" command simply confirm the pools exist and provide some extra details. The "zfs list" command below confirms these pools were created with no mount points.

[root@usa2 /]# zfs list
NAME      USED  AVAIL  REFER  MOUNTPOINT
vztank1   520K   890G   136K  none
vztank2   520K   913G   136K  none
[root@usa2 /]#

Ok so now we have 2 mirrored pools setup and ready for the next step, creating ZVOL's. But before creating ZVOLs its best to enable compression for these pools. Compression is a "no-brainer", it gives faster read/writes and more available disk space. So to enable compression on both pools I executed the following commands:

[root@usa2 /]# zfs set compression=lzjb vztank1
[root@usa2 /]# zfs set compression=lzjb vztank2
[root@usa2 /]# zfs get compression
NAME     PROPERTY     VALUE     SOURCE
vztank1  compression  lzjb      local
vztank2  compression  lzjb      local
[root@usa2 /]#

Here I have set compression to the most commonly used "lzjb" for both pools and then run the command "zfs get compression" to confirm compression is set, which it is.



STEP#3 Create ZVOL block device and EXT4 Filesystem

Create ZVOL's (ZFS block devices)

ZVOL's are needed so we can use EXT4 filesystems which is the only filesystem that OpenVZ truly works with at present. Zpools are like LVM Volume Groups while ZVOL's are like LVM Logical Volumes. The first thing we need to determine is the maximum size of the ZVOLS for each pool:

[root@usa2 /]# zpool get free
NAME     PROPERTY  VALUE  SOURCE
vztank1  free      904G   -
vztank2  free      928G   -
[root@usa2 /]#

This command shows us the amount of free usable space in each pool in G's only. We need to convert this to GBytes by multiplying it by 0.953674, so in this case:
vztank1 free space = 862 GBytes
vztank2 free space = 885 GBytes
So we create both ZVOLS as shown below:

[root@usa2 /]# zfs create -V 862G vztank1/vzol
[root@usa2 /]# zfs create -V 885G vztank2/vzol
[root@usa2 /]# zfs list
NAME           USED  AVAIL  REFER  MOUNTPOINT
vztank1        889G   742M   136K  none
vztank1/vzol   889G   890G    72K  -
vztank2        913G   640M   136K  none
vztank2/vzol   913G   913G    72K  -
[root@usa2 /]#

Note we told ZFS to create ZVOLs by passing it the "-V" option. Here "zfs list" confirms the creation of both ZVOLS. Ok now we are almost done, all we have to do now is create the EXT4 filesystems on each ZVOL. In order to do this we need to find the device names of these ZVOL's. We find this information in the /dev/zvol/vztankX directories as follows:

[root@usa2 /]# ls -l /dev/zvol/vztank1
total 0
lrwxrwxrwx 1 root root 9 Sep 29 09:27 vzol -> ../../zd0
[root@usa2 /]# ls -l /dev/zvol/vztank2
total 0
lrwxrwxrwx 1 root root 10 Sep 29 09:36 vzol -> ../../zd16
[root@usa2 /]#

Both these are symlinks back to the /dev directory, which are simply:
vztank1/zvol = /dev/zd0
vztank2/zvol = /dev/zd16
Now that we know the device names we can create the EXT4 filesystems:

[root@usa2 /]# mkfs.ext4 /dev/zd0
mke2fs 1.41.12 (17-May-2010)
Discarding device blocks: done                            
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=2 blocks
56492032 inodes, 225968128 blocks
11298406 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
6896 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848

Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 24 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
[root@usa2 /]#
[root@usa2 /]# mkfs.ext4 /dev/zd16
mke2fs 1.41.12 (17-May-2010)
Discarding device blocks: done                            
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=2 blocks, Stripe width=2 blocks
57999360 inodes, 231997440 blocks
11599872 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
7080 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848

Writing inode tables: done                            
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 32 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
[root@usa2 /]#

And lastly we can now mount each ZVOL (ext4) filesystem for use by OpenVZ:

[root@usa2 /]# mount -o noatime /dev/zd0 /vz
[root@usa2 /]# mount -o noatime /dev/zd16 /vz2
[root@usa2 /]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/md2        9.7G  2.5G  6.8G  27% /
tmpfs           7.8G     0  7.8G   0% /dev/shm
/dev/md0        485M   96M  364M  21% /boot
/dev/zd0        849G  201M  806G   1% /vz
/dev/zd16       872G  200M  827G   1% /vz2
[root@usa2 /]#

That's it, you now have ZFS on Linux working with OpenVZ !




About the Author

Administrator

Most Viewed - All Categories