Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1356876

Summary: Skying wants to create journal partition on disk with not enough space
Product: Red Hat Storage Console Reporter: Daniel Horák <dahorak>
Component: CephAssignee: Shubhendu Tripathi <shtripat>
Ceph sub component: configuration QA Contact: sds-qe-bugs
Status: CLOSED WONTFIX Docs Contact:
Severity: unspecified    
Priority: unspecified CC: dahorak, nthomas, shtripat
Version: 2   
Target Milestone: ---   
Target Release: 3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-19 05:40:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Add/configure OSD task failed none

Description Daniel Horák 2016-07-15 07:50:26 UTC
Created attachment 1180064 [details]
Add/configure OSD task failed

Description of problem:
  It seems that Skyring didn't check the size of disk where should be created journal partition.

  I prepared OSD node with 8 disks with sizes 4*4GB and 4*1TB 
(see "Additional info" section).
  I tried to create Ceph cluster there with 5GB journal size, but no OSD was created on this node, because :
    ~~~~~~~~~~~~~~~~~~
    create_partition: refusing to create journal on /dev/vdb
    create_partition: journal size (5120M) is bigger than device (4096M)
    ceph-disk: Error: /dev/vdb device size (4096M) is not big enough for journal
    ~~~~~~~~~~~~~~~~~~

Version-Release number of selected component (if applicable):
  USM Server:
  ceph-ansible-1.0.5-26.el7scon.noarch
  ceph-installer-1.0.14-1.el7scon.noarch
  rhscon-ceph-0.0.33-1.el7scon.x86_64
  rhscon-core-0.0.34-1.el7scon.x86_64
  rhscon-core-selinux-0.0.34-1.el7scon.noarch
  rhscon-ui-0.0.47-1.el7scon.noarch
  
  Ceph OSD node:
  ceph-base-10.2.2-5.el7cp.x86_64
  ceph-common-10.2.2-5.el7cp.x86_64
  ceph-osd-10.2.2-5.el7cp.x86_64
  ceph-selinux-10.2.2-5.el7cp.x86_64
  libcephfs1-10.2.2-5.el7cp.x86_64
  python-cephfs-10.2.2-5.el7cp.x86_64
  rhscon-agent-0.0.15-1.el7scon.noarch
  rhscon-core-selinux-0.0.34-1.el7scon.noarch

How reproducible:
  100%

Steps to Reproduce:
1. Beside other required nodes, prepare OSD node with 8 spare disks with sizes 4*4GB and 4*1TB.
2. Create Ceph cluster and set OSD journal size to 5GB (default).
3. When the cluster is created, check the cluster creation task, ceph-installer logs and the output of `ceph-disk list`.

Actual results:
  Create Cluster task contains following information about OSD addition failures:
    ~~~~~~~~~~~~~~~~~~
    OSD addition failed for [HOSTNAME:map[/dev/vdg:/dev/vde] HOSTNAME:map[/dev/vdh:/dev/vdb] HOSTNAME:map[/dev/vdf:/dev/vdd] HOSTNAME:map[/dev/vdi:/dev/vdc]]
    ~~~~~~~~~~~~~~~~~~

  ceph-installer log  contains following error (acquired via `curl USM-SERVER:8181/api/tasks/ | jq .`):
    ~~~~~~~~~~~~~~~~~~
    ...
    create_partition: refusing to create journal on /dev/vdb
    create_partition: journal size (5120M) is bigger than device (4096M)
    ceph-disk: Error: /dev/vdb device size (4096M) is not big enough for journal
    ...
    ~~~~~~~~~~~~~~~~~~

  No OSD is created on this node:
    ~~~~~~~~~~~~~~~~~~
    # ceph-disk list
      /dev/vda :
       /dev/vda1 other, swap
       /dev/vda2 other, xfs, mounted on /
      /dev/vdb other, unknown
      /dev/vdc other, unknown
      /dev/vdd other, unknown
      /dev/vde other, unknown
      /dev/vdf other, unknown
      /dev/vdg other, unknown
      /dev/vdh other, unknown
      /dev/vdi other, unknown
    ~~~~~~~~~~~~~~~~~~

Expected results:
  It wouldn't make much sense to create OSD from the small 4GB disks and put journal to the 1TB disk, but still it would be possible to put both data and journal to the 1TB disks and create 2 OSDs (as there are 4*1TB disks).
  But I'm not sure about the proper/best logic there, but current behaviour is definitely bad. 

Additional info:
  All disks are rotational and follows is the disk size schema:
    ~~~~~~~~~~~~~~~~~~
    # lsblk 
      NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
      vda    253:0    0  20G  0 disk 
      ├─vda1 253:1    0   2G  0 part 
      └─vda2 253:2    0  18G  0 part /
      vdb    253:16   0   4G  0 disk 
      vdc    253:32   0   4G  0 disk 
      vdd    253:48   0   4G  0 disk 
      vde    253:64   0   4G  0 disk 
      vdf    253:80   0   1T  0 disk 
      vdg    253:96   0   1T  0 disk 
      vdh    253:112  0   1T  0 disk 
      vdi    253:128  0   1T  0 disk 
    ~~~~~~~~~~~~~~~~~~

Comment 1 Shubhendu Tripathi 2016-07-15 09:09:45 UTC
Daniel as the map [HOSTNAME:map[/dev/vdg:/dev/vde] HOSTNAME:map[/dev/vdh:/dev/vdb] HOSTNAME:map[/dev/vdf:/dev/vdd] HOSTNAME:map[/dev/vdi:/dev/vdc]] shows bigger 1 TB disks are trying to use smaller 4GB disks as journal so mapping looks fine.

All /dev/vdg, /devvdh, /dev/vdf and /dev/vdi are 1 TB disks and are marked to be used as data disks.

Here 5120M = 5GB is the journal size which certainly is higher than journal disks of sizes 4096M = 4GB. And thats the reason the journal creation is failing.

As all the disks are rotational in nature, one disk can be used as journal for only one disk, so mapping and errors are perfectly correct.

Comment 2 Shubhendu Tripathi 2016-07-15 09:19:38 UTC
so the algo is implemented such a way that
- co-located journal is not allowed
- in case of rotational disk, only one disk can use other disk as journal
- the disks are sorted in the descending order of sizes and one by one bigger disk try to use smaller ones as journal

Here in above scenario exactly there are 4 biggersized disks and 4 smaller sized disks and unfortunately in this case all 4 bigger ones try to use 4 smaller ones one-by-one as journal and due to lesser size none of the jorunals are allowed to be create.

But as I mentioned its exactly as per the algorithm implemented currently in system.

Comment 3 Daniel Horák 2016-07-15 09:41:27 UTC
I think I quite understand the algorithm and I agree that it is exactly as per the algo.
The problem is, that it try to create journal on smaller disk than is the journal size. It should check if the journal fit to the disk.

I understand, that it is quite implausible scenario with rotational disk with this "small" size, but still I think it should be covered as a corner case.

Different situation will be with SSD disks, if there will be some similar issue with SSD disks (trying to put journal to SSD disks with not enough space), it will be more serious problem, because SSD disks are smaller and it's more likely to hit similar corner case.

I'll try to test some other scenarios also with combination with SSD disks and then update this BB.

Comment 4 Daniel Horák 2016-07-19 12:08:31 UTC
I tried few other scenarios also with SSD disks and I think there are still some issues which should be addressed for rhscon-2 release - at least in documentation.

This time I created cluster with 16GB journal.
The reason for 16GB journal follows:
  I think, that maybe not 32GB SSD but 64GB SSD might be quite real scenario.
  And because I know, that on SSD disk can be journal for at maximum 4 OSD disks, for the best utilisation I divided 64 by 4 and that's 16GB.

My configuration was as follows (the differences are in the size of SSD disks):
- node1: 2 SSD (32G, 32G), 6 HDD (1T, 1T, 1T, 1T, 1T, 1T)
- node2: 2 SSD (64G, 64G), 6 HDD (1T, 1T, 1T, 1T, 1T, 1T)
- node3: 2 SSD (65G, 65G), 6 HDD (1T, 1T, 1T, 1T, 1T, 1T)
- node4: 2 SSD (40G, 80G), 6 HDD (1T, 1T, 1T, 1T, 1T, 1T)

And the result was this:
- node1: 2 SSD (32G, 32G), 6 HDD (1T, 1T, 1T, 1T, 1T, 1T)
  created 3 OSDs, 1 journal peer 1 SSD, 1 journal on HDD,
  creation of 2 OSDs failed because of 32GB SSD is too small for 2x16GB journal

- node2: 2 SSD (64G, 64G), 6 HDD (1T, 1T, 1T, 1T, 1T, 1T)
  created 5 OSDs, 3 journals on first SSD, 2 journas on second SSD,
  creation of 1 OSD failed because of 64GB SSD is too small for 4x16GB journal

- node3: 2 SSD (65G, 65G), 6 HDD (1T, 1T, 1T, 1T, 1T, 1T)
  6 OSDs created, 4 journals on one SSD, 2 journals on the second

- node4: 2 SSD (40G, 80G), 6 HDD (1T, 1T, 1T, 1T, 1T, 1T)
  6 OSDs created, 4 journals on 80GB SSD, 2 journals on 40GB SSD

The main issue I see is this: It is not possible to put four 16GB journals to 64GB disk (or two 16GB journals on 32GB disk or similarly) - simply it is not possible to divide the size of disk by the number of desired journals and use the size, which seems to me like quite straightforward solution.
So it should be at least documented, that you should use slightly smaller journal than the result of division the size of SSD disk by the number of desired journals.
As you can see on node3, it is possible to create four 16GB journals on 65GB disk, so for 64GB SSd disk would be ideal journal size 15GB.

Comment 5 Shubhendu Tripathi 2016-07-19 13:00:30 UTC
Daniel, All I suspect here that exact 64GB for 4*16GB journals might not work as some bytes might be utilized down the line.

So as you pointed out correctly it might not be exact fit case and SSD should have enough space to fit 4 journals.

Admin guide already is supposed to have a section to explain the journal mapping mechanism for OSDs. I would ask doc team to add note for this scenario as well.

Thanks for rigorously testing all the possible flows :)

Comment 6 Shubhendu Tripathi 2018-11-19 05:40:23 UTC
This product is EOL now