Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1517128 - [RFE] CRUSH map ruleset for primary replicas on SSD OSDs and, secondary on HDD OSDs on hosts with mix of SSD and HDD OSDs may place 2 replicas on the same host
Summary: [RFE] CRUSH map ruleset for primary replicas on SSD OSDs and, secondary on HD...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS
Version: 3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: 3.*
Assignee: Josh Durgin
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-24 09:45 UTC by Tomas Petr
Modified: 2019-03-07 23:21 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-07 23:21:51 UTC


Attachments (Terms of Use)

Description Tomas Petr 2017-11-24 09:45:01 UTC
Description of problem:
Creating crushmap ruleset to have primary replicas on SSD OSDs and secondary on HDD OSDs, when these OSD are located on the same host, will place 2 replicas on the same host, primary on SSD OSD and secondary on HDD OSD on the same host.


Version-Release number of selected component (if applicable):
# ceph version
ceph version 12.2.1-39.el7cp (22e26be5a4920c95c43f647b31349484f663e4b9) luminous (stable)


How reproducible:
Always

Steps to Reproduce:
1. create ceph environment where SSD OSDs and HDD OSDs are on the same host
2. follow the steps to create CRUSH map ruleset for primary replicas on SSD OSDs and, secondary on HDD OSDs 
3. check that 2 replicas of the same PG may land on the same host

Actual results:


Expected results:


Additional info:

Comment 4 Tomas Petr 2017-11-24 09:54:52 UTC
reproduced steps:
Crush map decompiled:
# cat crush.decom.bz
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host osds-0 {
	id -2		# do not change unnecessarily
	id -7 class hdd		# do not change unnecessarily
	id -13 class ssd		# do not change unnecessarily
	# weight 0.113
	alg straw
	hash 0	# rjenkins1
	item osd.0 weight 0.025
	item osd.8 weight 0.029
	item osd.14 weight 0.029
	item osd.15 weight 0.029
}
host osds-3 {
	id -3		# do not change unnecessarily
	id -8 class hdd		# do not change unnecessarily
	id -14 class ssd		# do not change unnecessarily
	# weight 0.084
	alg straw
	hash 0	# rjenkins1
	item osd.1 weight 0.025
	item osd.6 weight 0.029
	item osd.12 weight 0.029
}
host osds-2 {
	id -4		# do not change unnecessarily
	id -9 class hdd		# do not change unnecessarily
	id -15 class ssd		# do not change unnecessarily
	# weight 0.084
	alg straw
	hash 0	# rjenkins1
	item osd.2 weight 0.025
	item osd.7 weight 0.029
	item osd.11 weight 0.029
}
host osds-1 {
	id -5		# do not change unnecessarily
	id -10 class hdd		# do not change unnecessarily
	id -16 class ssd		# do not change unnecessarily
	# weight 0.084
	alg straw
	hash 0	# rjenkins1
	item osd.4 weight 0.025
	item osd.5 weight 0.029
	item osd.10 weight 0.029
}
host osds-4 {
	id -6		# do not change unnecessarily
	id -11 class hdd		# do not change unnecessarily
	id -17 class ssd		# do not change unnecessarily
	# weight 0.084
	alg straw
	hash 0	# rjenkins1
	item osd.3 weight 0.025
	item osd.9 weight 0.029
	item osd.13 weight 0.029
}
root default {
	id -1		# do not change unnecessarily
	id -12 class hdd		# do not change unnecessarily
	id -18 class ssd		# do not change unnecessarily
	# weight 0.449
	alg straw
	hash 0	# rjenkins1
	item osds-0 weight 0.113
	item osds-3 weight 0.084
	item osds-2 weight 0.084
	item osds-1 weight 0.084
	item osds-4 weight 0.084
}

# rules
rule ssd-pool {
	id 0
	type replicated
	min_size 1
	max_size 10
	step take default class ssd
	step chooseleaf firstn 0 type host
	step emit
}
rule hdd-pool {
	id 1
	type replicated
	min_size 1
	max_size 10
	step take default class hdd
	step chooseleaf firstn 0 type host
	step emit
}
rule ssd-primary {
	id 2
	type replicated
	min_size 1
	max_size 10
	step take default class ssd
 	step chooseleaf firstn 1 type host
	step emit
	step take default class hdd
 	step chooseleaf firstn -1 type host
	step emit
}

# end crush map


# ceph osd tree (test system, the ssd OSDs are not actual SSDs)
ID CLASS WEIGHT  TYPE NAME	 STATUS REWEIGHT PRI-AFF
-1	 0.44899 root default
-2	 0.11299     host osds-0
 8   hdd 0.02899         osd.8       up  1.00000 1.00000
14   hdd 0.02899         osd.14      up  1.00000 1.00000
15   hdd 0.02899         osd.15      up  1.00000 1.00000
 0   ssd 0.02499         osd.0       up  1.00000 1.00000
-5	 0.08400     host osds-1
 5   hdd 0.02899         osd.5       up  1.00000 1.00000
10   hdd 0.02899         osd.10      up  1.00000 1.00000
 4   ssd 0.02499         osd.4       up  1.00000 1.00000
-4	 0.08400     host osds-2
 7   hdd 0.02899         osd.7       up  1.00000 1.00000
11   hdd 0.02899         osd.11      up  1.00000 1.00000
 2   ssd 0.02499         osd.2       up  1.00000 1.00000
-3	 0.08400     host osds-3
 6   hdd 0.02899         osd.6       up  1.00000 1.00000
12   hdd 0.02899         osd.12      up  1.00000 1.00000
 1   ssd 0.02499         osd.1       up  1.00000 1.00000
-6	 0.08400     host osds-4
 9   hdd 0.02899         osd.9       up  1.00000 1.00000
13   hdd 0.02899         osd.13      up  1.00000 1.00000
 3   ssd 0.02499         osd.3       up  1.00000 1.00000


- pool creation
[root@mons-0 ~]# ceph osd pool create fast-ssd 8 8 replicated ssd-pool
pool 'fast-ssd' created
[root@mons-0 ~]# ceph osd pool create slow-hdd 8 8 replicated hdd-pool
pool 'slow-hdd' created
[root@mons-0 ~]# ceph osd pool create mixed 8 8 replicated ssd-primary


pool 3 'fast-ssd' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 122 flags hashpspool stripe_width 0
pool 4 'slow-hdd' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 8 pgp_num 8 last_change 125 flags hashpspool stripe_width 0
pool 5 'mixed' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 8 pgp_num 8 last_change 129 flags hashpspool stripe_width 0

# for i in `seq 3 5` ; do for j in `seq 0 7`; do ceph pg map $i.$j ; done; done
osdmap e142 pg 3.0 (3.0) -> up [3,1,0] acting [3,1,0]
osdmap e142 pg 3.1 (3.1) -> up [2,3,1] acting [2,3,1]
osdmap e142 pg 3.2 (3.2) -> up [4,3,1] acting [4,3,1]
osdmap e142 pg 3.3 (3.3) -> up [4,0,2] acting [4,0,2]
osdmap e142 pg 3.4 (3.4) -> up [1,4,3] acting [1,4,3]
osdmap e142 pg 3.5 (3.5) -> up [0,4,2] acting [0,4,2]
osdmap e142 pg 3.6 (3.6) -> up [1,3,2] acting [1,3,2]
osdmap e142 pg 3.7 (3.7) -> up [2,4,1] acting [2,4,1]
^^^ all SSDs - check OK

osdmap e142 pg 4.0 (4.0) -> up [9,10,11] acting [9,10,11]
osdmap e142 pg 4.1 (4.1) -> up [6,13,14] acting [6,13,14]
osdmap e142 pg 4.2 (4.2) -> up [5,14,12] acting [5,14,12]
osdmap e142 pg 4.3 (4.3) -> up [12,15,9] acting [12,15,9]
osdmap e142 pg 4.4 (4.4) -> up [11,15,12] acting [11,15,12]
osdmap e142 pg 4.5 (4.5) -> up [9,6,11] acting [9,6,11]
osdmap e142 pg 4.6 (4.6) -> up [9,11,5] acting [9,11,5]
osdmap e142 pg 4.7 (4.7) -> up [10,9,15] acting [10,9,15]
^^^ all HDDs - check OK

osdmap e142 pg 5.0 (5.0) -> up [3,13,14] acting [3,13,14]
^^^ osd 3 and 13 are on the same host osds-4
osdmap e142 pg 5.1 (5.1) -> up [2,14,7] acting [2,14,7]
^^^ osd.2 and 7 are on the same host osds-2
osdmap e142 pg 5.2 (5.2) -> up [3,9,8] acting [3,9,8]
^^^ 3 and 9 on the same host osds-4
osdmap e142 pg 5.3 (5.3) -> up [4,15,6] acting [4,15,6]
osdmap e142 pg 5.4 (5.4) -> up [0,12,5] acting [0,12,5]
osdmap e142 pg 5.5 (5.5) -> up [2,15,9] acting [2,15,9]
osdmap e142 pg 5.6 (5.6) -> up [3,6,5] acting [3,6,5]
osdmap e142 pg 5.7 (5.7) -> up [0,5,13] acting [0,5,13]

Comment 5 Tomas Petr 2017-11-24 10:02:12 UTC
following the same steps that are in the  upstream link
http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/#placing-different-pools-on-different-osds

# cat decomp
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host osds-0-hdd {
	id -2		# do not change unnecessarily
	# weight 0.088
	alg straw
	hash 0	# rjenkins1
	item osd.8 weight 0.029
	item osd.14 weight 0.029
	item osd.15 weight 0.029
}
host osds-0-ssd {
	id -7		# do not change unnecessarily
	# weight 0.025
	alg straw
	hash 0	# rjenkins1
	item osd.0 weight 0.025
}
host osds-3-hdd {
	id -3		# do not change unnecessarily
	# weight 0.084
	alg straw
	hash 0	# rjenkins1
	item osd.6 weight 0.029
	item osd.12 weight 0.029
}
host osds-3-ssd {
	id -8		# do not change unnecessarily
	# weight 0.084
	alg straw
	hash 0	# rjenkins1
	item osd.1 weight 0.025
}
host osds-2-hdd {
	id -4		# do not change unnecessarily
	# weight 0.084
	alg straw
	hash 0	# rjenkins1
	item osd.7 weight 0.029
	item osd.11 weight 0.029
}
host osds-2-ssd {
	id -9		# do not change unnecessarily
	# weight 0.084
	alg straw
	hash 0	# rjenkins1
	item osd.2 weight 0.025
}
host osds-1-hdd {
	id -5		# do not change unnecessarily
	# weight 0.084
	alg straw
	hash 0	# rjenkins1
	item osd.5 weight 0.029
	item osd.10 weight 0.029
}
host osds-1-ssd {
	id -10		# do not change unnecessarily
	# weight 0.084
	alg straw
	hash 0	# rjenkins1
	item osd.4 weight 0.025
}
host osds-4-hdd {
	id -6		# do not change unnecessarily
	# weight 0.084
	alg straw
	hash 0	# rjenkins1
	item osd.9 weight 0.029
	item osd.13 weight 0.029
}
host osds-4-ssd {
	id -11		# do not change unnecessarily
	# weight 0.084
	alg straw
	hash 0	# rjenkins1
	item osd.3 weight 0.025
}
#root buckets
root ssd-root {
	id -12		# do not change unnecessarily
	# weight 0.449
	alg straw
	hash 0	# rjenkins1
	item osds-0-ssd weight 0.025
	item osds-3-ssd weight 0.029
	item osds-2-ssd weight 0.029
	item osds-1-ssd weight 0.029
	item osds-4-ssd weight 0.029
}
root hdd-root {
	id -1		# do not change unnecessarily
	# weight 0.449
	alg straw
	hash 0	# rjenkins1
	item osds-0-hdd weight 0.088
	item osds-3-hdd weight 0.058
	item osds-2-hdd weight 0.058
	item osds-1-hdd weight 0.058
	item osds-4-hdd weight 0.058
}
# rules
rule ssd-pool {
	id 0
	type replicated
	min_size 1
	max_size 10
	step take ssd-root
	step chooseleaf firstn 0 type host
	step emit
}
rule hdd-pool {
	id 1
	type replicated
	min_size 1
	max_size 10
	step take hdd-root
	step chooseleaf firstn 0 type host
	step emit
}
rule ssd-primary {
	id 2
	type replicated
	min_size 1
	max_size 10
	step take ssd-root
 	step chooseleaf firstn 1 type host
	step emit
	step take hdd-root
 	step chooseleaf firstn -1 type host
	step emit
}
# end crush map


# ceph osd tree
ID  CLASS WEIGHT  TYPE NAME	      STATUS REWEIGHT PRI-AFF
-12	  0.14096 root ssd-root
 -7	  0.02499     host osds-0-ssd
  0   ssd 0.02499         osd.0           up  1.00000 1.00000
-10       0.02899     host osds-1-ssd
  4   ssd 0.02499         osd.4           up  1.00000 1.00000
 -9       0.02899     host osds-2-ssd
  2   ssd 0.02499         osd.2           up  1.00000 1.00000
 -8       0.02899     host osds-3-ssd
  1   ssd 0.02499         osd.1           up  1.00000 1.00000
-11       0.02899     host osds-4-ssd
  3   ssd 0.02499         osd.3           up  1.00000 1.00000
 -1       0.31999 root hdd-root
 -2       0.08800     host osds-0-hdd
  8   hdd 0.02899         osd.8           up  1.00000 1.00000
 14   hdd 0.02899         osd.14          up  1.00000 1.00000
 15   hdd 0.02899         osd.15          up  1.00000 1.00000
 -5       0.05800     host osds-1-hdd
  5   hdd 0.02899         osd.5           up  1.00000 1.00000
 10   hdd 0.02899         osd.10          up  1.00000 1.00000
 -4       0.05800     host osds-2-hdd
  7   hdd 0.02899         osd.7           up  1.00000 1.00000
 11   hdd 0.02899         osd.11          up  1.00000 1.00000
 -3	  0.05800     host osds-3-hdd
  6   hdd 0.02899         osd.6           up  1.00000 1.00000
 12   hdd 0.02899         osd.12          up  1.00000 1.00000
 -6	  0.05800     host osds-4-hdd
  9   hdd 0.02899         osd.9           up  1.00000 1.00000
 13   hdd 0.02899         osd.13          up  1.00000 1.00000


# for i in `seq 3 5` ; do for j in `seq 0 7`; do ceph pg map $i.$j ; done; done
osdmap e144 pg 3.0 (3.0) -> up [4,3,2] acting [4,3,2]
osdmap e144 pg 3.1 (3.1) -> up [4,2,3] acting [4,2,3]
osdmap e144 pg 3.2 (3.2) -> up [2,0,3] acting [2,0,3]
osdmap e144 pg 3.3 (3.3) -> up [2,1,4] acting [2,1,4]
osdmap e144 pg 3.4 (3.4) -> up [1,0,4] acting [1,0,4]
osdmap e144 pg 3.5 (3.5) -> up [4,0,1] acting [4,0,1]
osdmap e144 pg 3.6 (3.6) -> up [4,1,3] acting [4,1,3]
osdmap e144 pg 3.7 (3.7) -> up [3,1,2] acting [3,1,2]
^^^ all SSDs - check OK

osdmap e144 pg 4.0 (4.0) -> up [11,13,15] acting [11,13,15]
osdmap e144 pg 4.1 (4.1) -> up [11,5,13] acting [11,5,13]
osdmap e144 pg 4.2 (4.2) -> up [9,7,6] acting [9,7,6]
osdmap e144 pg 4.3 (4.3) -> up [15,9,12] acting [15,9,12]
osdmap e144 pg 4.4 (4.4) -> up [6,11,10] acting [6,11,10]
osdmap e144 pg 4.5 (4.5) -> up [9,10,15] acting [9,10,15]
osdmap e144 pg 4.6 (4.6) -> up [14,6,11] acting [14,6,11]
osdmap e144 pg 4.7 (4.7) -> up [14,9,6] acting [14,9,6]
^^^ all HDDs - check OK

osdmap e144 pg 5.0 (5.0) -> up [3,13,14] acting [3,13,14]
^^^ osd 3 and 13 are on the same host osds-4
osdmap e144 pg 5.1 (5.1) -> up [2,5,6] acting [2,5,6]
osdmap e144 pg 5.2 (5.2) -> up [3,10,13] acting [3,10,13]
^^^ osd 3 and 13 are on the same host osds-4
osdmap e144 pg 5.3 (5.3) -> up [2,6,11] acting [2,6,11]
^^^ osd 2 and 11 are on the same host osds-2
osdmap e144 pg 5.4 (5.4) -> up [1,13,12] acting [1,13,12]
^^^ osd 1 and 12 on osds-3-hdd
osdmap e144 pg 5.5 (5.5) -> up [0,9,7] acting [0,9,7]
osdmap e144 pg 5.6 (5.6) -> up [1,6,5] acting [1,6,5]
^^^ osd 1 and 6 on osds-3-hdd
osdmap e144 pg 5.7 (5.7) -> up [4,5,13] acting [4,5,13]

Comment 13 Greg Farnum 2019-03-07 23:21:51 UTC
There are various workarounds that can let users approach this problem, but a generic way of handling this is unfortunately beyond the plausible scope of CRUSH.

I suspect the best approach would be to add a replica-affinity parameter that mirrors the primary-affinity so we could set SSDs to be primaries and not replicas, and vice-versa on hard drives. If it becomes a strategic imperative we can open an RFE for that.


Note You need to log in before you can comment on or make changes to this bug.