Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1358275 - Rados df gives wrong degraded object count
Summary: Rados df gives wrong degraded object count
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS
Version: 2.0
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: rc
: 2.1
Assignee: Josh Durgin
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks: 1349913
TreeView+ depends on / blocked
 
Reported: 2016-07-20 12:13 UTC by anmol babu
Modified: 2018-10-23 17:18 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-21 18:44:03 UTC


Attachments (Terms of Use)

Description anmol babu 2016-07-20 12:13:38 UTC
Description of problem:
The rados df cli command shows number of degraded objects greater than the total number of objects when some of the PGs are degraded

Version-Release number of selected component (if applicable):

Mon rpms:
rpm -qa|grep ceph
ceph-selinux-10.2.2-5.el7cp.x86_64
python-cephfs-10.2.2-5.el7cp.x86_64
ceph-common-10.2.2-5.el7cp.x86_64
ceph-base-10.2.2-5.el7cp.x86_64
libcephfs1-10.2.2-5.el7cp.x86_64
ceph-mon-10.2.2-5.el7cp.x86_64

OSD rpms:
rpm -qa|grep ceph
ceph-selinux-10.2.2-9.el7cp.x86_64
ceph-common-10.2.2-9.el7cp.x86_64
ceph-base-10.2.2-9.el7cp.x86_64
libcephfs1-10.2.2-9.el7cp.x86_64
python-cephfs-10.2.2-9.el7cp.x86_64
ceph-osd-10.2.2-9.el7cp.x86_64

How reproducible:
Frequently

Steps to Reproduce:
1. prepare some pool and add there several objects e.g. 4 objects
2. remove some OSDs so there is less OSDs than the pool requires
3. create another object

Actual results:
degraded object count > total object count

Expected results:
Degraded object count should not be more than total object count.

Additional info:

rados df --cluster c1 --format json
{"pools":[{"name":"p1","id":"1","size_bytes":"114","size_kb":"1","num_objects":"4","num_object_clones":"0","num_object_copies":"12","num_objects_missing_on_primary":"0","num_objects_unfound":"0","num_objects_degraded":"8","read_ops":"5483","read_bytes":"4009984","write_ops":"8","write_bytes":"2048"},{"name":"p2","id":"2","size_bytes":"0","size_kb":"0","num_objects":"1","num_object_clones":"0","num_object_copies":"3","num_objects_missing_on_primary":"0","num_objects_unfound":"0","num_objects_degraded":"2","read_ops":"0","read_bytes":"0","write_ops":"2","write_bytes":"0"}],"total_objects":"5","total_used":"74284","total_avail":"31360428","total_space":"31434712"}

ceph -s --cluster c1
    cluster ef7329fe-01e5-4b60-8427-71112db95c9d
     health HEALTH_WARN
            256 pgs degraded
            256 pgs stuck unclean
            256 pgs undersized
            recovery 10/15 objects degraded (66.667%)
     monmap e1: 1 mons at {dhcp41-235=10.70.41.235:6789/0}
            election epoch 3, quorum 0 dhcp41-235
     osdmap e37: 2 osds: 2 up, 2 in
            flags sortbitwise
      pgmap v700: 256 pgs, 2 pools, 114 bytes data, 5 objects
            74284 kB used, 30625 MB / 30697 MB avail
            10/15 objects degraded (66.667%)
                 256 active+undersized+degraded

Comment 2 Ken Dreyer (Red Hat) 2016-07-20 13:40:30 UTC
The builds listed above are pretty old. Please confirm that this is still happening with the latest builds (10.2.2-24.el7cp)

Comment 3 Samuel Just 2016-07-20 14:45:13 UTC
It's actually ok for there to be more degraded objects than objects.  If the pool is configured for 4 replicas but you only have 2, each object is degraded twice.  This appears to have happened with 2 osds and pool size=3, however, so it seems like each object should have been degraded once.  Possibly a bug with contructing the stats.

I do not think this should be a 2.0 blocker.

Comment 4 Josh Durgin 2016-07-23 01:07:29 UTC
I haven't been able to reproduce this myself. One possible cause would be PGs getting mapped to a smaller acting set than expected, as can happen with older crush tunables.

Can you reproduce with osd debugging (debug osd = 20, debug ms = 1) enabled and post the osd logs and output of 'ceph pg dump'?

Comment 5 Samuel Just 2016-09-21 18:44:03 UTC
Closing on the assumption that it's just the normal behavior absent any other information.


Note You need to log in before you can comment on or make changes to this bug.