Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1355970 - CRUSH Node create fails for a large cluster
Summary: CRUSH Node create fails for a large cluster
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Calamari
Version: 2.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 3.0
Assignee: Boris Ranto
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-13 07:48 UTC by Nishanth Thomas
Modified: 2017-07-19 13:28 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-07-19 13:28:44 UTC


Attachments (Terms of Use)
calamari log (deleted)
2016-07-13 07:50 UTC, Nishanth Thomas
no flags Details
cthulhu log (deleted)
2016-07-13 07:51 UTC, Nishanth Thomas
no flags Details

Description Nishanth Thomas 2016-07-13 07:48:04 UTC
Description of problem:
When creating the crush hierarchies for large clusters, the crush updates fails with the below exception:
2016-07-13 01:59:31,323 - ERROR - django.request Internal Server Error: /api/v2/cluster/c2942c44-471c-46ba-b738-e4ebf6e252b5/crush_node
Traceback (most recent call last):
  File "/opt/calamari/venv/lib/python2.7/site-packages/django/core/handlers/base.py", line 115, in get_response
    response = callback(request, *callback_args, **callback_kwargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/viewsets.py", line 78, in view
    return self.dispatch(request, *args, **kwargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 91, in dispatch
    return super(RPCViewSet, self).dispatch(request, *args, **kwargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/django/views/decorators/csrf.py", line 77, in wrapped_view
    return view_func(*args, **kwargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/views.py", line 399, in dispatch
    response = self.handle_exception(exc)
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 108, in handle_exception
    return super(RPCViewSet, self).handle_exception(exc)
  File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/views.py", line 396, in dispatch
    response = handler(request, *args, **kwargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/v2.py", line 131, in create
    create_response = self.client.create(fsid, CRUSH_NODE, serializer.get_data())
  File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 260, in <lambda>
    return lambda *args, **kargs: self(method, *args, **kargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 245, in __call__
    return self._process_response(request_event, bufchan, timeout)
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 55, in _process_response
    result = super(ProfiledRpcClient, self)._process_response(request_event, bufchan, timeout)
  File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 226, in _process_response
    raise ex
TimeoutExpired: timeout after 30s, when calling remote method create
2016-07-13 01:59:31,388 - ERROR - django.request Internal Server Error: /api/v2/cluster/c2942c44-471c-46ba-b738-e4ebf6e252b5/crush_node
Traceback (most recent call last):
  File "/opt/calamari/venv/lib/python2.7/site-packages/django/core/handlers/base.py", line 115, in get_response
    response = callback(request, *callback_args, **callback_kwargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/viewsets.py", line 78, in view
    return self.dispatch(request, *args, **kwargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 91, in dispatch
    return super(RPCViewSet, self).dispatch(request, *args, **kwargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/django/views/decorators/csrf.py", line 77, in wrapped_view
    return view_func(*args, **kwargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/views.py", line 399, in dispatch
    response = self.handle_exception(exc)
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 108, in handle_exception
    return super(RPCViewSet, self).handle_exception(exc)
  File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/views.py", line 396, in dispatch
    response = handler(request, *args, **kwargs)
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/v2.py", line 131, in create
    create_response = self.client.create(fsid, CRUSH_NODE, serializer.get_data())
  File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/serializers/v2.py", line 51, in get_data
    for datum in self.data[field]:
TypeError: 'NoneType' object is not iterable


Version-Release number of selected component (if applicable):
calamari-server-1.4.2-1.el7cp.x86_64

How reproducible:
100% when creating big clusters, in my case the cluster has 18 OSDs

Steps to Reproduce:
1.create a big cluster
2.create crush hierarchies afresh after creating the cluster


Additional info:

Can provide a setup where this issue is reproducible

Comment 2 Nishanth Thomas 2016-07-13 07:50:37 UTC
Created attachment 1179123 [details]
calamari log

Comment 3 Nishanth Thomas 2016-07-13 07:51:15 UTC
Created attachment 1179124 [details]
cthulhu log

Comment 4 Nishanth Thomas 2016-07-13 12:53:23 UTC
But the strange thing is that the requests are completed successfully even though the calamari returns 500 internal server error

GET /api/v2/cluster/098da288-5034-4d28-8e93-9a5c12d2c54b/request
HTTP 200 OK
Vary: Accept
Content-Type: text/html; charset=utf-8
Allow: GET, HEAD, OPTIONS

{
    "count": 9, 
    "next": null, 
    "previous": null, 
    "results": [
        {
            "id": "eb2ebb0b-078a-4668-aa43-1aa923782bbc", 
            "state": "complete", 
            "error": false, 
            "error_message": "", 
            "headline": "Creating CRUSH node in ceph", 
            "status": "Completed successfully", 
            "requested_at": "2016-07-13T12:21:08.342551+00:00", 
            "completed_at": "2016-07-13T12:21:40.981105+00:00"
        }, 
        {
            "id": "fd6a1693-40c7-4f47-a453-d906ba43b981", 
            "state": "complete", 
            "error": false, 
            "error_message": "", 
            "headline": "Creating CRUSH node in ceph", 
            "status": "Completed successfully", 
            "requested_at": "2016-07-13T12:19:37.011427+00:00", 
            "completed_at": "2016-07-13T12:20:08.486005+00:00"
        }, 
        {
            "id": "064cda6d-a2f1-4a0d-857a-dd430727fc43", 
            "state": "complete", 
            "error": false, 
            "error_message": "", 
            "headline": "Creating CRUSH node in ceph", 
            "status": "Completed successfully", 
            "requested_at": "2016-07-13T12:18:05.960423+00:00", 
            "completed_at": "2016-07-13T12:18:37.083870+00:00"
        }, 
        {
            "id": "76d066c5-556b-40bb-ae1a-66229b72353c", 
            "state": "complete", 
            "error": false, 
            "error_message": "", 
            "headline": "Creating CRUSH node in ceph", 
            "status": "Completed successfully", 
            "requested_at": "2016-07-13T12:16:29.761690+00:00", 
            "completed_at": "2016-07-13T12:17:06.119425+00:00"
        }, 
        {
            "id": "671c2ab8-dcb2-4293-b0c8-00a3096eee24", 
            "state": "complete", 
            "error": false, 
            "error_message": "", 
            "headline": "Creating CRUSH rule in ceph", 
            "status": "Completed successfully", 
            "requested_at": "2016-07-13T12:16:27.546834+00:00", 
            "completed_at": "2016-07-13T12:16:28.935236+00:00"
        }, 
        {
            "id": "bf6fbc52-3e90-4000-b1f0-61fb0e2ca941", 
            "state": "complete", 
            "error": false, 
            "error_message": "", 
            "headline": "Creating CRUSH node in ceph", 
            "status": "Completed successfully", 
            "requested_at": "2016-07-13T12:16:23.448211+00:00", 
            "completed_at": "2016-07-13T12:16:25.622181+00:00"
        }, 
        {
            "id": "4252ce0c-090a-4cb3-8b86-b65fe8d757f1", 
            "state": "complete", 
            "error": false, 
            "error_message": "", 
            "headline": "Creating CRUSH node in ceph", 
            "status": "Completed successfully", 
            "requested_at": "2016-07-13T12:15:51.924745+00:00", 
            "completed_at": "2016-07-13T12:16:21.577878+00:00"
        }, 
        {
            "id": "a43cb7b2-3542-4fd3-ad61-b9d18fb25215", 
            "state": "complete", 
            "error": false, 
            "error_message": "", 
            "headline": "Creating CRUSH node in ceph", 
            "status": "Completed successfully", 
            "requested_at": "2016-07-13T12:14:18.648555+00:00", 
            "completed_at": "2016-07-13T12:14:52.097737+00:00"
        }, 
        {
            "id": "88aa9c3a-05ec-4be6-a887-c6ba78b900f3", 
            "state": "complete", 
            "error": false, 
            "error_message": "", 
            "headline": "Deleting pool 'rbd'", 
            "status": "Completed successfully", 
            "requested_at": "2016-07-13T12:14:10.352536+00:00", 
            "completed_at": "2016-07-13T12:14:11.752854+00:00"
        }
    ]
}

Comment 6 Gregory Meno 2016-07-14 05:14:41 UTC
can't reproduce, no steps and no POST data for failing request.

Comment 7 Nishanth Thomas 2016-07-14 12:09:10 UTC
Here is the POST data:

{
        "bucket_type": "host", 
        "name": "dhcp-126-122.lab.eng.brq.redhat.com-ssd", 
        "items": [
            {
                "id": 6, 
                "weight": 1.0, 
                "pos": 0
            }, 
            {
                "id": 7, 
                "weight": 0.097991943359375, 
                "pos": 1
            }, 
            {
                "id": 8, 
                "weight": 1.0, 
                "pos": 2
            }, 
            {
                "id": 9, 
                "weight": 1.0, 
                "pos": 3
            }, 
            {
                "id": 10, 
                "weight": 1.0, 
                "pos": 4
            }, 
            {
                "id": 11, 
                "weight": 1.0, 
                "pos": 5
            }
        ]
    }

Comment 8 Ken Dreyer (Red Hat) 2016-07-14 19:04:34 UTC
Discussed with Product Management (Neil) today, and large clusters this should not block the GA release. The workaround should be to use a smaller cluster where this does not break.

Comment 11 Gregory Meno 2017-07-19 13:28:44 UTC
calamari won't ship in 3.0


Note You need to log in before you can comment on or make changes to this bug.