Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1060389 - Document that `sssd` cache needs to be cleared manually, if ID mapping configuration changes
Summary: Document that `sssd` cache needs to be cleared manually, if ID mapping config...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: sssd
Version: 7.0
Hardware: All
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Jakub Hrozek
QA Contact: Kaushik Banerjee
URL:
Whiteboard:
Depends On:
Blocks: 1044717
TreeView+ depends on / blocked
 
Reported: 2014-02-01 03:58 UTC by Chinmay Paradkar
Modified: 2014-06-18 04:06 UTC (History)
9 users (show)

Fixed In Version: sssd-1.11.2-50.el7
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-06-13 10:44:18 UTC


Attachments (Terms of Use)

Description Chinmay Paradkar 2014-02-01 03:58:20 UTC
Description of problem:

Sometimes we  need to clear the `sssd` cache manually, if we change the `sssd` configuration or `sssd` is failing to start. So, if we need to clear the `sssd` cache (manually) because `sssd` finds it unusable (is a common issue), customer would expect either one of two things :

a) sssd logs the reason for startup failure somewhere less hidden.  It should not be necessary to set debug_level on the daemon and infer the meaning from one of the files inside /var/log/sssd/. Since systemd already suggests checking the logs with 'systemctl status sssd.service' or 'journalctl -xn', rather than simply logging:

    Jan 15 10:36:13 abc.xyz.com systemd[1]: Starting System Security Services Daemon...
    Jan 15 10:36:13 abc.xyz.com sssd[20912]: Starting up
    Jan 15 10:36:13 abc.xyz.com sssd[be[20913]: Starting up
    Jan 15 10:36:13 abc.xyz.com sssd[be[20914]: Starting up
    Jan 15 10:36:15 abc.xyz.com sssd[be[20915]: Starting up
    Jan 15 10:36:19 abc.xyz.com sssd[be[20920]: Starting up
    Jan 15 10:36:19 abc.xyz.com systemd[1]: sssd.service: control process exited, code=exited status=1
    Jan 15 10:36:19 abc.xyz.com systemd[1]: Failed to start System Security Services Daemon.
    Jan 15 10:36:19 abc.xyz.com systemd[1]: Unit sssd.service entered failed state.

Note: `sssd` should log an additional entry along the lines of:
    Jan 15 10:36:19 abc.xyz.com sssd[20920]: Unable to start, local cache is unusable.  Try rm -f /var/lib/sss/db/*cache* and restarting sssd.

Right now, even after knowing to turn up debug_level and look inside /var/log/sssd/, it is not readily apparent that the failures reported are the result of a cache that sssd finds unusable.

b) Don't fail on an unusable cache, but clean it up for the user.  If the cache file is unusable/corrupt to the point that sssd literally cannot even start, then the cache is providing no benefit to the user or system.  Rather than crashing, sssd should clear the cache files (or move them to /var/lib/sss/db-corrupt) and continue its startup.  This action should be logged to the system log so log monitoring tools can capture the event.  I am sure somewhere someone will not want the cache auto-cleared, but that is easily addressed by a config file flag to disable the auto-cleanup.

Options (a) and (b) are mutually exclusive.  While (a) is a minimal and hopefully relatively easy option to accomplish, (b) would provide the host a more robust identity lookup service and reduce admin frustration.

Regards,
Chinmay Paradkar

Comment 2 Jakub Hrozek 2014-02-03 11:08:16 UTC
You're right that failure to report why SSSD failed to start should be logged by default. But that's a minor bug that doesn't concern me as much as the fact that the cache got corrupt in the first place.

We're working on cleaning up the DEBUG messages for the next major version. It's not as easy task as it may seem as for historical reasons we used two different ways of expressing the verbosity of the message.

I agree that if we see a corrupt cache, we should try to recover. But unfortunately, there is not enough info in this bug report to know what's going on. Can you get us the debug logs when the cache broken and ideally the cache as well?

Comment 3 Jakub Hrozek 2014-02-10 16:00:36 UTC
OK, I should have checked what the customer case is first. Can you confirm that there is no 'corruption' per se, but there was a confusion that you need to purge the cache in case the ID mapping changes? In other words, were you referring to:

(Wed Jan 15 10:40:18 2014) [sssd[be[ad.gatech.edu]]] [sdap_idmap_add_domain] (0x0020): BUG: Range maximum exceeds the global maximum: 4099232704 > 2000200000
(Wed Jan 15 10:40:18 2014) [sssd[be[ad.gatech.edu]]] [sdap_idmap_init] (0x0020): Could not add domain [coa.ad.gatech.edu][S-1-5-21-2133428053-1096432638-2007205440][4196] to ID map: [Invalid argument]

In this case the sssd cache was never 'corrupt', the configuration simply was wrong. I agree, though that it's very important to make the admin aware of what went wrong.

Comment 4 Chinmay Paradkar 2014-02-11 20:46:55 UTC
Hello,

-->> Can you confirm that there is no 'corruption' per se, but there was a confusion that you need to purge the cache in case the ID mapping changes?

Yes, we had made changes with the ID mapping and hence had to clear the cache and restart sssd service.

Regards,
Chinmay Paradkar

Comment 5 Howey Vernon 2014-02-14 15:46:52 UTC
Apologies, it appears there was some nuance missed when this
was initially copied from the case.  As you deduced, the
issue was sssd failing to start when the cache was corrupt
OR when the cache became unusable as a result of a
configuration change.

There have been relatively rare cases of cache corruption
(different issue).  The primary concern was the almost
guaranteed unusable cache and consequential nearly silent
sssd startup failure upon use of realmd to setup the initial
sssd.conf.  This is especially problematic if the sssd.conf
file is non-interactively corrected from the realmd version
via a Satellite config channel or other configuration
management engine such as puppet.  Those tools would need to
be taught to detect a cache issue and clean it if sssd
failed to start after they applied their changes, a
non-trivial task.

This could potentially naively be addressed outside of the
sssd daemon itself with some tweaks to the sssd service
file.  By adding these two lines to the [Unit] section:
    OnFailure=sssd-cacheclean.service
    Conflicts=sssd-cacheclean.service

and creating sssd-cacheclean.service as below, when sssd
fails to successfully start, it tries cleaning the cache and
then restarting sssd.

sssd-cacheclean.service
--------------------------
[Unit]
Description=System Security Services Daemon Cache Cleaner
After=syslog.target
Conflicts=sssd.service
OnFailure=sssd.service

[Service]
EnvironmentFile=-/etc/sysconfig/sssd
ExecStart=/usr/bin/find /var/lib/sss/db/ -maxdepth 1 -mindepth 1 -name "*cache*" -delete ; /bin/false
Type=oneshot
Restart=no
--------------------------

There may be some reason why an automatic cache clean in
this manner is undesirable, so I did not go farther than the
trivial POC above.  Obviously if this was considered,
something more appropriate than a call to find should be
used (ideally something smart enough to determine it was an
actual unusable cache issue), as well as a need to include
options to prevent this from looping in the event of a
failure not solved by cleaning the cache.

If there is an easy way to determine if the cache is
unsuitable for sssd as configured, it would probably be
simpler to add correction straight into the sssd.service
file, avoiding the complexity of a secondary OnFailure
service.

Comment 6 Dmitri Pal 2014-02-14 21:35:57 UTC
Our vision is that there will be the external way to cleanup cache over the OpenLMI and then the central server that pushes the change would be able to make a call whether this cleanup is needed depending upon the changes being made. I do not think SSSD has all the information to determine whether cleaning cache is a good idea or not.

Comment 7 Jakub Hrozek 2014-02-17 09:16:13 UTC
(In reply to howey.vernon from comment #5)

I'm sorry for the late reply, I was out of the office on Friday.

> Apologies, it appears there was some nuance missed when this
> was initially copied from the case.  As you deduced, the
> issue was sssd failing to start when the cache was corrupt
> OR when the cache became unusable as a result of a
> configuration change.
> 

Ah, sorry, I was confused by the reply to my question in comment #3 and thought the issue was just documentation. Thanks for coming back to the BZ and pointing out the right status!

> There have been relatively rare cases of cache corruption
> (different issue).

OK, this sounds quite concerning and I agree that this is an issue. I would prefer to track it in a separate bugzilla entry if you don't mind.

>  The primary concern was the almost
> guaranteed unusable cache and consequential nearly silent
> sssd startup failure upon use of realmd to setup the initial
> sssd.conf.  

I must say I'm still a bit confused by the word 'corruption' here. Normally, when you say corruption, I would expect that the cache got damaged during a normal SSSD operation (say, while saving a group entry). Was the failure you were seeing caused by changing the configuration or the sssd was functioning normally and then stopped? Can you give us concrete examples? I'd like to get a clear picture on what the new bug should be tracking, or rather, whether there should be one or several.

> This is especially problematic if the sssd.conf
> file is non-interactively corrected from the realmd version
> via a Satellite config channel or other configuration
> management engine such as puppet.  Those tools would need to
> be taught to detect a cache issue and clean it if sssd
> failed to start after they applied their changes, a
> non-trivial task.
> 

I agree this is a problem. We actually talked about this last week with the realmd maintainer. The long-term fix would be to make the SSSD more resilient to misconfigurations. Short-term, we should scream loudly to syslog whenever the configuration can't be loaded so that the admin can fix the configuration himself.

> This could potentially naively be addressed outside of the
> sssd daemon itself with some tweaks to the sssd service
> file.  By adding these two lines to the [Unit] section:
>     OnFailure=sssd-cacheclean.service
>     Conflicts=sssd-cacheclean.service
> 
> and creating sssd-cacheclean.service as below, when sssd
> fails to successfully start, it tries cleaning the cache and
> then restarting sssd.
> 
> sssd-cacheclean.service
> --------------------------
> [Unit]
> Description=System Security Services Daemon Cache Cleaner
> After=syslog.target
> Conflicts=sssd.service
> OnFailure=sssd.service
> 
> [Service]
> EnvironmentFile=-/etc/sysconfig/sssd
> ExecStart=/usr/bin/find /var/lib/sss/db/ -maxdepth 1 -mindepth 1 -name
> "*cache*" -delete ; /bin/false
> Type=oneshot
> Restart=no
> --------------------------
> 
> There may be some reason why an automatic cache clean in
> this manner is undesirable, so I did not go farther than the
> trivial POC above.  Obviously if this was considered,
> something more appropriate than a call to find should be
> used (ideally something smart enough to determine it was an
> actual unusable cache issue), as well as a need to include
> options to prevent this from looping in the event of a
> failure not solved by cleaning the cache.
> 

This is a neat idea for environments that are always online. However, the sssd is often used on laptops where losing the cache means losing all the cached passwords and might lock the user out of his computer, so we should prefer a way that fixes the configuration or the database without resorting to wiping out the cache completely.

Comment 8 Howey Vernon 2014-02-17 15:17:51 UTC
(In reply to Dmitri Pal from comment #6)
> Our vision is that there will be the external way to
> cleanup cache over the OpenLMI and then the central server
> that pushes the change would be able to make a call
> whether this cleanup is needed depending upon the changes
> being made. I do not think SSSD has all the information to
> determine whether cleaning cache is a good idea or not.

Thanks for the clarification.  I am not coming up with a
scenario where it would be better for SSSD to fail to start
because of the cache being in such a state that SSSD cannot
use it, thereby requiring admin interaction rather than
detecting the condition and clearing itself, but I have
given this far less thought than the SSSD team.

In that vision, is there intent to include a way to ask SSSD
if the cache it has is usable with its current
configuration, or is the expectation that each configuration
utility and admin will need to maintain its own list of "bad
options" which require a cache clear, knowing that changing
any of those means needing to run a cache purge before
restarting the daemon?

There are of course certain configuration changes where it
would make sense to clear the cache without it being
strictly required, I am not talking about those situations.
As you described, those will probably be addressed by
whatever is making the change.


(In reply to Jakub Hrozek from comment #7)
<>
> > There have been relatively rare cases of cache
> > corruption (different issue).
> 
> OK, this sounds quite concerning and I agree that this is
> an issue. I would prefer to track it in a separate
> bugzilla entry if you don't mind.

Agreed, and that was my intent.  I am currently waiting on
locating a reliable reproducer and confirmation it is not
being triggered by hardware.  When and if that happens it
will be treated as a different issue.

> >  The primary concern was the almost guaranteed unusable
> >  cache and consequential nearly silent sssd startup
> >  failure upon use of realmd to setup the initial
> >  sssd.conf.  
> 
> I must say I'm still a bit confused by the word
> 'corruption' here. Normally, when you say corruption, I
> would expect that the cache got damaged during a normal
> SSSD operation (say, while saving a group entry). Was the
> failure you were seeing caused by changing the
> configuration or the sssd was functioning normally and
> then stopped? Can you give us concrete examples? I'd like
> to get a clear picture on what the new bug should be
> tracking, or rather, whether there should be one or
> several.

There was no "corruption" in the part you quoted, but I have
inadvertently added confusion so let me attempt to simplify.

The intent was to cover multiple reasons for which the SSSD
cache was in a state that the daemon could not launch.

    1) A cache that became corrupt and could not be loaded
    for whatever reason, up to and including hardware
    induced failure.  This would be inclusive of things like
    your example of cache damage during normal operation,
    caused by anything from a software bug to a solar flare.

    2) A cache that was unable to be used by SSSD because a
    configuration change was made and the previous cache was
    not compatible with the modification.

For simplicity, please scratch all references to "corrupt"
and replace them with "unusable".  How or why it became
unusable is irrelevant in this case, only that the cache is
in such a state that SSSD cannot start.

> > This is especially problematic if the sssd.conf file is
> > non-interactively corrected from the realmd version via
> > a Satellite config channel or other configuration
> > management engine such as puppet.  Those tools would
> > need to be taught to detect a cache issue and clean it
> > if sssd failed to start after they applied their
> > changes, a non-trivial task.
> > 
> 
> I agree this is a problem. We actually talked about this
> last week with the realmd maintainer. The long-term fix
> would be to make the SSSD more resilient to
> misconfigurations. Short-term, we should scream loudly to
> syslog whenever the configuration can't be loaded so that
> the admin can fix the configuration himself.

Logging the problem into journald would be a significant
improvement in the visibility of the issue.

<snip crappy systemd unconditional cache clean hack>
> This is a neat idea for environments that are always
> online. However, the sssd is often used on laptops where
> losing the cache means losing all the cached passwords and
> might lock the user out of his computer, so we should
> prefer a way that fixes the configuration or the database
> without resorting to wiping out the cache completely.

Understood.  It was never intended as a good solution, just
a crude example of what might be possibly outside of SSSD
proper.

In your use case of a roaming laptop user, there are very
few good options.  If the cache is unusable (for whatever
reason) to the point SSSD cannot even start, not only is the
user locked out of their computer when away from the
authenticating domain, but getting it access to the
authenticating domain is also insufficient, administrative
intervention is mandatory to clear the cache.

Automatically repairing the cache or converting it to a
usable form is of course preferable, but not an option in
all situations.  Right now there appears to be a binary
choice when the cache is unusable:

    0) Fail to launch SSSD and log to the admin.

    1) Move the unusable cache out of the way, launch SSSD,
    log to the admin

In (0), the user is completely locked out regardless of
if they are online with their authenticating domain or not.

In (1), the user is locked out only if they are not
online with their authenticating domain.

In the long run there may be a third option, attempt to make
the cache usable (repair, convert, etc.).  Obviously that
would be preferable in most situations, but at least some of
the time it will fall back to a choice between (0) and (1).

So, given the roaming laptop user with an unusable cache
such that SSSD cannot start, would it be preferable to be:

    a) completely locked out of the laptop, needing someone
    with administrative access so they can clear the cache
    allowing the daemon to start.

    b) able to solve the problem yourself by authenticating
    to the VPN from the GDM login screen (before logging in)
    OR bringing it to your office (and plugging into a
    network where the authentication domain is reachable).
    Never having to wait for an admin.

I think (b) would be preferable, but I may be missing
something.

That said, merely logging the issue would be an improvement.

Comment 9 Dmitri Pal 2014-02-17 19:22:08 UTC
There are some valid points that you bring up here. Some of the enhancements are even more drastic that we have been thinking about. A fine grain repair of the cache is something that we have not been thinking about. It might be possible but there are so many other things that we should be doing before that. This as requirement so far seems distant and unrealistic.
However I will file separate upstream tickets based on your ideas. I suggest you follow up directly in SSSD trac.

Comment 11 Jakub Hrozek 2014-02-20 09:50:23 UTC
Upstream ticket:
https://fedorahosted.org/sssd/ticket/2252

Comment 12 Jakub Hrozek 2014-02-26 17:38:26 UTC
Fixed upstream:
    master: 3dfa09a826e5f63b4948462c2452937fc329834d
    sssd-1-11: 0aa145fd26584d129fb5a6974f58c232b87bb692

Comment 14 Kaushik Banerjee 2014-03-03 19:58:54 UTC
Verified in the manpage of sssd-ad with version 1.11.2-50.el7

Comment 15 Ludek Smid 2014-06-13 10:44:18 UTC
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.


Note You need to log in before you can comment on or make changes to this bug.