Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1693816 - An engine restore from a backup can fail to start due to missing custom certificates (or similar) [NEEDINFO]
Summary: An engine restore from a backup can fail to start due to missing custom certi...
Keywords:
Status: POST
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backup-Restore.Engine
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
medium
high vote
Target Milestone: ovirt-4.3.4
: ---
Assignee: Yedidyah Bar David
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-28 17:06 UTC by Simone Tiraboschi
Modified: 2019-04-16 10:29 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Integration
steve.goodman: needinfo? (didi)
steve.goodman: needinfo? (sbonazzo)
sbonazzo: ovirt-4.3?
sbonazzo: planning_ack?
sbonazzo: devel_ack+
lleistne: testing_ack+


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 99066 master POST packaging: engine-backup: Handle httpd SSO 2019-04-01 08:06:57 UTC
oVirt gerrit 99068 master POST packaging: engine-backup: Exclude non-configuration files 2019-04-01 08:44:20 UTC
oVirt gerrit 99069 master POST packaging: engine-backup: Handle all of /etc/httpd 2019-04-01 08:44:25 UTC

Description Simone Tiraboschi 2019-03-28 17:06:28 UTC
Description of problem:
see https://bugzilla.redhat.com/show_bug.cgi?id=1660595#c16
according to the reports we have case where, after the restore, the engine doesn't correctly accepts connection due to a missing custom certificate or to a broken kerberos configuration with an external identity provider.


Version-Release number of selected component (if applicable):
4.3

How reproducible:
?

Steps to Reproduce:
1. ?
2.
3.

Actual results:
a restored engine starts but doesn't accepts user connections.

Expected results:
- backup and restore what's (including custom certificates or kerberos conf) needed to "always" (not really sure on how we can be exhaustive here) be able to correctly start the engine after the restore
- provide a kind of "safe mode" for the engine where we are confident that at least the core set of functionalities can always go up on disaster recovery and then let the user add/restore what's missing

Additional info:

Comment 1 Yedidyah Bar David 2019-03-31 06:01:25 UTC
(In reply to Simone Tiraboschi from comment #0)
> Description of problem:
> see https://bugzilla.redhat.com/show_bug.cgi?id=1660595#c16
> according to the reports we have case where, after the restore, the engine
> doesn't correctly accepts connection due to a missing custom certificate or
> to a broken kerberos configuration with an external identity provider.
> 
> 
> Version-Release number of selected component (if applicable):
> 4.3
> 
> How reproducible:
> ?

Probably always

> 
> Steps to Reproduce:
> 1. ?

I guess at least one concrete example provided by the referenced bug 1660595
is to follow the documentation:

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/administration_guide/configuring_ldap_and_kerberos_for_single_sign-on

> 2.
> 3.
> 
> Actual results:
> a restored engine starts but doesn't accepts user connections.
> 
> Expected results:
> - backup and restore what's (including custom certificates or kerberos conf)
> needed to "always" (not really sure on how we can be exhaustive here) be
> able to correctly start the engine after the restore

So far, we only backed up stuff written by our *code*, not documentation.

Another option is to change the documentation to adhere to the way the code
works. Meaning - to tell users to write custom conf/stuff in engine-specific
locations, not system-wide ones. Need to check the specific cases, though.

> - provide a kind of "safe mode" for the engine where we are confident that
> at least the core set of functionalities can always go up on disaster
> recovery and then let the user add/restore what's missing
> 

Might make sense. We need a different bug for this, and also there, we need
to enumerate the concrete failure modes that fail-safe should prevent. And/or
to make it a tracker.

Comment 2 Yedidyah Bar David 2019-03-31 06:18:48 UTC
How about this:

Backup all of /etc/pki, /etc/httpd (perhaps others as well)

On restore, skip all files that are packaged by some installed rpm

Perhaps skip optionally, so that a user changing a packaged file
still has an option to tell "Please overwrite packaged file with
restored ones".

This would also have prevented cases like bug 1452182.

Comment 3 Yedidyah Bar David 2019-04-01 10:18:18 UTC
It's hard to decide what to do :-)

99066 just adds two files to be backed up/restored. Simple and easy, but covers only the specific case of single-sign-on-following-RHV-docs.

99068+99069 are more general, but somewhat more risky. And also might be not enough in the general case. I talked with Simone in private, and he said "Why not all of /etc?", then checked changed files there on a clean machine, and we agreed we do not want to start thinking about each and every one of them. People that want a full-machine-image backup (and restore) should use other means, not engine-backup.

I don't mind adding 99066, because I think it's very low risk, but generally speaking, engine-backup should handle what engine-setup ovirt-engine-rename change. Nothing more, nothing less.

Regarding André's point, which is, basically, above last statement, _plus_ "... and the official documentation instructs the user to change/add". Well, not sure about that in general. As a user, when I read documentation, and if it's obvious to me that what I read is mainly an example, and not the only way to do something, I might not follow it strictly. Simplest example: there is nothing magical about the name "/etc/httpd/http.keytab". You can use any file name you want, as long as you set GssapiCredStore to point at it. There is no other component that strictly expects to find/use this file. But if I merge 99066, we can no longer treat it as a mere example - then it becomes part of the code. If a user reads the docs, says "well, I think I prefer to call it /etc/my-local-stuff/ovirt-engine.keytab", engine-backup will no longer handle it. So how to treat this? Not sure. I currently tend to think that the best solution is to update the documentation, adding there:

mkdir -p /etc/ovirt-engine-backup/engine-backup-defaults.d

cat << __EOF__ >> /etc/ovirt-engine-backup/engine-backup-defaults.d/sso.sh
BACKUP_PATHS="\${BACKUP_PATHS}
/etc/httpd/http.keytab
/etc/httpd/conf.d/ovirt-sso.conf"
__EOF__

( This will work, since 3.6: https://gerrit.ovirt.org/#/q/Ice63783,n,z )

Comment 4 Yedidyah Bar David 2019-04-01 10:28:27 UTC
Sandro, what do you think?

The main drawback with end of comment 3 is for people that already configured sso and backup, but never tested restore.
These will benefit from 99066 but not from a doc update.

Comment 5 Yedidyah Bar David 2019-04-02 06:43:56 UTC
Sorry, that was too quick. End of comment 3 is enough to _backup_ these files, but not enough to _restore_ them. For that, we'll either need a patch to engine-backup, or something quite longer than what I wrote there, which is also quite ugly, so I didn't test it yet - something like add a file to /etc/ovirt-engine-backup/engine-backup-config.d that redefines dump_config_for_restore, adding there to VARS_TO_SAVE also BACKUP_PATHS. Otherwise, you'll have to do end-of-comment-3 also on the restored machine, before running restore, so would not be better than adding a custom playbook, for the hosted-engine restore case.

Comment 6 Sandro Bonazzola 2019-04-03 07:24:12 UTC
Let's ensure documentation states to use exact names as in docs so backup and restore will work out of the box for patch 99066.
Let's also create those files with comments within rpm to make it clear they're part of the product.

Comment 7 Steve Goodman 2019-04-16 10:26:35 UTC
I'm looking at the page referenced in comment 1 and it's not clear to me precisely what you want the documentation to say.

Could you please indicate (just paste into this bug) the specific location and text you're proposing?

Comment 8 Steve Goodman 2019-04-16 10:29:19 UTC
Also, if the doc change you are suggesting is something in addition to the scope of this particular bug, would you please create a new bug so we can track it better?


Note You need to log in before you can comment on or make changes to this bug.