Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1693751 - rpm.labelCompare expects strings, but RPM headers are bytes
Summary: rpm.labelCompare expects strings, but RPM headers are bytes
Keywords:
Status: MODIFIED
Alias: None
Product: Fedora
Classification: Fedora
Component: rpm
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Panu Matilainen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On: 1693759 1693760 1693762 1693766 1693768 1693771 1693773 1693774 1693787 1693788 1693767
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-28 14:57 UTC by Panu Matilainen
Modified: 2019-04-15 11:06 UTC (History)
8 users (show)

Fixed In Version: rpm-4.14.2.1-6.fc31
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:


Attachments (Terms of Use)

Description Panu Matilainen 2019-03-28 14:57:45 UTC
Description of problem:
With Python 3, the contents of RPM headers are returned as bytes and thus cannot be fed back to functions like rpm.labelCompare which expects strings.

Version-Release number of selected component (if applicable):
All released versions of rpm, including python3-rpm-4.14.2

How reproducible:
always

Steps to Reproduce:
import rpm
required_version = ('0', '4.8.0', '1')
transaction_set = rpm.TransactionSet()
db_result = transaction_set.dbMatch('name', 'rpm')
package = list(db_result)[0]
print(rpm.labelCompare((package['epoch'], package['version'], package['release']), required_version))

Actual results:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: argument 1, item 1 must be str or None, not bytes

Expected results:
the result of labelCompare() is printed

Additional info:
rpm.labelCompare() is just the tip of the iceberg, this type mismatch is present all over the rpm-python API and makes the bindings unusable and fundamentally incompatible with all the rpm-related scripts ever written.

This has been fixed upstream by making the python3 bindings return ALL string data as surrogate-escaped utf-8 string objects to make things "just work" for the normal case while allowing non-utf8 data to still be handled. For further rationale see the upstream commit message:

https://github.com/rpm-software-management/rpm/commit/84920f898315d09a57a3f1067433eaeb7de5e830 

This is the Fedora side counterpart of bug 1631292, to be used as a tracking bug for known incompatibilities and other possible blockers for making this change.

Comment 1 Panu Matilainen 2019-04-10 09:29:31 UTC
This change is now active in rawhide as of rpm-4.14.2.1-6.fc31.

As a temporary crutch to allow existing users to migrate to the new behavior, the strings returned by rpm will have a fake .decode() method attached that will issue the following warning pointing to this bug if used:

    UnicodeWarning: decode() called on unicode string, see https://bugzilla.redhat.com/show_bug.cgi?id=1693751

The evil crutch will be removed once critical users have been adapted to the new behavior. Thank you for your attention.

Comment 2 Jan Pazdziora 2019-04-15 10:09:45 UTC
Thanks for the warning message. It does not however seem to catch

  p_nevra = b"%s-%s-%s.%s" % (p["name"], p["version"], p["release"], p["arch"])

which failes with

  TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'

Should the compatibility layer implement __bytes__ as well?

Comment 3 Jan Pazdziora 2019-04-15 10:10:59 UTC
I also wonder how decoding to str already in the python3-rpm will affect working with ancient rpms that may have ISO-8859-1 values, or in general random content there.

Comment 4 Panu Matilainen 2019-04-15 11:06:01 UTC
The returned string is surrogate-escaped so it can handle fairly arbitrary data, and in the unlikely event that you actually know the original encoding you *can* get back to that:

>>> import rpm
>>> str = u'älämölö'
>>> b = str.encode('iso-8859-1')
>>> b
b'\xe4l\xe4m\xf6l\xf6'
>>> h = rpm.hdr()
>>> h['group'] = b
>>> d = h['group']
>>> d
'\udce4l\udce4m\udcf6l\udcf6'
>>> t = bytes(d, 'utf-8', 'surrogateescape')
>>> t
b'\xe4l\xe4m\xf6l\xf6'
>>> t.decode('iso-8859-1')
'älämölö'

But in reality, unless the header has an explicit 'encoding' tag there's no way to know what that encoding is, and the only value that the encoding tag can be is 'utf-8' in packages that have been validated to consist only of utf-8 strings. In practise all API users I've looked into have just had hardcoded .decode('utf-8') calls in place anyway so they'd fail anyway.

As for __bytes__, we'd prefer keeping the evil compat trickery to absolute bare minimum, both in content- and time-wise. But lets see how if goes, if that's actually a common pattern then maybe it makes sense to support it too.


Note You need to log in before you can comment on or make changes to this bug.