Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1057420 - REST APIs generate illegal XML when files contain invalid characters like 0x1b, 0x08
Summary: REST APIs generate illegal XML when files contain invalid characters like 0x1...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Zanata
Classification: Retired
Component: Component-Maven, Component-Logic, Component-PythonClient, Component-zanata-client
Version: 3.1
Hardware: Unspecified
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Patrick Huang
QA Contact: Zanata-QA Mailling List
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-01-24 04:42 UTC by Patrick Huang
Modified: 2015-07-31 01:47 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-07-31 01:47:52 UTC


Attachments (Terms of Use)
Test file (deleted)
2014-01-24 04:43 UTC, Patrick Huang
no flags Details

Description Patrick Huang 2014-01-24 04:42:36 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.63 Safari/537.31
Build Identifier: 

When using zanata maven client to push or pull, if one text flow contains Unicode character: 0x1b, resteasy marshalling/unmarshalling will fail. But upload through server UI will not suffer from this problem.

Reproducible: Always

Steps to Reproduce:
1. create a gettext project/version
2. mvn zanata:push 


Actual Results:  
org.jboss.resteasy.plugins.providers.jaxb.JAXBUnmarshalException: javax.xml.bind.UnmarshalException
 - with linked exception:
[org.xml.sax.SAXParseException; lineNumber: 325; columnNumber: 7; An invalid XML character (Unicode: 0x1b) was found in the element content of the document.]


Expected Results:  
push ok

Server resteasy version is different from client.

Comment 1 Patrick Huang 2014-01-24 04:43:40 UTC
Created attachment 854745 [details]
Test file

This is a cut down version of production file.

Comment 2 Sean Flanigan 2014-01-28 01:48:29 UTC
Patrick, regarding the ESCAPE char handling, see https://java.net/jira/browse/JAXB-614 .  Note that the workaround in the first comment is no good to us, because it irreversibly converts all illegal chars into the same char.

It seems that there is no good way of representing control characters in plain XML, even with CDATA (apparently).  We either need an alternative/extended XML schema for our REST service which performs (eg) base64 encoding/decoding for control chars, or we could just push POT/PO files directly to Zanata for processing on the server, thus bypassing the XML stage entirely.  

In the meantime, we should make sure we detect these control characters before JAXB goes and generates illegal XML.

As a workaround, I would recommend separating the non-translatable text[1] from the translatable text (eg names of colours).  Escape characters and ANSI sequences are very likely to be difficult for translators to deal with anyway, because the editor may not be able to show the escape character very well.

[1] including ANSI codes, any other control codes and command-line keywords like "status-line" and "save-confirmation"

Comment 3 Patrick Huang 2014-04-27 23:11:40 UTC
if we use json instead will it help?

Comment 4 Sean Flanigan 2014-04-28 01:31:31 UTC
Good idea.  Yes, it's worth a try.

JSON can probably escape any problematic characters.  There may be portability issues with some characters, but we should be able to choose implementations which are compatible:

http://stackoverflow.com/a/8676021/14379
https://en.wikipedia.org/wiki/JSON#Data_portability_issues
http://www.bennadel.com/blog/2576-testing-which-ascii-characters-break-json-javascript-object-notation-parsing.htm

Comment 5 Chester Cheng 2014-09-18 01:56:50 UTC
I got a similar error, because of a hidden character in the translation.

==========
$ mvn org.zanata:zanata-maven-plugin:pull -Dzanata.encodeTabs=false
(...)
[ERROR] Operation failed: javax.xml.bind.UnmarshalException
 - with linked exception:
[org.xml.sax.SAXParseException; lineNumber: 2; columnNumber: 35273; An invalid XML character (Unicode: 0x8) was found in the element content of the document.]

    To retry from the last document, please set the following option(s):

        -Dzanata.fromDoc="Memory"

.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 19.387 s
[INFO] Finished at: 2014-09-18T11:44:11+10:00
[INFO] Final Memory: 19M/170M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.zanata:zanata-maven-plugin:3.3.2:pull (default-cli) on project standalone-pom: Zanata mojo exception: javax.xml.bind.UnmarshalException
[ERROR] - with linked exception:
[ERROR] [org.xml.sax.SAXParseException; lineNumber: 2; columnNumber: 35273; An invalid XML character (Unicode: 0x8) was found in the element content of the document.]
[ERROR] -> [Help 1]
[ERROR]

Comment 6 Sean Flanigan 2014-09-18 02:00:06 UTC
Just for reference, the workaround was to download the affected document from the web interface (fortunately, it was a PO file, so it could be downloaded that way) and search for the offending character:

    grep --color='auto' -P -n '\x08' *.po

Comment 7 Zanata Migrator 2015-07-31 01:47:52 UTC
Migrated; check JIRA for bug status: http://zanata.atlassian.net/browse/ZNTA-543


Note You need to log in before you can comment on or make changes to this bug.