Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1691487 - openQA transient test failure as duplicated first character just after a snapshot
Summary: openQA transient test failure as duplicated first character just after a snap...
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: openqa
Version: 30
Hardware: powerpc
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Adam Williamson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-21 17:49 UTC by Michel Normand
Modified: 2019-04-16 15:09 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:


Attachments (Terms of Use)

Description Michel Normand 2019-03-21 17:49:49 UTC
openQA test failure as duplicated first character 'hhostnamectl' expecting 'hostnamctl'

for server_database_client test:
https://openqa.stg.fedoraproject.org/tests/500172#step/_post_network_static/1

seems to be an openQA problem that only appear when overloading of the workers host, (when a new set of tests are triggered for a new compose); because my local trial show that a manual retry of individual test is sufficient to run it successfully.

I locally hit this same problem with server_cockpit_default
my local bypass was to add a  
script_run 'echo "bypass duplicated first character"'; at top of file.


=== extract autoinst.log for above referenced url.
...
[2019-03-21T02:26:26.891 UTC] [debug] Migrating total bytes:     	4312072192
[2019-03-21T02:26:26.891 UTC] [debug] Migrating remaining bytes:   	0
[2019-03-21T02:26:26.892 UTC] [debug] Snapshot complete
[2019-03-21T02:26:26.913 UTC] [debug] EVENT {"event":"RESUME","timestamp":{"microseconds":913591,"seconds":1553135186}}
[2019-03-21T02:26:26.921 UTC] [debug] QEMU: Formatting '/var/lib/openqa/pool/5/raid/hd0-overlay1', fmt=qcow2 size=10737418240 backing_file=/var/lib/openqa/pool/5/raid/hd0-overlay0 backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-03-21T02:26:26.921 UTC] [debug] QEMU: Formatting '/var/lib/openqa/pool/5/raid/cd0-overlay1', fmt=qcow2 size=3130112000 backing_file=/var/lib/openqa/pool/5/raid/cd0-overlay0 backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2019-03-21T02:26:26.930 UTC] [debug] ||| starting _post_network_static tests/_post_network_static.pm
[2019-03-21T02:26:26.931 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:26.931 UTC] [debug] <<< testapi::assert_script_run(cmd='hostnamectl set-hostname dbclient.domain.local', wait=90, fail_message=undef)
[2019-03-21T02:26:26.931 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:26.931 UTC] [debug] <<< testapi::type_string(string='hostnamectl set-hostname dbclient.domain.local', max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2019-03-21T02:26:26.932 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:26.932 UTC] [debug] <<< testapi::wait_screen_change(timeout=10, similarity_level=50)
[2019-03-21T02:26:28.067 UTC] [debug] waiting for screen change: 0 32.9796599074581
[2019-03-21T02:26:28.069 UTC] [debug] >>> testapi::wait_screen_change: screen change seen at 0
[2019-03-21T02:26:28.072 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:28.072 UTC] [debug] <<< testapi::type_string(string=' ; echo V5qMN-$?- > /dev/hvc0
', max_interval=250, wait_screen_changes=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2019-03-21T02:26:28.072 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:28.072 UTC] [debug] <<< testapi::wait_screen_change(timeout=10, similarity_level=50)
[2019-03-21T02:26:28.822 UTC] [debug] waiting for screen change: 0 35.9559735377936
[2019-03-21T02:26:28.822 UTC] [debug] >>> testapi::wait_screen_change: screen change seen at 0
[2019-03-21T02:26:28.822 UTC] [debug] /var/lib/openqa/share/tests/fedora/tests/_post_network_static.pm:12 called tapnet::setup_tap_static
[2019-03-21T02:26:28.822 UTC] [debug] <<< testapi::wait_serial(regexp=qr/V5qMN-\d+-/, timeout=90)
[2019-03-21T02:26:30.974 UTC] [debug] >>> testapi::wait_serial: (?^:V5qMN-\d+-): ok
[2019-03-21T02:26:31.096 UTC] [debug] command 'hostnamectl set-hostname dbclient.domain.local' failed at /var/lib/openqa/share/tests/fedora/lib/tapnet.pm line 39.
===

Comment 1 Michel Normand 2019-04-16 07:32:55 UTC
I have an openQA bypass with https://pagure.io/fedora-qa/os-autoinst-distri-fedora/pull-request/105 merged 20190416.
Waiting for investigation of this bug.

Comment 2 Adam Williamson 2019-04-16 15:09:39 UTC
I mean, I dunno what I can really 'investigate', unfortunately, if it's happening because the host gets overloaded with work all we could really do is cut the number of workers :/

/proc/cpuinfo shows the system as having 10 CPUs and being a "PowerNV 8348-21C". We're currently running 8 workers on it. 8 workers on 10 CPUs doesn't seem like it should be a problem.

For storage it seems to have a pair of Seagate ST1000NM0033s being used as one big LVM VG. That's a bit weedy, I guess. The 'small' x86_64 worker hosts (which run 10 workers each) have RAID-6 arrays of 8 disks each (can't tell what type off hand, the RAID controller obfuscates it), the 'big' x86_64 worker hosts (which run 40 workers each) have RAID-6 arrays of 10 disks each (again can't tell what model, but I think they're something beefy; they may even be SSDs).

So...my working theory is we're running into storage contention here. That seems plausible especially since the way we schedule tests means many of them may reach the point at which they restore a snapshot at basically the same time.

Perhaps we can get the storage on the worker host beefed up? Or get a beefier worker host? :)


Note You need to log in before you can comment on or make changes to this bug.