Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1361509

Summary: At least one pair of MPI processes are unable to reach each other for MPI communications over cxgb3/4
Product: Red Hat Enterprise Linux 7 Reporter: zguo <zguo>
Component: openmpiAssignee: John Feeney <jfeeney>
Status: NEW --- QA Contact: zguo <zguo>
Severity: high Docs Contact:
Priority: high    
Version: 7.3CC: mstowell, rdma-dev-team
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description zguo 2016-07-29 08:50:49 UTC
Description of problem:


Version-Release number of selected component (if applicable):
compat-openmpi16 openmpi-1.10.3.x86_64
RHEL-7.3-20160719.1
3.10.0-470.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. cxgb4: https://beaker.engineering.redhat.com/jobs/1420524
2. https://beaker.engineering.redhat.com/recipes/2909090/tasks/43647650/results/215608198/logs/test_log--kernel-infiniband-mpi-openmpi-non-root-client.log
3. cxgb3: https://beaker.engineering.redhat.com/jobs/1418974
https://beaker.engineering.redhat.com/recipes/2905457/tasks/43608036/results/215534800/logs/test_log--kernel-infiniband-mpi-openmpi-non-root-client.log


Actual results:
 All processes entering MPI_Finalize

+ report IMBEXT.Accumulate IMB
+ '[' 0 -ne 0 ']'
+ echo IMBEXT.Accumulate
+ for hfile in '${HFILES}'
+ '[' -f /home/test/hfile_all_cores_iw ']'
++ cat imb_mpi.txt
+ for subrt in '$(cat imb_mpi.txt)'
++ cat /home/test/hfile_all_cores_iw
++ wc -l
+ timeout 15m mpirun --allow-run-as-root --map-by node -mca mtl '^psm2,psm,ofi' -mca btl openib,self -hostfile /home/test/hfile_all_cores_iw -np 16 mpitests-IMB-MPI1 PingPong
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[28335,1],3]) is on host: rdma-dev-09
  Process 2 ([[28335,1],1]) is on host: rdma-dev-09
  BTLs attempted: openib self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[rdma-dev-09:23606] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

Expected results:
No this error

Additional info:

Submitted test job to mlx5 to see if it's cxbg3/4 specific issue: https://beaker.engineering.redhat.com/jobs/1421692, running

Submitted test job to RHEL-7.2 to see if it's a regression:  https://beaker.engineering.redhat.com/jobs/1421693, queued.