Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1361509 - At least one pair of MPI processes are unable to reach each other for MPI communications over cxgb3/4
Summary: At least one pair of MPI processes are unable to reach each other for MPI com...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: openmpi
Version: 7.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: John Feeney
QA Contact: zguo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-29 08:50 UTC by zguo
Modified: 2018-07-01 22:40 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:


Attachments (Terms of Use)

Description zguo 2016-07-29 08:50:49 UTC
Description of problem:


Version-Release number of selected component (if applicable):
compat-openmpi16 openmpi-1.10.3.x86_64
RHEL-7.3-20160719.1
3.10.0-470.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. cxgb4: https://beaker.engineering.redhat.com/jobs/1420524
2. https://beaker.engineering.redhat.com/recipes/2909090/tasks/43647650/results/215608198/logs/test_log--kernel-infiniband-mpi-openmpi-non-root-client.log
3. cxgb3: https://beaker.engineering.redhat.com/jobs/1418974
https://beaker.engineering.redhat.com/recipes/2905457/tasks/43608036/results/215534800/logs/test_log--kernel-infiniband-mpi-openmpi-non-root-client.log


Actual results:
 All processes entering MPI_Finalize

+ report IMBEXT.Accumulate IMB
+ '[' 0 -ne 0 ']'
+ echo IMBEXT.Accumulate
+ for hfile in '${HFILES}'
+ '[' -f /home/test/hfile_all_cores_iw ']'
++ cat imb_mpi.txt
+ for subrt in '$(cat imb_mpi.txt)'
++ cat /home/test/hfile_all_cores_iw
++ wc -l
+ timeout 15m mpirun --allow-run-as-root --map-by node -mca mtl '^psm2,psm,ofi' -mca btl openib,self -hostfile /home/test/hfile_all_cores_iw -np 16 mpitests-IMB-MPI1 PingPong
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[28335,1],3]) is on host: rdma-dev-09
  Process 2 ([[28335,1],1]) is on host: rdma-dev-09
  BTLs attempted: openib self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[rdma-dev-09:23606] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

Expected results:
No this error

Additional info:

Submitted test job to mlx5 to see if it's cxbg3/4 specific issue: https://beaker.engineering.redhat.com/jobs/1421692, running

Submitted test job to RHEL-7.2 to see if it's a regression:  https://beaker.engineering.redhat.com/jobs/1421693, queued.


Note You need to log in before you can comment on or make changes to this bug.