Skip to content

Problem with multi GPU

Viewing 10 posts - 1 through 10 (of 10 total)
  • Author
    Posts
  • #8809
    thanhphatvt
    Participant

    Dear Open LB team,
    I used the gpuopenmpi config for running the case with multi GPU. But I think it did not work on my workstation. I use 2 K80 GPU card on my workstation.
    Here is the Cuda and MPI version which I use.
    root@acmt:/home/kc-lin/olb-1.6r0/examples/laminar/cylinder3d# nvcc –version
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2021 NVIDIA Corporation
    Built on Wed_Jun__2_19:15:15_PDT_2021
    Cuda compilation tools, release 11.4, V11.4.48
    Build cuda_11.4.r11.4/compiler.30033411_0
    root@acmt:/home/kc-lin/olb-1.6r0/examples/laminar/cylinder3d# mpirun –version
    mpirun (Open MPI) 4.1.6

    Report bugs to http://www.open-mpi.org/community/help/
    root@acmt:/home/kc-lin/olb-1.6r0/examples/laminar/cylinder3d# ompi_info –parsable -l 9 –all | grep mpi_built_with_cuda_support:value
    mca:mpi:base:param:mpi_built_with_cuda_support:value:true

    After make it I got it “nvcc cylinder3d.o -o cylinder3d -lolbcore -L../../../external/lib -lpthread -lz -ltinyxml -lcuda -lcudadevrt -lcudart -L../../../build/lib

    And when I use the command: “mpirun -np 2 ./cylinder3d” for trying to run with 2 GPUs
    Then I found only 1 GPU work but maybe it run 2 jobs at the same time ”
    | 0 N/A N/A 7808 C ./cylinder3d 376MiB |
    | 0 N/A N/A 7809 C ./cylinder3d 376MiB”

    Could you help me to deal with this problem?
    Thank you so much!

    #8810
    Adrian
    Keymaster

    The problem is that each process by default uses the first visible GPU. You can restrict this to assign each process their own GPU via:

    mpirun -np 2 bash -c 'export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cylinder3d'

    This is also documented in the config/gpu_openmpi.mk: (excerpt)

    
    #  - Start the simulation using <code>mpirun -np 2 ./cavity3d</code> (All processes share default GPU, not optimal)
    #
    # Usage on a multi GPU system: (recommended when using MPI, use non-MPI version on single GPU systems)
    #  - Run  "mpirun -np 4 bash -c 'export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cavity3d'"
    #    (for a 4 GPU system, further process mapping advisable, consult cluster documentation)
    
    #8811
    thanhphatvt
    Participant

    Thank you so much. It’s my mistake. I miss one character in the command so it did not work
    It’s working now.
    Thank you

    #8812
    Adrian
    Keymaster

    No worries! Glad to hear that it works now,

    #8823
    thanhphatvt
    Participant

    Dear Adrian,
    When I use the mpi with 2 GPUs for running the simulation and I got the problem with the results. I can not open the result with Paraview. Here is the errors in Paraview

    ERROR: In vtkXMLParser.cxx, line 368
    vtkXMLDataParser (000001E3D2A8DA40): Error parsing XML in stream at line 13, column 0, byte index 894: junk after document element

    ERROR: In vtkXMLReader.cxx, line 576
    vtkXMLImageDataReader (000001E3D2B3BFC0): Error parsing input file. ReadXMLInformation aborting.

    ERROR: In vtkExecutive.cxx, line 730
    vtkCompositeDataPipeline (000001E3CCEC9130): Algorithm vtkXMLImageDataReader (000001E3D2B3BFC0) returned failure for request: vtkInformation (000001E3CD00B1E0)
    Debug: Off
    Modified Time: 409141
    Reference Count: 1
    Registered Events: (none)
    Request: REQUEST_INFORMATION
    FORWARD_DIRECTION: 0
    ALGORITHM_AFTER_FORWARD: 1

    ERROR: In vtkXMLParser.cxx, line 368
    vtkXMLDataParser (000001E3D2A8E610): Error parsing XML in stream at line 13, column 0, byte index 894: junk after document element

    ERROR: In vtkXMLReader.cxx, line 576
    vtkXMLImageDataReader (000001E3D2B3CE80): Error parsing input file. ReadXMLInformation aborting.

    ERROR: In vtkExecutive.cxx, line 730
    vtkCompositeDataPipeline (000001E3CCEC10F0): Algorithm vtkXMLImageDataReader (000001E3D2B3CE80) returned failure for request: vtkInformation (000001E3CD00D550)
    ERROR: In vtkXMLParser.cxx, line 368
    vtkXMLDataParser (000001E3D2A8DA40): Error parsing XML in stream at line 13, column 0, byte index 894: junk after document element

    ERROR: In vtkXMLReader.cxx, line 576
    vtkXMLImageDataReader (000001E3D2B3BFC0): Error parsing input file. ReadXMLInformation aborting.

    ERROR: In vtkExecutive.cxx, line 730
    vtkCompositeDataPipeline (000001E3CCEC9130): Algorithm vtkXMLImageDataReader (000001E3D2B3BFC0) returned failure for request: vtkInformation (000001E3CD00B1E0)
    Debug: Off
    Modified Time: 409141
    Reference Count: 1
    Registered Events: (none)
    Request: REQUEST_INFORMATION
    FORWARD_DIRECTION: 0
    ALGORITHM_AFTER_FORWARD: 1

    ERROR: In vtkXMLParser.cxx, line 368
    vtkXMLDataParser (000001E3D2A8E610): Error parsing XML in stream at line 13, column 0, byte index 894: junk after document element

    ERROR: In vtkXMLReader.cxx, line 576
    vtkXMLImageDataReader (000001E3D2B3CE80): Error parsing input file. ReadXMLInformation aborting.

    ERROR: In vtkExecutive.cxx, line 730
    vtkCompositeDataPipeline (000001E3CCEC10F0): Algorithm vtkXMLImageDataReader (000001E3D2B3CE80) returned failure for request: vtkInformation (000001E3CD00D550)
    Debug: Off
    Modified Time: 409430
    Reference Count: 1
    Registered Events: (none)
    Request: REQUEST_INFORMATION
    FORWARD_DIRECTION: 0
    ALGORITHM_AFTER_FORWARD: 1

    What should I do in this situation? Or Did I make something wrong in the simulation?
    Thank you Adrian!

    #8824
    Adrian
    Keymaster

    Is this using the laminar/cylinder3d case? One explanation could be if the Paraview files in tmp are in a broken state due to playing around with various parallelization modes. Does this also happen if you completely remove it and restart? Did you change anything in the example case?

    #8825
    thanhphatvt
    Participant

    Yes, it’s laminar/cylinder3d case. This also happen when I remove and restart. I haven’t changed anything in the code.
    I see this problem when I run with MPI by this command “”mpirun -np 4 bash -c ‘export CUDA_VISIBLE_DEVICES=${OMPI_COMM_WORLD_LOCAL_RANK}; ./cylinder3d'””
    If I use ./cylinder3d, I can open the file normally.

    Here is my workstation when I run with MPI:
    | NVIDIA-SMI 470.239.06 Driver Version: 470.239.06 CUDA Version: 11.4 |
    |——————————-+———————-+———————-+
    | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    | | | MIG M. |
    |===============================+======================+======================|
    | 0 Tesla K80 Off | 00000000:05:00.0 Off | 0 |
    | N/A 44C P0 57W / 149W | 369MiB / 11441MiB | 30% Default |
    | | | N/A |
    +——————————-+———————-+———————-+
    | 1 Tesla K80 Off | 00000000:06:00.0 Off | 0 |
    | N/A 36C P0 70W / 149W | 359MiB / 11441MiB | 27% Default |
    | | | N/A |
    +——————————-+———————-+———————-+
    | 2 Tesla K80 Off | 00000000:09:00.0 Off | 0 |
    | N/A 43C P0 59W / 149W | 369MiB / 11441MiB | 28% Default |
    | | | N/A |
    +——————————-+———————-+———————-+
    | 3 Tesla K80 Off | 00000000:0A:00.0 Off | 0 |
    | N/A 36C P0 71W / 149W | 359MiB / 11441MiB | 29% Default |
    | | | N/A |
    +——————————-+———————-+———————-+

    +—————————————————————————–+
    | Processes: |
    | GPU GI CI PID Type Process name GPU Memory |
    | ID ID Usage |
    |=============================================================================|
    | 0 N/A N/A 16351 C ./cylinder3d 366MiB |
    | 1 N/A N/A 16352 C ./cylinder3d 356MiB |
    | 2 N/A N/A 16353 C ./cylinder3d 366MiB |
    | 3 N/A N/A 16355 C ./cylinder3d 356MiB

    Thanks

    #8838
    thanhphatvt
    Participant

    Hello,
    I’m still on it but now I got this message when I make the file
    “nvcc cavity3d.o -o cavity3d -lolbcore -L../../../external/lib -lmpi_cxx -lmpi -lpthread -lz -ltinyxml -lcuda -lcudadevrt -lcudart -L../../../build/lib
    /usr/bin/ld: cannot find -lmpi_cxx: No such file or directory
    collect2: error: ld returned 1 exit status
    make: *** [../../../default.single.mk:38: cavity3d] Error 1

    I know the message from the directory but last time I can run but now it has the problem. Could you help me in this situation?
    Thanks

    #8857
    mathias
    Keymaster
    #8859
    thanhphatvt
    Participant

    Thank you so much!

Viewing 10 posts - 1 through 10 (of 10 total)
  • You must be logged in to reply to this topic.