Skip to content

dpgen RuntimeError #506

@Cc-12342234

Description

@Cc-12342234

When I ran dpgen and reached the fp step, this error occurred. I believe the cause of the error is that unconverged tasks were not categorized as failed tasks, which led to this issue. Part of the log file is as follows:
dpdispatcher.log
024-11-16 19:41:00,268 - INFO : job: 8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b 17340881 terminated; fail_cout is 4; resubmitting job
2024-11-16 19:41:00,295 - INFO : job:8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b re-submit after terminated; new job_id is 17341121
2024-11-16 19:41:00,509 - INFO : job:8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b job_id:17341121 after re-submitting; the state now is <JobStatus.waiting: 2>
2024-11-16 19:41:00,509 - INFO : job: 4f184443ae19c2074dab082c47f4935e9ebc6f7a 17340916 terminated; fail_cout is 3; resubmitting job
Traceback (most recent call last):
File "/public/home/1234/soft/anaconda3/envs/dpdata/lib/python3.10/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/public/home/1234/soft/anaconda3/envs/dpdata/lib/python3.10/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state
raise RuntimeError(err_msg)
RuntimeError: job:4f184443ae19c2074dab082c47f4935e9ebc6f7a 17340916 failed 3 times.
Possible remote error message: �[31m==> /public/home/1234/DPGEN-cp2k/remotefile/fp/d1034656268ea53c6a7e208951eec1341db12e59/task.000.000518/output <==
:605 *


===== Routine Calling Stack =====

        4 scf_env_do_scf
        3 qs_energies
        2 qs_forces
        1 CP2K

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[42699,1],0]
Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.


prterun has exited due to process rank 0 with PID 0 on node node0402 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).

CP2K output:


  • ___ *
  • / \ *
  • [ABORT] *
  • ___/ SCF run NOT converged. To continue the calculation regardless, *
  • | please set the keyword IGNORE_CONVERGENCE_FAILURE. *
  • O/| *
  • /| | *
  • / \ qs_scf.F:605 *

===== Routine Calling Stack =====

        4 scf_env_do_scf
        3 qs_energies
        2 qs_forces
        1 CP2K

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[42699,1],0]
Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.


prterun has exited due to process rank 0 with PID 0 on node node0402 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions