-
Notifications
You must be signed in to change notification settings - Fork 59
Description
When I ran dpgen and reached the fp step, this error occurred. I believe the cause of the error is that unconverged tasks were not categorized as failed tasks, which led to this issue. Part of the log file is as follows:
dpdispatcher.log:
024-11-16 19:41:00,268 - INFO : job: 8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b 17340881 terminated; fail_cout is 4; resubmitting job
2024-11-16 19:41:00,295 - INFO : job:8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b re-submit after terminated; new job_id is 17341121
2024-11-16 19:41:00,509 - INFO : job:8478e1cf10abefb934c3ebc4e2ca1bb5a9265d5b job_id:17341121 after re-submitting; the state now is <JobStatus.waiting: 2>
2024-11-16 19:41:00,509 - INFO : job: 4f184443ae19c2074dab082c47f4935e9ebc6f7a 17340916 terminated; fail_cout is 3; resubmitting job
Traceback (most recent call last):
File "/public/home/1234/soft/anaconda3/envs/dpdata/lib/python3.10/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/public/home/1234/soft/anaconda3/envs/dpdata/lib/python3.10/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state
raise RuntimeError(err_msg)
RuntimeError: job:4f184443ae19c2074dab082c47f4935e9ebc6f7a 17340916 failed 3 times.
Possible remote error message: �[31m==> /public/home/1234/DPGEN-cp2k/remotefile/fp/d1034656268ea53c6a7e208951eec1341db12e59/task.000.000518/output <==
:605 *
===== Routine Calling Stack =====
4 scf_env_do_scf
3 qs_energies
2 qs_forces
1 CP2K
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[42699,1],0]
Errorcode: 1
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
prterun has exited due to process rank 0 with PID 0 on node node0402 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
CP2K output:
- ___ *
- / \ *
- [ABORT] *
- ___/ SCF run NOT converged. To continue the calculation regardless, *
- | please set the keyword IGNORE_CONVERGENCE_FAILURE. *
- O/| *
- /| | *
- / \ qs_scf.F:605 *
===== Routine Calling Stack =====
4 scf_env_do_scf
3 qs_energies
2 qs_forces
1 CP2K
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[42699,1],0]
Errorcode: 1