典型报错内容:
Error: not enough free space for shared memory: need 11954331648, have 6760333312 KID0: Process received SIGBUS. Most likely cause: disk or shared memory full. KID1: Process received SIGBUS. Most likely cause: disk or shared memory full. Process received SIGBUS. Most likely cause: disk or shared memory full.
原因:硬盘空间不足,请在作业被提交的那个节点,输入df命令回车,检查有没有Uses% 列达到约99%的项目。该列指硬盘各个分区,使用情况。99%表示该区域被占尽,没有剩余空间了,请清理该区域的数据。
作业的*.out文件开头会有使用的节点信息,例如:
Parallel Execution: Process Information ============================================================================== Rank Node Name NodeID MyNodeRank NodeMaster 0 node08 0 0 0 1 node08 0 1 -1 2 node08 0 2 -1 3 node08 0 3 -1 4 node08 0 4 -1 5 node08 0 5 -1 6 node08 0 6 -1 7 node08 0 7 -1 8 node08 0 8 -1 9 node08 0 9 -1 10 node08 0 10 -1 11 node08 0 11 -1 12 node08 0 12 -1 13 node08 0 13 -1 14 node08 1 0 1 15 node08 1 1 -1 16 node08 1 2 -1 17 node08 1 3 -1 18 node08 1 4 -1 19 node08 1 5 -1 20 node08 1 6 -1 21 node08 1 7 -1 22 node08 1 8 -1 23 node08 1 9 -1 24 node08 1 10 -1 25 node08 1 11 -1 26 node08 1 12 -1 27 node08 1 13 -1 ==============================================================================
表示该作业是投递到node08上面运行的,并且使用了其中2个CPU,每个CPU用了14核心。