【犀牛鸟实战issue】training: split_sizes error #102
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
尝试复现issue提到的bug发现他们之前使用的yaml文件关于Point cloud sampling的配置有问题才出错,而仓库里最新的代码已经修复了这一问题可以顺利运行。所以仅添加注释进行提醒
复现过程中反复配置本机网口比较麻烦,于是修改Hunyuan3D-2.1-main2/hy3dshape/scripts/train_deepspeed.sh脚本的NCCL硬编码,做成自动检测网络接口,删除脚本中所有硬编码 bond1
实验使用8卡H100完成,使用单卡进行测试仍会报错,说明问题与分布式设置(多GPU)无关,根源在于预训练的 ShapeVAE 模型架构本身
包含实验流程、日志、结果的技术文档已发送至:seanxhyang@tencent.com
8.6更新:
实验发现用户在自定义hy3dshape训练的配置文件时,问题出在修改的pc_size和pc_sharpedge_size参数未能成功传给 from_pretrained 加载流程,我们只需要在hy3dshape/hy3dshape/utils/misc.py中,将 config 中的 params 作为关键字参数 **kwargs 传递进去即可解决问题。具体修改请查看files changed