【代码bug】mmaction开源库之环境配置bug

tech2025-12-08  4

【bug1】

UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all '

fix: 该错误在有的服务器上使用pytorch 1.3版本时出现(不是所有机器都会有这个问题), 将版本降至1.0-1.1 即可。

【bug2】

return torch._C._broadcast_coalesced(tensors, devices, buffer_size) RuntimeError: all tensors must be on devices[0]

fix:该错误是由于mmcv版本问题导致。 mmcv 0.4的分布式训练框架对于早期的mmaction代码支持不佳, 该用 mmcv 0.2.15即可解决。

【bug3】

result = self.forward(*input, **kwargs) File "/home/hadoop-mtcv/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 392, in forward self.reducer.prepare_for_backward([]) RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at ../torch/csrc/distributed/c10d/reducer.cpp:408) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6d (0x7fd23150733d in /home/hadoop-mtcv/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > const&) + 0x778 (0x7fd25d5adf18 in /home/hadoop-mtcv/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #2: <unknown function> + 0x666524 (0x7fd25d59b524 in /home/hadoop-mtcv/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #3: <unknown function> + 0x1144b0 (0x7fd25d0494b0 in /home/hadoop-mtcv/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/lib/libtorch_python.so) '

解决: 版本降低到pytorch 1.0

【bug4】RuntimeError: all tensors must be on devices[0]

重新配置环境 mmcv 必须使用0.2.16版本

【bug5】video_dataset.py 读取图片索引报错。

def _load_image(self, video_reader, directory, modality, idx): if modality in ['RGB', 'RGBDiff']: return [video_reader[idx - 1]] elif modality == 'Flow': raise NotImplementedError else: raise ValueError('Not implemented yet; modality should be ' '["RGB", "RGBDiff", "Flow"]')

因为某些视频已经broken,无法还在,video_reader是none, 而且record.num_frames==0.

解决:增加容错代码,在 def getitem(self, idx):函数中

else: video_reader = mmcv.VideoReader('{}{}'.format( osp.join(self.img_prefix, record.path), self.video_ext)) record.num_frames = len(video_reader) # fix bug: some videos are broken, can't be load, the record is None, length=0 if record.num_frames ==0: while(True): record = self.video_infos[random.randint(0,len(self.video_infos)-1)] video_reader = mmcv.VideoReader('{}{}'.format( osp.join(self.img_prefix, record.path), self.video_ext)) record.num_frames = len(video_reader) if record.num_frames!=0: break
最新回复(0)