tensorflow资源耗净 Resource exhausted OOM when allocating tensor with shape

2020年5月01日 13:02

描述

 
tensorflow跑训练集经常会遇到错误Resource exhausted: OOM when allocating tensor with shape[64,33,33,2048]
 

错误内容

tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[64,33,33,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node SecondStageBoxPredictor_1/ResizeBilinear (defined at /app/models/research/object_detection/predictors/heads/mask_head.py:149) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[total_loss/_7771]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[64,33,33,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node SecondStageBoxPredictor_1/ResizeBilinear (defined at /app/models/research/object_detection/predictors/heads/mask_head.py:149) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node SecondStageBoxPredictor_1/ResizeBilinear:
 SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/Relu (defined at /app/models/research/slim/nets/resnet_v1.py:136)

Input Source operations connected to node SecondStageBoxPredictor_1/ResizeBilinear:
 SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/Relu (defined at /app/models/research/slim/nets/resnet_v1.py:136)
 

原因

 
tensorflow在为张量shape[64,33,33,2048]分配gpu内存是发现资源不够。
 
假如数据类型是int8, 该张量需要的内存大小64*33*33*2048*1B=142737408B = 142.7MB
 

解决方法

 
1. 降低图片质量
2. batch_size改成1
3. 改用大内存的显卡
4. 增加显卡, 并行训练
 

 

Tags: tensorflow nvidia
评论(69) 阅读(6410)

tesla t4的坑Unable to load the kernel module 'nvidia.ko'.ipynb

2020年4月24日 15:35

说明

 
安nvidia tesla T4显卡遇到的坑, 在ubuntu16.04上安装t4会遇到下面错误
 

错误内容

 
   make[1]: Leaving directory '/usr/src/linux-headers-4.4.0-142-generic'
-> done.
-> Kernel module compilation complete.
ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA GPU(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.
 

解决方法

 
t4不支持普通服务器,更换成刀片服务器
 

补充

 
  • 如果普通主机操作系统是win10,插上t4,安装驱动正常。
 
  • 安装nvidia 2080Ti 驱动,如果忘记插显卡电源线,会提示同样错误
 
 
 

Tags: tensorflow
评论(283) 阅读(23821)

object-detection图片切割提示Invalid argument: Key: image/object/mask错误

2020年4月14日 14:09

  • tensorflow的object-detection切割图片出现错误
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Key: image/object/mask.  Data types don't match. Expected type: float, Actual type: string
	 [[{{node ParseSingleExample/ParseSingleExample}}]]
	 [[IteratorGetNext]]
	 [[BatchMultiClassNonMaxSuppression/map/while/TensorArrayReadV3_5/_7587]]
  (1) Invalid argument: Key: image/object/mask.  Data types don't match. Expected type: float, Actual type: string
	 [[{{node ParseSingleExample/ParseSingleExample}}]]
	 [[IteratorGetNext]]
0 successful operations.
0 derived errors ignored.
  • 错误原因

pipeline配置文件缺少参数mask_type去指定mask类型

  • 设置方法
train_input_reader: {
  mask_type: PNG_MASKS
}


eval_input_reader: {
  mask_type: PNG_MASKS
}

Tags: tensorflow object-detection
评论(133) 阅读(3953)