eventlet在ubuntu上出现OSError protocol not found

2020年6月21日 09:24

描述

 
tensorflow的nvidia docker镜像使用ubuntu16.04, ubuntu是精简之后的,有些包可能没有。在上面运行eventlet会出现下面问题
 
 

错误内容

 
Traceback (most recent call last):
  File "/app/defect-client/defect_client/cmd/wafer-worker.py", line 14, in <module>
    import eventlet
  File "/usr/local/lib/python3.6/dist-packages/eventlet/__init__.py", line 10, in <module>
    from eventlet import convenience
  File "/usr/local/lib/python3.6/dist-packages/eventlet/convenience.py", line 7, in <module>
    from eventlet.green import socket
  File "/usr/local/lib/python3.6/dist-packages/eventlet/green/socket.py", line 21, in <module>
    from eventlet.support import greendns
  File "/usr/local/lib/python3.6/dist-packages/eventlet/support/greendns.py", line 69, in <module>
    setattr(dns.rdtypes.IN, pkg, import_patched('dns.rdtypes.IN.' + pkg))
  File "/usr/local/lib/python3.6/dist-packages/eventlet/support/greendns.py", line 59, in import_patched
    return patcher.import_patched(module_name, **modules)
  File "/usr/local/lib/python3.6/dist-packages/eventlet/patcher.py", line 126, in import_patched
    *additional_modules + tuple(kw_additional_modules.items()))
  File "/usr/local/lib/python3.6/dist-packages/eventlet/patcher.py", line 100, in inject
    module = __import__(module_name, {}, {}, module_name.split('.')[:-1])
  File "/usr/local/lib/python3.6/dist-packages/dns/rdtypes/IN/WKS.py", line 25, in <module>
    _proto_tcp = socket.getprotobyname('tcp')
OSError: protocol not found
 

解决办法

 
apt-get -o Dpkg::Options::="--force-confmiss" install --reinstall netbase
 

Tags: eventlet python
评论(45) 阅读(4593)

tensorflow资源耗净 Resource exhausted OOM when allocating tensor with shape

2020年5月01日 13:02

描述

 
tensorflow跑训练集经常会遇到错误Resource exhausted: OOM when allocating tensor with shape[64,33,33,2048]
 

错误内容

tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[64,33,33,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node SecondStageBoxPredictor_1/ResizeBilinear (defined at /app/models/research/object_detection/predictors/heads/mask_head.py:149) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

[[total_loss/_7771]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[64,33,33,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node SecondStageBoxPredictor_1/ResizeBilinear (defined at /app/models/research/object_detection/predictors/heads/mask_head.py:149) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node SecondStageBoxPredictor_1/ResizeBilinear:
 SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/Relu (defined at /app/models/research/slim/nets/resnet_v1.py:136)

Input Source operations connected to node SecondStageBoxPredictor_1/ResizeBilinear:
 SecondStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/Relu (defined at /app/models/research/slim/nets/resnet_v1.py:136)
 

原因

 
tensorflow在为张量shape[64,33,33,2048]分配gpu内存是发现资源不够。
 
假如数据类型是int8, 该张量需要的内存大小64*33*33*2048*1B=142737408B = 142.7MB
 

解决方法

 
1. 降低图片质量
2. batch_size改成1
3. 改用大内存的显卡
4. 增加显卡, 并行训练
 

 

Tags: tensorflow nvidia
评论(67) 阅读(6217)

tesla t4的坑Unable to load the kernel module 'nvidia.ko'.ipynb

2020年4月24日 15:35

说明

 
安nvidia tesla T4显卡遇到的坑, 在ubuntu16.04上安装t4会遇到下面错误
 

错误内容

 
   make[1]: Leaving directory '/usr/src/linux-headers-4.4.0-142-generic'
-> done.
-> Kernel module compilation complete.
ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA GPU(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver release.
 

解决方法

 
t4不支持普通服务器,更换成刀片服务器
 

补充

 
  • 如果普通主机操作系统是win10,插上t4,安装驱动正常。
 
  • 安装nvidia 2080Ti 驱动,如果忘记插显卡电源线,会提示同样错误
 
 
 

Tags: tensorflow
评论(276) 阅读(23165)

object-detection图片切割提示Invalid argument: Key: image/object/mask错误

2020年4月14日 14:09

  • tensorflow的object-detection切割图片出现错误
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Key: image/object/mask.  Data types don't match. Expected type: float, Actual type: string
	 [[{{node ParseSingleExample/ParseSingleExample}}]]
	 [[IteratorGetNext]]
	 [[BatchMultiClassNonMaxSuppression/map/while/TensorArrayReadV3_5/_7587]]
  (1) Invalid argument: Key: image/object/mask.  Data types don't match. Expected type: float, Actual type: string
	 [[{{node ParseSingleExample/ParseSingleExample}}]]
	 [[IteratorGetNext]]
0 successful operations.
0 derived errors ignored.
  • 错误原因

pipeline配置文件缺少参数mask_type去指定mask类型

  • 设置方法
train_input_reader: {
  mask_type: PNG_MASKS
}


eval_input_reader: {
  mask_type: PNG_MASKS
}

Tags: tensorflow object-detection
评论(124) 阅读(3722)

学习: matplotlib绘制阻尼正弦波

2019年4月06日 07:55

介绍

 
* 阻尼正弦波定义: 振幅会随时间增长而趋向零的正弦波函数
 
 

Tags: matplotlib
评论(640) 阅读(10933)

学习:人工智能-机器学习-深度学习概念的区别

2019年1月20日 12:36

 

一图胜千言

 
 

Tags: 机器学习 深度学习
评论(6) 阅读(3082)

使用numba的姿势不正确反而导致性能下降

2019年1月20日 09:24

numba能够极大的提高python在计算方面的性能。是不是所有的python代码上,都可以加上numba.jit装饰器?答案是否定的。
 
 

Tags: python 性能
评论(446) 阅读(8383)

numba加速python学习与尝试

2019年1月20日 09:20

 简介

 
探索python性能优化工具,发现了numba. 只需要给函数加上装饰器就可以。比cython和pypy方便多了。
 

numba是什么

 
numba是为了提高numpy速度而开发的,使用llvm将python代码翻译为bitcode,并在bitcode外面做了一层包装,让python可以调用
 
通过numba翻译的代码由于经过llvm优化并可在机器上直接执行,效率将有所提高,对海量数据处理非常有帮助
 
 

评论(523) 阅读(29590)