IT Panda Blog

Life is fantastic


  • Home

  • Tags

  • Categories

  • Archives

Pytorch RNN CPU Issue

Posted on 2019-08-30 Edited on 2020-05-04 In machine learning

最近工作上在做搭建机器学习平台的相关工作,使用的是MLflow;但是线上的Data Scientist在使用Pytorch的时候遇到了问题,下面做个记录…

现象

MLflow在部署使用Pytorch RNN训练的模型的时候,无法正常启动,内部的gunicorn的work无限重启,同时dump thread stack和heap到core文件,一度造成线上GFS run out of space…

由于我们的service是跑在k8s的Pod内的,最神奇的是一部分pod可以启动无问题,一部分不行…

解决问题

既然手里有core dump文件,那就分析,使用的是gdb,打开core文件后见到如下错误:

1
2
3
4
5
6
7
8
9
10
[New LWP 470]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/conda/bin/python /opt/conda/bin/gunicorn --timeout 60 -b 0.0.0.0:5000 -w 4'.
Program terminated with signal SIGILL, Illegal instruction.
#0 0x00007fc6db828ffa in Xbyak::Operand::Operand(int, Xbyak::Operand::Kind, int, bool) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so
(gdb) where
#0 0x00007fc6db828ffa in Xbyak::Operand::Operand(int, Xbyak::Operand::Kind, int, bool) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#1 0x00007fc6d9bb0877 in _GLOBAL__sub_I_verbose.cpp () from /opt/conda/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#2 0x00007fc71d91879a in call_init (l=<optimized out>, argc=argc@entry=9, argv=argv@entry=0x7ffc9bc85ad8, env=env@entry=0x55bfa73c4d20) at dl-init.c:72

能看出来是Python和Pytorch的问题,google搜了下,很多人遇到过这个问题

  • same issue 01
  • same issue 02

which is caused by the CPU architecture

cat /proc/cpuinfo | grep flag

compared and reference from same-issue-from-github, comfired that, that model works well on CPU AVX2

解决办法

upgrade pytorch to 1.2.0 to fix the issue

MLflow Python Pytorch gbd gunicorn pod
使用Docker创建Hexo博客并部署到github.io
Logstash timezone UTC issue
  • Table of Contents
  • Overview
Rex

Rex

25 posts
26 categories
49 tags
Links
  • GitHub
  1. 1. 现象
  2. 2. 解决问题
  3. 3. 解决办法
© 2019 – 2020 作者拥有版权,转载请注明出处