zsx

zsx

先作为主站备份 源站:https://my.toho.red

A Card Deep Learning Server Troubleshooting & Pitfall Memo

Title: Adventures and Pitfalls of Building and Troubleshooting an A-Card Deep Learning Server
Date: 2022-11-06 13:22:00
Categories: Random/Notes
URL Name: 113
Tags:#

As a not-so-motivated student, when I heard that the neighboring artificial intelligence major offered online courses during winter break, I immediately signed up and successfully applied. However, there was a catch: I had to provide my own equipment. Thinking about my trashy computer that crashes whenever I increase the resolution for AI drawing, with the support of my family, I started building a server.

The initial configuration was as follows:

  • CPU: D1581
  • GPU: Tesla P40
  • Memory: 32GB DDR4

But after a few days of finalizing the configuration, perhaps because my family was moved by my nonexistent motivation, they decided to upgrade the server's configuration:

  • CPU: 13700KF
  • GPU: 3070
  • Memory: 32GB DDR5

However, before the excitement of the upgraded configuration could settle in, my family mentioned that the configuration might need to be "slightly" downgraded. Although I remained emotionally stable about it, my emotions became unstable as soon as I learned about the actual configuration: the CPU was indeed slightly downgraded to 13600KF, with 4 fewer cores, but the price decreased significantly, so it could still be considered cost-effective. As for the GPU, it was replaced with AMD's MI50 computing card.

Although I haven't researched deep learning, I have long heard about the notorious compatibility issues with AMD, which made me hesitate. However, after doing a quick search, I discovered that AMD had released ROCm technology years ago to compete with CUDA. Seeing the description mentioning "support for PyTorch and TensorFlow," I started to become interested and immediately agreed to switch to the A-card. Now that I have experienced the notoriety of AMD, I realized that it was the most regrettable decision I have ever made.

Enough with the introduction, let's get to the complete process of tinkering:

During the driver installation, I initially didn't think much of it: what could go wrong? But during the installation process, I ended up switching between approximately 3 different systems:

  1. Ubuntu 22.10
  2. Ubuntu 22.04 with GUI
  3. Windows 10 LTSC 2019
  4. Ubuntu 22.04

During the tinkering process, I encountered numerous pitfalls and gained a lot of experience. Here's a summary in chronological order:

  1. The highest supported system is Ubuntu 22.04. Why is this so important? Because the ROCm installation program provided by AMD relies on outdated libraries, and higher version apt sources do not have these dependencies, resulting in installation failures.

  2. When using "amdgpu-install" for installation, the "--no-dkms" parameter must be included because the default dkms mode installs the driver into the kernel. However, since only older kernel versions (4.x) are supported, installing on newer kernels will encounter constant installation failures.

  3. Never use Windows because ROCm does not support Windows, meaning you can only use this card for gaming, rendering, and editing.

Although these pitfalls may seem few, it took me several days of research to succeed. It's safe to say that AMD's inability to compete with Nvidia in deep learning is not without reason.

When you reach this point, I have finally succeeded in the installation. With the help of Google and AMD's official personnel, I managed to install the graphics card driver and successfully run AI benchmarks.

Here are the tutorial links I used during the installation:

https://askubuntu.com/questions/1429376/how-can-i-install-amd-rocm-5-on-ubuntu-22-04
https://github.com/RadeonOpenCompute/ROCm/issues/1852#event-7730462672

Here is an overview of the current system:

  • System: Ubuntu Server 22.04.1
  • Kernel: 5.15
  • ROCm Version: 5.1.1

After going through all of this tinkering, looking back, I realized that my efforts were not in vain: the AI benchmark score of this card falls between the P100 and 3070, and the prices of these two cards are much higher than the current one. Coupled with 16GB of VRAM, it indeed has a significant advantage in terms of cost-effectiveness.

If AMD can improve its driver support, I believe the A-card is still worth buying, especially for students like me who value cost-effectiveness. On the other hand, thanks to AMD, I finally get to use the cheaper N-card and IU (escape).

2022.11.15

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.