Windows 11 – Running ROCm ZLUDA PyTorch Stable Diffusion 2024

Requirements

Windows 11
Radeon GPU: 7900 GRE, 7900 XT, or 7900 XTX. Other GPU's may work such as a 7800 XT but are not officially supported. I personally have tested this on both the 7900 GRE and 7900 XTX.
HIP SDK Driver: 6.1.0
Adrenalin Driver: 24.6.1
ZLUDA
Miniconda
Git
WSL2

Step 1: Install Adrenalin Driver: 24.6.1

https://www.amd.com/en/resources/support-articles/release-notes/RN-RAD-WIN-24-6-1.html
If for some reason this version is not available, I have a copy here: https://drive.google.com/file/d/1Evr9gzAsBxgb9VeJx-2KQ2A_cKkloV3k/view?usp=sharing

Install and reboot

NOTE: This is the only step I wont show on video because of recording issues while installing GPU drivers. Install this version, click next through everything, no changes are needed.

Step 2: Install HIP SDK Driver: 6.1.0

https://www.amd.com/en/developer/resources/rocm-hub/hip-sdk.html
If for some reason this version is not available, I have a copy here: https://drive.google.com/file/d/1nt5CQtTmC4uznOY2xN45pDAfW0fd7ZsP/view?usp=sharing

Install and reboot

Step 3: Turn on WSL2

https://docs.microsoft.com/en-us/windows/wsl/install
Open powershell as administrator and run the following command: wsl --install

NOTE: If this fails you may need to enable virtualization in your BIOS. This is a common issue with WSL2 on Intel.

Step 4: Install Miniconda

https://docs.conda.io/en/latest/miniconda.html
If for some reason this version is not available, I have a copy here: https://drive.google.com/file/d/1_SBpgwump9XZXiCzOql-zhf9Y2lPF3v8/view?usp=sharing

Click next through the installer, no changes are needed

Step 5: Install Git

https://git-scm.com/download/win
If for some reason this version is not available I have a copy here: https://drive.google.com/file/d/1CfddS9h1n3LhyiN-s6nmngp0ZI-qOClp/view?usp=sharing

Click next through the installer, no changes are needed

Step 6: Download a copy of ZLUDA

This project was active, then canceled and recently in Q4 2024 it has been restarted independent of AMD.
I am using ZLUDA version ZLUDA 3.8.3
You may download from the following link: https://drive.google.com/file/d/17_TprYhYro-v9ghAuG8p2rQXDBiGcLbi/view?usp=sharing

Unzip this to a location of your choice, I will be using: F:\AI\ZLUDA
This will be important in a moment

Step 7: Edit Windows environment variables

Add to the user path: C:\Program Files\Git\bin
Add to the sytem path:
%HIP_PATH%bin
F:\AI\ZLUDA

Step 8: Create WSL2 Ubuntu 22.04 instance

Open powershell

wsl --install --d Ubuntu-22.04

This will prompt you to enter a username and password for the new instance.

NOTE: You may want to read up on WSL2 and how to run multiple instances of a distro. Ideally you would create this image then clone it for future projects but what we are doing works just fine.

Step 9: Configure Ubuntu 22.04

Once inside Ubuntu, enter these commands, I have updated this to be just 2 lines, one short update and one very long chain of commands:

sudo apt update -y

wget https://repo.radeon.com/amdgpu-install/6.1.3/ubuntu/jammy/amdgpu-install_6.1.60103-1_all.deb && sudo apt install ./amdgpu-install_6.1.60103-1_all.deb -y && sudo amdgpu-install --list-usecase && amdgpu-install -y --usecase=wsl,rocm --no-dkms && rocminfo && sudo apt install python3-pip -y && pip3 install --upgrade pip wheel && wget https://repo.radeon.com/rocm/manylinux/rocm-rel-6.1.3/torch-2.1.2%2Brocm6.1.3-cp310-cp310-linux_x86_64.whl && wget https://repo.radeon.com/rocm/manylinux/rocm-rel-6.1.3/torchvision-0.16.1%2Brocm6.1.3-cp310-cp310-linux_x86_64.whl && wget https://repo.radeon.com/rocm/manylinux/rocm-rel-6.1.3/pytorch_triton_rocm-2.1.0%2Brocm6.1.3.4d510c3a44-cp310-cp310-linux_x86_64.whl && pip3 uninstall torch torchvision pytorch-triton-rocm numpy && pip3 install torch-2.1.2+rocm6.1.3-cp310-cp310-linux_x86_64.whl torchvision-0.16.1+rocm6.1.3-cp310-cp310-linux_x86_64.whl pytorch_triton_rocm-2.1.0+rocm6.1.3.4d510c3a44-cp310-cp310-linux_x86_64.whl numpy==1.26.4 && location=`pip show torch | grep Location | awk -F ": " '{print $2}'` && cd ${location}/torch/lib/ && rm libhsa-runtime64.so* && cp /opt/rocm/lib/libhsa-runtime64.so.1.2 libhsa-runtime64.so && python3 -c 'import torch' 2> /dev/null && echo 'Success' || echo 'Failure' && python3 -c 'import torch; print(torch.cuda.is_available())' && python3 -c "import torch; print(f'device name [0]:', torch.cuda.get_device_name(0))" && python3 -m torch.utils.collect_env

NOTE: I have seen a few people fail to get this working because they did not run the commands one at a time. I condensed this down and tested so there are technically only 2 lines to run.

Step 10: Test run PyTorch

From within WSL Ubuntu 22.04 do the following:

cd ~
touch test.py
nano test.py

Paste in the following code, be prepared to monitor GPU utilization to validate the workload is actually using GPU and not CPU:

######################################################################
#The MNIST dataset is widely used for introductory machine learning projects.
#Download MNIST dataset using torchvision
#Train the model
#Evaluate the model after training
#Validates our WSL2 environment setup and PyTorch installation for GPU acceleration

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

# 1. Define device (GPU or CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 2. Define transforms to normalize the dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

# 3. Load the MNIST dataset
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=1024, shuffle=True)

testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=1024, shuffle=False)

# 4. Define a simple CNN model
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1)  # 1 input channel (grayscale), 32 filters, 3x3 kernel
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)  # Max pooling with 2x2 window
        self.fc1 = nn.Linear(64 * 7 * 7, 128)  # Fully connected layer (flattened after 2 pooling operations)
        self.fc2 = nn.Linear(128, 10)  # Output layer for 10 classes (digits 0-9)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = x.view(-1, 64 * 7 * 7)  # Flatten the tensor for fully connected layers
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# 5. Initialize the model, loss function, and optimizer
model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 6. Train the model
for epoch in range(5):  # 5 epochs
    running_loss = 0.0
    for i, (inputs, labels) in enumerate(trainloader, 0):
        inputs, labels = inputs.to(device), labels.to(device)

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)

        # Backward pass and optimize
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        if i % 100 == 99:  # Print every 100 mini-batches
            print(f'Epoch {epoch+1}, Batch {i+1}, Loss: {running_loss/100}')
            running_loss = 0.0

print('Finished Training')

# 7. Evaluate the model on the test set
correct = 0
total = 0
model.eval()  # Set model to evaluation mode (disables dropout, etc.)
with torch.no_grad():
    for inputs, labels in testloader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy on test set: {100 * correct / total:.2f}%')
######################################################################

Save and exit the file: Ctrl+O, Enter, Ctrl+X

Run the following command:

python3 test.py

You should see the loss function and accuracy print out as the model trains and evaluates. If you see this then you have successfully setup WSL2 with PyTorch for GPU acceleration.
You should also validate the GPU is being loaded via Adrenalin, HWMon or HWInfo64 etc…

Step 11: Stable Diffusion WebUI

Repo we will be using: https://github.com/lshqqytiger/stable-diffusion-webui-amdgpu

If for some reason this version is not available I have a copy here: https://drive.google.com/file/d/1Ila5AQjgBmr2g6b1CeC2P_m5vItj8FvP/view?usp=sharing

We are mostly already setup since we have ROCm installed and our environment vars loaded.

Create a folder where we can download the stable diffusion repo. I will be using F:\AI\tiger

NOTE: conda is a type of virtual environment that lets you run different versions of python and packages. We will be using this to run the webui.

Open Miniconda3 and run the following:

f:
cd ai
cd tiger
conda create --name tiger python=3.10.6 -y
conda activate tiger
git clone https://github.com/lshqqytiger/stable-diffusion-webui-amdgpu
cd stable-diffusion-webui-amdgpu

Edit the webui-user.bat file and update this line to show the following:

set COMMANDLINE_ARGS=--use-zluda

Save and exit the file

Run the following command from the stable-diffusion-webui-amdgpu folder:

webui-user.bat

This will take a while to run the first time.

Uncategorized