This article is driven by two events:

  1. Recently, Meta, the largest AI supplier of this AI season, heavily criticized in social and VR fields but revered as a living Bodhisattva in the AI sector, released Llama 2. It's said to compete head-to-head with OpenAI's GPT series and allows for easy fine-tuning.
  2. About a month ago, llama.cpp added support for CLBlast.

So, my AMD Radeon card can now join the fun without much hassle. Below, I'll share how to run llama.cpp + Llama 2 on Ubuntu 22.04 Jammy Jellyfish.

Download the Model

Thanks to TheBloke, who kindly provided the converted Llama 2 models for download:

Choose the version that fits your memory capacity. For example, the 70B version requires about 31GB~70GB of memory. I downloaded llama-2-13b-chat.ggmlv3.q4_K_M.bin.

The "q4" indicates a 4-bit version. Once downloaded, save the model as a .bin file, e.g., ~/Downloads/llama-2-13b-chat.ggmlv3.q4_K_M.bin.

Compile llama.cpp

On Ubuntu, download the necessary tools and libraries:

sudo apt install git make cmake vim

Clone the Llama.cpp code:

git clone https://github.com/ggerganov/llama.cpp

According to the official website, simply running make should suffice, but I encountered some issues, so I switched to cmake:

mkdir build
cd build
cmake ..
cmake --build . --config Release

The built program will be located in llama.cpp/build/bin, with main as the command program entry and server as the web server entry.

Copy and rename it:

cp ./bin/main/main ../llama-cpu
cd ..

Test It

Inside llama.cpp/examples, there are several test scripts. Copy one and modify it for our own use:

cp examples/chat-13B.sh examples/chat-llama2-13B.sh
vim examples/chat-llama2-13B.sh

Change the MODEL path in examples/chat-llama2-13B.sh to your own, like so:

MODEL="/home/lyric/Downloads/llama-2-13b-chat.ggmlv3.q4_K_M.bin"

Replace ./main with your own name ./llama-cpu

Then run:

./examples/chat-llama2-13B.sh

Depending on your machine's configuration, it may take a while before you can start chatting. The result should look like the image below, where the green text is what I input, and the white text is Llama 2's response.

An image to describe post

Enable GPU Acceleration

Download the driver from AMD: https://repo.radeon.com/amdgpu-install/

TIP

Note that the 5.* versions are the newer ones, and `

2*..versions are actually older and not usable. I installed version5.5`: https://repo.radeon.com/amdgpu-install/5.5/ubuntu/jammy/

After installation, run:

amdgpu-install --usecase=opencl,rocm

On Ubuntu, download the necessary libraries:

sudo apt install ocl-icd-dev ocl-icd-opencl-dev \
	opencl-headers libclblast-dev

Recompile with -DLLAMA_CLBLAST=ON option:

cd build
cmake .. -DLLAMA_CLBLAST=ON -DCLBlast_dir=/usr/local
cmake --build . --config Release

Copy and rename it:

cp ./bin/main/main ../llama-cl
cd ..

Then modify the launch script:

vim examples/chat-llama2-13B.sh

Replace ./llama-cpu with your new name ./llama-cl

And add --n-gpu-layers 40 to the second to last line, for example, mine is:

./llama-cl $GEN_OPTIONS \
  --model "$MODEL" \
  --threads "$N_THREAD" \
  --n_predict "$N_PREDICTS" \
  --color --interactive \
  --file ${PROMPT_FILE} \
  --reverse-prompt "${USER_NAME}:" \
  --in-prefix ' ' \
  --n-gpu-layers 40
  "$@"
TIP

The --n-gpu-layers option utilizes VRAM to accelerate token generation. I set it to 40 for my card, but you can set a very large number, like 100000, and llama.cpp will adjust to the maximum number of layers your GPU can handle.

Then run:

./examples/chat-llama2-13B.sh

Theoretically, using GPU acceleration should significantly reduce waiting time, and you should see output indicating GPU acceleration, like:

ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
...
ama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required  =  710.19 MB (+ 1600.00 MB per state)
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloaded 40/41 layers to GPU
llama_model_load_internal: total VRAM used: 7285 MB
...

This indicates GPU acceleration is in use. On my computer, the generation speed reached over 600 tokens per second, which feels incredibly fast.

Run a Service

llama.cpp also provides a server, which you can learn about in the official documentation.

For me, I simply ran:

./server -m ~/Download/llama-2-13b-chat.ggmlv3.q4_K_M.bin \
	-c 2048 -ngl 40 --port 10081

Then, opening http://localhost:10081 allowed me to use the Web UI.

This Web server supports API requests, for example:

curl --request POST \
    --url http://localhost:10081/completion \
    --header "Content-Type: application/json" \
    --data \
		'{"prompt": "Building a website can be done in 10 steps:","n_predict": 128}'

This makes it very convenient to experiment with the model.

Troubleshooting

OpenCL Permission Issues

There may be cases where OpenCL functions are only accessible with root permissions. For instance, running clinfo might fail to find OpenCL, but sudo clinfo works just fine.

In such cases, execute the following (replacing LOGIN_NAME with your username):

sudo usermod -a -G video LOGIN_NAME
sudo usermod -a -G render LOGIN_NAME

This grants the current user the necessary permissions.

About Llama2's Quirks

OpenAI's ChatGPT has undergone extensive prompt engineering and optimization, but Llama2, run independently, lacks these refinements. If Llama2 seems unintelligent, consider increasing the complexity of your prompts; otherwise, Llama2 might not meet your expectations.

For example, if you want Llama2 to output JSON, your prompt should include several examples of generating JSON, like this intent recognition example:

You read the following text and recognize the user's intent.
Possible intents are:

1. "eating"
2. "sleeping"
3. "fighting"
9999. "unknown intent"

You must return the intent with the highest confidence.
You must return the result

 in JSON format. 
Here is the template: 
{ "id": id, "intent": "USER'S INTENT", "confidence": 0.9 }

**instructions: I'm hungry**
{ "id": 1, "intent": "eating", "confidence": 0.9 }

**instructions: I'm tired and want to sleep**
{ "id": 2, "intent": "sleeping", "confidence": 0.9 }

**instructions: Where's the bean? I want to hit it**
{ "id": 3, "intent": "fighting", "confidence": 0.7 }

**instructions: What time is it?**
{ "id": 9999, "intent": "unknown intent", "confidence": 0.9 }

Configured in the Web UI like this:

An image to describe post

The running effect is as follows:

An image to describe post

Not bad~