Importing Yi-9B to Ollama

Get The Yi-9B For Your Ollama

Here is the model: https://ollama.com/shinyzhu/yayi

Or just run it from terminal:

ollama run shinyzhu/yayi:9b --verbose

And you can run Yi-9B-200K too:

ollama run shinyzhu/yayi:9b-200k --verbose

Why I Do This

If you like to run LLM models locally with Ollama like I do.

AND

You discovered a better model but it’s not published in the Ollama library.

Then you can import the model to Ollama by following the import doc.

This post is the log of the importing steps on my machine. Let me show you.

Prerequisites

First of all. You’d check what type of the model is. Here shows the Yi-9B model importing. It’s published as safetensors on HuggingFace. So we’ll follow the Importing (PyTorch & Safetensors) guide.

According to the doc. You need a dev machine to do the python and make jobs.

Here I use a vm in my NAS which has i3-N300 CPU and total 16G RAM. The vm configuration is 4 CPU cores, 8 GB RAM and 64GB storage. I just hoped it work. But the importing job is just some building and converting tasks. It ran at an acceptable rate.

The OS is Ubuntu 22.04. You need to install these packages first:

sudo apt install python3.10-venv make gcc g++

And install Git LFS.

wget https://github.com/git-lfs/git-lfs/releases/download/v3.5.1/git-lfs-linux-amd64-v3.5.1.tar.gz
sudo ./install.sh

Or you can install them when you are stuck anytime. Actually that’s what I did.

Now we are ready to go.

Setup Ollama & Llama.cpp

I’d show the shell output here.

shiny@ubuntuinr2:~/llmbuild$ git clone https://github.com/ollama/ollama.git ollama
Cloning into 'ollama'...
remote: Enumerating objects: 12866, done.
remote: Counting objects: 100% (1143/1143), done.
remote: Compressing objects: 100% (545/545), done.
remote: Total 12866 (delta 614), reused 1014 (delta 530), pack-reused 11723
Receiving objects: 100% (12866/12866), 7.64 MiB | 5.20 MiB/s, done.
Resolving deltas: 100% (8001/8001), done.

shiny@ubuntuinr2:~/llmbuild$ cd ollama/
shiny@ubuntuinr2:~/llmbuild/ollama$ git submodule init
Submodule 'llama.cpp' (https://github.com/ggerganov/llama.cpp.git) registered for path 'llm/llama.cpp'

shiny@ubuntuinr2:~/llmbuild/ollama$ git submodule update llm/llama.cpp
remote: Enumerating objects: 14257, done.
remote: Counting objects: 100% (14257/14257), done.
remote: Compressing objects: 100% (3902/3902), done.
remote: Total 13898 (delta 10363), reused 13382 (delta 9882), pack-reused 0
Receiving objects: 100% (13898/13898), 11.01 MiB | 5.65 MiB/s, done.
Resolving deltas: 100% (10363/10363), completed with 287 local objects.
From https://github.com/ggerganov/llama.cpp
 * branch            37e7854c104301c5b5323ccc40e07699f3a62c3e -> FETCH_HEAD
Submodule path 'llm/llama.cpp': checked out '37e7854c104301c5b5323ccc40e07699f3a62c3e'

shiny@ubuntuinr2:~/llmbuild/ollama$ python3 -m venv llm/llama.cpp/.venv
shiny@ubuntuinr2:~/llmbuild/ollama$ source llm/llama.cpp/.venv/bin/activate
(.venv) shiny@ubuntuinr2:~/llmbuild/ollama$ pip install -r llm/llama.cpp/requirements.txt

Successfully installed MarkupSafe-2.1.5 certifi-2024.2.2 charset-normalizer-3.3.2 einops-0.7.0 filelock-3.13.3 fsspec-2024.3.1 gguf-0.6.0 huggingface-hub-0.22.2 idna-3.6 jinja2-3.1.3 mpmath-1.3.0 networkx-3.3 numpy-1.24.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.18.1 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.1.105 packaging-24.0 protobuf-4.25.3 pyyaml-6.0.1 regex-2023.12.25 requests-2.31.0 safetensors-0.4.2 sentencepiece-0.1.99 sympy-1.12 tokenizers-0.15.2 torch-2.1.2 tqdm-4.66.2 transformers-4.39.3 triton-2.1.0 typing-extensions-4.11.0 urllib3-2.2.1

(.venv) shiny@ubuntuinr2:~/llmbuild/ollama$ make -C llm/llama.cpp quantize
make: Entering directory '/home/shiny/llmbuild/ollama/llm/llama.cpp'
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG 
I NVCCFLAGS: -std=c++11 -O3 
I LDFLAGS:    
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -c common/build-info.cpp -o build-info.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml.c -o ggml.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -c llama.cpp -o llama.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml-alloc.c -o ggml-alloc.o
cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion    -c ggml-backend.c -o ggml-backend.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion     -c ggml-quants.c -o ggml-quants.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -c unicode.cpp -o unicode.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -c unicode-data.cpp -o unicode-data.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -c examples/quantize/quantize.cpp -o examples/quantize/quantize.o
g++ -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  build-info.o ggml.o llama.o ggml-alloc.o ggml-backend.o ggml-quants.o unicode.o unicode-data.o examples/quantize/quantize.o -o quantize  
make: Leaving directory '/home/shiny/llmbuild/ollama/llm/llama.cpp'

See? You’d follow the Ollama doc. Haha.

Clone Model: Yi-9B

I failed to clone from HuggingFace. Changed to https://www.modelscope.cn/models/01ai/Yi-9B/files.

(.venv) shiny@ubuntuinr2:/data/llmbuild/ollama$ git clone https://www.modelscope.cn/01ai/Yi-9B.git model
Cloning into 'model'...
remote: Enumerating objects: 105, done.
remote: Counting objects: 100% (105/105), done.
remote: Compressing objects: 100% (58/58), done.
remote: Total 105 (delta 56), reused 90 (delta 44), pack-reused 0
Receiving objects: 100% (105/105), 1.04 MiB | 5.79 MiB/s, done.
Resolving deltas: 100% (56/56), done.
Filtering content: 100% (3/3), 4.44 GiB | 9.70 MiB/s, done.

Convert & Quantize

(.venv) shiny@ubuntuinr2:/data/llmbuild/ollama$ python llm/llama.cpp/convert.py ./model --outtype f16 --outfile converted.bin
Loading model file model/model-00001-of-00002.safetensors
Loading model file model/model-00001-of-00002.safetensors
Loading model file model/model-00002-of-00002.safetensors
params = Params(n_vocab=64000, n_embd=4096, n_layer=48, n_ctx=4096, n_ff=11008, n_head=32, n_head_kv=4, n_experts=None, n_experts_used=None, f_norm_eps=1e-06, rope_scaling_type=None, f_rope_freq_base=10000, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('model'))
Loaded vocab file PosixPath('model/tokenizer.model'), type 'spm'
Vocab info: <SentencePieceVocab with 64000 base tokens and 0 added tokens>
Special vocab info: <SpecialVocab with 0 merges, special tokens {'bos': 1, 'eos': 2, 'unk': 0, 'pad': 0}, add special tokens {'bos': False, 'eos': False}>
Permuting layer 0

Wrote converted.bin
(.venv) shiny@ubuntuinr2:/data/llmbuild/ollama$ llm/llama.cpp/quantize converted.bin quantized.bin q4_0
main: build = 2581 (37e7854)
main: built with  for unknown
main: quantizing 'converted.bin' to 'quantized.bin' as Q4_0
llama_model_loader: loaded meta data with 23 key-value pairs and 435 tensors from converted.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama

[ 435/ 435]                   output_norm.weight - [ 4096,     1,     1,     1], type =    f32, size =    0.016 MB
llama_model_quantize_internal: model size  = 16841.52 MB
llama_model_quantize_internal: quant size  =  4802.22 MB

main: quantize time = 313916.31 ms
main:    total time = 313916.31 ms

Then you’ll get a file quantized.bin. Transfer it to local machine (a MacBook Pro in my case) if you like.

Create Your Model

Create a Modelfile alongside the quantized.bin.

FROM quantized.bin
TEMPLATE """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""
PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
SYSTEM """
You are a helpful and powerful assistant. Respond to user's input carefully.
"""

Then run ollama to create a model:

yi9b % ollama create yayi:9b -f Modelfile 
transferring model data 
creating model layer 
creating template layer 
creating parameters layer 
creating config layer 
using already created layer sha256:0def2b0dd002d747de88ba0b8f396d4d4cb4fcc4daa71ab34bd68834f09b4734 
using already created layer sha256:a47b02e00552cd7022ea700b1abf8c572bb26c9bc8c1a37e01b566f2344df5dc 
using already created layer sha256:f02dd72bb2423204352eabc5637b44d79d17f109fdb510a7c51455892aa2d216 
writing layer sha256:85263e68fb902f60df147b8cf5984fdc76e84fb5783b4a85591ef7fb0a498e6f 
writing manifest 
success

Using The Model

Now you can use the yayi:9b model like any other models you have.

yi9b-cli

[Updated] Update The Model

After about 2 weeks, there are many new commits of the model and Ollama as well. So I decide to re create the model. Here is what I do.

Update Ollama and llama.cpp git repo with these commands:

shiny@ubuntuinr2:/data/llmbuild/ollama$ git pull
shiny@ubuntuinr2:/data/llmbuild/ollama$ git submodule update --remote --merge

Remove “old” files.

shiny@ubuntuinr2:/data/llmbuild/ollama$ rm -f converted.bin
shiny@ubuntuinr2:/data/llmbuild/ollama$ rm -f quantized.bin

Update model repo:

shiny@ubuntuinr2:/data/llmbuild/ollama$ cd model/
shiny@ubuntuinr2:/data/llmbuild/ollama/model$ git pull
shiny@ubuntuinr2:/data/llmbuild/ollama$ cd ..

Then you are ready to build it again.

shiny@ubuntuinr2:/data/llmbuild/ollama$ source llm/llama.cpp/.venv/bin/activate
(.venv) shiny@ubuntuinr2:/data/llmbuild/ollama$ python llm/llama.cpp/convert.py ./model --outtype f16 --outfile converted.bin
(.venv) shiny@ubuntuinr2:/data/llmbuild/ollama$ llm/llama.cpp/quantize converted.bin quantized.bin q4_0

Create the model with Ollama and push to library, and TADA!

Fun Facts

Here are some funny points I learned.

The compression of Git LFS is SUPER!

If you have a look at the sizes of the tensors. They are about 18 GB in total with 2 files!

-rw-rw-r--  1 shiny shiny  9.3G Apr  7 09:49 model-00001-of-00002.safetensors
-rw-rw-r--  1 shiny shiny  7.2G Apr  7 09:47 model-00002-of-00002.safetensors

But. Did you see this:

Filtering content: 100% (3/3), 4.44 GiB | 9.70 MiB/s, done.

What? Just 4.44 GB? and DONE?

I believe it benefited from the LFS.

The outputs are really HUGE

Sizes matter again.

shiny@ubuntuinr2:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb         63G   59G  444M 100% /data

OK. It’s full of LLM model. :D

shiny@ubuntuinr2:/data/llmbuild/ollama$ du -h -d 1
33G     ./model
4.8G    ./llm
59G     .
4.7G    llm/llama.cpp/.venv

It’s ok for a build machine.

-rw-rw-r--  1 shiny shiny  17G Apr  7 10:06 converted.bin
-rw-rw-r--  1 shiny shiny 4.7G Apr  7 10:11 quantized.bin

The final model size is 5 GB.

Python venv Error?

If you changed your path like moved to another volume of the .venv , you need to re create it.

Questions and Further Doings

See what’s wrong with the responses:

yayi9b-res

Here is the first edition of the Modelfile which I followed the doc.

FROM quantized.bin
TEMPLATE "[INST] {{ .Prompt }} [/INST]"

So I changed to the template that the official Yi models use.

I also added a system prompt to it.

Question here:

How can I make a base model like Yi-9B to be a chat model with less efforts?

It seems doesn’t work correctly. e.g. this.

Question here:

How can I measure or explain how the model works?

Feedback

Any feedback would be greatly appreciated.

Don’t forget trying it out: https://ollama.com/shinyzhu/yayi

Last modified on April 8, 2024