話說最近真係將self hosted AI帶入工作上
極北鷲 2024-11-22 00:16:18 分享下
TLDR: 自己host個OpenAI API compatible server,再用VSCode extension用佢 (當然要你老細/Infra同意)

Frontend
現時有幾個Open source VSCode extension係支持OpenAI compatible API server. 我自己係用continue。以下列出幾個現成野:
https://www.continue.dev/
https://github.com/cline/cline
(Neovim) https://github.com/yetone/avante.nvim
Continue可以chat, 可以autocomplete, 可以gen code, 幾好用
:^(


Backend
個backend server就好多可以揀,ollama/text-generation-webui等等。如果你係想純GPU行AI既話(我自己用3090),可以用vLLM同tabbyAPI。我係用後者
https://github.com/theroyallab/tabbyAPI
tabbyAPI支持spectulative decoding: 意思大概係用個細D既model(draft model)去predict大既model既output。可以大幅加快output speed(我自己至少快左25%)。

Model
Model我而家用緊Qwen2.5-coder,係alibaba整既,community都話佢係現時最強self hosted coding model
如果用tabbyAPI既話,要用exl2 quantized既model。我用以下兩個:
主要model:https://huggingface.co/lucyknada/Qwen_Qwen2.5-Coder-32B-Instruct-exl2
Draft model:https://huggingface.co/lucyknada/Qwen_Qwen2.5-Coder-1.5B-Instruct-exl2
兩個Model都係用4.0bpw variant
不過3090 24GB VRAM係頂唔到呢兩個加埋一齊的,要係tabbyAPI config到set返cache_mode: Q8先得。咁樣setup岩岩好用得曬D VRAM:
user@ai:~$ nvidia-smi
Fri Nov 22 00:05:06 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:05:00.0 Off |                  N/A |
|  0%   32C    P8              48W / 390W |  24242MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     10972      C   /home/user/tabbyAPI/.venv/bin/python    24236MiB |
+---------------------------------------------------------------------------------------+

呢到用48W idle係因為我插左個EDID emulator上隻GPU到
:^(
部機同時係sunshine server黎,要EDID emulator先用到 (Windows問題,好似係)。正常idle應該低好多

VPN
最後就係公司駁返部server。
呢個唔難,用Tailscale或Wireguard就得。反而係說服你公司infra可能較麻煩
:^(
呢個要靠你自己了

Ads

極北鷲 2024-11-22 00:24:01 另外,其實想自己benchmark個setup
但唔知點解benchmark時會炒車
:^(
最近忙 未有時間研究

benchmark tools:
https://github.com/EleutherAI/lm-evaluation-harness
諸如此類 2024-11-22 00:26:38 Model我而家用緊Qwen2.5-coder,係alibaba整既,community都話佢係現時最強self hosted coding model

此post已完
細滴大雨雲 2024-11-22 00:27:09 留名明天睇
品客薯條 2024-11-22 00:28:19 Alibaba
:^(
極北鷲 2024-11-22 00:30:46 唔用qwen既話我都唔知用邊個 我都係睇鬼佬推薦
老實講我冇follow self hosted LLM一段時間 上次聽話最勁既係codestral 你可以睇下
Model可以任轉,唔影響backend/frontend運作的
炎明禎 2024-11-22 00:35:38 Lm
極北鷲 2024-11-22 00:37:22 冇GPU 或 GPU VRAM少既人請用gguf quantization version
佢可以用CPU+RAM去行隻AI,同時都可以offload部分layer比你個GPU以加快inference speed

不過要揀就揀Q4或以上既variant:我理解低過Q4既話output質素會大幅下降
(exl2既話好似係至少4.0bpw?呢個我唔肯定)

同埋用gguf既話 backend用ollama就得 setup非常簡單的
https://ollama.com/
茫茫星海 2024-11-22 00:40:18 唔太識,想問用chatgpt 有冇分別?
極北鷲 2024-11-22 00:42:37 想搵quantized version of a model既方法係去huggingface搵返個base model既repo先,例如我係Qwen/Qwen2.5-Coder-32B-Instruct

網頁右邊有hyperlink比你快速搵finetune/quantized version:
:^(


用ollama既話就唔洗:你直接行ollama run <model name>:<branch>就得。
ollama model list:
https://ollama.com/search
極北鷲 2024-11-22 00:43:54 分別係chatgpt係OpenAI host既,呢個就係我自己host既

你可以理解做私人版chatgpt

Ads

大王子小王子 2024-11-22 00:47:52 呢個可唔可以gen圖, 例如講一大段野, gen個mindmap出黎
標槍佬會攪掂 2024-11-22 00:52:09 https://x.com/bnjmn_marie/status/1850805329610625355
蛋散一舊飯 2024-11-22 00:52:43 你不如直接用cursor 算啦,自己host 個api server咁麻煩,啲model又係咁出新款,咪要係咁patch
極北鷲 2024-11-22 00:52:56 呢個gen唔到圖

gen圖我冇特別研究過,應該要用comfyui/forge (呢兩個都係包曬backend+frontend)

現時新鮮滾熱辣既model係Stable Diffusion 3.5同埋Flux,兩個都有gguf版本以降低VRAM usage(不過會gen得好慢,聽講係一分鐘以上gen一張
:^(

舊D既話,SDXL都用得
極北鷲 2024-11-22 00:56:26
:^(


不過講真,其實唔麻煩
Frontend同Backend都係setup and forget
Model既話出新既咪就咁replace佢
:^(
而且老實講短時間內應該唔需要換,Qwen2.5 coder既output真係ok
極北鷲 2024-11-22 01:07:40 https://simonwillison.net/2024/Nov/12/qwen25-coder/

不過而家最勁都係Claude Sonnet
:^(
旋風管家一拳超人 2024-11-22 01:08:43 why not use copilot
蛋散一舊飯 2024-11-22 01:09:31 sonnet 出啲code係最啱同最快,用黎做簡單refactoring 差唔多無野改咁劑
極北鷲 2024-11-22 01:11:41 因為鐘意self hosting
:^(


不過講真 你肯比錢既話,一係copilot,一係用我講既frontend任意一個 + Claude API
駐連燈首席美軍 2024-11-22 02:24:27 佢又咪係基於llama開發出嚟

Ads

debugger; 2024-11-22 03:03:43 local llm好處多太多,唔洗將公司野send出街已經差好遠
大棍巴 2024-11-22 06:38:39 真係唔洗同佢地解太多點解Qwen好
:^(

叫佢地自己睇Reddit同Livebench/Aider Leaderboard,同學下咩叫Open Model/Weight
:^(


師兄出post好有心
:^(

暫時打code最好Local model都係Qwen,
不過Qwen嘅coder variant好似本身冇train到tool calling,就咁用API call,Cline會唔work,你用落有冇問題?
:^(
大棍巴 2024-11-22 07:15:29 第一次聽,source?
Peter_Pan 2024-11-22 08:36:33
:^(
我玩ff14翻譯新劇情都係用llm
唔知點解m2 pro macbook host同一個model快過我pc用3080gpu