在昇腾 910B 上部署轻量级和跨平台大模型 Agent

文摘 2024-10-25 18:10 北京

昇腾 910B 是中国市场上 Nvidia H100 的热门替代^[1]。虽然它是 AI 训练工作负载的强大引擎，但我们最感兴趣的是它的推理性能。随着新的昇腾NPU 面向边缘设备^[2]发布，这一点尤其重要。

最近，华为慷慨捐赠了 5 台裸机服务器，每台配备 8卡昇腾 910B，以支持 GOSIM^[3]Super Agent 黑客马拉松活动。这些机器每台售价超过 10 万美元。我们为参赛的学生团队提供了与 OpenAI 兼容的 API 服务，用于这些热门的 LLM^[4] 机器。其超大 VRAM（64GB）使我们能够在每个昇腾 910B NPU 上运行 70B LLM（量化到了 4 bit）。

Agent 黑客马拉松使用 MoFA^[5] 和 LangChain等 LLM Agent 框架，每天消耗数千万个 token，而这些 NPU 轻松应对。本文中，我们将讨论使用昇腾 910B 的经验，并提供有关如何在此硬件上设置和运行 LLM 的详细教程。

轻量级和跨平台的 LLM 应用

开源 LLM 的主要应用场景是在异构边缘设备上。对于边缘而言，Python 和 PyTorch 过于臃肿，有着复杂的依赖和不安全的软件供应链。然而，如果没有 Python 提供的设备后端抽象，使用 Rust 和 C/C++ 等语言的开发者将需要为每个 GPU 设备重新编译甚至重写他们的应用程序。

假设你是一名拥有 MacBook 笔记本电脑的开发者。你编译了一个用 Rust 编写的 LLM 推理应用程序并在笔记本电脑上对其进行了测试。你很有可能在 Apple M 系列芯片上的 Apple Metal 框架上构建它。这个编译后的二进制应用程序能直接在 Nvidia CUDA 设备上运行的可能性为零。

对于昇腾等新兴 GPU 和 NPU 厂商来说，这个问题尤其严重。昇腾NPU 需要自己的运行时框架 CANN^[6]（类似英伟达的CUDA）。很少有开发者能够使用昇腾/CANN，专门为该平台开发应用程序的开发者就更少了。

解决此问题的一个方法是 Linux 基金会和 CNCF 的开源 WasmEdge Runtime^[7]，它为 GPU 抽象提供了原生性能。借助 WasmEdge 的标准 WASI-NN API，开发者只需将他们的应用程序编译为 Wasm，它就会自动在所有 GPU 和 NPU 上运行。

WasmEdge 对昇腾NPU 和 CANN 框架的支持建立在对 llama.cpp 项目的开源贡献之上^[8]。

与 Python 和 PyTorch 相比，WasmEdge 运行时大小仅为 1%，并且不依赖其他操作系统库和设备驱动程序 —— 从而更轻、更安全且更适用于边缘设备。

对于本次黑客马拉松项目，我们使用以下基于 WasmEdge 构建的与 OpenAI 兼容的 API server。它们以 Rust 编写，并编译为跨平台的 Wasm 以在昇腾 910B 上运行。

LlamaEdge^[9] 是一个组件化的 API server，可以运行各种各样的 AI 模型，包括 LLM、Stable Diffusion/Flux 模型、Whisper模型和 TTS 模型。
Gaia 节点^[10]是 LLM、提示、向量知识库、访问控制、负载均衡器和域服务的完全集成堆栈，用于大规模提供知识补充的 LLM。

昇腾的 Docker 容器

虽然 WasmEdge 运行时是跨平台的，但它还没有预先构建的昇腾 release asset。在裸机昇腾 910B 服务器上使用 WasmEdge 的最简单方法是使用 Docker 镜像。它在容器内为 CANN 驱动程序构建 WasmEdge 二进制文件。Dockerfile 如下。

FROM dockerproxy.cn/hydai/expr-repo-src-base AS src
FROM dockerproxy.cn/ascendai/cann:8.0.rc1-910b-openeuler22.03

COPY --from=src /fmt /src/fmt
COPY --from=src /spdlog /src/spdlog
COPY --from=src /llama.cpp /src/llama.cpp
COPY --from=src /simdjson/ /src/simdjson
COPY ./WasmEdge /src/WasmEdge

ENV ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ENV LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:$LIBRARY_PATH
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:${ASCEND_TOOLKIT_HOME}/lib64/plugin/opskernel:${ASCEND_TOOLKIT_HOME}/lib64/plugin/nnengine:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe/op_tiling:${LD_LIBRARY_PATH}
ENV PYTHONPATH=${ASCEND_TOOLKIT_HOME}/python/site-packages:${ASCEND_TOOLKIT_HOME}/opp/built-in/op_impl/ai_core/tbe:${PYTHONPATH}
ENV PATH=${ASCEND_TOOLKIT_HOME}/bin:${ASCEND_TOOLKIT_HOME}/compiler/ccec_compiler/bin:${PATH}
ENV ASCEND_AICPU_PATH=${ASCEND_TOOLKIT_HOME}
ENV ASCEND_OPP_PATH=${ASCEND_TOOLKIT_HOME}/opp
ENV TOOLCHAIN_HOME=${ASCEND_TOOLKIT_HOME}/toolkit
ENV ASCEND_HOME_PATH=${ASCEND_TOOLKIT_HOME}
ENV LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/runtime/lib64/stub:$LD_LIBRARY_PATH

RUN yum install -y git gcc g++ cmake make llvm15-devel zlib-devel libxml2-devel libffi-devel
RUN cd /src/WasmEdge && source /usr/local/Ascend/ascend-toolkit/set_env.sh --force && \
  cmake -Bbuild -DCMAKE_BUILD_TYPE=Release \
  -DWASMEDGE_BUILD_TESTS=OFF \
  -DWASMEDGE_BUILD_WASI_NN_RPC=OFF \
  -DWASMEDGE_USE_LLVM=OFF \
  -DWASMEDGE_PLUGIN_WASI_NN_BACKEND=GGML \
  -DWASMEDGE_PLUGIN_WASI_NN_GGML_LLAMA_CANN=ON && \
  cmake --build build --config Release -j

RUN cd /src/llama.cpp && source /usr/local/Ascend/ascend-toolkit/set_env.sh --force && \
  cmake -B build -DGGML_CANN=ON -DBUILD_SHARED_LIBS=OFF  && \
  cmake --build build --config Release --target llama-cli

WORKDIR /root
RUN mkdir -p .wasmedge/{bin,lib,include,plugin} && \
  cp -f /src/WasmEdge/build/include/api/wasmedge/* .wasmedge/include/ && \
  cp -f /src/WasmEdge/build/tools/wasmedge/wasmedge .wasmedge/bin/ && \
  cp -f -P /src/WasmEdge/build/lib/api/libwasmedge.so* .wasmedge/lib/ && \
  cp -f /src/WasmEdge/build/plugins/wasi_nn/libwasmedgePluginWasiNN.so .wasmedge/plugin/
COPY ./env .wasmedge/env

为了构建 Docker 镜像，你需要获取 WasmEdge 的源代码并从源代码构建。Dockerfile 将主机上的./WasmEdge 映射到容器中的/src/WasmEdge ，并使用容器中的 CANN 库构建二进制文件。

git clone https://github.com/WasmEdge/WasmEdge.git -b dm4/cann 

docker build -t build-wasmedge-cann .

接下来，按如下方式启动容器。容器应用直接访问主机上的 CANN 驱动程序和实用程序。

sudo docker run -it --rm --name LlamaEdge\
        --device /dev/davinci0 \
        --device /dev/davinci_manager \
        --device /dev/devmm_svm \
        --device /dev/hisi_hdc \
        -v /usr/local/dcmi:/usr/local/dcmi \
        -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
        -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
        -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
        -p 8080:8080 \
        build-wasmedge-cann bash

现在你应该处在容器内的命令行提示符中。

从 GitHub 克隆 WasmEdge 项目的例子。

git clone https://mirror.ghproxy.com/https://github.com/WasmEdge/WasmEdge.git -b dm4/cann

API 服务

在容器内，你可以下载 LLM 模型文件。llama.cpp 的 CANN 后端目前限制是它仅支持 Q4 和 Q8 量化级别。

curl -LO https://hf-mirror.com/gaianet/Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf

下载 LlamaEdge API server的跨平台 Wasm 二进制文件。

curl -LO https://mirror.ghproxy.com/https://github.com/LlamaEdge/LlamaEdge/releases/download/0.14.11/llama-api-server.wasm

启动 API server。

nohup wasmedge --nn-preload default:GGML:AUTO:Meta-Llama-3-8B-Instruct-Q4_0.gguf llama-api-server.wasm --model-name llama3 --ctx-size 4096 --batch-size 128 --prompt-template llama-3-chat --socket-addr 0.0.0.0:8080 --log-prompts --log-stat &

使用与 OpenAI 兼容的 API 请求进行测试！

curl -X POST https://localhost:8080/v1/chat/completions \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"system", "content": "You are an experienced Rust developer."}, {"role":"user", "content": "How do I convert a string into an integer?"}]}'

API server 的返回结果如下。

{"id":"chatcmpl-683a09ec-f0be-4d88-a0eb-77acd60dd8b5","object":"chat.completion","created":1729648349,"model":"llama3","choices":[{"index":0,"message":{"content":"You can convert a string into an integer in Rust with the `parse` function, which is associated with the `FromStr` trait. The specific method depends on the format of your string and the type you want to convert it to.\n\nFor example: \n\n```rust\nuse std::str::FromStr;\n\nlet s = \"12345\";\nif let Ok(n) = i32::from_str(&s) { // Replace 'i32' with the integer type that best fits your needs.\n println!(\"{}\", n); \n} else {\n eprintln!(\"Unable to parse {} into an integer\", s); \n}\n```\nThis code will convert a string into a 32-bit signed integer (i32). If the string does not represent a valid number in the chosen type or is out of range for that type, `parse` will return an `Err` value which you can handle as shown above.\n\nYou may also use `unwrap()` method instead of pattern matching if you want to crash your program with a clear message when parsing fails:\n\n```rust\nlet s = \"12345\";\nlet n = i32::from_str(&s).unwrap(); // Replace 'i32' with the integer type that best fits your needs.\nprintln!(\"{}\", n); \n```","role":"assistant"},"finish_reason":"stop","logprobs":null}],"usage":{"prompt_tokens":30,"completion_tokens":315,"total_tokens":345}}

Chatbot

在容器内停止 LlamaEdge API server。

pkill -9 wasmedge

下载 chatbot 的 HTML、CSS 和 JS 文件。将它们解压到 chatbot-ui 文件夹中。

curl -LO https://github.com/LlamaEdge/chatbot-ui/releases/latest/download/chatbot-ui.tar.gz
tar xzf chatbot-ui.tar.gz
rm chatbot-ui.tar.gz

用 chatbot UI 重启 LlamaEdge API server。

nohup wasmedge --dir .:. --nn-preload default:GGML:AUTO:Meta-Llama-3-8B-Instruct-Q4_0.gguf llama-api-server.wasm --model-name llama3 --ctx-size 4096 --batch-size 128 --prompt-template llama-3-chat --socket-addr 0.0.0.0:8080 --log-prompts --log-stat &

现在，你可以打开浏览器指向 server 的 8080 端口。

工具调用

Agent 黑客马拉松的要求之一是展示 LLM 如何使用工具并进行函数调用来访问外部资源并执行复杂任务。LlamaEdge 支持在昇腾NPU 上调用与 OpenAI 兼容的工具。

停止容器内的 LlamaEdge API server。

pkill -9 wasmedge

下载针对工具调用进行了微调的 LLM。

curl -LO https://huggingface.co/gaianet/Llama-3-Groq-8B-Tool-Use-GGUF/resolve/main/Llama-3-Groq-8B-Tool-Use-Q4_0.gguf

在容器内重新启动 API server。

nohup wasmedge --nn-preload default:GGML:AUTO:Llama-3-Groq-8B-Tool-Use-Q4_0.gguf llama-api-server.wasm --model-name tools --ctx-size 4096 --batch-size 128 --prompt-template groq-llama3-tool --socket-addr 0.0.0.0:8080 --log-prompts --log-stat &

现在，我们可以提出一个 OpenAI 风格的请求，为 LLM 提供可用工具的列表。

curl -X POST http://localhost:8080/v1/chat/completions \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  --data-binary @tooluse.json

tooluse.json 包含以下可用工具

{
    "messages": [
        {
            "role": "user",
            "content": "What is the weather like in San Francisco in Celsius?"
        }
    ],
    "tools": [
        {
            "type": "function",
            "function": {
                "name": "get_current_weather",
                "description": "Get the current weather in a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "enum": [
                                "celsius",
                                "fahrenheit"
                            ],
                            "description": "The temperature unit to use. Infer this from the users location."
                        }
                    },
                    "required": [
                        "location",
                        "unit"
                    ]
                }
            }
        },
        {
            "type": "function",
            "function": {
                "name": "predict_weather",
                "description": "Predict the weather in 24 hours",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA"
                        },
                        "unit": {
                            "type": "string",
                            "enum": [
                                "celsius",
                                "fahrenheit"
                            ],
                            "description": "The temperature unit to use. Infer this from the users location."
                        }
                    },
                    "required": [
                        "location",
                        "unit"
                    ]
                }
            }
        }
    ],
    "tool_choice": "auto",
    "stream": false
}

LLM 将使用它希望 Agent执行的函数调用进行响应。

{"id":"chatcmpl-f5c9efff-c742-4948-93c1-0e19287a764e","object":"chat.completion","created":1729653908,"model":"tools","choices":[{"index":0,"message":{"content":"<tool_call>\n{\"id\": 0, \"name\": \"get_current_weather\", \"arguments\": {\"location\": \"San Francisco, CA\", \"unit\": \"celsius\"}}\n</tool_call>","tool_calls":[{"id":"call_abc123","type":"function","function":{"name":"get_current_weather","arguments":"{\"location\":\"San Francisco, CA\",\"unit\":\"celsius\"}"}}],"role":"assistant"},"finish_reason":"tool_calls","logprobs":null}],"usage":{"prompt_tokens":404,"completion_tokens":38,"total_tokens":442}}

在此了解有关 LLM 工具调用的更多信息^[11]。

性能和未来方向

在多 GPU 机器上，LlamaEdge 允许你指定 GPU 来运行 LLM。这使我们能够并行运行多个 LLM 应用程序。

昇腾 910B 每秒为 8B 类 LLM 生成约 15 个token，为 70B 类 LLM 生成约 5 个token。这与 Apple 的 M3 芯片相当，后者在 TOPS 基准测试中比昇腾 910B 慢得多。我们认为 llama.cpp 的 CANN 后端仍有很大优化空间。我们期待在不久的将来对这款出色的硬件提供更好的软件和驱动程序支持！

参考资料

[1]

Nvidia H100 的热门替代: https://www.theregister.com/AMP/2024/08/13/huaweis_ascend_910_launches_this/

[2]

边缘设备: https://www.tomshardware.com/raspberry-pi/orange-pi-teams-up-with-huawei-to-create-a-sbc-for-ai-development-huawei-ascend-chip-delivers-820-tops-of-ai-performance

[3]

GOSIM: https://www.gosim.org/

[4]

热门的 LLM: https://github.com/ggerganov/llama.cpp/blob/master/docs/backend/CANN.md#model-supports

[5]

MoFA: https://github.com/moxin-org/mofa

[6]

CANN: https://www.hiascend.com/en/software/cann

[7]

WasmEdge Runtime: https://github.com/WasmEdge/WasmEdge

[8]

对 llama.cpp 项目的开源贡献之上: https://github.com/ggerganov/llama.cpp/blob/master/docs/backend/CANN.md

[9]

LlamaEdge: https://github.com/LlamaEdge/LlamaEdge

[10]

Gaia 节点: https://github.com/GaiaNet-AI/gaianet-node

[11]

在此了解有关 LLM 工具调用的更多信息: https://llamaedge.com/docs/user-guide/tool-call

关于 WasmEdge

WasmEdge 是轻量级、安全、高性能、可扩展、兼容OCI的软件容器与运行环境。目前是 CNCF 沙箱项目。WasmEdge 被应用在 SaaS、云原生，service mesh、边缘计算、边缘云、微服务、流数据处理、LLM 推理等领域。

GitHub：https://github.com/WasmEdge/WasmEdge

官网：https://wasmedge.org/

‍‍Discord 群：https://discord.gg/U4B5sFTkFc

文档：https://wasmedge.org/docs

http://mp.weixin.qq.com/s?__biz=MzI2MjkxNjA2Mg==&mid=2247487712&idx=1&sn=4de0ce361f9219d388faa0dcc82cf737

Second State

Rust 函数即服务

在昇腾 910B 上部署轻量级和跨平台大模型 Agent

课程升级、资源加码！万人共学的书生大模型实战营第4期正式起航！

OSC源创会·北京站：高性能计算与大模型推理

RTE 大会报名丨AI 时代新基建：云边端架构和 AI Infra ，RTE2024 技术专场第二弹！

2024年第五届CID参会就在明天！

Rust 群星闪耀！20+ 海内外顶尖 Rust 天团 GOSIM CHINA 2024 相聚北京

开创跨平台的未来！GOSIM CHINA 2024《App 开发》专题论坛重磅揭晓！

打造更安全、去中心化和协作的互联网！GOSIM CHINA 2024《下一代互联网》重磅嘉宾揭晓

Triton & vLLM 联袂呈现 AI 技术盛宴：高效推理框架的应用实践与未来创新

倒计时 2 天，GOSIM CHINA 2024 全日程重磅发布（附参会指南）！

聚焦开源大模型前沿应用，GOSIM CHINA 2024《AI 模型与基础模型》专题论坛重磅揭晓！

ChatGPT开源替代：阿里最新最强大模型千问2.5

在 MacBook 上运行 FLUX.1，可无缝跨平台 | 为假期添加点趣味

Wasm技术浪潮来袭：加入我们的在线课程，掌握WebAssembly的未来

贡献开源拿奖励，再送10份免费课程/认证考试

自建AI编程助手 | 本地 Yi-Coder模型 + Cursor 5分钟写一个网页

议题征集倒计时啦！不能错过的第五届CID大会！

当 Rust 遇到 AI 会擦出什么样的火花|与你相约 RustChinaConf 2024

Mac上运行微软最新Phi-3.5-mini大模型+开发Agent

【福利】来偶遇Linus！KubeCon + CloudNativeCon +开源峰会+ AI_dev China下周三火热开幕

来 RustChinaConf 听听 LlamaEdge 的 Rust 实践

极客与技术，产业与生态，年度开源峰会 2024 GOTC x GOGC 即将开幕

2024 秋季WasmEdge LFX实习机会：大模型、交易机器人等你来

LlamaEdge 支持 tool call！调用外部工具

KubeCon 2024 AI_Dev日程已发布!

本地搭建 AI 服务？一文带你轻松部署 internlm2_5-7b-chat 大模型应用

在 Llama 3.1 构建多种AI应用

简单命令行搭建吴恩达的 LLM Translation Agent，测测开源模型哪家强

《歌手》排名里的 13.8%和13.11%哪个大？ Mathstral：AI数学能力大考验！

在个人电脑一键运行谷歌最新 Gemma-2-9B 大模型

OpenAI 不可用？使用开源模型一键替换 OpenAI API

扫码申请最终用户门票｜2024 年 KubeCon + CloudNativeCon + 开源峰会 + AI_dev 中国大会

阿里巴巴全球数学竞赛是什么难度？让阿里的Qwen2-72B 试一试

2024 年 KubeCon + CloudNativeCon + 开源峰会 + AI_dev 中国大会的精彩阵容出炉！

做大模型时代的开源贡献者，WasmEdge 开源之夏项目等你来

一键运行零一万物新鲜出炉Yi-1.5-9B-Chat大模型

Llama-3-8B 中文版来了，在自己设备上运行试试看吧

Wasm 性能究竟如何 | Arm 上的容器运行时和 WasmEdge 基准测试

本周末来上海 GOTC 现场和 WasmEdge 见面吧

Open Source Summit NA 上的 WebAssembly演讲

KubeCon EU |云计算的未来是什么？

开源之夏2023明天开启报名！欢迎报名 WasmEdge 社区项目

WebAssembly @ KubeCon + CloudNativeCon EU 2023

那些让 ChatGPT review 代码的程序员，后来都怎么样了？

用 Rust 开发 WasmEdge 应用 | 微软 Reactor 活动回顾

社区合作|第二届开源云原生开发者日开启预约！

五分钟创建一个 Serverless ChatGPT GitHub App

活动预告|【欧拉多咖·操作系统研讨会】第九期：面向未来云计算的虚拟化技术

分类

时事

民生

政务

教育

文化

科技

财富

体娱

健康

情感

旅行

百科

职场

楼市

企业

乐活

学术

汽车

时尚

创业

美食

幽默

美体

文摘

原创标签

时事社会财经军事教育体育科技汽车科学房产搞笑综艺明星音乐动漫游戏时尚健康旅游美食生活摄影宠物职场育儿情感小说曲艺文化历史三农文学娱乐电影视频图片新闻宗教电视剧纪录片广告创意壁纸头像心灵鸡汤星座命理教育培训艺术文化金融财经健康医疗美妆时尚餐饮美食母婴育儿社会新闻工业农业时事政治星座占卜幽默笑话独立短篇连载作品文化历史科技互联网

发布位置

广东北京山东江苏河南浙江山西福建河北上海四川陕西湖南安徽湖北内蒙古江西云南广西甘肃辽宁黑龙江贵州新疆重庆吉林天津海南青海宁夏西藏香港澳门台湾美国加拿大澳大利亚日本新加坡英国西班牙新西兰韩国泰国法国德国意大利缅甸菲律宾马来西亚越南荷兰柬埔寨俄罗斯巴西智利卢森堡芬兰瑞典比利时瑞士土耳其斐济挪威朝鲜尼日利亚阿根廷匈牙利爱尔兰印度老挝葡萄牙乌克兰印度尼西亚哈萨克斯坦塔吉克斯坦希腊南非蒙古奥地利肯尼亚加纳丹麦津巴布韦埃及坦桑尼亚捷克阿联酋安哥拉