Serving on CctoctoFX

Serving on CctoctoFX https://pillumina.github.io/tags/serving/ Recent content in Serving on CctoctoFX CctoctoFX https://pillumina.github.io/imgs/icon_head.png https://pillumina.github.io/imgs/icon_head.png Hugo -- 0.148.2 en Mon, 22 Jun 2026 09:06:00 +0800 LLM 系统分析方法论（七）：推理服务性能建模 https://pillumina.github.io/posts/aiinfra/llm-computation-methodology/part-7/ Mon, 22 Jun 2026 09:06:00 +0800 https://pillumina.github.io/posts/aiinfra/llm-computation-methodology/part-7/ 推理服务完整性能建模：从单 token 延迟到多请求并发，覆盖连续批处理、PagedAttention、Prefill-Decode 分离、推测解码、量化部署。含 Llama-70B 完整服务分析和 MoE 模型服务策略。跨 NVIDIA + Ascend 双平台。