3. 模型与基准 (Models & Benchmarks)
- LLM Leaderboard 2026 — Compare Top AI Models
Compare the latest LLM benchmarks for GPT, Claude, Gemini and more. Updated rankings across reasoning, coding, math, and multilingual tasks with pricing and speed data.
- SWE-bench Leaderboards
SWE-bench Multilingual features 300 tasks across 9 programming languages [Post]. SWE-bench Lite is a subset curated for less costly evaluation [Post].
- 最强 LLM...
最强 LLM 评测基准全景图:一文看懂 GPT-5、Claude-4、Grok-4 等 6 大新模型如何「应试」.SWE-bench: Can Language Models Resolve Real-World GitHub Issues? LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.
- 在 Felo LLM Playground 免费试用 DeepSeek V4 | Felo Search Blog
GPT-5.5 现已在 Felo Agent 上线 — 免费试用. 在 Felo LLM Playground 免费试用 DeepSeek V4.SWE-bench Verified.
- LLM评测集 | EvalScope
vLLM Bench vs Evalscope Perf 压测对比.LLM全链路最佳实践.
- DeepSeek智能编程 - DECHIN - 博客园
DeepSeek智能编程 本文介绍了两种智能编程的方案,一种是使用Cursor结合远程API形式的智能化自动编程,另一种方案是VSCode插件结合本地部署的Ollama模型来进行智. 能编程。