HomeSample Page

Sample Page Title


GLM-4.7-Flash is a brand new member of the GLM 4.7 household and targets builders who need sturdy coding and reasoning efficiency in a mannequin that’s sensible to run regionally. Zhipu AI (Z.ai) describes GLM-4.7-Flash as a 30B-A3B MoE mannequin and presents it because the strongest mannequin within the 30B class, designed for light-weight deployment the place efficiency and effectivity each matter.

Mannequin class and place contained in the GLM 4.7 household

GLM-4.7-Flash is a textual content technology mannequin with 31B params, BF16 and F32 tensor sorts, and the structure tag glm4_moe_lite. It helps English and Chinese language, and it’s configured for conversational use. GLM-4.7-Flash sits within the GLM-4.7 assortment subsequent to the bigger GLM-4.7 and GLM-4.7-FP8 fashions.

Z.ai positions GLM-4.7-Flash as a free tier and light-weight deployment choice relative to the total GLM-4.7 mannequin, whereas nonetheless concentrating on coding, reasoning, and common textual content technology duties. This makes it attention-grabbing for builders who can’t deploy a 358B class mannequin however nonetheless need a fashionable MoE design and powerful benchmark outcomes.

Structure and context size

In a Combination of Specialists structure of this kind, the mannequin shops extra parameters than it prompts for every token. That enables specialization throughout specialists whereas maintaining the efficient compute per token nearer to a smaller dense mannequin.

GLM 4.7 Flash helps a context size of 128k tokens and achieves sturdy efficiency on coding benchmarks amongst fashions of comparable scale. This context dimension is appropriate for giant codebases, multi-file repositories, and lengthy technical paperwork, the place many current fashions would wish aggressive chunking.

GLM-4.7-Flash makes use of a typical causal language modeling interface and a chat template, which permits integration into current LLM stacks with minimal modifications.

Benchmark efficiency within the 30B class

The Z.ai workforce compares GLM-4.7-Flash with Qwen3-30B-A3B-Considering-2507 and GPT-OSS-20B. GLM-4.7-Flash leads or is aggressive throughout a mixture of math, reasoning, lengthy horizon, and coding agent benchmarks.

https://huggingface.co/zai-org/GLM-4.7-Flash

This above desk showcase why GLM-4.7-Flash is without doubt one of the strongest mannequin within the 30B class, at the very least among the many fashions included on this comparability. The necessary level is that GLM-4.7-Flash isn’t solely a compact deployment of GLM but additionally a excessive performing mannequin on established coding and agent benchmarks.

Analysis parameters and pondering mode

For many duties, the default settings are: temperature 1.0, high p 0.95, and max new tokens 131072. This defines a comparatively open sampling regime with a big technology price range.

For Terminal Bench and SWE-bench Verified, the configuration makes use of temperature 0.7, high p 1.0, and max new tokens 16384. For τ²-Bench, the configuration makes use of temperature 0 and max new tokens 16,384. These stricter settings cut back randomness for duties that want secure device use and multi step interplay.

Z.ai workforce additionally recommends turning on Preserved Considering mode for multi flip agentic duties akin to τ²-Bench and Terminal Bench 2. This mode preserves inside reasoning traces throughout turns. That’s helpful if you construct brokers that want lengthy chains of perform calls and corrections.

How GLM-4.7-Flash matches developer workflows

GLM-4.7-Flash combines a number of properties which might be related for agentic, coding targeted functions:

  • A 30B-A3B MoE structure with 31B params and a 128k token context size.
  • Sturdy benchmark outcomes on AIME 25, GPQA, SWE-bench Verified, τ²-Bench, and BrowseComp in comparison with different fashions in the identical desk.
  • Documented analysis parameters and a Preserved Considering mode for multi flip agent duties.
  • Top notch help for vLLM, SGLang, and Transformers based mostly inference, with prepared to make use of instructions.
  • A rising set of finetunes and quantizations, together with MLX conversions, within the Hugging Face ecosystem.

Take a look at the Mannequin weight. Additionally, be happy to observe us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be a part of us on telegram as effectively.


Michal Sutter is a knowledge science skilled with a Grasp of Science in Information Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at reworking complicated datasets into actionable insights.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles