2 hours ago · Tech · 0 comments

I spent some time yesterday getting the new Qwen 3.6 27B model running locally on a (solar-powered) machine with dual RTX 3090 GPUs. With this setup I'm able to achieve around 50 tokens/second and use the model's full 256k context window.This deployment uses vllm/vllm-openai:latest; no patches or anything else needed:services: vllm-qwen36: image: vllm/vllm-openai:latest container_name: vllm-qwen36 ipc: host shm_size: 32gb ports: - "8337:8000" volumes: - /opt/docker/data/vllm/cache/huggingface:/root/.cache/huggingface - /opt/docker/data/vllm/cache/vllm:/root/.cache/vllm environment: VLLM_ALLOW_LONG_MAX_MODEL_LEN: "1" deploy: resources: reservations: devices: - driver: nvidia device_ids: ['0', '1'] capabilities: [gpu] command: cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4 --tensor-parallel-size 2 --max-model-len 262144 --gpu-memory-utilization 0.98 --mm-encoder-tp-mode data --kv-cache-dtype fp8 --enable-prefix-caching --enable-chunked-prefill --max-num-batched-tokens 4096 --reasoning-parser qwen3…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.