What industries can benefit from Veyllo Labs' AI solutions?

Veyllo Labs' AI solutions are designed for industries like legal, healthcare, engineering, and R&D, where data privacy and advanced automation are critical.

VQ-1: A New Approach to Enhancing Reasoning in AI Models

Mert Can Elsner
13. Dez. 2025
5 Min. Lesezeit

Aktualisiert: 26. Feb.

Mert Can Elsner - Veyllo Labs

In the realm of LLM development, a common belief persists: to enhance a model, one must either increase its parameter count or utilize a vast dataset. My objective with the VQ-1 experiment was to challenge this notion on a smaller scale. I aimed to answer the question: Can we significantly improve the reasoning reliability of a highly constrained, quantized model (4-bit) simply by altering how we fine-tune it, rather than how much data we provide?

To explore this, I employed a dataset of just 3,260 examples. I compared a "Logic Density" fine-tuning approach against the base model and larger competitors. The results indicate that for specific reasoning tasks, the quality and structure of the data are far more crucial than sheer volume.

Understanding Logic Density Fine-Tuning

Logic Density Fine-Tuning is an engineering-focused strategy for data selection. Instead of expanding the dataset size, I prioritize samples that necessitate multi-step logical resolution, explicit constraint handling, or structured problem decomposition. The aim is not to introduce a new learning paradigm but to empirically investigate how data composition influences reasoning behavior in small, resource-constrained models.

The Technical Foundation: Signal over Noise

For this experiment, I did not utilize a pre-tuned "Chat" model. Instead, I started with Qwen3-8B-bnb-4bit. It is essential to clarify that this model is the raw, quantized version of the base model. While it possesses general knowledge, it lacks the alignment needed to strictly follow complex logic or manage specific resource constraints out of the box.

The training environment was intentionally limited to consumer-grade constraints, optimized for an RTX 3080. This was to demonstrate that high performance does not necessitate an H100 cluster.

Configuration details:

Method: QLoRA (Rank 32, Alpha 64) to enforce strong adaptation of the frozen weights.
Data: 3,260 highly curated samples focusing on reasoning steps (Post-Hoc reasoning, constraints) rather than general conversation.
Training Time: 3 Epochs with a batch size of 2 (effective batch size 8).

By maintaining a small dataset while ensuring a strong learning signal, the goal was to "nudge" the raw model into a higher tier of performance without incurring the computational costs associated with full retraining.

The Benchmark: Why Accuracy Wasn't Enough

Measuring the "Logic Density"

Standard benchmarks typically assess whether a model arrives at the correct answer. However, they rarely consider how "expensive" that answer is. In many real-world applications, a model that generates 1,000 tokens before arriving at the correct answer is often too slow and costly to be practical.

To capture this nuance, I developed a custom metric called the "Reasoning Efficiency Score (RES)." While conceptually straightforward, RES provides a powerful signal by weighing response accuracy against token consumption. It is defined as follows:

Mathematical formula for RES: Complexity_Score times Accuracy over Token_Count, in white serif font on a black background.

Complexity (1-10): A subjective rating of the logical depth required (e.g., a simple fact is rated as 1, while a multi-step resource allocation problem is rated as 9).
Accuracy (0 or 1): A binary pass/fail. If the logic is sound but the final answer is incorrect, the score is 0.
Token Count: The total number of output tokens generated to reach the solution.

Using this formula, we can quantify "Logic per Token." A high RES indicates a model that has internalized the reasoning process, delivering high-value output with minimal latency and cost.

Results: VQ-1 vs. Qwen 3 (Base)

We compared VQ-1 (8.2B) directly against its base foundation, Qwen 3 (8B), across a range of complex logical tasks, from recursive programming constraints to "Theory of Mind" psychological assessments. The results validated the "High-Density Training" hypothesis.

1. Efficiency Gains (RES)

As illustrated in the chart below, VQ-1 achieved a significantly higher average Reasoning Efficiency Score.

Bar graph comparing RES scores: Qwen 3 (0.0126, blue) vs. Veyllo VQ-1 (0.0178, black). Title: Reasoning Efficiency Score (Higher is Better).

The data reveals a 41% increase in reasoning efficiency for VQ-1 compared to the base model. This suggests that the fine-tuning process effectively streamlined cognitive steps. While the base model often relied on extensive, externalized chain-of-thought processing and faced confusion, VQ-1 efficiently arrived at solutions.

2. Token Economy

There is a direct correlation between high efficiency and reduced inference costs. By integrating foundational logic, VQ-1 has significantly lowered the average token generation.

Bar chart compares average output tokens per task. "Qwen 3" at 1228 (blue), "Veyllo VQ-1" at 892 (black). Lower is better.

On average, VQ-1 required 27% fewer tokens to complete the same tasks. For API-based deployments or local inference on limited hardware, this translates to a 27% reduction in latency and operational costs.

Where the Gap Widens

The performance difference was not uniform across tasks. The disparity between VQ-1 and the base model was most pronounced in tasks involving implicit constraint management, particularly within the "Resource Management" and "Real World Logic" categories, such as Tasks E2 and C3 in our dataset.

Performance comparison showing Veyllo VQ-1 using significantly fewer tokens than the Qwen 3 Base model to solve the same logical reasoning tasks.

In these scenarios, the base model often experienced "analysis paralysis," generating numerous tokens to evaluate ethical or logistical pros and cons, frequently resulting in suboptimal conclusions. In contrast, VQ-1, aligned through our specific high-density logic samples, quickly identified constraints and provided mathematically optimal decisions without delay.

This confirms that even a small model (8B), when trained with high-precision data, can exceed the expected reasoning baseline of its parameter class, delivering robust logical outputs typically associated with larger models in specific domains.

Logic Density Fine-Tuning Final Results

I also evaluated the fine-tuned VQ-1 against Ministral 8B and DeepSeek R1 Distill. Here are the findings.

Bar chart showing Reasoning Efficiency Score by task and model. Models are differentiated by color. Tasks on the x-axis, RES score on the y-axis.

1. Stability over the Base Model

The most immediate observation is the stability improvement over the base model (shown in blue vs. VQ-1 in black). In tasks requiring strict logical constraints, such as the Resource Triage scenarios, the base model often failed to produce valid answers or became stuck in loops (marked as "Failed" in the chart). VQ-1, despite being the same underlying model size, managed these edge cases reliably.

2. Efficiency vs. "Reasoning" Models

Models like DeepSeek R1 (Light Blue) are designed to "think" through lengthy text outputs. While this often leads to correct answers, the efficiency trade-off is significant. My data shows that VQ-1 frequently arrives at the same correct conclusion but with considerably fewer tokens. This suggests that through high-density training, VQ-1 learned to navigate logical constraints implicitly, effectively bridging the reasoning gap without verbose verbalization.

3. A Closer Look: Token Consumption

When comparing VQ-1 directly to the base model, the efficiency gains become evident:

Logic Tasks (A1): VQ-1 required approximately 660 tokens to solve the problem, while the base model needed nearly 1,000 tokens to reach a conclusion, which was often less coherent.
Complex Math (C3): This served as a stress test. The base model drifted into repetitive loops (using over 1,200 tokens), while VQ-1 identified the pattern and provided the solution concisely.

Conclusion

This experiment was not about surpassing GPT-5. It aimed to explore how much performance we can extract from a small, 4-bit model. The results indicate that complex reasoning capabilities do not necessarily require a model to generate thousands of thinking tokens. With the right training data, we can align even a quantized base model to solve complex tasks efficiently. I do not claim to have mechanistic insights into internal reasoning processes; further interpretability work is needed to support such conclusions.

Try it yourself:

You can test VQ-1 and explore the findings directly on Hugging Face:

👉 https://huggingface.co/Veyllo/VQ-1_Instruct-q4_k_m