Logic Density Fine-Tuning: Elevating 8B Models with Data

Mert Can Elsner
Dec 13, 2025
5 min read

Minimalist 16:9 blog header for the VQ-1 model. The text 'VQ-1' is centered using an ASCII-style font. The dark background features a dense pattern of ASCII characters, numbers, and symbols. A simple white dot is positioned in each of the four corners, stylized to resemble a screw.

Mert Can Elsner - Veyllo Labs

There is a common assumption in LLM development that to improve a model, you either need a bigger parameter count or a massive dataset. My goal with the VQ-1 experiment was to challenge that assumption on a much smaller scale. I wanted to find out Can we take a highly constrained, quantized model (4-bit) and significantly improve its reasoning reliability just by changing how we fine-tune it, rather than how much data we feed it? Using a dataset of only 3,260 examples, I compared a "Logic Density" fine-tuning approach against the base model and larger competitors. The results suggest that for specific reasoning tasks, data quality and structure matter far more than volume.

Logic Density Fine-Tuning is an engineering-focused data selection strategy. Instead of increasing dataset size, I prioritize samples that require multi-step logical resolution, explicit constraint handling, or structured problem decomposition. The goal is not to introduce a new learning paradigm, but to empirically explore how data composition affects reasoning behavior in small, resource-constrained models.

The Technical Foundation: Signal over Noise

For this experiment the "Logic Density Fine-Tuning", I didn't use a pre-tuned "Chat" model. I started with Qwen3-8B-bnb-4bit. It’s important to clarify what this model is: It is the raw, quantized version of the base model. It has general knowledge, but out of the box, it lacks the alignment to strictly follow complex logic or handle specific resource constraints.

The training environment was deliberately limited to consumer-grade constraints (optimized for an RTX 3080) to prove this doesn't require an H100 cluster.

Configuration details:

Method: QLoRA (Rank 32, Alpha 64) to force a strong adaptation of the frozen weights.
Data: 3,260 highly curated samples focusing on reasoning steps (Post-Hoc reasoning, constraints) rather than general conversation.
Training Time: 3 Epochs with a batch size of 2 (effective batch size 8).

By keeping the dataset small but the learning signal strong, the aim was to "nudge" the raw model into a higher tier of performance without the computational cost of full retraining.

The Benchmark: Why Accuracy wasn't enough

Measuring the "Logic Density"

Standard benchmarks measure whether a model gets the answer right. They rarely measure how "expensive" that answer was. In many production cases, a model that rambles for 1,000 tokens before giving the right answer is too slow and costly to be useful.

To capture this nuance, I developed a custom metric, the "Reasoning Efficiency Score (RES)".

While conceptually simple, RES provides a powerful signal by weighting response accuracy against token consumption. It is defined as follows:

Mathematical formula for RES: Complexity_Score times Accuracy over Token_Count, in white serif font on a black background.

Complexity (1-10): A subjective rating of the logical depth required (e.g., a simple fact is 1, a multi-step resource allocation problem is 9).
Accuracy (0 or 1): A binary pass/fail. If the logic is sound but the final answer is wrong, the score is 0.
Token Count: The total number of output tokens generated to reach the solution.

By using this formula, we can quantify "Logic per Token." A high RES indicates a model that has internalized the reasoning process, delivering high-value output with minimal latency and cost.

Results: VQ-1 vs. Qwen 3 (Base)

We pitted VQ-1 (8.2B) directly against its base foundation, Qwen 3 (8B), across a suite of complex logical tasks ranging from recursive programming constraints to "Theory of Mind" psychological assessments. The results validated the "High-Density Training" hypothesis.

1. Efficiency Gains (RES) :

As shown in the chart below, VQ-1 achieved a significantly higher average Reasoning Efficiency Score.

Bar graph comparing RES scores: Qwen 3 (0.0126, blue) vs. Veyllo VQ-1 (0.0178, black). Title: Reasoning Efficiency Score (Higher is Better).

The data indicates a 41% increase in reasoning efficiency for VQ-1 compared to the base model. This suggests that the fine-tuning process effectively streamlined the cognitive steps. While the base model frequently relied on extensive, externalized chain-of-thought processing (or encountered difficulties due to confusion), VQ-1 efficiently arrived at the solution.

2. Token Economy

There is a direct relationship between high efficiency and reduced inference costs. By integrating the foundational logic, VQ-1 has substantially lowered the average token generation.

Bar chart compares average output tokens per task. "Qwen 3" at 1228 (blue), "Veyllo VQ-1" at 892 (black). Lower is better.

VQ-1 necessitated, on average, 27% fewer tokens to accomplish the same set of tasks. For API-based deployments or local inference on limited hardware, this directly results in a 27% reduction in latency and operational costs.

Where the Gap Widens

The performance difference was not consistent across tasks. The disparity between VQ-1 and the base model was most pronounced in tasks that involved implicit constraint management, particularly within the "Resource Management" and "Real World Logic" categories, such as Tasks E2 and C3 in our dataset.

Performance comparison showing Veyllo VQ-1 using significantly fewer tokens than the Qwen 3 Base model to solve the same logical reasoning tasks.

In these scenarios, the base model often experienced "analysis paralysis," producing numerous tokens to evaluate ethical or logistical pros and cons, frequently resulting in a suboptimal conclusion. In contrast, VQ-1, aligned through our specific high-density logic samples, promptly identified the constraints and provided the mathematically optimal decision without delay.

This confirms that even a small model (8B), when treated with high-precision data, can exceed the expected reasoning baseline of its parameter class, delivering robust logical outputs typically associated with larger models in these specific domains.

Logic Density Fine-Tuning Final Results

I also ran the fine-tuned VQ-1 against Ministral 8B as well as DeepSeek R1 Distill. Here is what the data showed.

Bar chart showing Reasoning Efficiency Score by task and model. Models are differentiated by color. Tasks on the x-axis, RES score on the y-axis.

1. Stability over the Base Model

The most immediate observation is the stability improvement over the base model (shown in blue vs. VQ-1 in black). In tasks requiring strict logical constraints like the Resource Triage scenarios the base model often failed to produce a valid answer or got stuck in loops (marked as "Failed" in the chart). VQ-1, despite being the same underlying model size, handled these edge cases reliably.

2. Efficiency vs. "Reasoning" Models

Models like DeepSeek R1 (Light Blue) are designed to "think" via long text outputs. While this often leads to correct answers, the efficiency trade-off is massive. My data shows that VQ-1 often arrives at the same correct conclusion but with significantly fewer tokens. It suggests that through high-density training, VQ-1 learned to navigate these logical constraints implicitly, effectively bridging the reasoning gap without the need for verbose explicit verbalization.

3. A Closer Look: Token Consumption

When comparing VQ-1 directly to the base model, the efficiency gains become clearer:

Logic Tasks (A1): VQ-1 required ~660 tokens to solve the problem, whereas the base model needed nearly 1,000 tokens to reach a conclusion (which was often less coherent).
Complex Math (C3): This was a stress test. The base model drifted into repetitive loops (using 1,200+ tokens), while VQ-1 identified the pattern and provided the solution concisely.

Conclusion

This experiment wasn't about beating GPT-5. It was about seeing how much performance we can squeeze out of a small, 4-bit model. The results indicate that complex reasoning capabilities don't necessarily require a model to output thousands of thinking tokens. With the right training data, we can align even a quantized base model to solve complex tasks efficiently. I do not claim mechanistic insight into internal reasoning processes, further interpretability work would be required to support such conclusions.

Try it yourself: You can test VQ-1 and explore the findings directly on Hugging Face: 👉 https://huggingface.co/Veyllo/VQ-1_Instruct-q4_k_m