Show HN: Duplicate 3 layers in a 24B LLM, logical deduction .22→.76. No training

simgt · 2026-03-19T09:59:24.000Z 1773914364

> I replicated David Ng's RYS method [...] found something I didn't expect.

> Transformers appear to have discrete "reasoning circuits" — contiguous blocks of 3-4 layers that act as indivisible cognitive units. Duplicate the right block and the model runs its reasoning pipeline twice. No weights change. No training. The model just thinks longer.

How did you not expect that if you read his post? That's literally what he discovered, two years ago.

For anyone interested, there's more meat in the post and comments from last week: https://news.ycombinator.com/item?id=47322887

regularfry · 2026-03-19T10:50:02.000Z 1773917402

That's explicitly not the unexpected part. Read the rest of the post.

yorwba · 2026-03-19T11:32:38.000Z 1773919958

After reading both the original post and this submission, what do you think is new here?

regularfry · 2026-03-19T13:18:06.000Z 1773926286

> The weird part: different duplication patterns create different cognitive "modes" from the same weights. Double-pass boosts math. Triple-pass boosts emotional reasoning. Interleaved doubling (13,13,14,14,15,15,16) creates a pure math specialist. Same model, same VRAM, different routing.

As far as I can see that's not implied by the original post.

But that's beside the point: quoting the bit where the poster says "here's what I'm building on top of" and using that to imply they haven't done anything new is a bit pointless, no?

simgt · 2026-03-19T13:43:32.000Z 1773927812

You're right that my quote was misleading, I overlooked "the weird part" in the post because it didn't seem new to me either.

Here's the section in the original post that covers it: https://dnhkng.github.io/posts/rys/#the-brain-scanner All heatmaps are split by tasks and show an optimal point for each. The resulting routing he chose is a trade-off for both tasks, there isn't much else to do unless you intend to train a router anyway.

> So the ‘math organ’ has boundaries on both sides. Too few layers and you get nothing — you’ve cut into the circuit and it can’t complete its operation. Too many layers and you also get nothing — you’ve included tissue from a neighbouring circuit that doesn’t belong. Pre-training carved these structures out of the layer stack, and they only work whole. It also doesn’t translate to other tasks, as the heatmap for EQ scores doesn’t have this patch.

gavinray · 2026-03-19T17:19:27.000Z 1773940767

This is stated in the original post as well, under "The Beginning of LLM Neuroanatomy?" section:

  > From end-position 43 to 46, we then see solid boosts in math scores (red = good, yay). But include layer 46 or beyond, and the benefits collapse again. The hypothesis: position 47 is where a different circuit begins. Including even one step of the next recipe messes up the current recipe.

  > So the ‘math organ’ has boundaries on both sides. Too few layers and you get nothing — you’ve cut into the circuit and it can’t complete its operation. Too many layers and you also get nothing — you’ve included tissue from a neighbouring circuit that doesn’t belong. Pre-training carved these structures out of the layer stack, and they only work whole. It also doesn’t translate to other tasks, as the heatmap for EQ scores doesn’t have this patch.

  > This is a much more specific claim than “middle layers do reasoning.” It’s saying the reasoning cortex is organised into functional circuits: coherent multi-layer units that perform complete cognitive operations. Each circuit is an indivisible processing unit, and the sweeps seen in the heatmap is essentially discovering the boundaries of these circuits.

regularfry · 2026-03-20T12:29:43.000Z 1774009783

That's just saying there are circuits. It's not saying you get different effects by stacking the same circuit in different ways.

jstanley · 2026-03-19T13:05:19.000Z 1773925519

It's all new to me.

4bpp · 2026-03-19T02:43:19.000Z 1773888199

Assuming the benchmarks are sound (rather than capturing a fluke), the provided explanation still does not pass the smell test. As far as I can tell, there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n, unless perhaps these layers were initialised as identity and the training process did not get to change them much. (Plausible for middle layers?)

Considering this, I think (again, assuming the benchmarks themselves are sound) the most plausible explanation for the observations is (1) the layers being duplicated are close to the identity function on most inputs; (2) something happened to the model in training (RLHF?) that forcefully degraded its reasoning performance; (3) the mechanism causing the degradation involves the duplicated layers, so their duplication has the effect of breaking the reasoning-degrading mechanism (e.g. by clobbering a "refusal" "circuit" that emerged in post-training).

More concisely, I'm positing that this is an approach that can only ever break things, and rather than boosting reasoning, it is selectively breaking things deleterious to reasoning.

ACCount37 · 2026-03-19T11:28:23.000Z 1773919703

Empirical findings tell a very different tale: all LLM layers use vaguely compatible internal representations. And middle layers in particular can be almost interchangeable - a lot of what they seems to be "iterative refinement of the same representations". Proven by various probes and ablations, but the most obvious one is probably the good old logit lens.

This is likely to be shaped by tied embeddings and skips on one end, and maybe training pressures on the other.

The very top of FF stack and the very bottom of FF stack both reflect the same token embeddings - and this propagates through the model, setting up a shared identity space. Skip connections propagate that through the layers. No explicit shared identity imposed, but there is an implicit one set by the architecture. Fairly well established.

(Now: highly speculative! Attention over past tokens creates an implicit "robustness/convergence" pressure? The model can't be "certain" if it'll have access to the right representations at a given layer, because representations depend not just on the past layers, but also on the highly uncertain contents of previous tokens as passed through attention. Which in turn depends on more of the same, increasing variance further. So the training causes: "each layer can't be certain of what it will have access to, so it develops to refine anything it currently has access to in a convergent fashion, because that's what's useful under pressure of attention-induced uncertainty".)

LLMs are notoriously nonfragile, and robust to perturbations. Far more so if you anneal with SFT/distillation after your model surgery, although this wasn't done here. Plenty of weird franken-LLM experiments prove that empirically.

So I'm not too surprised to find that someone has managed to improve benchmark performance on a few narrow tasks by duplicating a few middle layers. "Duplicating a few layers that were doing convergent iterative refinement benefits a few tasks that suffered from insufficient depth of convergent iterative refinement" is a fairly reasonable hypothesis, in my eyes.

The chances of duplication "breaking something somewhere" are high, and I would expect the capability profile of an unannealed franken-LLM like this to have a few gaps in it if evaluated extensively against the original. But "franken-LLM layer duplication can actually improve some things" is far too plausible with what we know to be dismissed pre-emptively.

4bpp · 2026-03-19T12:31:05.000Z 1773923465

That's interesting, could you point me to some source on these findings?

It seems to me that the difference between "iterative improvement" as you put it and "close to the identity" (as in the output is close to the input for most of the volume of the input space) as I put it is fairly subtle, anyway. One experiment I would like to see is what happens to the reasoning performance if rather than duplicating the selected layers, they are deleted/skipped entirely. If the layers improve reasoning by iterative improvement, this should make the performance worse; but if they contain a mechanism that degrades reasoning and is not robust against unannealed self-composition, it should make the performance similarly better.

observationist · 2026-03-19T15:50:13.000Z 1773935413

https://arxiv.org/abs/2505.12540 https://arxiv.org/abs/2405.07987

These, other papers, and the lottery ticket phenomenon; what it boils down to is that any neural network like system which encodes some common mapping of a phenomenon in the context of the world - not necessarily a world model, but some "real-world thing" - will tend to map to a limited number of permutations of some archetypal representation, which will resemble other mappings of the same thing.

The lottery ticket phenomenon is a bit like the birthday paradox; there will be some number of structures in a large, random initialization of neural network weights that coincide with one or more archetypal mappings of complex objects. Some sub-networks are also useful mappings to features of one or more complex objects, which makes learning hierarchical nested networks of feature mappings easier; it's also why interpretability is so damned difficult.

jstanley · 2026-03-19T13:07:03.000Z 1773925623

> As far as I can tell, there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n

Right, I had the same thought.

Even if the output was in the same "format", does the LLM even have any way to know which order the outputs will go in? The ordering of the nodes is part of our representation of the network, it's not fundamental to it.

It would be like shuffling the bytes in a PNG file and expecting the program still to understand it as a PNG file.

The more I think about this, the more I don't get this at all.

WithinReason · 2026-03-19T14:47:57.000Z 1773931677

These layers are residual layers, so what a layer does is:

x = x + layer(x)

so it's not too surprising that they can be used recurrently

jstanley · 2026-03-19T17:23:05.000Z 1773940985

Ah! Thank you

visarga · 2026-03-19T17:04:17.000Z 1773939857

> there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n

There is something that does exactly that - the residual connections. Each layer adds a delta to it, but that means they share a common space. There are papers showing the correlation across layers, of course it is not uniform across depth, but consecutive layers tend to be correlated.

zozbot234 · 2026-03-19T12:23:44.000Z 1773923024

> far as I can tell, there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n

Wouldn't "pass-through" identity connections have exactly that effect? These are quite common in transformer models.

4bpp · 2026-03-19T12:32:38.000Z 1773923558

Yeah, that's what I meant with "initialised as identity and the training process did not get to change them much".

SCLeo · 2026-03-19T13:35:33.000Z 1773927333

There are explicit residual connections in a transformer block. Look up "residual connections" in Google images and you will see.

WithinReason · 2026-03-19T14:27:01.000Z 1773930421

Some transformers have a block recurrent structure, here is a paper that made a similar observation recently:

https://www.alphaxiv.org/abs/2512.19941

kirill5pol · 2026-03-19T04:00:37.000Z 1773892837

Basically all of them are using residual connections so it’s not that surprising honestly

getnormality · 2026-03-19T13:56:53.000Z 1773928613

> something happened to the model in training (RLHF?) that forcefully degraded its reasoning performance

I've been seeing more people speculating like this and I don't understand why. What evidence do we have for RLHF degrading performance on a key metric like reasoning? Why would this be tolerated by model developers?

Can someone point to an example of an AI researcher saying "oops, RLHF forcefully degrades reasoning capabilities, oh well, nothing we can do"?

It strikes me as conspiracist reasoning, like "there's a car that runs on water but they won't sell it because it would destroy oil profits".

4bpp · 2026-03-19T18:23:57.000Z 1773944637

The most obvious way would simply be excessive agreeableness. Users rate responses more highly if they affirm the user's thinking, but a general tendency to affirm would presumably result in the model being more inclined to affirm its own mistakes in a reasoning chain.

There was some research about it early on that was shared widely and shaped the folklore perception around it, such as the graph in https://static.wixstatic.com/media/be436c_84a7dceb0d834a37b3... from the GPT-4 whitepaper which shows that RLHF destroyed its calibration (ability to accurately estimate the likelihood that its guesses are correct). Of course the field may have moved on in the 2+ years that have passed since then.

Karuma · 2026-03-19T01:52:05.000Z 1773885125

Wow, every single word in the original post and on that README.md is pure LLM. How sad.

In any case, this has been done at least since the very first public releases of Llama by Meta... It also works for image models. There are even a few ComfyUI nodes that let you pick layers to duplicate on the fly, so you can test as many as you want really quickly.

xlayn · 2026-03-19T02:22:44.000Z 1773886964

Fair point on the writing style, I used Claude extensively on this project, including drafting. The experiments and ideas are mine though.

On the prior art: you're right that layer duplication has been explored before. What I think is new here is the systematic sweep toolkit + validation on standard benchmarks (lm-eval BBH, GSM8K, MBPP) showing exactly which 3 layers matter for which model. The Devstral logical deduction result (0.22→0.76) was a surprise to me.

If there are ComfyUI nodes that do this for image models, I'd love links, the "cognitive modes" finding (different duplication patterns that leads to different capability profiles from the same weights) might be even more interesting for diffusion models.

abhikul0 · 2026-03-19T06:39:44.000Z 1773902384

I only know of this one: https://github.com/shootthesound/comfyUI-Realtime-Lora. Haven't played with any layer manipulation though.

Karuma · 2026-03-19T13:32:52.000Z 1773927172

I was thinking more like this one: https://github.com/AdamNizol/ComfyUI-Anima-Enhancer/

"It adds the Anima Layer Replay Patcher, which can enhance fine detail and coherence by replaying selected internal blocks during denoising."

abhikul0 · 2026-03-19T14:55:01.000Z 1773932101

I tried out the one I linked with sd1.5 today, moved the sliders around like a total noob and got pretty bad results but I found no way to "replay" any of the layers like the one you linked, so thanks for the link. Must take a lot of trial & errors haha. I'll check it out, assuming it works for the anima preview 2 too.

taliesinb · 2026-03-19T01:57:57.000Z 1773885477

There is an obvious implication: since the initial models were trained without loops, it is exceedingly unlikely that a single stack of consecutive N layers represents only a single, repeatable circuit that can be safely looped. It is much more likely that the loopable circuits are superposed across multiple layers and have different effective depths.

That you can profitably loop some say 3-layer stack is likely a happy accident, where the performance loss from looping 3/4 of mystery circuit X that partially overlaps that stack is more than outweighed by the performance gain from looping 3/3 of mystery circuit Y that exactly aligns with that stack.

So, if you are willing to train from scratch, just build the looping in during training and let each circuit find its place, in disentangled stacks of various depths. Middle of transformer is:

(X₁)ᴹ ⊕ (Y₁∘Y₂)ᴺ ⊕ (Z₁∘Z₂∘Z₃)ᴾ ⊕ …

Notation: Xᵢ is a layer (of very small width) in a circuit of depth 1..i..D, ⊕ is parallel composition (which sums the width up to rest of transformer), ∘ is serial composition (stacking), and ᴹ is looping. The values of ᴹ shouldnt matter as long as they are > 1, the point is to crank them up after training.

Ablating these individual circuits will tell you whether you needed them at all, but also roughly what they were for in the first place, which would be very interesting.

taliesinb · 2026-03-19T02:04:05.000Z 1773885845

And i bet these would be useful in initial and final parts of transformer too. Because syntactic parsing and unparsing of brackets, programming language ASTs, etc is highly recursive; no doubt current models are painfully learning "unrolled" versions of the relevant recursive circuits, unrolled to some fixed depth that must compete for layers with other circuits, since your total budget is 60 or whatever. Incredibly duplicative and by definition unable to generalize to arbitrary depth!

awwaiid · 2026-03-19T02:50:43.000Z 1773888643

Maybe another idea, no idea if this is a thing, you could pick your block-of-layers size (say... 6) and then during training swap those around every now and then at random. Maybe that would force the common api between blocks, specializaton of the blocks, and then post training analyze what each block is doing (maybe by deleting it while running benchmarks).

taliesinb · 2026-03-19T02:09:44.000Z 1773886184

Amusingly, you need only have circuits of prime depth, though you should probably adjust their widths using something principled, perhaps Euler's totient function.

kgeist · 2026-03-19T03:26:07.000Z 1773890767

Heh, for a couple last days, I've been doing this exact kind of "neuroanatomy" on Qwen2.5/Qwen3 too. Fascinating stuff. To make it easier to fiddle with the network, I created a small inference engine that is stripped of all the framework magic, just raw matmuls and all (main inference loop is just 50 lines of code!). For example, it's trivial to remove a layer: i just skip it in code with a simple "if". I've found that removing some layers doesn't appear to change anything (based on the vibes at least). If you remove some later layers, the model forgets how to insert the EOS token and keeps chatting ad finitum (still coherently). Removing earliest layers makes the model generate random garbage. Turns out abliteration is not hard to do, 10 examples was enough to find the refusal vector and cancel most refusals. Interestingly, I've found that refusal happens in the middle layers too (I think, layer 12 out of 26)

From what I understand, transformers are resistant to network corruption (without complete collapse) thanks to residual connections.

I tried to repeat some layers too but got garbage results. I guess I need to automate finding the reasoning layers too, instead of just guessing.

gmerc · 2026-03-19T08:42:44.000Z 1773909764

Hook it up in autoresearch?

edg5000 · 2026-03-19T12:51:33.000Z 1773924693

Very interesting stuff

kristianp · 2026-03-19T04:03:14.000Z 1773892994

The method used here by David Ng, was discussed a few days ago at https://news.ycombinator.com/item?id=47322887

woadwarrior01 · 2026-03-19T00:55:05.000Z 1773881705

Reminds me of Solar 10.7B, which was a very good model for its size ~2 year ago and the "Depth Up-Scaling" technique behind it. Although, that involved continued training after repeating the layers.

https://arxiv.org/abs/2312.15166

christianqchung · 2026-03-19T02:37:37.000Z 1773887857

Why test on Qwen 2.5 when Qwen 3 has been out for about a year, and Qwen 3.5 for a month? My problem with this is ironically entirely vibes based: that for some reason, LLMs love to talk about Qwen 2.5 instead of anything newer.

hackpert · 2026-03-19T12:46:47.000Z 1773924407

We found evidence of specific layer-localized "reasoning" circuits in a few models last year too! A very much work-in-progress paper is here: https://openreview.net/forum?id=mTjGBrkdtz

aimarketintel · 2026-03-21T02:11:28.000Z 1774059088

Interesting technique. For practical applications, structured tool access (MCP) matters more than model size — a 7B model with real-time data often beats 70B without it.

SyzygyRhythm · 2026-03-19T01:13:31.000Z 1773882811

If running twice is good, then is running N times even better? I wonder if you could even loop until some kind of convergence, say hitting a fixed point (input equals output). I wonder if there's even a sort of bifurcation property where it sometimes loops A->A->A, but other times A->B->A, or more, rather like the logistic map fractal.

xlayn · 2026-03-19T01:56:27.000Z 1773885387

I explored that, again with Devstral, but the execution with 4 times the same circuit lead to less score on the tests.

I chat with the model to see if the thing was still working and seemed coherent to me, I didn't notice anything off.

I need to automate testing like that, where you pick the local maxima and then iterate over that picking layers to see if it's actually better, and then leave the thing running overnight

smusamashah · 2026-03-19T08:40:16.000Z 1773909616

Can Karpathy's autoresearch be used on this to explore what works and what does not? That is supposed to automate research like this from what I understand.

imtringued · 2026-03-19T09:00:07.000Z 1773910807

That's how deep equilibrium models were discovered.

Whats's more. It was found out that you only need a single looped layer to be equivalent to a multi layer network.

BoredomIsFun · 2026-03-19T12:44:16.000Z 1773924256

Phi-4-14b with layers duplicated (phi-4-25b) has increassed performance. Phi-4-49b has degraded vs 14b.

nowittyusername · 2026-03-19T01:39:15.000Z 1773884355

There's still a lot of low hanging fruit left IMO. Good find and rather funny to think about as you can have someone simply clone the various layers multiple times and instead of spending millions of dollars retraining the model increase performance significantly with "this one trick".

xlayn · 2026-03-19T01:53:49.000Z 1773885229

The other interesting point is that right now I'm copy pasting the layers, but a patch in llama.cpp can make the same model now behave better by a fact of simply following a different "flow" without needing more vram...

if this is validated enough it can eventually lead to ship some kind of "mix" architecture with layers executed to fit some "vibe?"

Devstral was the first one I tried and optimize for math/eq, but that din't result in any better model, then I added the reason part, and that resulted in "better" model

I used the devstral with the vibe.cli and it look sharp to me, thing didn't fail, I also used the chat to "vibe" check it and look ok to me.

The other thing is that I pick a particular circuit and that was "good" but I don't know if it was a local maxima, I think I ran just like 10 sets of the "fast test harness" and pick the config that gave the most score... once I have that I use that model and run it against the llm_eval limited to only 50 tests... again for sake of speed, I didn't want to wait a week to discover the config was bad

skerit · 2026-03-19T10:45:39.000Z 1773917139

I've been running my own (admittedly naïve) experiments of new, wacky ideas for both LLMs (well, SLMs) and for Image-Super-Resolution models.

I'm just trying different kinds of attention mechanisms, different configurations of the network, adding loops, ... All kind of wacky ideas. And the real weird thing is that 99% of the ideas I try work at all.

Lerc · 2026-03-19T11:05:57.000Z 1773918357

That weird part is kind of what I was expecting.

This goes to the thing that I posted on the thread a couple of days ago. https://news.ycombinator.com/item?id=47327132

What you need is a mechanism to pick the right looping pattern, Then it really does seem to be Mixture of experts on a different level.

Break the model into input path, thinking, output path. and make the thinking phase a single looping layer of many experts. Then the router gets to decide 13,13,14,14,15,15,16.

Training the router left as an exercise to the reader.

simgt · 2026-03-19T13:30:33.000Z 1773927033

If you're adding a model to do the "routing" you're basically putting learned backward connections and you end up with a RNN

Lerc · 2026-03-19T17:35:49.000Z 1773941749

Mixture of Experts already have routing models,

I'm just suggesting eliminate (or weaken) the distinction between layers and expert and have just the one, then iterate that one until its 'gpod enough' score plus (iterationcount*spontaneity) is greater than some threshold.

Imanari · 2026-03-19T09:37:23.000Z 1773913043

Fascinating! I wonder if new training techniques could emerge from this. If we say layer-1=translater, layer2-5=reasoner, layer6 retranslater, could we train small 6 layer models but evaluate their performance in a 1>n*(2-5)>6 setup to directly train towards optimal middle-layers that can be looped? You'd only have to train 6 layers but get the duplication-benefit of the middle layers for free.

zozbot234 · 2026-03-19T12:32:57.000Z 1773923577

Yes, training directly for a diverse mix of "looped" inference procedures makes a lot of sense as a way of allowing for increased inference-time compute. It would likely be complementary to the usual thinking approach, which essentially runs the "loop" LLM-wide - and, critically, yields interpretable output which lets us see what the LLM is thinking about.

snats · 2026-03-19T02:59:15.000Z 1773889155

you can also have removed layers of models and keep the same score in benchmarks [1].

i feel that sometimes a lot of the layers might just be redundant and are not fully needed once a model is trained.

[1] https://snats.xyz/pages/articles/pruningg.html

rao-v · 2026-03-19T01:13:28.000Z 1773882808

I’d love to believe this is real, but I’m pretty sure you will lose performance on a “fair” mix of tasks, even after fine tuning. I know multiple teams have explored recurrent layers (great for limited VRAM) but I don’t think it’s ever been found to be optimal.

zhangchen · 2026-03-19T01:35:46.000Z 1773884146

this lines up with what pruning papers have been finding, the middle layers carry most of the reasoning weight and you can often drop the outer ones without much loss. cool to see the inverse also works, just stacking them for extra passes.

m3kw9 · 2026-03-19T02:47:47.000Z 1773888467

What, just randomly choose some "layer" and duplicate it and give some arbitrary reasoning went from 0.2 -> 0.7, i don't know man. You need to use real benchmarks.

3eb7988a1663 · 2026-03-19T03:44:02.000Z 1773891842

Someone recently posted the exact same idea to much acclaim: https://news.ycombinator.com/item?id=47322887

getnormality · 2026-03-19T03:44:02.000Z 1773891842

Didn't we recently see another hack, where you could get better performance by repeating the prompt?

I wonder if they work for similar reasons.

puppykito · 2026-03-19T12:47:22.000Z 1773924442

I find it so cute that making the LLM think twice before outputting something makes it smarter.

colejhudson · 2026-03-19T01:04:53.000Z 1773882293

Would you be able to publish the individual benchmarks for Qwen2.5-Coder-32B? GSM8K specifically would be useful to look at.

xlayn · 2026-03-20T01:54:11.000Z 1773971651

I updated the results, with just the Devstral part, but ran the full suite for it, and posted all the results file as well as a script to re-run the process.

The results are more spectacular...

The model pointed way better in gsm8k, but lost a bit on the other categories.

xlayn · 2026-03-19T01:37:52.000Z 1773884272

I published the results for devstral... results folder of the github https://github.com/alainnothere/llm-circuit-finder/tree/main...

I'm using the following configuration --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp I did also try humaneval but something in the harness is missing and failed...

notice that I'm running 50 tests for each task, mostly because of time limitation as it takes like two hours to validate the run for the base model and the modified one.

I'll also try to publish the results of the small tests harness when I'm testing the multiple layers configurations, for reference this is phi-4-Q6_K.gguf, still running, I'm now giving more importance to the Reason factor, the reason factor comes from running a small subset of all the problems in the task config above

Initially I tried the approach of the highest math/eq but in resulted in models that were less capable overall with the exception of math, and math like in the original research is basically how good was the model at giving you the answer of a really though question, say the cubic root of some really large number... but that didn't translate to the model being better at other tasks...

  Config  | Lyr | Math   | EQ    | Reas   | Math Δ  | EQ Δ  | Reas Δ  | Comb Δ
  --------|-----|--------|-------|--------|---------|-------|---------|-------
  BASE    |   0 | 0.7405 | 94.49 | 94.12% |     --- |   --- |     --- |    ---
  (6,9)   |   3 | 0.7806 | 95.70 | 94.12% | +0.0401 | +1.21 |  +0.00% |  +1.21
  (9,12)  |   3 | 0.7247 | 95.04 | 94.12% | -0.0158 | +0.55 |  +0.00% |  +0.55
  (12,15) |   3 | 0.7258 | 94.14 | 88.24% | -0.0147 | -0.35 |  -5.88% |  -6.23
  (15,18) |   3 | 0.7493 | 95.74 | 88.24% | +0.0088 | +1.25 |  -5.88% |  -4.63
  (18,21) |   3 | 0.7204 | 93.40 | 94.12% | -0.0201 | -1.09 |  +0.00% |  -1.09
  (21,24) |   3 | 0.7107 | 92.97 | 88.24% | -0.0298 | -1.52 |  -5.88% |  -7.41
  (24,27) |   3 | 0.6487 | 95.27 | 88.24% | -0.0918 | +0.78 |  -5.88% |  -5.10
  (27,30) |   3 | 0.7180 | 94.65 | 88.24% | -0.0225 | +0.16 |  -5.88% |  -5.73
  (30,33) |   3 | 0.7139 | 94.02 | 94.12% | -0.0266 | -0.47 |  +0.00% |  -0.47
  (33,36) |   3 | 0.7104 | 94.53 | 94.12% | -0.0301 | +0.04 |  +0.00% |  +0.04
  (36,39) |   3 | 0.7017 | 94.69 | 94.12% | -0.0388 | +0.20 |  +0.00% |  +0.20
  (6,10)  |   4 | 0.8125 | 96.37 | 88.24% | +0.0720 | +1.88 |  -5.88% |  -4.01
  (9,13)  |   4 | 0.7598 | 95.08 | 94.12% | +0.0193 | +0.59 |  +0.00% |  +0.59
  (12,16) |   4 | 0.7482 | 93.71 | 88.24% | +0.0076 | -0.78 |  -5.88% |  -6.66
  (15,19) |   4 | 0.7617 | 95.16 | 82.35% | +0.0212 | +0.66 | -11.76% | -11.10
  (18,22) |   4 | 0.6902 | 92.27 | 88.24% | -0.0504 | -2.23 |  -5.88% |  -8.11
  (21,25) |   4 | 0.7288 | 94.10 | 88.24% | -0.0117 | -0.39 |  -5.88% |  -6.27
  (24,28) |   4 | 0.6823 | 94.57 | 88.24% | -0.0583 | +0.08 |  -5.88% |  -5.80
  (27,31) |   4 | 0.7224 | 94.41 | 82.35% | -0.0181 | -0.08 | -11.76% | -11.84
  (30,34) |   4 | 0.7070 | 94.73 | 94.12% | -0.0335 | +0.23 |  +0.00% |  +0.23
  (33,37) |   4 | 0.7009 | 94.38 |100.00% | -0.0396 | -0.12 |  +5.88% |  +5.77
  (36,40) |   4 | 0.7057 | 94.84 | 88.24% | -0.0348 | +0.35 |  -5.88% |  -5.53
  (6,11)  |   5 | 0.8168 | 95.62 |100.00% | +0.0762 | +1.13 |  +5.88% |  +7.02
  (9,14)  |   5 | 0.7245 | 95.23 | 88.24% | -0.0160 | +0.74 |  -5.88% |  -5.14
  (12,17) |   5 | 0.7825 | 94.88 | 88.24% | +0.0420 | +0.39 |  -5.88% |  -5.49
  (15,20) |   5 | 0.7832 | 95.86 | 88.24% | +0.0427 | +1.37 |  -5.88% |  -4.52
  (18,23) |   5 | 0.7208 | 92.42 | 88.24% | -0.0197 | -2.07 |  -5.88% |  -7.95
  (21,26) |   5 | 0.7055 | 92.89 | 88.24% | -0.0350 | -1.60 |  -5.88% |  -7.48
  (24,29) |   5 | 0.5825 | 95.04 | 94.12% | -0.1580 | +0.55 |  +0.00% |  +0.55
  (27,32) |   5 | 0.7088 | 94.18 | 88.24% | -0.0317 | -0.31 |  -5.88% |  -6.19
  (30,35) |   5 | 0.6787 | 94.69 | 88.24% | -0.0618 | +0.20 |  -5.88% |  -5.69
  (33,38) |   5 | 0.6650 | 94.96 | 88.24% | -0.0755 | +0.47 |  -5.88% |  -5.41
  (6,12)  |   6 | 0.7692 | 95.39 | 94.12% | +0.0287 | +0.90 |  +0.00% |  +0.90
  (9,15)  |   6 | 0.7405 | 94.65 | 94.12% | -0.0000 | +0.16 |  +0.00% |  +0.16
  (12,18) |   6 | 0.7582 | 94.57 | 88.24% | +0.0177 | +0.08 |  -5.88% |  -5.80
  (15,21) |   6 | 0.7828 | 93.52 | 88.24% | +0.0423 | -0.98 |  -5.88% |  -6.86
  (18,24) |   6 | 0.7308 | 92.93 | 94.12% | -0.0097 | -1.56 |  +0.00% |  -1.56
  (21,27) |   6 | 0.6791 | 92.54 | 82.35% | -0.0615 | -1.95 | -11.76% | -13.72

XCSme · 2026-03-19T01:28:14.000Z 1773883694

But if it got worse on other tests, it doesn't do much good, right?

ekianjo · 2026-03-19T01:30:12.000Z 1773883812

Which tests are worse?

XCSme · 2026-03-19T01:32:57.000Z 1773883977

Hard to tell, they only mention a few ones that got better, not clear results on others

xlayn · 2026-03-19T02:00:02.000Z 1773885602

You can check here the results for Devstral, speed limits me, but these are the results for the first 50 tests of the command

  # Run lm-evaluation-harness
  lm_eval --model local-chat-completions \
      --model_args model=test,base_url=http://localhost:8089/v1/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False \
      --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp \
      --apply_chat_template --limit 50 \
      --output_path ./eval_results

BoredomIsFun · 2026-03-19T12:47:28.000Z 1773924448

please post it on /r/localllama

gukoff · 2026-03-19T12:50:33.000Z 1773924633

How do you run these models on AMD GPUs?

DiabloD3 · 2026-03-19T17:54:57.000Z 1773942897

The same way you normally would, using llama.cpp.

seertaak · 2026-03-20T08:51:44.000Z 1773996704

This -- and obviously David Ng's article -- are absolutely fascinating pieces of work.

I have a few (very naive) questions:

There is a widespread intuition, encapsulated in the very terms "feed-forward networks" and "deep neural networks", that computation in such networks is akin to a circuit wired in series. My "observation" is that residual layers offer an "escape hatch" from this, allowing layers (or sets of layers), to operate in parallel (and of course, something in between).

So here are my dumb questions:

1. Is my intuition about residual networks, at least in principle, allowing for in parallel layers, correct? Or am I missing something fundamental? Let's say the intuition is correct -- is it possible to measure the degree to which a layer operates in series or in parallel?

2. The formula for residual layers (at least to my mind) reminds of an Ornstein-Ühlenbeck time series process. If so, can we measure the degree of mean-reversion of a/several layer(s)? For me, this makes intuitive sense -- the goal of avoiding vanishing gradients feels similar to the goal of stationarity in time series processes.

3. Let's take as an article of faith the central idea of a tripartite network: input->latentspace block => reasoning block => latentspace->output block. Ng's intuition iiuc is that the reasoning block, more or less, wired in series. Intuitively, it feels like that is what it ought to be (i.e., a chain of calculations), though I'll add -- again hand-wavingly -- that OP's efforts appear to cast doubt on this conjecture. Are the two "translation" blocks wired "more" in parallel, then?

4. So what both Ng and OP did was to "tape together" the ostensibly reasoning layers -- in different ways but that's essentially it. Another thing you could do is to treat the input and output translation blocks as fixed. You now train a totally new model on a much smaller corpus of training data, only instead of feeding the input directly to your new model you feed it translated training data (similarly, your targets are now the activations at the entrance to the reasoning->output block. Let's assume it's exactly the same architecture in the middle as the standard netowrk, only it's initialized to random weights as per usual. Surely you should be able to pre-train that 6 layer reasoning network much, much faster. Has anyone tried this?

5. Having thus partitioned a very deep architecture into three distinct parts, there's no reason why you can't experiment with making the reasoning block wider or narrower. Has anyone tried that?

6. Another fun idea is to map a given input through input block and read the pre-reasoning activations. You now let that vector be a random variable and do a random walk through reasoning input space, and use this to "augment" your corpus of training data. Reasonable idea or bullshit?

Please remember, I'm only just (and belatedly) trying to wrap my head around how transformer architectures work -- I'm still waiting for my copy of "Build a Large Language Model (from scratch)"! I hope these questions aren't totally daft!

edg5000 · 2026-03-19T12:51:41.000Z 1773924701

This is very cool

jacquesm · 2026-03-19T20:03:07.000Z 1773950587

> No weights change. No training. The model just thinks longer.

...

Singlaw · 2026-03-19T02:30:24.000Z 1773887424

What does this do?

BoredomIsFun · 2026-03-19T12:42:13.000Z 1773924133

Phi-4-25 is another example.