Prompt Details

Model

Chat - (gpt-5.5)

Token size

581

Example input

[CORE_CONCEPT]: Chain-of-Thought Reasoning [TIMEFRAME]: 2015–2022 [LANDMARK_PAPER]: "Attention Is All You Need" (Vaswani et al., 2017) [RECENT_FOCUS]: linear attention mechanisms, 2023–2025

Example output

PART 1 — Concept pre-history: Chain-of-Thought Reasoning, 2015–2022 1. “End-To-End Memory Networks” — Sukhbaatar, Szlam, Weston, Fergus, 2015 This paper belongs to the period when “reasoning” in neural NLP was still framed as multi-hop retrieval over memory, rather than as explicit verbal reasoning. The core problem was that conventional RNNs compressed everything into a hidden state, making it hard to retrieve multiple supporting facts for question answering. End-to-End Memory Networks introduced an external memory with recurrent attention “hops,” allowing a model to read from memory several times before producing an answer. The key mechanism was not a natural-language chain of thought, but a latent chain of attentional reads: each hop updated the query representation and allowed the model to focus on another relevant memory slot. This quietly foreshadowed CoT by treating reasoning as a sequence of intermediate operations rather than a single forward pass. Its limitation was that the intermediate steps were mostly hidden and not linguistically inspectable. That gap pushed later work toward either executable intermediate programs or human-readable rationales: the model could “reason” in multiple steps, but the user could not see the reasoning. NeurIPS 会议义务 +1 2. “Neural Programmer: Inducing Latent Programs with Gradient Descent” — Neelakantan, Le, Sutskever, 2015 Where Memory Networks made reasoning look like repeated retrieval, Neural Programmer made it look like latent program execution. The problem was that neural networks were poor at arithmetic, table lookup, and logical operations, especially when tasks required composing several operations. The paper augmented a neural model with differentiable arithmetic and logic operators, then learned which operator to apply and where, using only answer supervision rather than full program annotations. Its key insight was that multi-step reasoning might be decomposed into a sequence of learned operation calls, even when the operation trace was latent. This influenced the CoT lineage by strengthening the idea that “reasoning” should be represented as an intermediate trajectory, not only as input-output mapping. But the correction came from later explanation and prompting work: hand-designed operation sets and synthetic table tasks did not scale naturally to open-ended language. Later natural-language rationale papers replaced latent symbolic operators with human-readable textual justifications; later LLM papers replaced specialized modules with general-purpose sequence prediction. arXiv 3. “Rationalizing Neural Predictions” — Lei, Barzilay, Jaakkola, 2016 This paper shifted the intermediate-step idea from programs to rationales. The core problem was practical interpretability: neural predictions without justification were hard to trust, but attention weights alone were not reliable explanations. Lei et al. proposed a generator-encoder architecture in which the generator selected short, coherent snippets of the input as rationales, and the encoder made the prediction from those snippets. The key mechanism was a constrained latent selection process: rationales should be sparse, coherent, and sufficient for the downstream answer. This did not yet produce step-by-step reasoning, but it made the model’s intermediate evidence visible. Its influence on the CoT lineage was conceptual: before models were asked to “think step by step,” researchers first asked them to expose why an answer was plausible. The next correction was that extracted rationales could only point to existing input text; they could not generate missing commonsense bridges, arithmetic transformations, or abstract deductions. That limitation led to free-form natural-language explanation datasets such as e-SNLI and CoS-E. arXiv 4. “Attention Is All You Need” — Vaswani et al., 2017 The Transformer was not a CoT paper, but it became one of the enabling conditions for CoT. The problem it addressed was architectural: sequence transduction systems relied heavily on recurrence or convolution, limiting parallelization and long-range dependency modeling. Vaswani et al. replaced recurrence with self-attention and multi-head attention, making sequence modeling more parallelizable and scalable. Its key mechanism was to let each token condition on other tokens through learned attention patterns, stacked across layers. For the CoT lineage, this mattered because later GPT-style LLMs inherited the Transformer’s ability to model long textual contexts, few-shot demonstrations, and generated continuations. The correction introduced by later CoT work was behavioral rather than architectural: even very large Transformers did not automatically perform reliable multi-step reasoning when prompted for direct answers. The architecture made long-form reasoning possible as text continuation, but prompting work had to discover that demonstrations containing intermediate steps could elicit that capacity. arXiv +1 5. “e-SNLI: Natural Language Inference with Natural Language Explanations” — Camburu, Rocktäschel, Lukasiewicz, Blunsom, 2018 e-SNLI made a decisive move from extractive rationales to free-form natural-language explanations. The problem was that classification labels such as entailment, contradiction, and neutral did not reveal whether a model understood the relation between premise and hypothesis. The paper extended SNLI with human-written explanations and trained models both to predict labels and to output explanations. The key insight was that explanations could become part of the supervised learning signal and part of the model output. In CoT terms, this helped normalize the idea that an NLP model’s answer could include a linguistic intermediate product, not merely a class label. The next correction came from commonsense QA: NLI explanations were useful, but they were still attached to a relatively constrained inference task. CoT-style reasoning needed explanations that could bridge unstated world knowledge, not just explain premise-hypothesis relations. That pressure led to CoS-E, where explanations were collected for commonsense multiple-choice reasoning. NIPS论文集 +1 6. “Explain Yourself! Leveraging Language Models for Commonsense Reasoning” — Rajani, McCann, Xiong, Socher, 2019 CoS-E brought natural-language explanations into commonsense QA. The core problem was that deep models performed poorly on tasks requiring background knowledge not explicitly stated in the input. The paper collected human explanations for CommonsenseQA and introduced CAGE, a framework where language models generated explanations that could be used during training and inference. The mechanism was important for CoT: a model could improve answer selection by producing an explanatory sentence that served as a bridge between question and answer. The reported gain—about 10% over prior state of the art on CommonsenseQA—made explanations look like useful computational scaffolding, not just interpretability decoration. The correction came from GPT-3 and later CoT work: CoS-E still required task-specific explanation data and training pipelines. The next leap was to discover that sufficiently large pretrained LMs could use examples in the prompt itself as a temporary reasoning format, without retraining a dedicated explanation generator. ACL Anthology +1 7. “Language Models are Few-Shot Learners” — Brown et al., 2020 GPT-3 reframed the question. Instead of building task-specific reasoning modules or explanation datasets, it asked whether a large autoregressive Transformer could infer tasks from textual demonstrations alone. The core problem was the dependence of NLP systems on task-specific fine-tuning data. GPT-3 showed that scaling a 175B-parameter autoregressive LM produced strong zero-shot and few-shot performance across many tasks, using prompts as the interface. The mechanism that mattered for CoT was in-context learning: examples in the prompt could shape the model’s continuation behavior without weight updates. But GPT-3 also exposed a limitation that directly motivated CoT: direct-answer prompting remained weak on multi-step arithmetic, symbolic, and compositional reasoning. The model had the capacity to continue text, but the prompt format often asked it to compress reasoning into one answer. The next papers corrected this by making intermediate solution paths explicit—first through benchmarks and verifiers, then through scratchpads and chain-of-thought demonstrations. arXiv +1 8. “Training Verifiers to Solve Math Word Problems” — Cobbe et al., 2021 GSM8K and verifier training made multi-step mathematical reasoning a central stress test for LMs. The problem was diagnostic: large models could produce fluent text, but they still failed robustly on grade-school word problems requiring several reasoning steps. The paper introduced GSM8K, a dataset of 8.5K math word problems with natural-language solutions, and proposed generating many candidate completions, then training a verifier to rank them. The key mechanism was selection over solution paths: instead of trusting one greedy answer, produce multiple possible derivations and choose the most credible. This directly influenced later CoT and self-consistency work. However, its correction was also clear: the verifier was an extra trained component, and the model’s reasoning ability was still evaluated through candidate generation plus external ranking. CoT prompting simplified the interface: put worked solution steps into the prompt and let the LM generate its own intermediate reasoning. Self-consistency later inherited the “many candidate paths” idea but removed the trained verifier. arXiv +1 9. “Show Your Work: Scratchpads for Intermediate Computation with Language Models” — Nye et al., 2021 Scratchpads were the immediate technical ancestor of CoT. The core problem was that LMs struggled with computations requiring many latent intermediate states, such as long addition or program execution. Nye et al. showed that models performed much better when trained or prompted to emit intermediate computation steps into a “scratchpad.” The key insight was simple but powerful: externalizing intermediate state reduces the burden on hidden activations. Instead of forcing a Transformer to silently maintain every carry, variable, or program state, the model writes them into the sequence and attends back to them. This is very close to CoT, except that scratchpads were more computation-oriented and often involved supervised intermediate formats. The next correction was to generalize from structured scratchpads to natural-language rationales across reasoning tasks. CoT prompting showed that a few demonstrations containing intermediate reasoning steps could unlock arithmetic, commonsense, and symbolic reasoning in sufficiently large LMs. arXiv 10. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” — Wei et al., 2022 This is the naming breakthrough. The problem was that LMs often failed on tasks requiring multiple inferential steps when prompted only with input-output examples. Wei et al. proposed including demonstrations where each answer was preceded by a chain of intermediate reasoning steps. The mechanism was not architectural; it was representational and prompt-based. The model was encouraged to continue a pattern of “question → reasoning → answer,” turning reasoning into a language-generation trajectory. The paper showed gains across arithmetic, commonsense, and symbolic reasoning, with particularly strong results when model scale was large enough. Its inheritance from scratchpads is direct: intermediate computation is written into the context. Its inheritance from e-SNLI and CoS-E is also visible: reasoning is expressed in natural language. But CoT corrected both traditions by showing that reasoning traces could be elicited through prompting rather than only through specialized datasets or modules. Its own limitation was dependence on hand-written exemplars and vulnerability to brittle, single-path decoding. arXiv +1 11. “Large Language Models are Zero-Shot Reasoners” — Kojima et al., 2022 Zero-shot CoT asked whether explicit worked examples were really necessary. The core problem was prompt engineering cost: CoT worked well, but few-shot reasoning demonstrations had to be manually designed for each task. Kojima et al. showed that simply appending a phrase such as “Let’s think step by step” could elicit intermediate reasoning in large instruction-tuned models. The mechanism was a minimal natural-language control signal that shifted the model from direct-answer mode into reasoning-generation mode. This paper corrected the assumption that CoT was mainly a few-shot learning phenomenon. It suggested that reasoning formats had become latent capabilities in large LMs and could be activated by a generic instruction. The limitation was reliability: a single generated reasoning path could still be wrong, and the model could produce plausible but invalid explanations. That weakness led naturally to self-consistency, which treated reasoning as a distribution over possible paths rather than one deterministic trace. arXiv +1 12. “Self-Consistency Improves Chain of Thought Reasoning in Language Models” — Wang et al., 2022 Self-consistency corrected the fragility of greedy CoT decoding. The problem was that a model’s first reasoning path might be locally plausible but globally wrong. Wang et al. proposed sampling multiple diverse reasoning chains and selecting the answer that appeared most consistently across them. The mechanism was answer marginalization over reasoning paths: reasoning is no longer one trace but a set of possible trajectories converging on an answer. Empirically, the method produced large gains on arithmetic and commonsense benchmarks, including reported improvements of +17.9% on GSM8K, +11.0% on SVAMP, +12.2% on AQuA, +6.4% on StrategyQA, and +3.9% on ARC-Challenge. This completed the 2015–2022 arc: hidden multi-hop computation became visible scratchpad reasoning, then promptable natural-language reasoning, then stochastic path aggregation. Its remaining gap is still central today: self-consistency improves answer accuracy, but it does not guarantee that the selected chain is faithful, efficient, or grounded in external evidence. arXiv +1 PART 2 — Breakthrough follow-ups to “Attention Is All You Need” in linear attention mechanisms, 2023–2025 1. “Retentive Network: A Successor to Transformer for Large Language Models” — Yutao Sun et al., 2023, arXiv Limitation of the anchor addressed. The original Transformer made sequence modeling highly parallel, but full self-attention has expensive sequence-length scaling and autoregressive decoding requires growing attention state/KV cache. RetNet targets this inference-time and long-sequence cost while trying to keep Transformer-like training parallelism. arXiv +1 Technical mechanism. RetNet introduces a retention mechanism that can be computed in three equivalent modes: parallel for training, recurrent for decoding, and chunkwise recurrent for long sequences. The key architectural idea is to bridge attention and recurrence: instead of storing all past keys and values, the model maintains a compressed recurrent state with decay-like retention dynamics. arXiv Evidence of improvement. The paper reports favorable scaling on language modeling, parallel training, lower-cost deployment, and efficient inference. Its claimed advantage is the combination of training parallelism, low-cost recurrent inference, and competitive language-modeling performance. arXiv What remains unsolved. RetNet compresses history, so it does not fully preserve the exact token-level random access of softmax attention. This makes the broader recall problem central: how much can a fixed or compressed state remember without losing long-range associative retrieval? Later papers such as BASED a

🤖 GPT