Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking (code of improved recursive CoT like in OpenAI o1)

Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, Noah D. Goodman

Quiet-STaR represents a significant evolution of the Self-Taught Reasoner (STaR) algorithm, offering advanced capabilities that address key limitations of its predecessor. Unlike STaR, which was confined to specific question-answering datasets, Quiet-STaR generalizes reasoning to unstructured text, enabling language models (LMs) to autonomously generate rationales for each token. This broad applicability is achieved through innovative methodologies, including a tokenwise parallel sampling algorithm that ensures computational efficiency, and the introduction of meta-tokens to seamlessly integrate internal thoughts into the LM’s predictions. Additionally, a mixing head mechanism dynamically balances base model predictions with rationale-based enhancements, mitigating performance dips due to initial out-of-distribution issues.

The potential impact of Quiet-STaR is profound, as it complements chain-of-thought (CoT) prompting by enhancing the depth and coherence of recursive answer generation. This symbiotic relationship underscores Quiet-STaR’s capability to foster deeper thinking in LMs, making it a valuable tool for researchers and practitioners aiming to push the boundaries of artificial intelligence. Furthermore, the efficiency gains and robust performance suggest promising applications in diverse, real-world scenarios, extending beyond the academic benchmarks. Future research can build on these findings by exploring dynamic rationale generation, broader cross-task evaluations, and integrating human feedback, to further close the gap between machine and human-like reasoning. Quiet-STaR’s advancements not only refine the art of reasoning in LMs but also pave the way for more sophisticated and contextually aware AI systems like OpenAI’s recent o1 model. The authors have provided the official implementation on models, and this blog post focus on how to use Quiet-STaR to improve chain-of-thought recursively and simulate OpenAI’s o1 with a simple GPT-4o-mini model.

Mind Map

High resolution PDF and interactive HTML can be found at Github repo.

Highlights explained

1. Generalization of STaR to Unstructured Text

Explanation

Quiet-STaR generalizes the Self-Taught Reasoner (STaR) method from question-answering datasets to unstructured text data. This allows language models (LMs) to generate rationales at each token, aimed at predicting future text in a variety of contexts.

Significance

This broadens the application of reasoning capabilities from niche, curated datasets to the vast and diverse array of natural text available on the internet, enhancing the LM’s utility and robustness.

Relation to Existing Work

While STaR was confined to specific QA datasets, Quiet-STaR leverages the diverse tasks present in language modeling, embodying the idea that “language models are unsupervised multitask learners” (Radford et al., 2019). This shift underscores Quiet-STaR’s adaptability and scalability.

2. Parallel Rationale Generation Algorithm

Explanation

Quiet-STaR employs a tokenwise parallel sampling algorithm that generates rationales for each token in parallel. This is achieved using a custom attention mask that ensures each rationale’s continuity independently.

Significance

This method drastically improves computational efficiency, making it feasible to apply rationale generation to longer sequences and larger datasets without excessive computational cost.

Relation to Existing Work

This approach builds on traditional parallel processing techniques, optimizing the generation of rationales without compromising efficiency. It contrasts with the sequential rationale generation used in STaR, addressing a significant bottleneck in scalability.

3. Custom Meta-Tokens for Thought Delimitation

Explanation

The introduction of learned meta-tokens, specifically “start-of-thought” and “end-of-thought” tokens, guides the LM to delineate and utilize internal rationales effectively.

Significance

These tokens enable the model to transition seamlessly into and out of “thinking mode,” improving the structure and utility of generated rationales without requiring extensive re-training or manual intervention.

Relation to Existing Work

Meta-tokens have been explored in other contexts (e.g., prompt-tuning by Lester et al., 2021), but Quiet-STaR’s implementation focuses on using them to optimize internal thought processes, enhancing the LM’s reasoning capabilities.

4. Mixing Head for Logit Integration

Explanation

A mixing head, which is a shallow multi-layer perceptron (MLP), combines the LM’s predictions with and without thoughts, adjusting the influence of rationale-based predictions dynamically.

Significance

This integration mechanism smooths the transition from base LM predictions to those informed by rationales, mitigating initial performance degradation due to out-of-distribution thoughts and stabilizing training.

Relation to Existing Work

The concept of integrating multiple sources of predictive information is reminiscent of attention mechanisms and residual connections in neural networks, but applied uniquely here to combine rationales and direct predictions.

5. Complementary to Chain-of-Thought Prompting

Explanation

Quiet-STaR enhances reasoning capabilities when combined with chain-of-thought (CoT) prompting. Internal rationales generated by Quiet-STaR improve the coherence and structure of externally prompted CoT reasoning.

Significance

This complementary nature suggests that Quiet-STaR can be used alongside existing reasoning frameworks to further improve the depth and quality of recursive answer generation, enabling deeper and more thorough reasoning processes.

Relation to Existing Work

CoT prompting, demonstrated by works like Kojima et al. (2022), benefits from Quiet-STaR’s internal rationale generation, showing that the two methods are orthogonal and mutually beneficial, thus expanding the toolkit for advanced reasoning in LMs.

Code

The following code illustrates Quiet-STaR idea on improving CoT by producing multiple branches at each step and using its “helpfulness” to evaluate the branches recursively for the best possible chain. It shows promising results with simple GPT-4o-mini model. An interesting example is counting “r” in the word “Strawberry” and its variant “strawberrrrry”: many CoT implementations for “Strawberry” case may fail on “strawberrrrry” case due to lack of evaluation and back tracing.

Bash

pip install pydantic openai dspy

Python

import os
import json
from typing import List, Dict, Any
from pydantic import BaseModel, Field
from openai import OpenAI
import dspy

# Set up OpenAI client
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Pydantic models for structured data handling
class Thought(BaseModel):
    content: str = Field(..., description="The content of the thought")
    helpfulness: float = Field(..., description="A score from 0 to 1 indicating how helpful the thought was")

class PredictionResult(BaseModel):
    original_text: str = Field(..., description="The original input text")
    next_token: str = Field(..., description="The predicted next token")
    thoughts: List[Thought] = Field(..., description="List of thoughts generated before prediction")
    confidence: float = Field(..., description="Confidence score for the prediction")

class ReasoningStep(BaseModel):
    step: str = Field(..., description="A step in the chain-of-thought reasoning process")
    thoughts: List[Thought] = Field(..., description="Quiet-STaR thoughts generated for this step")

class ChainOfThoughtResult(BaseModel):
    question: str = Field(..., description="The original question")
    reasoning_steps: List[ReasoningStep] = Field(..., description="List of reasoning steps with associated thoughts")
    answer: str = Field(..., description="The final answer to the question")

class QuietSTaR(dspy.Module):
    def __init__(self):
        super().__init__()
        self.lm = dspy.OpenAI(model="gpt-4o-mini")
    
    def forward(self, text: str) -> PredictionResult:
        # Step 1: Generate thoughts
        # This is a key step in Quiet-STaR, where we generate internal thoughts to guide the reasoning process
        thoughts = self.generate_thoughts(text)
        
        # Step 2: Predict next token
        # Using the generated thoughts, we predict the next token in the sequence
        prediction = self.predict_next_token(text, thoughts)
        
        # Step 3: Evaluate thoughts
        # We evaluate the helpfulness of each thought, which is crucial for learning and improving the thought generation process
        evaluated_thoughts = self.evaluate_thoughts(text, thoughts, prediction)
        
        return PredictionResult(
            original_text=text,
            next_token=prediction,
            thoughts=evaluated_thoughts,
            confidence=self.calculate_confidence(evaluated_thoughts)
        )
    
    def generate_thoughts(self, text: str) -> List[str]:
        # This method generates internal thoughts that might help predict the next token
        # It's a key component of Quiet-STaR, allowing the model to "think before speaking"
        prompt = f"Generate 3 brief thoughts that might help predict the next token in this text: {text}"
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=2048
        )
        thoughts = response.choices[0].message.content.split('\n')
        return [thought.strip() for thought in thoughts if thought.strip()]
    
    def predict_next_token(self, text: str, thoughts: List[str]) -> str:
        # This method uses the generated thoughts to predict the next token
        # It demonstrates how Quiet-STaR leverages internal reasoning to improve predictions
        prompt = f"Given the text '{text}' and these thoughts: {json.dumps(thoughts)}, predict the next token."
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=1
        )
        return response.choices[0].message.content.strip()
    
    def evaluate_thoughts(self, text: str, thoughts: List[str], prediction: str) -> List[Thought]:
        # This method evaluates the helpfulness of each thought
        # It's crucial for the learning process in Quiet-STaR, allowing the model to improve its thought generation over time
        prompt = f"Evaluate how helpful each thought was in predicting '{prediction}' as the next token for '{text}'. Rate each thought from 0 to 1."
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            functions=[{
                "name": "rate_thoughts",
                "description": "Rate the helpfulness of thoughts",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "ratings": {
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "thought": {"type": "string"},
                                    "helpfulness": {"type": "number"}
                                }
                            }
                        }
                    }
                }
            }],
            function_call={"name": "rate_thoughts"}
        )
        ratings = json.loads(response.choices[0].message.function_call.arguments)["ratings"]
        return [Thought(content=r["thought"], helpfulness=r["helpfulness"]) for r in ratings]
    
    def calculate_confidence(self, thoughts: List[Thought]) -> float:
        # This method calculates the overall confidence based on the helpfulness of thoughts
        # It provides a measure of how reliable the model's prediction is
        return sum(t.helpfulness for t in thoughts) / len(thoughts)

class EnhancedChainOfThought(dspy.Module):
    def __init__(self):
        super().__init__()
        self.quiet_star = QuietSTaR()
        self.lm = dspy.OpenAI(model="gpt-4o-mini")
    
    def forward(self, question: str) -> ChainOfThoughtResult:
        reasoning_steps = []
        current_context = question
        
        while True:
            # Generate the next reasoning step
            next_step = self.generate_next_step(current_context)
            
            # Use Quiet-STaR to generate thoughts for this step
            # This is where Quiet-STaR enhances the traditional chain-of-thought process
            quiet_star_result = self.quiet_star(next_step)
            
            reasoning_steps.append(ReasoningStep(
                step=next_step,
                thoughts=quiet_star_result.thoughts
            ))
            
            current_context += f"\n{next_step}"
            
            # Check if we've reached a conclusion
            if self.is_conclusion(next_step):
                break
        
        # Generate the final answer
        answer = self.generate_answer(current_context)
        
        return ChainOfThoughtResult(
            question=question,
            reasoning_steps=reasoning_steps,
            answer=answer
        )
    
    def generate_next_step(self, context: str) -> str:
        # This method generates the next step in the reasoning process
        # It's part of the traditional chain-of-thought approach
        prompt = f"Given the following context, provide the next step in the reasoning process:\n\n{context}\n\nNext step:"
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=2048
        )
        return response.choices[0].message.content.strip()
    
    def is_conclusion(self, step: str) -> bool:
        # This method checks if the current step concludes the reasoning process
        # It helps determine when to stop generating new steps
        prompt = f"Does the following step conclude the reasoning process? Answer with 'yes' or 'no':\n\n{step}"
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=1
        )
        return response.choices[0].message.content.strip().lower() == "yes"
    
    def generate_answer(self, context: str) -> str:
        def is_complete_answer(answer: str) -> bool:
            # This helper function checks if an answer is complete
            # It helps determine when to stop the recursive answer generation process
            if answer.replace('.', '').isdigit() or len(answer.split()) <= 5:
                return True
            return answer.endswith((".", "!", "?"))

        def recursive_generate(current_answer: str) -> str:
            # This is the recursive part of the answer generation process
            # It continues generating the answer until it's complete
            if is_complete_answer(current_answer):
                return current_answer.strip()
            
            prompt = f"Continue the following answer:\n\n{current_answer}"
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=1024
            )
            continuation = response.choices[0].message.content.strip()
            return recursive_generate(current_answer + " " + continuation)

        # Start the answer generation process
        prompt = f"Based on the following reasoning, what is the final answer? Provide only the answer without any additional explanation:\n\n{context}\n\nFinal answer:"
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=1024
        )
        initial_answer = response.choices[0].message.content.strip()
        return recursive_generate(initial_answer)

if __name__ == "__main__":
    enhanced_cot = EnhancedChainOfThought()

    examples = [
        "If a train travels at 60 mph for 2 hours and then at 80 mph for 1 hour, how far has it traveled in total?",
        "What is the probability of rolling a sum of 7 with two six-sided dice?",
        "In a group of 30 people, 40% are wearing hats. If 5 more people put on hats, what percentage of the group will be wearing hats?",
        "If the Earth's radius is approximately 6,371 km, what is the approximate surface area of the Earth?",
        "A bacteria population doubles every 20 minutes. If you start with 100 bacteria, how many will there be after 2 hours?",
        "How many R's are in the word 'strawberry'?",
        "How many R's are in the word 'strawberrrrry'?",
        "In a room, there are 2 fathers, 2 sons, and 1 grandson. What is the minimum number of people in the room?",
        "If you have a 5-liter jug and a 3-liter jug, how can you measure exactly 4 liters of water?",
        "In a certain language, 'pim tim' means 'good morning', 'pim nim' means 'good night', and 'tim bim' means 'say morning'. What does 'tim' mean in this language?",
        "A certain species of tree grows 15 cm in its first year, then grows 10 cm each year after. How tall will the tree be after 10 years?",
    ]

    for i, example in enumerate(examples, 1):
        print(f"\nExample {i}:")
        result = enhanced_cot(example)
        print(json.dumps(result.model_dump(), indent=2))
        print("\nReflection:")
        print("Most helpful thoughts in each step:")
        for step in result.reasoning_steps:
            most_helpful_thought = max(step.thoughts, key=lambda t: t.helpfulness)
            print(f"- {most_helpful_thought.content} (helpfulness: {most_helpful_thought.helpfulness:.2f})")

"""
Core Steps of Quiet-STaR and Its Application to Chain-of-Thought:

1. Thought Generation: Quiet-STaR generates internal thoughts before making predictions or reasoning steps.
2. Thought Evaluation: The helpfulness of each thought is evaluated, allowing the model to learn and improve its thought generation over time.
3. Enhanced Prediction: The generated thoughts are used to improve the prediction of the next token or reasoning step.
4. Integration with Chain-of-Thought: Quiet-STaR is applied to each step of the chain-of-thought process, enhancing the overall reasoning capability.
5. Recursive Answer Generation: The final answer is generated recursively, ensuring completeness while avoiding unnecessary verbosity.

Possible Improvements and Their Effects:

1. Parallel Thought Generation: Implement parallel processing for thought generation to improve efficiency.
   Effect: Faster processing, especially for complex problems requiring multiple thoughts.

2. Dynamic Thought Count: Adjust the number of thoughts generated based on the problem's complexity.
   Effect: More efficient use of computational resources and potentially more accurate results for varying problem difficulties.

3. Thought Evolution: Implement a mechanism to evolve thoughts based on their historical performance.
   Effect: Improved thought quality over time, leading to better reasoning and predictions.

4. Meta-Learning: Develop a meta-learning system to adapt the thought generation process across different problem types.
   Effect: Enhanced versatility and performance across diverse problem domains.

5. Explainable AI Features: Add functionality to provide explanations for why certain thoughts were considered helpful.
   Effect: Improved transparency and potential for human-AI collaboration in problem-solving.

6. Interactive Reasoning: Implement a system for the model to ask clarifying questions when faced with ambiguous problems.
   Effect: More robust problem-solving capabilities, especially for complex or poorly defined problems.

7. Multi-Step Lookahead: Extend the prediction to consider multiple future tokens or steps.
   Effect: Improved long-term coherence in reasoning and generation tasks.

8. Attention Mechanism: Implement an attention mechanism to weigh the importance of different thoughts.
   Effect: More nuanced integration of thoughts into the reasoning process, potentially leading to better outcomes.

9. Confidence-Based Backtracking: Allow the model to backtrack in the reasoning process if confidence falls below a threshold.
   Effect: More robust reasoning, especially for problems where initial assumptions may be incorrect.

10. Fine-Tuning on Domain-Specific Data: Adapt the model to specific domains by fine-tuning on relevant datasets.
    Effect: Improved performance in specialized areas while maintaining general reasoning capabilities.
"""

Mind Map

Highlights explained

1. Generalization of STaR to Unstructured Text

Explanation

Significance

Relation to Existing Work

2. Parallel Rationale Generation Algorithm

Explanation

Significance

Relation to Existing Work

3. Custom Meta-Tokens for Thought Delimitation

Explanation

Significance

Relation to Existing Work

4. Mixing Head for Logit Integration

Explanation

Significance

Relation to Existing Work

5. Complementary to Chain-of-Thought Prompting

Explanation

Significance

Relation to Existing Work

Code

Leave a Reply

Related Posts

STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning

Let’s Verify Step by Step (must-read to build OpenAI o1)

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Self-Discover: Large Language Models Self-Compose Reasoning Structures