Anthropic discusses "How We Construct a Multi-Agent Research System."

Our Research feature uses multiple Claude agents to explore complex topics more effectively. In this post, we share the engineering challenges we faced and the lessons we learned while building this system.

Our "Research" functionality leverages several Claude agents to more efficiently investigate complex topics. We discuss the engineering challenges encountered during the construction of the system and the insights we gained.

Claude now includes Research capabilities that allow it to search the entire web, Google Workspace, and any integrated systems or services in order to complete complex tasks.

Claude now offers a "Research" feature, which enables searches across the entire web, Google Workspace, and any integrated systems or services to complete complex tasks.

The progression of this multi-agent system from prototype to production taught us crucial lessons about system architecture, tool design, and prompt engineering. A multi-agent system comprises several agents (large language models that autonomously use tools in a loop) working together. Our Research feature includes one agent that plans a research process based on user queries, then uses tools to create parallel subagents to search for information at the same time. Systems with multiple agents introduce new challenges in coordinating agents, evaluating their work, and ensuring reliability.

This journey from prototype to production taught us vital lessons about system architecture, tool design, and prompt engineering. A multi-agent system is made up of several agents (LLMs that autonomously use tools in a loop) that work collaboratively. Our Research feature includes a primary agent that plans the research process based on the user’s query, then leverages tools to spawn multiple subagents that simultaneously search for information. Multi-agent systems bring new challenges in agent coordination, evaluation, and reliability.

This post breaks down the principles that worked for us—we hope you find these insights useful when building your own multi-agent systems.

This article outlines the principles that have proven effective for us—we hope you will find them valuable for building your own multi-agent systems.

Benefits of a Multi-Agent System

Research often involves open-ended problems where it is very difficult to predict the necessary steps in advance. You cannot hardcode a fixed path for exploring complex topics since the process is inherently dynamic and path-dependent. When people conduct research, they continuously update their approach based on new discoveries, following leads as they arise during the investigation.

Research work typically involves open-ended problems, making it hard to predict the required steps ahead of time. For exploring complex subjects, you cannot establish a fixed method, because the process is inherently dynamic and path-dependent. When conducting research, individuals tend to continuously adjust their methods based on emerging discoveries and follow new leads.

This unpredictability makes AI agents particularly well-suited for research tasks. Research requires the flexibility to pivot or explore tangential connections as the investigation unfolds. The model must operate autonomously over many turns, making decisions about which direction to pursue based on intermediate findings. A linear, one-shot process cannot handle such tasks.

This unpredictability makes AI agents especially good for research tasks. Research calls for the flexibility to shift focus or explore related connections as the study evolves. The model must operate autonomously over many iterations, deciding which paths to take based on ongoing discoveries. A single-step, linear approach cannot handle these requirements.

The essence of search is compression: distilling insights from a vast corpus. Subagents aid this compression by working in parallel, each with its own context window, exploring various aspects of the query simultaneously before condensing the most important tokens for the lead research agent. Each subagent also provides separation of concerns—with distinct tools, prompts, and exploration paths—which reduces path dependency and enables more thorough, independent investigations.

Search essentially entails compression: extracting insights from an enormous amount of data. Subagents facilitate this process by working in parallel with their own context windows, each investigating different facets of the question before condensing the key information for the lead research agent. Additionally, each subagent focuses on a separate aspect—using different tools, prompts, and exploration trajectories—thus reducing dependency on a single path and allowing comprehensive, independent investigations.

Once intelligence reaches a certain threshold, multi-agent systems become an essential means to scale performance. For example, while individual humans have grown more intelligent over the past 100,000 years, human societies have become exponentially more capable in the information age thanks to our collective intelligence and ability to coordinate. Even generally intelligent agents face limits when working independently; groups of agents can achieve far more together.

When intelligence reaches a certain level, multi-agent systems become crucial for scaling performance. For instance, although individual human intelligence has improved over the past 100,000 years, human societies in the information age have become exponentially more capable due to our collective intelligence and coordination abilities. Even generally capable agents are limited when working alone; groups of agents can accomplish far more.

Our internal evaluations show that multi-agent research systems excel particularly in breadth-first queries that involve exploring multiple independent directions simultaneously. In our internal research evaluation, a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed a single Claude Opus 4 agent by 90.2%. For example, when tasked with identifying all board members of companies in the Information Technology S&P 500, the multi-agent system correctly answered by decomposing the task across subagents, while the single-agent system failed to find the answers due to slow, sequential searches.

Our internal assessments indicate that multi-agent research systems perform exceptionally well on breadth-first queries, which require pursuing several independent directions at once. In our evaluation, a multi-agent system—with Claude Opus 4 as the lead agent and Claude Sonnet 4 as subagents—outperformed a single Claude Opus 4 agent by 90.2%. For instance, when asked to list all board members of companies in the Information Technology S&P 500, the multi-agent system managed to obtain the correct answers by breaking the task into sub-tasks, whereas the single-agent system’s sequential search approach failed.

Multi-agent systems are effective primarily because they allow spending sufficient tokens to solve a problem. In our analysis, three factors accounted for 95% of the performance variance in the BrowseComp evaluation (which tests a browsing agent’s ability to locate hard-to-find information). We found that token consumption alone explained 80% of the variance, with the number of tool calls and model choice accounting for the remaining differences. This finding validates our architecture that distributes work across agents with separate context windows to enhance parallel reasoning capacity. The latest Claude models greatly enhance token efficiency; for example, upgrading to Claude Sonnet 4 yields a larger performance boost than simply doubling the token allocation on Claude Sonnet 3.7. Multi-agent architectures effectively scale token usage for tasks that exceed the limits of single agents.

Multi-agent systems work primarily because they allow the expenditure of enough tokens to solve the problem. Our analysis showed that three factors explained 95% of the performance variance in the BrowseComp evaluation—which tests how well browsing agents locate hard-to-find information. We found that token usage alone explained 80% of that variance, with the number of tool calls and the choice of model making up the rest. This supports our architecture, which assigns work to agents with separate context windows to increase parallel reasoning capacity. The newest Claude models act as powerful multipliers on token usage; for example, upgrading to Claude Sonnet 4 provides a greater boost than merely doubling the token budget on Claude Sonnet 3.7. Multi-agent architectures effectively scale token usage for tasks that single agents cannot handle.

There is a downside: in practice, these architectures consume tokens very quickly. Our data shows that agents typically use about 4 times more tokens than regular chat interactions, and multi-agent systems use about 15 times more tokens than chats. For economic viability, multi-agent systems must be applied to tasks where the task’s value is high enough to justify the increased token usage. Moreover, some areas that require all agents to share the same context or have many interdependencies among agents are not well suited to multi-agent systems today. For instance, most coding tasks involve fewer truly parallelizable components than research tasks, and LLM agents are not yet adept at coordinating and delegating to other agents in real time. We have found that multi-agent systems excel at high-value tasks that involve extensive parallelization, information beyond a single context window, and the integration of numerous complex tools.

However, there is a drawback: in practice, these architectures quickly consume large numbers of tokens. Our data shows that agents use roughly four times the tokens of typical chat interactions, and multi-agent systems consume around 15 times as many tokens as chats. For economic feasibility, multi-agent systems must be used in situations where the task’s value is high enough to offset the increased token cost. Additionally, domains that require all agents to operate within the same context or where many inter-agent dependencies exist are currently not ideal for multi-agent systems. For example, most programming tasks have fewer parallelizable subcomponents than research tasks, and LLM agents are not yet proficient at real-time coordination and delegation. We have observed that multi-agent systems perform best for high-value tasks that demand heavy parallelization, information that exceeds individual context windows, and the use of numerous complex tools.

Architecture Overview for Research

Our Research system employs a multi-agent architecture based on an orchestrator-worker pattern, where a lead agent coordinates the process while delegating specialized tasks to parallel subagents.

Our "Research" system uses a multi-agent architecture following an orchestrator-worker pattern, where a primary agent coordinates the process and assigns tasks to specialized subagents that operate in parallel.

An image to describe post

The multi-agent architecture in action: user queries pass to a lead agent, which creates specialized subagents to search concurrently for different aspects.

When a user submits a query, the lead agent analyzes the request, devises a strategy, and spawns subagents to investigate different facets simultaneously. As depicted in the diagram above, the subagents function as intelligent filters by iteratively using search tools to collect information—in this case, on AI agent companies in 2025—and then return a list of companies to the lead agent, which compiles a final answer.

Traditional Retrieval Augmented Generation (RAG) methods employ static retrieval. They fetch a set of text chunks most similar to the input query and use them to generate a response. In contrast, our architecture employs a multi-step search that dynamically locates relevant information, adapts to new findings, and analyzes the results to formulate high-quality answers.

Traditional approaches to Retrieval Augmented Generation (RAG) use static retrieval—that is, they extract a set of text segments that best match the input query and use these segments to generate a response. In contrast, our architecture uses a multi-step, dynamic search process that finds relevant information, adapts to new findings, and analyzes the results to produce high-quality answers.

An image to describe post

This process diagram illustrates the complete workflow of our multi-agent Research system. When a user submits a query, the system creates a LeadResearcher agent that enters an iterative research cycle. The LeadResearcher begins by contemplating its approach and saving the plan to Memory to preserve context—important because if the context window exceeds 200,000 tokens, it will be truncated. Then, it creates specialized subagents (two are shown here, but the number can vary) with specific research tasks. Each subagent independently conducts web searches, evaluates tool results through interleaved thinking, and returns their findings to the LeadResearcher. The LeadResearcher synthesizes these findings and determines whether further research is necessary—if so, it can create additional subagents or refine its strategy. Once enough information is collected, the system exits the research loop and passes all findings to a CitationAgent, which processes the documents and research report to pinpoint specific citation locations. This ensures that all claims are properly attributed. The final research results, complete with citations, are then returned to the user.

This process diagram shows the complete workflow of our multi-agent Research system. When a query is submitted, the system creates a LeadResearcher agent that embarks on an iterative research cycle. The LeadResearcher first plans its strategy and saves the plan to Memory to maintain context—critical because if the token count exceeds 200,000, the context will be truncated. It then creates specialized subagents (two are shown here, but more can be used) assigned specific research tasks. Each subagent independently performs web searches, uses interleaved thinking to evaluate tool results, and sends back their findings. The LeadResearcher then synthesizes these results and decides whether more research is needed. If so, it can spawn additional subagents or refine its strategy. Once sufficient data is gathered, the process ends and all findings are passed to a CitationAgent, which processes the documents and report to identify exact locations for citations, ensuring all claims are properly attributed. The final, fully cited research results are then provided to the user.

Prompt Engineering and Evaluations for Research Agents

Multi-agent systems differ significantly from single-agent systems, including a rapid increase in coordination complexity. Early agents made mistakes such as spawning 50 subagents for simple queries, endlessly scouring the web for sources that didn't exist, and interfering with one another with excessive updates. Since prompts guide every agent's behavior, prompt engineering was our primary method for refining these behaviors. Below are some principles we learned for prompting agents:

Multi-agent systems differ from single-agent systems, notably in their rapidly growing coordination complexity. Early agents made errors like creating 50 subagents for simple queries, endlessly searching the web for nonexistent sources, and distracting one another with too many updates. Because each agent's actions are steered by its prompt, prompt engineering became our key lever for improvement. Below are some principles we uncovered for engineering prompts:

Think like your agents. To iterate on prompts effectively, you must understand their effects. We built simulations in our Console using the exact prompts and tools from our system, watching the agents work step by step. This immediately revealed failure modes: agents continuing when they already had sufficient results, using overly verbose search queries, or selecting incorrect tools. Effective prompt design relies on developing an accurate mental model of the agent, which makes impactful changes apparent.
Teach the orchestrator how to delegate. In our system, the lead agent breaks down queries into subtasks and explains them to subagents. Each subagent needs a clear objective, a defined output format, guidance on which tools and sources to use, and unambiguous task boundaries. Without precise task descriptions, agents may duplicate work, leave gaps, or fail to locate necessary information. Initially, we allowed the lead agent to give simple instructions like "research the semiconductor shortage," but these were often too vague, leading to misinterpretation or redundant searches. For example, one subagent examined the 2021 automotive chip crisis while two others redundantly investigated current 2025 supply chains, resulting in ineffective division of labor.
Scale effort to query complexity. Agents often struggle to gauge the appropriate level of effort for different tasks, so we incorporated scaling rules into the prompts. Simple fact-finding might require a single agent with 3–10 tool calls, while direct comparisons might necessitate 2–4 subagents with 10–15 calls each, and complex research could require more than 10 subagents with clearly defined responsibilities. These guidelines help the lead agent assign resources efficiently and prevent overinvestment in simple queries—a common failure mode in our early implementations.
Tool design and selection are critical. The interface between agents and tools is as important as human-computer interfaces. Using the correct tool is not only efficient but often essential. For example, an agent searching the web for context found only in Slack is likely to fail. With MCP servers that grant the model access to external tools, this problem is compounded, as agents encounter unfamiliar tools with varying quality in descriptions. We provided explicit heuristics: for instance, review all available tools first, match tool usage with user intent, use web search for broad exploration, and prioritize specialized tools over generic ones. Poor tool descriptions can mislead agents, so each tool must have a distinct purpose and clear description.
Let agents improve themselves. We discovered that Claude 4 models can be excellent prompt engineers. When given a prompt along with a failure mode, they can diagnose the issue and suggest improvements. We even created a tool-testing agent—when it encountered a flawed MCP tool, it attempted to use it and then rewrote the tool description to prevent errors. By testing the tool multiple times, this agent identified subtle nuances and bugs. This process of refining tool ergonomics reduced task completion time by 40% for subsequent agents using the improved description, as they avoided most mistakes.
Start wide, then narrow down. The search strategy should mirror expert human research: first explore the landscape before delving into specifics. Agents often begin with overly long, specific queries that return few results. We countered this by prompting agents to start with short, broad queries, assess the available information, and then gradually narrow the focus.
Guide the thinking process. Extended thinking mode, which prompts Claude to produce additional tokens in a visible reasoning process, can serve as a controllable scratchpad. The lead agent uses this mode to plan its strategy, determine which tools fit the task, assess query complexity and required subagent count, and define each subagent's role. Our testing showed that extended thinking enhances instruction-following, reasoning, and efficiency. Subagents also plan and then use interleaved thinking after receiving tool results to evaluate quality, identify gaps, and refine their queries, making them more effective in adapting to various tasks.
Parallel tool calling transforms speed and performance. Complex research tasks naturally involve exploring many sources. Our early agents performed sequential searches, which was painfully slow. To increase speed, we introduced two kinds of parallelization: (1) the lead agent launches 3–5 subagents in parallel rather than sequentially, and (2) subagents use 3 or more tools in parallel. These changes reduced research time by up to 90% for complex queries, enabling the Research feature to accomplish more in minutes instead of hours while covering more information than other systems.

Our prompting strategy focuses on instilling effective heuristics rather than rigid rules. We studied expert research methods used by skilled humans and encoded these strategies into our prompts—such as breaking down difficult questions into smaller tasks, carefully evaluating source quality, adjusting search approaches based on new findings, and discerning when to focus on depth versus breadth. We also proactively set explicit guardrails to mitigate unintended side effects and keep agents from spiraling out of control. Finally, we emphasized a rapid iteration cycle with high observability and comprehensive test cases.

Effective Evaluation of Agents

Robust evaluation is crucial for building reliable AI applications, and agents are no exception. However, evaluating multi-agent systems is uniquely challenging. Traditional evaluations typically assume the AI follows the same sequence every time: given input X, the system should follow path Y to produce output Z. But multi-agent systems do not operate this way. Even with identical starting points, agents might take entirely different yet equally valid paths to reach the goal. One agent might consult three sources while another reviews ten, or they may use different tools to arrive at the same answer. Because we cannot always determine the correct steps in advance, we must use flexible evaluation methods that assess whether agents achieved the correct outcomes while also following a reasonable process.

Start evaluating immediately with small samples. In early development, even small prompt adjustments can dramatically affect performance—improving success rates from 30% to 80% is common. With such pronounced effects, a few test cases can reveal significant changes. We began with roughly 20 queries representing real-world usage patterns, which often clearly demonstrated the impact of modifications. Although some teams delay creating evaluations until they have hundreds of test cases, we found that starting with a small-scale evaluation is best.

LLM-as-judge evaluation scales when done well. Research outputs are challenging to evaluate programmatically because they are free-form text and usually lack a single correct answer. LLMs are naturally suited for grading these outputs. We used an LLM judge to assess outputs against criteria such as factual accuracy (do claims match sources?), citation accuracy (are the sources properly cited?), completeness (are all aspects addressed?), source quality (are primary sources used over lower-quality secondary ones?), and tool efficiency (were the right tools used an appropriate number of times?). Although we experimented with multiple judges, a single LLM call providing scores between 0.0–1.0 and a pass-fail grade proved most consistent and aligned with human judgments. This method was particularly effective when test cases had a clear answer—for example, verifying if the top three pharma companies by R&D budget were correctly listed. Using an LLM as a judge allowed us to scale evaluations for hundreds of outputs.

Human evaluation catches what automation misses. Human testers can identify edge cases that automated evaluations miss—such as hallucinated responses for unusual queries, system failures, or subtle biases in source selection. In our case, human testers noticed that our early agents consistently favored SEO-optimized content farms over authoritative but lower-ranked sources like academic PDFs or personal blogs. Adding source quality heuristics to our prompts helped correct this issue. Even with automated evaluations, manual testing remains indispensable.

Production Reliability and Engineering Challenges

In traditional software, a bug might cause a feature to malfunction, degrade performance, or trigger outages. In agent-based systems, minor changes can cascade into significant behavioral shifts, making it extremely difficult to write robust code for complex agents that maintain state over long periods.

Agents are stateful and errors compound. Agents can run for extended durations, maintaining state across numerous tool calls. This necessitates reliable code execution and robust error handling. Without effective safeguards, minor system failures can have catastrophic effects on agents. When errors occur, restarting from scratch is not viable—restarts are costly and can frustrate users. Instead, we built systems that resume from the point of failure. We also leverage the model's intelligence to handle issues gracefully; for instance, informing the agent when a tool fails and allowing it to adjust has proven surprisingly effective. We combine the adaptability of AI agents with deterministic safeguards like retry logic and regular checkpoints.

Debugging benefits from new approaches. Because agents make dynamic decisions and are non-deterministic—even with identical prompts—debugging can be challenging. For example, users might report that an agent "did not find obvious information," but the underlying cause could be unclear. Were poor search queries used? Were suboptimal sources selected? Was there a tool failure? Implementing comprehensive production tracing allowed us to diagnose why agents failed and systematically address issues. Beyond standard observability, we monitor decision patterns and interaction structures without inspecting individual conversations to maintain user privacy. This high-level observability helped us identify root causes, uncover unexpected behaviors, and resolve common failures.

Deployment needs careful coordination. Agent systems are highly stateful networks of prompts, tools, and execution logic that run almost continuously. As a result, when deploying updates, agents may be at various stages in their process. Consequently, we must prevent well-intentioned code changes from disrupting existing agents. We cannot update all agents simultaneously; instead, we use rainbow deployments to gradually shift traffic from the old version to the new version while both run concurrently.

Synchronous execution creates bottlenecks. Currently, our lead agents execute subagents synchronously, waiting for each set of subagents to complete before proceeding. While this simplifies coordination, it creates bottlenecks in information flow between agents. For example, the lead agent cannot intervene with subagents, subagents cannot coordinate, and the entire system may be delayed by a single slow subagent. Asynchronous execution could enable additional parallelism—agents working concurrently and spawning new subagents as needed. However, asynchronicity introduces challenges in result coordination, state consistency, and error propagation among subagents. We expect that as models handle longer and more complex research tasks, the performance gains from asynchronicity will justify the added complexity.

Conclusion

When building AI agents, the last mile often proves to be the longest part of the journey. Code that functions well on a developer’s machine may require extensive engineering to become a robust production system. The compounded nature of errors in agent systems means that minor issues for traditional software can completely derail agent functionality. A single step failure can cause agents to take entirely different trajectories, leading to unpredictable outcomes. For all the reasons discussed in this post, the gap between prototype and production often proves wider than anticipated.

Despite these challenges, multi-agent systems have proven invaluable for open-ended research tasks. Users have reported that Claude helped them uncover business opportunities they hadn’t considered, navigate complex healthcare options, resolve difficult technical issues, and save days of work by discovering research connections they otherwise would have missed. With careful engineering, exhaustive testing, meticulous prompt and tool design, robust operational practices, and seamless collaboration between research, product, and engineering teams, multi-agent research systems can operate reliably at scale. We are already witnessing these systems transform the way people solve complex problems.

An image to describe post

A Clio embedding plot displays the most common ways people are using the Research feature today. The top use-case categories include developing software systems in specialized domains (10%), developing and optimizing professional and technical content (8%), designing business growth and revenue strategies (8%), assisting with academic research and educational material development (7%), and investigating and verifying information about people, places, or organizations (5%).

Appendix

Below are some additional miscellaneous tips for multi-agent systems.

End-state evaluation of agents that mutate state over many turns.
Evaluating agents that modify persistent state over multiple conversation rounds presents unique challenges. Unlike straightforward research tasks, each action can change the environment for subsequent steps, creating dependencies that traditional evaluations struggle to handle. We found that focusing on end-state evaluation rather than turn-by-turn analysis is effective. Instead of judging whether an agent followed a specific process, evaluate whether it achieved the correct final state. This acknowledges that agents may take alternative paths to the same goal while still delivering the intended outcome. For complex workflows, break the evaluation into discrete checkpoints where specific state changes should have occurred rather than trying to validate every intermediate step.

Long-horizon conversation management.
Production agents often engage in conversations spanning hundreds of turns, necessitating careful context management strategies. As conversations progress, standard context windows become insufficient, requiring intelligent compression and memory mechanisms. We implemented patterns whereby agents summarize completed work phases and store essential information in external memory before proceeding to new tasks. When context limits are approached, agents can spawn fresh subagents with a clean context while maintaining continuity through careful handoffs. They can also retrieve stored context, such as a research plan, from memory rather than losing previous work when the context limit is reached. This distributed approach prevents context overflow while preserving conversation coherence across long interactions.

Subagent output to a filesystem to minimize the ‘game of telephone.’
Allowing subagents to directly output results can bypass the main coordinator for certain types of outputs, thereby improving both fidelity and performance. Rather than requiring subagents to relay everything through the lead agent, implement artifact systems where specialized agents can generate independently persistent outputs. Subagents use tools to store their findings in external systems, then return lightweight references to the coordinator. This approach prevents information loss during multi-stage processing and reduces token overhead from transferring large outputs through conversation history. This pattern is especially effective for structured outputs such as code, reports, or data visualizations, where a specialized subagent's focused prompt produces better results than filtering it through a general coordinator.

Appendix: Prompts

Anthropic has shared the prompts related to this research system in their GitHub repository:

citations_agent

You are an agent for adding correct citations to a research report. You are provided with a report encapsulated in <synthesized_text> tags that was generated based on specific sources; however, the sources are not cited in the <synthesized_text>. Your task is to enhance user trust by generating correct and appropriate citations for this report.

Based on the provided document, add citations to the input text using the format specified earlier. Output the resulting report, unchanged except for the added citations, contained within <exact_text_with_citation> tags.

Rules:

Do NOT modify the content within <synthesized_text>—keep all text identical, only adding citations.
Carefully maintain whitespace: DO NOT add or remove any whitespace.
ONLY add citations where the source documents directly support claims in the text.

Citation guidelines:

Avoid citing unnecessarily; do not cite every statement. Focus on key facts, conclusions, and substantive claims that readers might wish to verify, adding credibility by linking them to sources.
Cite meaningful semantic units: citations should span complete thoughts or findings rather than fragmented words.
Minimize sentence fragmentation: avoid interrupting the flow of a sentence with multiple citations. Use a single citation at the sentence end if multiple claims come from the same source.
Do not place redundant citations close to one another.

Technical requirements:

Citations should appear as a visual, interactive element at the closing tag. Be mindful not to disrupt sentence structure.
Output text with citations between <exact_text_with_citation> and </exact_text_with_citation> tags.
Do not include any preamble, thinking, or planning before the opening <exact_text_with_citation> tag.
ONLY add citation tags to the text that is within the <synthesized_text> tags.
Text without citations will be compared to the original report. If the text is not identical, your result will be rejected.

Now, add the citations to the research report and output the <exact_text_with_citation> content.

research_lead_agent

You are an expert research lead focused on high-level strategy, planning, efficient delegation to subagents, and final report composition. Your core goal is to be maximally helpful by leading a process to research the user’s query and generate an excellent research report that thoroughly answers their question. Take the current request, devise an effective research process to address it, and execute this plan by delegating key tasks to appropriate subagents.

The current date is {{.CurrentDate}}.

<research_process>

Follow this process to break down the user’s question and develop an effective research plan. Thoroughly consider the user's task to ensure complete understanding and determine the subsequent actions. Analyze each aspect of the user's question to identify the most important components, including main concepts, key entities, relationships, necessary data points, and any contextual or temporal constraints. Decide what format the final answer should take—be it a detailed report, a list of entities, an analysis of different perspectives, a visual report, or another format.

Assessment and breakdown: Identify the central concepts, key facts, and data needed. Determine what aspects of the prompt are most critical to the user’s needs.
Query type determination: Clearly reason whether this is a depth-first query (seeking multiple perspectives on a single issue), a breadth-first query (dividing into independent sub-questions), or a straightforward query (focused and well-defined).
Detailed research plan development: Develop a precise research plan with clear task allocations across subagents based on the query type.
Methodical plan execution: Execute the plan, deploying parallel subagents as needed for independent tasks, and synthesize the findings thoroughly.

</research_process>

<subagent_count_guidelines>
When determining how many subagents to create, follow these guidelines:

Simple/Straightforward queries: 1 subagent.
Standard complexity queries: 2–3 subagents.
Medium complexity queries: 3–5 subagents.
High complexity queries: 5–10 subagents (maximum 20).

Do not create more than 20 subagents unless essential. Prefer fewer, more capable subagents over many narrowly focused ones.
</subagent_count_guidelines>

<delegation_instructions>
Use subagents as your primary research team to perform major research tasks. Deploy subagents immediately after finalizing your research plan to begin the process quickly, using the run_blocking_subagent tool with clear instructions. Each subagent is a fully capable researcher with access to external tools. When assigning tasks, ensure that each subagent has specific, non-overlapping responsibilities. Provide detailed, concise instructions including research objectives, expected output format, relevant background context, key questions to answer, suggested starting points, and specific tools to use. Always structure subagent instructions to maximize efficiency and clarity.
</delegation_instructions>

<answer_formatting>
Before providing a final answer:

Review the latest fact list compiled during the research process.
Reflect whether these facts sufficiently answer the query.
Provide a final answer in the best format for the user's needs according to the <writing_guidelines> below.
Output the final result in Markdown using the complete_task tool, without including any Markdown citations; a separate agent will handle citations.
Do not include a list of references or citations at the end.
</answer_formatting>

<use_available_internal_tools>
You may have additional tools available for exploring the user's integrations (e.g., Slack, Asana, Github). Always use any available read-only tools to gain basic information about these integrations. For instance, use slack_search or asana_user_info if available. Do not use write, create, or update tools. When handling internal information, explicitly instruct subagents on which tools to use. This ensures that internal contexts are effectively utilized.
</use_available_internal_tools>

<use_parallel_tool_calls>
For maximum efficiency, call all relevant tools simultaneously rather than sequentially. Utilize parallel tool calls to run subagents concurrently, particularly during the initial phase of research unless the query is straightforward.
</use_parallel_tool_calls>

<important_guidelines>
In communication with subagents, maintain high information density and conciseness. As the research process unfolds:

Continuously review gathered core facts from your own research and subagent reports.
For key facts, especially numbers and dates, note any discrepancies or quality issues.
Carefully analyze new information and apply critical reasoning.
When further research yields diminishing returns, cease deploying new subagents and compile your final report.
NEVER create a subagent to generate the final report; you must synthesize the report yourself.
Avoid deploying subagents for topics that may cause harm; include clear constraints to prevent harm if a sensitive query arises.
</important_guidelines>

You have received a query from the user. Do your best to thoroughly complete the task based on these instructions. No clarifications will be provided—use your best judgment and do not ask the user questions. Plan your use of subagents and parallel tool calls efficiently, then synthesize the final research report based on the gathered results.

research_subagent

You are a research subagent working as part of the team. The current date is {{.CurrentDate}}. You have been assigned a clear task by the lead agent and should use available tools to accomplish the research process. Follow the instructions below carefully:

<research_process>

Planning: Carefully think through the task. Develop a detailed research plan outlining the research requirements, allocate a proper 'research budget' in terms of tool calls based on complexity (simple tasks under 5 calls, medium around 5, hard about 10, and very difficult up to 15).
Tool selection: Identify the most useful tools for this task (e.g., google_drive_search, gmail, gcal, web_search, web_fetch). Always use internal tools when personal data or internal contexts are involved. Prioritize using web_fetch to retrieve full website content.
Research loop: Execute an efficient observe-orient-decide-act (OODA) loop. Use multiple tool calls (minimum 5, up to 10 for complex queries) in parallel where possible, iterating based on new findings, while avoiding repeated queries for the same tool.

</research_process>

<research_guidelines>

Provide detailed internal reasoning but report concisely.
Use moderately broad search queries that balance specificity and generality.
For important numerical or factual data, track findings and sources carefully.
Be precise and thorough.
</research_guidelines>

<think_about_source_quality>
After obtaining results, critically evaluate their quality. Note any speculative language or indicators of unreliable sources (e.g., unnamed sources, passive voice, marketing language) and highlight uncertainties in your final report.
</think_about_source_quality>

<use_parallel_tool_calls>
Always perform at least two relevant tool calls in parallel to maximize efficiency.
</use_parallel_tool_calls>

<maximum_tool_call_limit>
Do not exceed 20 tool calls or 100 sources. If limits are approached, stop further calls and compile your final report immediately.
</maximum_tool_call_limit>

Follow these instructions to complete your task, gather all necessary information, and then provide a detailed, concise, and accurate report for the lead research agent.

#multi-agent #dynamic research #system design