
In this post, we cover the basics of using agentic workflows and LLM coding agent tools for software engineering. Read on for an introduction to agents and their capabilities, their limitations, and use cases.
How do LLM agents for software development work?
LLM agents are software systems that combine Large Language Models with other tools to perform complex tasks autonomously. In the context of coding agents, we define LLM agents as AI systems that can understand, write, and modify code, including solving complex software engineering problems, with minimal human intervention. That intervention generally comes in the form of natural language instructions. Agents are able to understand those instructions and translate them to code.
The key characteristics that make agents useful for software development are that they can:
- Break down complex tasks into manageable subtasks
- Make decisions throughout the delivery of a solution to the problem
- Interact with both the user and other systems as needed, resulting in an iterative and interactive process in resolving coding problems.
The key difference between basic code generation and real agentic behaviour is that agents can plan a solution, iteratively execute it, and use reflection to gauge and improve the success of the process. Those characteristics make agents very promising for certain software engineering use cases – but they also have some limitations that are important to consider.

Capabilities of LLM agents
LLM agents have a few core capabilities that make them useful for software engineering use cases. The following list is inspired by Google's take on LLM agents:
- Planning & reasoning: An agent can decompose complex tasks into smaller subtasks, identify the steps necessary to reach a solution (including evaluating multiple options and choosing the best one to achieve the specified goal). It can also mimic the human cognitive process by using reflection to draw conclusions, and may also use chain-of-thought reasoning and reflection to review the effectiveness of the plan. A key element of planning and reasoning is considering potential future states and obstacles. For example, a software development agent tasked to fix a bug might decide first to find a reproducer and write a unit test before inspecting the code and attempting to fix the problem.
- Execution: The agent can take the actions necessary to perform the task (autonomously or asking for guidance from the user), including using tools or interacting with external systems. For software development, this includes the ability to read and modify files, but also to execute shell commands in a development environment.
- Observing: To understand the context, agents can draw upon the environment or situation at hand. In the case of software engineering agents, this could mean analyzing a project’s existing code base or interpreting the output of shell commands.
- Collaboration: It’s a primary requirement for an agent to be able to work together with humans and/or other agents. That involves communicating effectively (and developing a holistic understanding of the situation as well as the viewpoints of others), and coordination to efficiently manage work. While a coding agent is expected to solve tasks autonomously, the goals and initial instructions are discussed with a user, and this human input is often given through an iterative process.
- Self-improvement: Advanced agents can not only plan and reason in iterations, they also have long-term memory to learn over time and self-refine based on either their own experiences or external feedback. Some programming agents can learn to adhere to coding styles or avoid common mistakes over time.
- Tool integration: In the case of software engineering agents, tool calling is an important capability. Tool calling interfaces enable the agent to interact with external systems, including retrieving data from external systems or databases, and connecting to external APIs. A recent development is the MCP (Model Context Protocol), a new standard for connecting AI assistants to external data sources and tools. For example, an MPC connection to GitHub can make an agent aware of open issues and pull requests.
Challenges & limitations
Despite all those capabilities and examples of agents being successfully applied in real-life software engineering workflows, the technology still has its limitations.
-
Risks: isolation or safeguarding: An LLM with access to developer machines or even live systems can create chaos. It’s important to make sure the agent is completely isolated, or to use some other safeguarding mechanism such as explicit confirmation to run shell commands. Note that we have seen examples of a clever LLM breaking out of its environment to repair the infrastructure, so the risk is definitely real.
-
Managing complexity & limited context: Agents can struggle to grasp the full context and develop a comprehensive understanding of complex system architectures. Especially in the case of large code bases and complex problems, agents can have difficulties planning complex multi-step development processes. Smart techniques help make the most of limited LLM contexts, such as compressing the code base as a "repo map" or implementing RAG specifically for software engineering tasks.
-
Inconsistent output and hallucination: Since they are based on LLMs, agents are also vulnerable to the usual problems: not following instructions correctly and producing unreliable outputs, including having a tendency to generate plausible but incorrect code. We have also seen examples of agents “faking” test results or success messages, and being overly confident about using approaches that are incorrect.
-
Tool calling: As mentioned above, tool calling is a crucial capability of agents. Yet most agents still display reliability issues when interacting with external systems. MCP is still not universally supported by currently available agents. Tool calling is a challenge on its own as agents can have difficulties choosing the right tools to use and then properly executing the calls. Agents need to balance flexibility with constraints and handle errors resulting from tool use.
-
LLM selection and costs: Not all agents can be configured to use any LLM that you choose to work with. Sure, there may be trade-offs between performance and token costs – but the most expensive LLMs aren’t always the best for every software development use case. To optimize costs when using agents, one needs to consider multiple criteria for selecting a model with an optimal price-vlaue ratio and adopt strategies to optimize prompt efficiency. It also makes sense to find a way to monitor costs in real time.
🤓 Wondering how different LLMs perform for coding tasks? Check out all the DevQualityEval deep dives:
- Anthropic’s Claude 3.7 Sonnet is the new king 👑 of code generation (but only with help), and DeepSeek R1 disappoints (Deep dives from the DevQualityEval v1.0)
- OpenAI’s o1-preview is the king 👑 of code generation but is super slow and expensive (Deep dives from the DevQualityEval v0.6)
- DeepSeek v2 Coder and Claude 3.5 Sonnet are more cost-effective at code generation than GPT-4o! (Deep dives from the DevQualityEval v0.5.0)
- Is Llama-3 better than GPT-4 for generating tests? And other deep dives of the DevQualityEval v0.4.0
- Can LLMs test a Go function that does nothing?
Use cases for coding agents
With the above limitations carefully considered, agents can already provide valuable help with a variety of software engineering tasks. The most common use cases of LLM coding agents include:
Automating repetitive tasks
A range of monotonous tasks inherent in coding can be automated with agents. Examples include creating the scaffolding for new projects or components with a consistent structure; generating boilerplate code; building standard interfaces; setting up development environments; and converting between data formats. Agents can also help automate the creating of logging, monitoring, testing, and observability code, and can be useful in creating and maintaining documentation.
Transpiling code
Agents have the potential to handle complex transpilation use cases. The DevQualityEval LLM benchmark evaluates models across a range of software engineering tasks including transpilation, and our findings from the latest deep dive into the results confirm that models are getting better at transpiling code.
🤖 Coding agents in action
Read about our experiences with Aider, All Hands' OpenHands (previously OpenDevin), Claude Code, Cline, Cursor, GitHub Copilot: “Agent Mode”, goose, gptme, and SWE-Agent.
Code maintenance and refactoring
Agents can help cut the effort it takes to identify and fix technical debt in a project. Agents can be used to standardize code patterns across codebases, as well as to update dependencies and fix compatibility issues.
Test generation
Test generation is an area ripe for automation, with a potential for agents to be used for expanding test coverage by creating unit and integration tests and generating test data. Test generation is another task type covered by the DevQualityEval benchmark, helping identify the LLMs that perform best at generating tests for a given language.
Learning and development
Those new to coding (or experienced developers working with new and complex projects or frameworks) can use agents to help understand the system at hand and get a better grasp of the context. Creating educational tutorials, examples, and explaining code functionality is made easier by agents which let you “chat with your code” to better understand its intricacies.
Nondeterminism in agentic workflows
By design, LLMs are nondeterministic. That affects how you’ll be working with coding agents. We have touched on some of these points above, but since they can greatly influence your experience with agents, it makes sense to emphasize the following key recommendations for working with LLM agents:
- Isolation: Never give an agent direct access to your machine.
- Checkpoints: Commit frequently to avoid breaking functionality.
- Tests: Exhaustive testing is the easiest way to confirm that the code produced by a model is working, and did not introduce any regressions.
- Reviews: Always look at the code in the final instance. If something breaks, you are to blame, not the agent.
- Costs: Especially when using autonomous agents, it is crucial to track costs and also monitor the progress of the agent. It might be better to stop an agent early than to waste additional money on a suboptimal solution to a problem.
Building your own agent
There are already a range of off-the-shelf coding agents available (both paid and open-source), each with their own strengths and weaknesses. (We took all the major coding agents for a test drive, and summarized our experiences in this post).
However, creating your own custom agent is also a viable option. But since that requires a considerable investment of time and effort, you’ll want to carefully consider if that’s appropriate:
- With some complex projects and tasks, it’s better to rely on workflows for consistency and predictability as to how a task is solved. Opt for the simplest solution available: if you can use plain LLM calls to get the job done, that’s likely to yield better results and have lower costs.
- Agents offer more flexibility and decision-making. If your application follows an iterative flow that is affected by incoming data, or the application needs to adapt and follow different flows based on feedback, using an agent may be better.
If you think creating a custom coding agent is the way to go for your team or project, sign up for Symflower’s agent seminar to learn about building your own agent.