Get Started

December 19, 2025

Now Is a Great Time to Build Computer Use Agents. Why?

We’ve evolved from the era of the chatbot — where you ask a question and the model reacts — to something more multimodal.

We’ve evolved from the era of the chatbot — where you ask a question and the model reacts — to something more multimodal. Instead of requiring guidance at every step, the model can now observe what’s happening in the world and begin taking action on its own.The gap today is that while people are training some very good multimodal models — models that can look at an image

 and generate answers or descriptions — they’re still not good at two critical things.First, they struggle with planning their actions. They don’t fully understand how they can affect the world.Second, they have to learn how to act once they receive feedback. This feedback forms a causal relationship between their actions and the internal dynamics of the environment.Even in web or computer-use settings, this relationship can be more complicated than the physical world. The physical world roughly follows Newton’s three laws, but webpages and software — take Adobe, for example — are designed in ways that aren’t always convenient for humans. The model has to understand this feedback and adapt accordingly.We have strong multimodal models that can understand images. We also have LLMs trained to use tools and interact with the world to finish tasks.Computer use agents combine multimodal intelligence with agentic intelligence.This approach offers a general interface for actually utilizing agents. Whenever you encounter a long-tail, infrequent workflow, you can simply describe it and let the agent handle it, rather than relying on specialized tools like MCP or writing custom tools for the agent ahead of time. Now is a very good moment to be working on agentic computer use.

 

Embracing the Challenges of Computer Use Model Training

However, we’re still in the early days. What are the main challenges we need to overcome to make computer use agents truly mainstream?

 

The first is reliability. When a computer use agent performs a task, you expect it to succeed 10 out of 10 times. Even 9 out of 10 can be disastrous in some settings. It’s not as dangerous as autonomous driving, but you still expect the model to reliably understand what’s happening on a website and correct itself when it makes mistakes. That’s the first critical factor.

 

The second challenge is that computer use agents should be considered a superset of what people are building inside ChatGPT or Gemini today. A strong computer use agent that collaborates with you needs to understand your content, be an excellent chatbot, and have enough world knowledge to understand its goals. Making these agents truly powerful requires balancing their capabilities as language models with their efficiency as actor models that can operate a computer.

The third challenge is data. Computer use data isn’t like YouTube videos or web content you can crawl. Companies building foundation computer-use models must set up efficient annotation workflows, know how to leverage synthetic data, and build large datasets so models can understand what’s happening on a screen. It feels like the early days of embodied intelligence: people have to return to the hard work of collecting data instead of assuming there’s already a gold mine on the internet.

An Exciting Future for Computer Use Agents

In the near term, computer use agents should operate as co-pilots. Whenever you need to do something dull — cook a meal, adjust boring pieces of a PowerPoint, edit documents — you hand off the task and let the agent do the work you don’t want to do. That’s the first phase: once reliability is achieved, you can delegate standard tasks.

 

In the second stage, as agents become more intelligent and capable of deeper decision-making, you could run a second computer or let the model live in the cloud, collaborating in a multi-agent way. At that point, the goal of computer use agents extends beyond simply managing a computer — they could run an Amazon store or take on even more ambitious operations.

 

This is when they begin making a meaningful economic impact. Many tasks can be done entirely on a computer: the agent could watch YouTube to see what’s trending, then source and list goods on Amazon.

 

The ultimate vision for computer use agents is straightforward: a human with many ideas about how to build businesses or products can use agents as amplifiers of their efficiency. Everyone could run a company with the help of computer use agents. Agents would essentially function like employees — your team. This unlocks a world where anyone can become an entrepreneur.

 

Not everyone has access to all the APIs and tools required to run a business, but with a truly capable computer use agent, everyone would have a unified interface to do and build whatever they want.

Take me to Lux SDK

Blog

Get Started

December 19, 2025

Now Is a Great Time to Build Computer Use Agents. Why?

We’ve evolved from the era of the chatbot — where you ask a question and the model reacts — to something more multimodal.

We’ve evolved from the era of the chatbot — where you ask a question and the model reacts — to something more multimodal. Instead of requiring guidance at every step, the model can now observe what’s happening in the world and begin taking action on its own.The gap today is that while people are training some very good multimodal models — models that can look at an image

 and generate answers or descriptions — they’re still not good at two critical things.First, they struggle with planning their actions. They don’t fully understand how they can affect the world.Second, they have to learn how to act once they receive feedback. This feedback forms a causal relationship between their actions and the internal dynamics of the environment.Even in web or computer-use settings, this relationship can be more complicated than the physical world. The physical world roughly follows Newton’s three laws, but webpages and software — take Adobe, for example — are designed in ways that aren’t always convenient for humans. The model has to understand this feedback and adapt accordingly.We have strong multimodal models that can understand images. We also have LLMs trained to use tools and interact with the world to finish tasks.Computer use agents combine multimodal intelligence with agentic intelligence.This approach offers a general interface for actually utilizing agents. Whenever you encounter a long-tail, infrequent workflow, you can simply describe it and let the agent handle it, rather than relying on specialized tools like MCP or writing custom tools for the agent ahead of time. Now is a very good moment to be working on agentic computer use.

 

Embracing the Challenges of Computer Use Model Training

However, we’re still in the early days. What are the main challenges we need to overcome to make computer use agents truly mainstream?

 

The first is reliability. When a computer use agent performs a task, you expect it to succeed 10 out of 10 times. Even 9 out of 10 can be disastrous in some settings. It’s not as dangerous as autonomous driving, but you still expect the model to reliably understand what’s happening on a website and correct itself when it makes mistakes. That’s the first critical factor.

 

The second challenge is that computer use agents should be considered a superset of what people are building inside ChatGPT or Gemini today. A strong computer use agent that collaborates with you needs to understand your content, be an excellent chatbot, and have enough world knowledge to understand its goals. Making these agents truly powerful requires balancing their capabilities as language models with their efficiency as actor models that can operate a computer.

The third challenge is data. Computer use data isn’t like YouTube videos or web content you can crawl. Companies building foundation computer-use models must set up efficient annotation workflows, know how to leverage synthetic data, and build large datasets so models can understand what’s happening on a screen. It feels like the early days of embodied intelligence: people have to return to the hard work of collecting data instead of assuming there’s already a gold mine on the internet.

An Exciting Future for Computer Use Agents

In the near term, computer use agents should operate as co-pilots. Whenever you need to do something dull — cook a meal, adjust boring pieces of a PowerPoint, edit documents — you hand off the task and let the agent do the work you don’t want to do. That’s the first phase: once reliability is achieved, you can delegate standard tasks.

 

In the second stage, as agents become more intelligent and capable of deeper decision-making, you could run a second computer or let the model live in the cloud, collaborating in a multi-agent way. At that point, the goal of computer use agents extends beyond simply managing a computer — they could run an Amazon store or take on even more ambitious operations.

 

This is when they begin making a meaningful economic impact. Many tasks can be done entirely on a computer: the agent could watch YouTube to see what’s trending, then source and list goods on Amazon.

 

The ultimate vision for computer use agents is straightforward: a human with many ideas about how to build businesses or products can use agents as amplifiers of their efficiency. Everyone could run a company with the help of computer use agents. Agents would essentially function like employees — your team. This unlocks a world where anyone can become an entrepreneur.

 

Not everyone has access to all the APIs and tools required to run a business, but with a truly capable computer use agent, everyone would have a unified interface to do and build whatever they want.

Take me to Lux SDK

About

Blog

Get Started

December 19, 2025

Now Is a Great Time to Build Computer Use Agents. Why?

We’ve evolved from the era of the chatbot — where you ask a question and the model reacts — to something more multimodal.

We’ve evolved from the era of the chatbot — where you ask a question and the model reacts — to something more multimodal. Instead of requiring guidance at every step, the model can now observe what’s happening in the world and begin taking action on its own.The gap today is that while people are training some very good multimodal models — models that can look at an image

 and generate answers or descriptions — they’re still not good at two critical things.First, they struggle with planning their actions. They don’t fully understand how they can affect the world.Second, they have to learn how to act once they receive feedback. This feedback forms a causal relationship between their actions and the internal dynamics of the environment.Even in web or computer-use settings, this relationship can be more complicated than the physical world. The physical world roughly follows Newton’s three laws, but webpages and software — take Adobe, for example — are designed in ways that aren’t always convenient for humans. The model has to understand this feedback and adapt accordingly.We have strong multimodal models that can understand images. We also have LLMs trained to use tools and interact with the world to finish tasks.Computer use agents combine multimodal intelligence with agentic intelligence.This approach offers a general interface for actually utilizing agents. Whenever you encounter a long-tail, infrequent workflow, you can simply describe it and let the agent handle it, rather than relying on specialized tools like MCP or writing custom tools for the agent ahead of time. Now is a very good moment to be working on agentic computer use.

 

Embracing the Challenges of Computer Use Model Training

However, we’re still in the early days. What are the main challenges we need to overcome to make computer use agents truly mainstream?

 

The first is reliability. When a computer use agent performs a task, you expect it to succeed 10 out of 10 times. Even 9 out of 10 can be disastrous in some settings. It’s not as dangerous as autonomous driving, but you still expect the model to reliably understand what’s happening on a website and correct itself when it makes mistakes. That’s the first critical factor.

 

The second challenge is that computer use agents should be considered a superset of what people are building inside ChatGPT or Gemini today. A strong computer use agent that collaborates with you needs to understand your content, be an excellent chatbot, and have enough world knowledge to understand its goals. Making these agents truly powerful requires balancing their capabilities as language models with their efficiency as actor models that can operate a computer.

The third challenge is data. Computer use data isn’t like YouTube videos or web content you can crawl. Companies building foundation computer-use models must set up efficient annotation workflows, know how to leverage synthetic data, and build large datasets so models can understand what’s happening on a screen. It feels like the early days of embodied intelligence: people have to return to the hard work of collecting data instead of assuming there’s already a gold mine on the internet.

An Exciting Future for Computer Use Agents

In the near term, computer use agents should operate as co-pilots. Whenever you need to do something dull — cook a meal, adjust boring pieces of a PowerPoint, edit documents — you hand off the task and let the agent do the work you don’t want to do. That’s the first phase: once reliability is achieved, you can delegate standard tasks.

 

In the second stage, as agents become more intelligent and capable of deeper decision-making, you could run a second computer or let the model live in the cloud, collaborating in a multi-agent way. At that point, the goal of computer use agents extends beyond simply managing a computer — they could run an Amazon store or take on even more ambitious operations.

 

This is when they begin making a meaningful economic impact. Many tasks can be done entirely on a computer: the agent could watch YouTube to see what’s trending, then source and list goods on Amazon.

 

The ultimate vision for computer use agents is straightforward: a human with many ideas about how to build businesses or products can use agents as amplifiers of their efficiency. Everyone could run a company with the help of computer use agents. Agents would essentially function like employees — your team. This unlocks a world where anyone can become an entrepreneur.

 

Not everyone has access to all the APIs and tools required to run a business, but with a truly capable computer use agent, everyone would have a unified interface to do and build whatever they want.

Take me to Lux SDK