GPT-5.4 Computer Use: Build a Desktop AI Agent Guide — ContentBuffer guide

GPT-5.4 Computer Use: Build a Desktop AI Agent Guide

K
Kodetra Technologies··3 min read Beginner

Summary

Automate desktop tasks with GPT-5.4's native computer-use API in six simple, tested steps.

Why GPT-5.4 Computer Use Matters

GPT-5.4 is the first mainline OpenAI model shipped with native computer-use. It can see your screen, click buttons, type text, and verify its own work in a build-run-verify-fix loop.

It scores 75% on OSWorld, beating the human expert baseline of 72.4%. You give it a task and a screenshot; it returns precise mouse and keyboard actions you execute locally.


Prerequisites

  • Python 3.10+
  • OpenAI API key with Tier 1 access (minimum $5 prior spend)
  • A desktop environment with a display
  • pip packages: openai, pyautogui, pillow

Step 1: Install the SDK

Upgrade openai to ensure you have the computer_use tool type. Install pyautogui for executing actions and Pillow for screenshots.

pip install --upgrade openai pyautogui pillow

Step 2: Take a Screenshot

GPT-5.4 needs a picture of your screen to reason about. Capture it and convert it to base64.

import pyautogui, base64, io

def capture_screen():
    img = pyautogui.screenshot()
    buf = io.BytesIO()
    img.save(buf, format="PNG")
    return base64.b64encode(buf.getvalue()).decode()

screen_b64 = capture_screen()

Step 3: Configure the Computer Use Tool

Tell GPT-5.4 your display size so it returns pixel-accurate coordinates.

from openai import OpenAI
import pyautogui

client = OpenAI()
w, h = pyautogui.size()

computer_tool = {
    "type": "computer_use",
    "display_width": w,
    "display_height": h,
    "environment": "linux"  # or "mac", "windows"
}

Step 4: Send the First Request

Give the model a task plus the screenshot. It responds with an action.

response = client.responses.create(
    model="gpt-5.4",
    tools=[computer_tool],
    input=[{
        "role": "user",
        "content": [
            {"type": "input_text", "text": "Open Firefox and search for 'agentic AI'"},
            {"type": "input_image", "image_url": f"data:image/png;base64,{screen_b64}"}
        ]
    }]
)

print(response.output[0])

Example Output

{
  "type": "computer_call",
  "action": {
    "type": "click",
    "x": 42,
    "y": 1055,
    "button": "left"
  },
  "call_id": "call_abc123"
}

Step 5: Execute the Action Locally

Map the model's action to pyautogui calls. Keep a safety check so nothing runs without you.

import pyautogui

def run_action(action):
    t = action["type"]
    if t == "click":
        pyautogui.click(action["x"], action["y"], button=action.get("button", "left"))
    elif t == "type":
        pyautogui.typewrite(action["text"], interval=0.02)
    elif t == "key":
        pyautogui.hotkey(*action["keys"])
    elif t == "scroll":
        pyautogui.scroll(action["dy"])
    elif t == "screenshot":
        pass  # handled in loop

run_action(response.output[0]["action"])

Step 6: Loop Until Task Complete

After each action, send a fresh screenshot plus the call_id. Stop when the model returns a final message instead of a computer_call.

prev_id = response.id
call_id = response.output[0]["call_id"]

while True:
    screen_b64 = capture_screen()
    resp = client.responses.create(
        model="gpt-5.4",
        previous_response_id=prev_id,
        tools=[computer_tool],
        input=[{
            "type": "computer_call_output",
            "call_id": call_id,
            "output": {"type": "input_image",
                       "image_url": f"data:image/png;base64,{screen_b64}"}
        }]
    )
    out = resp.output[0]
    if out["type"] != "computer_call":
        print("Done:", out.get("content"))
        break
    run_action(out["action"])
    prev_id, call_id = resp.id, out["call_id"]

Supported Actions at a Glance

ActionPurposeKey Fields
clickMouse clickx, y, button
typeKeyboard inputtext
keyHotkey combokeys[]
scrollScroll pagedx, dy
screenshotRe-observenone
waitPause for UIms

Pro Tips

  • Run in a sandbox or VM — agents sometimes click unexpected things.
  • Cap iterations at 20–30 to avoid runaway loops.
  • Use the Responses API with previous_response_id for state.
  • Log every action and screenshot for debugging.
  • Scale your display if coordinates seem off on HiDPI screens.

Next Steps

Start with a tiny task like 'open calculator and compute 42 * 7'. Once the loop feels stable, graduate to multi-app workflows — scraping a dashboard, filing a report, or testing a web app end-to-end.

GPT-5.4 Computer Use turns desktop automation into a plain-English conversation. Build your first agent today, and you'll never script brittle Selenium again.

Comments

Subscribe to join the conversation...

Be the first to comment

Found this useful?

Get new AI guides for builders by email. Free.