Performance Benchmarks

Comprehensive evaluation of IrisOS across various real-world scenarios and tasks.

WebArena

Open-source benchmark for evaluating web agents on real-world tasks using offline websites

Success Rate
62.5%
Type
Complex Web Tasks
Domains
E-commerce
CMS
Social Forums

OpenAI Agent

Computer-using agent capable of performing general-purpose tasks through web UI interaction

Success Rate
85.2%
Type
System Tasks
Domains
Web Browsing
UI Interaction

IrisOS Agent

Our comprehensive benchmark suite testing AI agents in real-world automation scenarios with flexible model support

Success Rate
88.7%
Type
Integrated Tasks
Domains
Web Automation
System Control

About Our Benchmarks

IrisOS demonstrates strong performance across different benchmarks. While WebArena focuses on complex web tasks and WebVoyager tests simple web interactions, our FlowBench evaluation shows superior results in integrated scenarios combining web and system automation.

With a success rate of 89.5%, IrisOS approaches human-level performance (92.3%) in handling complex, multi-step automation tasks. This demonstrates our system's capability to effectively bridge the gap between AI and human performance in real-world applications.