Performance comparison of different backbone LLMs under attack scenarios.
Model | Modal | Accsafe | Accattack | MR | Score |
---|---|---|---|---|---|
gpt-4o | text-based | 62.45 | 40.38 | 29.17 | |
vision-based | 72.62 | 41.58 | 42.29 | ||
multi-modal | 70.90 | 32.58 | 44.21 | ||
gpt-4o-mini | text-based | 55.57 | 36.97 | 37.30 | |
vision-based | 68.21 | 29.56 | 52.62 | ||
multi-modal | 67.63 | 19.78 | 61.20 | ||
Claude 3.7 sonnet |
text-based | 72.71 | 63.11 | 16.34 | |
vision-based | 73.10 | 59.67 | 21.75 | ||
multi-modal | 82.32 | 64.09 | 23.17 | ||
DeepSeek V3 |
text-based | 62.80 | 53.97 | 23.57 | |
DeepSeek R1 |
text-based | 64.52 | 50.33 | 21.43 |
Notes:
- The score is calculated as the average of
Accsafe - Accattack + MR
Success Rates and Misleading Rates of agents on different attack settings.
Threat Level | Action | Metric | M3A | SeeAct | T3A | AutoDroid | Cog | UGround | Aria UI | Avg. | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4o | mini | 4o | mini | 4o | mini | R1 | 4o | mini | R1 | Cog-9B | 4o | mini | 4o | mini | - | |||
Clean Env SR |
45.8 | 20.0 | 18.3 | 6.5 | 45.8 | 10.0 | 39.2 | 21.7 | 8.3 | 19.2 | 17.5 | 48.3 | 31.8 | 38.3 | 18.3 | 25.9 | ||
Threat Level Simple |
Mislead to Click |
ΔSR | -6.4 | -7.0 | -4.5 | -1.1 | -6.1 | -1.7 | -0.9 | 0.0 | -4.9 | 2.5 | -7.3 | -14.6 | -8.7 | -0.4 | -1.6 | -4.2 |
MR | 6.3 | 13.0 | 10.2 | 21.6 | 8.3 | 6.2 | 5.0 | 3.6 | 12.1 | 3.3 | 4.3 | 6.7 | 5.1 | 5.4 | 3.3 | 7.6 | ||
Mislead to Navigate |
ΔSR | -2.8 | -11.3 | 1.0 | -6.5 | 4.3 | 0.4 | 3.8 | 0.0 | -3.1 | 0.8 | - | -17.5 | -9.1 | -1.6 | -4.5 | -3.3 | |
MR | 0.0 | 4.3 | 0.0 | 2.6 | 0.0 | 4.2 | 0.0 | 3.0 | 0.0 | 0.0 | - | 4.3 | 15.9 | 2.0 | 12.1 | 3.5 | ||
Mislead to Terminate |
ΔSR | 1.8 | -9.1 | -0.5 | -4.3 | -6.1 | -3.3 | 4.1 | -0.7 | -3.1 | -0.9 | - | -19.4 | -6.2 | 3.4 | -9.2 | -3.8 | |
MR | 10.9 | 25.5 | 4.1 | 0.0 | 16.7 | 0.0 | 26.7 | 14.0 | 5.2 | 33.3 | - | 34.7 | 39.5 | 21.7 | 25.5 | 18.4 | ||
Threat Level Medium |
Mislead to Click |
ΔSR | -16.2 | -12.5 | -6.5 | -6.5 | -14.4 | 2.5 | 0.8 | -4.8 | -1.4 | 2.5 | -7.0 | -20.7 | -11.8 | -13.7 | -9.8 | -8.0 |
MR | 27.1 | 52.8 | 39.6 | 51.1 | 10.4 | 6.2 | 9.0 | 30.5 | 20.0 | 11.7 | 11.9 | 28.3 | 47.5 | 28.1 | 33.9 | 27.2 | ||
Mislead to Navigate |
ΔSR | -13.5 | -13.8 | -5.0 | -6.5 | -16.5 | 0.6 | -1.2 | 0.0 | -6.9 | 4.1 | - | -18.6 | -9.8 | -6.3 | -6.0 | -7.1 | |
MR | 18.8 | 37.5 | 5.8 | 24.4 | 4.2 | 12.8 | 0.0 | 4.1 | 22.4 | 0.0 | - | 21.7 | 41.5 | 14.0 | 22.8 | 16.4 | ||
Mislead to Terminate |
ΔSR | -23.1 | -12.5 | -6.5 | -6.5 | -14.0 | -1.7 | -4.2 | -2.7 | -4.5 | -2.5 | - | -26.7 | -22.0 | -11.6 | -11.1 | -10.7 | |
MR | 40.5 | 39.6 | 22.9 | 10.8 | 33.3 | 14.6 | 33.3 | 24.1 | 24.1 | 28.3 | - | 38.3 | 39.0 | 35.0 | 45.0 | 30.6 | ||
Threat Level Complex |
Mislead to Click |
ΔSR | -18.4 | -10.4 | 1.0 | -3.7 | -30.6 | 0.4 | -4.2 | -9.7 | 3.8 | -2.5 | -1.7 | -20.4 | -12.3 | -14.6 | -8.3 | -8.8 |
MR | 37.5 | 59.6 | 27.5 | 38.9 | 31.3 | 37.5 | 28.3 | 22.4 | 34.5 | 12.1 | 20.0 | 32.7 | 56.1 | 34.5 | 50.0 | 34.9 | ||
Mislead to Navigate |
ΔSR | -10.5 | -8.5 | -0.7 | -4.0 | -16.5 | 0.4 | -7.2 | 0.0 | -4.9 | 0.8 | - | -14.2 | -6.8 | -3.3 | -5.6 | -5.8 | |
MR | 8.5 | 15.4 | 0.0 | 2.5 | 8.3 | 6.2 | 0.0 | 3.5 | 5.2 | 0.0 | - | 10.9 | 24.4 | 15.6 | 18.5 | 8.5 | ||
Mislead to Terminate |
ΔSR | -29.9 | -14.5 | -10.6 | -4.1 | -33.2 | -3.8 | -9.2 | -7.7 | 0.0 | -7.5 | - | -36.2 | -22.3 | -20.0 | -14.7 | -15.3 | |
MR | 56.3 | 67.3 | 23.4 | 17.1 | 27.0 | 33.3 | 50.0 | 50.0 | 20.7 | 43.3 | - | 46.8 | 45.2 | 50.0 | 73.3 | 43.1 |
Notes:
- SR: Success Rate
- ΔSR: Change in Success Rate
- MR: Misleading Rate
- Highlighted values represent the most impactful values in each category