Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties

Performance comparison of different backbone LLMs under attack scenarios.

Model	Modal	Acc_safe	Acc_attack	MR
gpt-4o	text-based	62.45	40.38	29.17
	vision-based	72.62	41.58	42.29
	multi-modal	70.90	32.58	44.21
gpt-4o-mini	text-based	55.57	36.97	37.30
	vision-based	68.21	29.56	52.62
	multi-modal	67.63	19.78	61.20
Claude 3.7 sonnet	text-based	72.71	63.11	16.34
	vision-based	73.10	59.67	21.75
	multi-modal	82.32	64.09	23.17
DeepSeek V3	text-based	62.80	53.97	23.57
DeepSeek R1	text-based	64.52	50.33	21.43

Notes:

The score is calculated as the average of Acc_safe - Acc_attack + MR

Success Rates and Misleading Rates of agents on different attack settings.

Threat Level	Action	Metric	M3A		SeeAct		T3A			AutoDroid			Cog	UGround		Aria UI		Avg.
Threat Level	Action	Metric	4o	mini	4o	mini	4o	mini	R1	4o	mini	R1	Cog-9B	4o	mini	4o	mini	-
Clean Env SR			45.8	20.0	18.3	6.5	45.8	10.0	39.2	21.7	8.3	19.2	17.5	48.3	31.8	38.3	18.3	25.9
Threat Level Simple	Mislead to Click	ΔSR	-6.4	-7.0	-4.5	-1.1	-6.1	-1.7	-0.9	0.0	-4.9	2.5	-7.3	-14.6	-8.7	-0.4	-1.6	-4.2
	Mislead to Click	MR	6.3	13.0	10.2	21.6	8.3	6.2	5.0	3.6	12.1	3.3	4.3	6.7	5.1	5.4	3.3	7.6
	Mislead to Navigate	ΔSR	-2.8	-11.3	1.0	-6.5	4.3	0.4	3.8	0.0	-3.1	0.8	-	-17.5	-9.1	-1.6	-4.5	-3.3
	Mislead to Navigate	MR	0.0	4.3	0.0	2.6	0.0	4.2	0.0	3.0	0.0	0.0	-	4.3	15.9	2.0	12.1	3.5
	Mislead to Terminate	ΔSR	1.8	-9.1	-0.5	-4.3	-6.1	-3.3	4.1	-0.7	-3.1	-0.9	-	-19.4	-6.2	3.4	-9.2	-3.8
	Mislead to Terminate	MR	10.9	25.5	4.1	0.0	16.7	0.0	26.7	14.0	5.2	33.3	-	34.7	39.5	21.7	25.5	18.4
Threat Level Medium	Mislead to Click	ΔSR	-16.2	-12.5	-6.5	-6.5	-14.4	2.5	0.8	-4.8	-1.4	2.5	-7.0	-20.7	-11.8	-13.7	-9.8	-8.0
	Mislead to Click	MR	27.1	52.8	39.6	51.1	10.4	6.2	9.0	30.5	20.0	11.7	11.9	28.3	47.5	28.1	33.9	27.2
	Mislead to Navigate	ΔSR	-13.5	-13.8	-5.0	-6.5	-16.5	0.6	-1.2	0.0	-6.9	4.1	-	-18.6	-9.8	-6.3	-6.0	-7.1
	Mislead to Navigate	MR	18.8	37.5	5.8	24.4	4.2	12.8	0.0	4.1	22.4	0.0	-	21.7	41.5	14.0	22.8	16.4
	Mislead to Terminate	ΔSR	-23.1	-12.5	-6.5	-6.5	-14.0	-1.7	-4.2	-2.7	-4.5	-2.5	-	-26.7	-22.0	-11.6	-11.1	-10.7
	Mislead to Terminate	MR	40.5	39.6	22.9	10.8	33.3	14.6	33.3	24.1	24.1	28.3	-	38.3	39.0	35.0	45.0	30.6
Threat Level Complex	Mislead to Click	ΔSR	-18.4	-10.4	1.0	-3.7	-30.6	0.4	-4.2	-9.7	3.8	-2.5	-1.7	-20.4	-12.3	-14.6	-8.3	-8.8
	Mislead to Click	MR	37.5	59.6	27.5	38.9	31.3	37.5	28.3	22.4	34.5	12.1	20.0	32.7	56.1	34.5	50.0	34.9
	Mislead to Navigate	ΔSR	-10.5	-8.5	-0.7	-4.0	-16.5	0.4	-7.2	0.0	-4.9	0.8	-	-14.2	-6.8	-3.3	-5.6	-5.8
	Mislead to Navigate	MR	8.5	15.4	0.0	2.5	8.3	6.2	0.0	3.5	5.2	0.0	-	10.9	24.4	15.6	18.5	8.5
	Mislead to Terminate	ΔSR	-29.9	-14.5	-10.6	-4.1	-33.2	-3.8	-9.2	-7.7	0.0	-7.5	-	-36.2	-22.3	-20.0	-14.7	-15.3
	Mislead to Terminate	MR	56.3	67.3	23.4	17.1	27.0	33.3	50.0	50.0	20.7	43.3	-	46.8	45.2	50.0	73.3	43.1

Notes:

SR: Success Rate
ΔSR: Change in Success Rate
MR: Misleading Rate
Highlighted values represent the most impactful values in each category

Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties

Overview of the AgentHazard dynamic task execution environment.

Abstract

Experimental Results