It's a multi-modal AI Agent running at the back with a constant screenshot capturing mechanism to learn what it is seeing on the screen and direct the main action agent to function accordingly, using ...