Mobile-Agent重磅来袭：视觉感知+多模态智能助理，玩手机更高效！

文摘 2024-11-23 19:45 美国

随着多模态大语言模型（Multimodal Large Language Model, MLLM）的迅速发展，基于 MLLM 的多模态智能代理（agent）正在逐步应用于各种实际场景。这种技术的进步让利用多模态 agent 作为手机操作助手成为了现实，通过视觉感知和多模态交互，智能化地完成复杂任务。

本文将为您解读一项最新研究——《Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception》，该研究展示了如何借助多模态 agent 实现 AI 自动操作手机的技术突破。这一成果不仅扩展了移动设备的智能化边界，也为未来的自动化场景带来了全新可能性。

在本文中，我们介绍了一项具有里程碑意义的研究成果——Mobile-Agent，这是一种自主的多模态移动设备代理，能够通过视觉感知实现智能化的手机操作。

Mobile-Agent 的核心优势在于其视觉感知能力。它能够准确识别和定位应用前端界面中的视觉和文本元素，基于感知到的视觉上下文自主规划并分解复杂的操作任务。随后，它会通过逐步执行操作步骤导航移动应用。这种以视觉为中心的设计，使得Mobile-Agent 不再依赖应用的 XML 文件或移动系统的元数据，从而具备更强的适应性，能够在多样化的移动操作系统环境中工作，避免了系统定制的繁琐要求。

为了评估 Mobile-Agent 的性能，研究团队提出了MobileEval，一个专门用于评估移动设备操作能力的基准数据集。通过 MobileEval 的全面测试表明，Mobile-Agent 在操作准确性和任务完成率上都表现出了显著优势。即使在面临复杂指令（如跨应用操作）的情况下，它仍能够高效完成任务。

为了推动相关领域的发展，研究团队宣布将代码和模型开源，地址为https://github.com/X-PLUG/MobileAgent。这项成果不仅展示了多模态代理在移动设备领域的潜力，还为未来更复杂、更智能的移动操作系统代理奠定了坚实基础。

以下是一个利用 Mobile-Agent 在 YouTube 上搜索相关视频并发表评论的示例。用户的任务是让 Mobile-Agent 在 YouTube 上搜索某位明星的相关视频，找到合适的内容后，发布一条评论。在整个操作过程中，Mobile-Agent 准确无误地完成了任务，没有发生任何错误、不必要或无效的操作，展现了其强大的稳定性和执行能力。

接下来是一个操作多 App 的例子，用户的要求是先去查询今天的比赛结果，然后根据结果写一个新闻。这个任务的挑战性在于，前后要使用两个 App 完成两个子任务，并且需要将第一个子任务的结果作为第二个子任务的输入。Mobile-Agent 首先完成了查询比赛结果，随后退出浏览器并打开笔记，最后将比赛结果精准地写出，并以新闻的方式呈现。

为了便于将文本描述的操作转化为屏幕上的操作，Mobile-Agent 生成的操作必须在一个定义好的操作空间内。这个空间共有 8 个操作，分别是：

打开 App（App 名字）点击文本（文本内容）点击图标（图标描述）打字（文本内容）上翻、下翻返回上一页退出 App停止

其中，点击文本和点击图标是两个需要操作定位的操作，因此 Mobile-Agent 在使用这两个操作时，必须输出括号内的参数，以实现定位。

部分代码：

import numpy as npdef calculate_iou(box1, box2):    x1_min, y1_min, x1_max, y1_max = box1    x2_min, y2_min, x2_max, y2_max = box2    inter_x_min = max(x1_min, x2_min)    inter_y_min = max(y1_min, y2_min)    inter_x_max = min(x1_max, x2_max)    inter_y_max = min(y1_max, y2_max)    inter_area = max(0, inter_x_max - inter_x_min) * max(0, inter_y_max - inter_y_min)    box1_area = (x1_max - x1_min) * (y1_max - y1_min)    box2_area = (x2_max - x2_min) * (y2_max - y2_min)    union_area = box1_area + box2_area - inter_area    iou = inter_area / union_area    return ioudef compute_iou(box1, box2):    """    Compute the Intersection over Union (IoU) of two bounding boxes.    Parameters:    - box1: list or array [x1, y1, x2, y2]    - box2: list or array [x1, y1, x2, y2]    Returns:    - iou: float, IoU value    """    x1_inter = max(box1[0], box2[0])    y1_inter = max(box1[1], box2[1])    x2_inter = min(box1[2], box2[2])    y2_inter = min(box1[3], box2[3])    # print(x2_inter, x1_inter, y2_inter, y1_inter)    inter_area = max(0, x2_inter - x1_inter + 1) * max(0, y2_inter - y1_inter + 1)    box1_area = (box1[2] - box1[0] + 1) * (box1[3] - box1[1] + 1)    box2_area = (box2[2] - box2[0] + 1) * (box2[3] - box2[1] + 1)    iou = inter_area / float(box1_area + box2_area - inter_area)    return ioudef merge_boxes(box1, box2):    x1_min, y1_min, x1_max, y1_max = box1    x2_min, y2_min, x2_max, y2_max = box2    merged_box = [min(x1_min, x2_min), min(y1_min, y2_min), max(x1_max, x2_max), max(y1_max, y2_max)]    return merged_boxdef merge_boxes_and_texts(texts, boxes, iou_threshold=0):    """    Merge bounding boxes and their corresponding texts based on IoU threshold.    Parameters:    - boxes: List of bounding boxes, with each box represented as [x1, y1, x2, y2].    - texts: List of texts corresponding to each bounding box.    - iou_threshold: Intersection-over-Union threshold for merging boxes.    Returns:    - merged_boxes: List of merged bounding boxes.    - merged_texts: List of merged texts corresponding to the bounding boxes.    """    if len(boxes) == 0:        return [], []    # boxes = np.array(boxes)    merged_boxes = []    merged_texts = []    while len(boxes) > 0:        box = boxes[0]        text = texts[0]        boxes = boxes[1:]        texts = texts[1:]        to_merge_boxes = [box]        to_merge_texts = [text]        keep_boxes = []        keep_texts = []        for i, other_box in enumerate(boxes):            if compute_iou(box, other_box) > iou_threshold:                to_merge_boxes.append(other_box)                to_merge_texts.append(texts[i])            else:                keep_boxes.append(other_box)                keep_texts.append(texts[i])        # Merge the to_merge boxes into a single box        if len(to_merge_boxes) > 1:            x1 = min(b[0] for b in to_merge_boxes)            y1 = min(b[1] for b in to_merge_boxes)            x2 = max(b[2] for b in to_merge_boxes)            y2 = max(b[3] for b in to_merge_boxes)            merged_box = [x1, y1, x2, y2]            merged_text = " ".join(to_merge_texts)  # You can change the merging strategy here            merged_boxes.append(merged_box)            merged_texts.append(merged_text)        else:            merged_boxes.extend(to_merge_boxes)            merged_texts.extend(to_merge_texts)        # boxes = np.array(keep_boxes)        boxes = keep_boxes        texts = keep_texts    return merged_texts, merged_boxesdef is_contained(bbox1, bbox2):    x1_min, y1_min, x1_max, y1_max = bbox1    x2_min, y2_min, x2_max, y2_max = bbox2    if (x1_min >= x2_min and y1_min >= y2_min and x1_max <= x2_max and y1_max <= y2_max):        return True    elif (x2_min >= x1_min and y2_min >= y1_min and x2_max <= x1_max and y2_max <= y1_max):        return True    return Falsedef is_overlapping(bbox1, bbox2):    x1_min, y1_min, x1_max, y1_max = bbox1    x2_min, y2_min, x2_max, y2_max = bbox2    inter_xmin = max(x1_min, x2_min)    inter_ymin = max(y1_min, y2_min)    inter_xmax = min(x1_max, x2_max)    inter_ymax = min(y1_max, y2_max)    if inter_xmin < inter_xmax and inter_ymin < inter_ymax:        return True    return Falsedef get_area(bbox):    x_min, y_min, x_max, y_max = bbox    return (x_max - x_min) * (y_max - y_min)def merge_all_icon_boxes(bboxes):    result_bboxes = []    while bboxes:        bbox = bboxes.pop(0)        to_add = True        for idx, existing_bbox in enumerate(result_bboxes):            if is_contained(bbox, existing_bbox):                if get_area(bbox) > get_area(existing_bbox):                    result_bboxes[idx] = existing_bbox                to_add = False                break            elif is_overlapping(bbox, existing_bbox):                if get_area(bbox) < get_area(existing_bbox):                    result_bboxes[idx] = bbox                to_add = False                break        if to_add:            result_bboxes.append(bbox)    return result_bboxesdef merge_bbox_groups(A, B, iou_threshold=0.8):    i = 0    while i < len(A):        box_a = A[i]        has_merged = False        for j in range(len(B)):            box_b = B[j]            iou = calculate_iou(box_a, box_b)            if iou > iou_threshold:                merged_box = merge_boxes(box_a, box_b)                A[i] = merged_box                B.pop(j)                has_merged = True                break        if has_merged:            i -= 1        i += 1    return A, Bdef bbox_iou(boxA, boxB):    # Calculate Intersection over Union (IoU) between two bounding boxes    xA = max(boxA[0], boxB[0])    yA = max(boxA[1], boxB[1])    xB = min(boxA[2], boxB[2])    yB = min(boxA[3], boxB[3])    interArea = max(0, xB - xA + 1) * max(0, yB - yA + 1)    boxAArea = (boxA[2] - boxA[0] + 1) * (boxA[3] - boxA[1] + 1)    boxBArea = (boxB[2] - boxB[0] + 1) * (boxB[3] - boxB[1] + 1)    iou = interArea / float(boxAArea + boxBArea - interArea)    return ioudef merge_boxes_and_texts_new(texts, bounding_boxes, iou_threshold=0):    if not bounding_boxes:        return [], []    bounding_boxes = np.array(bounding_boxes)    merged_boxes = []    merged_texts = []    used = np.zeros(len(bounding_boxes), dtype=bool)    for i, boxA in enumerate(bounding_boxes):        if used[i]:            continue        x_min, y_min, x_max, y_max = boxA        # text = texts[i]        text = ''        overlapping_indices = [i] # []        for j, boxB in enumerate(bounding_boxes):            # print(i,j, bbox_iou(boxA, boxB))            if i != j and not used[j] and bbox_iou(boxA, boxB) > iou_threshold:                overlapping_indices.append(j)        # Sort overlapping boxes by vertical position (top to bottom)        overlapping_indices.sort(key=lambda idx: (bounding_boxes[idx][1] + bounding_boxes[idx][3])/2) # TODO        for idx in overlapping_indices:            boxB = bounding_boxes[idx]            x_min = min(x_min, boxB[0])            y_min = min(y_min, boxB[1])            x_max = max(x_max, boxB[2])            y_max = max(y_max, boxB[3])            # text += " " + texts[idx]            text += texts[idx]            used[idx] = True        merged_boxes.append([x_min, y_min, x_max, y_max])        merged_texts.append(text)        used[i] = True    return merged_texts, merged_boxes

import mathimport cv2import numpy as npfrom PIL import Image, ImageDraw, ImageFontimport clipimport torchdef crop_image(img, position):    def distance(x1,y1,x2,y2):        return math.sqrt(pow(x1 - x2, 2) + pow(y1 - y2, 2))        position = position.tolist()    for i in range(4):        for j in range(i+1, 4):            if(position[i][0] > position[j][0]):                tmp = position[j]                position[j] = position[i]                position[i] = tmp    if position[0][1] > position[1][1]:        tmp = position[0]        position[0] = position[1]        position[1] = tmp    if position[2][1] > position[3][1]:        tmp = position[2]        position[2] = position[3]        position[3] = tmp    x1, y1 = position[0][0], position[0][1]    x2, y2 = position[2][0], position[2][1]    x3, y3 = position[3][0], position[3][1]    x4, y4 = position[1][0], position[1][1]    corners = np.zeros((4,2), np.float32)    corners[0] = [x1, y1]    corners[1] = [x2, y2]    corners[2] = [x4, y4]    corners[3] = [x3, y3]    img_width = distance((x1+x4)/2, (y1+y4)/2, (x2+x3)/2, (y2+y3)/2)    img_height = distance((x1+x2)/2, (y1+y2)/2, (x4+x3)/2, (y4+y3)/2)    corners_trans = np.zeros((4,2), np.float32)    corners_trans[0] = [0, 0]    corners_trans[1] = [img_width - 1, 0]    corners_trans[2] = [0, img_height - 1]    corners_trans[3] = [img_width - 1, img_height - 1]    transform = cv2.getPerspectiveTransform(corners, corners_trans)    dst = cv2.warpPerspective(img, transform, (int(img_width), int(img_height)))    return dstdef calculate_size(box):    return (box[2]-box[0]) * (box[3]-box[1])def calculate_iou(box1, box2):    xA = max(box1[0], box2[0])    yA = max(box1[1], box2[1])    xB = min(box1[2], box2[2])    yB = min(box1[3], box2[3])
    interArea = max(0, xB - xA) * max(0, yB - yA)    box1Area = (box1[2] - box1[0]) * (box1[3] - box1[1])    box2Area = (box2[2] - box2[0]) * (box2[3] - box2[1])    unionArea = box1Area + box2Area - interArea    iou = interArea / unionArea
    return ioudef crop(image, box, i, text_data=None):    image = Image.open(image)    if text_data:        draw = ImageDraw.Draw(image)        draw.rectangle(((text_data[0], text_data[1]), (text_data[2], text_data[3])), outline="red", width=5)        # font_size = int((text_data[3] - text_data[1])*0.75)        # font = ImageFont.truetype("arial.ttf", font_size)        # draw.text((text_data[0]+5, text_data[1]+5), str(i), font=font, fill="red")    cropped_image = image.crop(box)    cropped_image.save(f"./temp/{i}.jpg")
def in_box(box, target):    if (box[0] > target[0]) and (box[1] > target[1]) and (box[2] < target[2]) and (box[3] < target[3]):        return True    else:        return False
def crop_for_clip(image, box, i, position):    image = Image.open(image)    w, h = image.size    if position == "left":        bound = [0, 0, w/2, h]    elif position == "right":        bound = [w/2, 0, w, h]    elif position == "top":        bound = [0, 0, w, h/2]    elif position == "bottom":        bound = [0, h/2, w, h]    elif position == "top left":        bound = [0, 0, w/2, h/2]    elif position == "top right":        bound = [w/2, 0, w, h/2]    elif position == "bottom left":        bound = [0, h/2, w/2, h]    elif position == "bottom right":        bound = [w/2, h/2, w, h]    else:        bound = [0, 0, w, h]
    if in_box(box, bound):        cropped_image = image.crop(box)        cropped_image.save(f"./temp/{i}.jpg")        return True    else:        return False

def clip_for_icon(clip_model, clip_preprocess, images, prompt):    image_features = []    for image_file in images:        image = clip_preprocess(Image.open(image_file)).unsqueeze(0).to(next(clip_model.parameters()).device)        image_feature = clip_model.encode_image(image)        image_features.append(image_feature)    image_features = torch.cat(image_features)
    text = clip.tokenize([prompt]).to(next(clip_model.parameters()).device)    text_features = clip_model.encode_text(text)    image_features /= image_features.norm(dim=-1, keepdim=True)    text_features /= text_features.norm(dim=-1, keepdim=True)    similarity = (100.0 * image_features @ text_features.T).softmax(dim=0).squeeze(0)    _, max_pos = torch.max(similarity, dim=0)    pos = max_pos.item()
    return pos

参考：

1. https://github.com/X-PLUG/MobileAgent/tree/main/PC-Agent

2. https://arxiv.org/pdf/2401.16158v1

AI技术研习社

专注分享人工智能、大模型、算法、大数据开发、数据分析领域的技术干货和落地实践！

最新文章

LLMs开发者必看！Pydantic AI代理框架震撼登场！

Long Term Memory：揭开人工智能自我进化的核心秘密！

手把手教你打造通用型LLM智能体，一文读懂核心原理！

3 大智能体开发平台详细对比：FastGPT、Dify和Coze

RAG内容生成革新：STORM与Co-STORM引领智能检索与人类协作

2025年Agents预测：知识研究领域Agent将迎来革命性突破！（附Top 3免费工具推荐）

RAG as a Service：开发者必备的新晋神器！

VLM论文深度解析：揭秘多模态大模型如何联动权重、任务与视觉嵌入

RAGAs评估工具：用指标与LLM优化你的RAG管道性能

视觉语言模型（VLMs）：复合人工智能系统的未来

《LLM 推理必知参数，全网最全解析！》

架构师必修之项目篇：基于ASR+GPT4.0+TTS实现全双工智能语音助手

Mobile-Agent重磅来袭：视觉感知+多模态智能助理，玩手机更高效！

LLM加速全攻略：教你降本增效，提升响应速度的必备技巧！

构建Agent应用：Development Roadmap

检索增强生成（RAG）：解密AI如何融合记忆与搜索

揭秘汽车语音助手：从语音识别到智能回复的全流程解析！

RAG 和 RAU：自然语言处理中检索增强语言模型的调查

RAG 驱动的 NER：构建自定义模型的入门指南

基于BERT的对话意图和槽位联合识别模块

GLM-4-Plus上线：杀进“世界前三”，它真的好用吗？

揭秘RAG背后的人机对话流程：从语音识别到智能生成

揭秘RAG：全方位解析RAG检索中的意图识别，如何助力智能问答

Agent智能大揭秘：企业如何利用AI代理驱动高效增长！

LLMs+SQL：用自然语言轻松搞定数据查询，彻底解锁数据库潜能！

秒懂LLM流式输出的SSE原理！一文带你搞定SSE实现和Python实战案例

RAG实战：打造可扩展的智能文档系统：终极 RAG 管道全解析

RAG工具：FlashRAG用于高效 RAG 研究的 Python 工具包

重磅上线！ChatGPT引入Search功能，秒查秒懂新体验！

RAG评估：RAGChecker重磅发布！精准诊断RAG系统的全新细粒度框架！

RAGFlow重磅开源！基于深度文档理解的智能检索神器！

从零开始，用万行代码打造专属向量数据库！

揭秘RAG神器！如何通过上下文检索与混合搜索打造超强生成效果

IM-RAG：解锁AI内心独白，多轮检索增强生成新突破！

Agent实战：基于大模型的Agent技术框架开发实战

重磅发布！Claude 3.5 Sonnet上线，首个能像人类操作电脑的AI，官方提示词全解密！

多模态RAG-ColPali：使用视觉语言模型实现高效的文档检索

LightRAG：创新双级检索系统，整合图形结构，实现更强大信息检索！

颠覆传统生成方式！Adaptive RAG：实时学习、智能调整的下一代检索增强技术

颠覆传统RAG！Corrective-RAG引入自我反思与自我评估，让文档检索更智能更精准！

国内首部以“生成式人工智能”为应用背景的知识产权标准，诚邀参编！

揭秘Self-RAG：引领大型语言模型生成质量的新潮流！

初识 OpenAI 的 Swarm：轻量级、多智能体系统的探索利器

MemoRAG重磅登场：彻底革新AI问答的长期记忆功能！

颠覆RAG性能！揭秘多头RAG的强大优化秘诀

解锁RAG架构：必知的6种提升AI内容生成的检索增强技术（二）

解锁RAG架构：必知的6种提升AI内容生成的检索增强技术（一）

可控 Text2Image：打造您想要的完美图像生成神器

揭秘RAG多模态应用：Text2Image检索开源项目

揭秘顶级 RAG 技术，不可错过的关键知识！

分类

时事

民生

政务

教育

文化

科技

财富

体娱

健康

情感

旅行

百科

职场

楼市

企业

乐活

学术

汽车

时尚

创业

美食

幽默

美体

文摘

原创标签

时事社会财经军事教育体育科技汽车科学房产搞笑综艺明星音乐动漫游戏时尚健康旅游美食生活摄影宠物职场育儿情感小说曲艺文化历史三农文学娱乐电影视频图片新闻宗教电视剧纪录片广告创意壁纸头像心灵鸡汤星座命理教育培训艺术文化金融财经健康医疗美妆时尚餐饮美食母婴育儿社会新闻工业农业时事政治星座占卜幽默笑话独立短篇连载作品文化历史科技互联网

发布位置

广东北京山东江苏河南浙江山西福建河北上海四川陕西湖南安徽湖北内蒙古江西云南广西甘肃辽宁黑龙江贵州新疆重庆吉林天津海南青海宁夏西藏香港澳门台湾美国加拿大澳大利亚日本新加坡英国西班牙新西兰韩国泰国法国德国意大利缅甸菲律宾马来西亚越南荷兰柬埔寨俄罗斯巴西智利卢森堡芬兰瑞典比利时瑞士土耳其斐济挪威朝鲜尼日利亚阿根廷匈牙利爱尔兰印度老挝葡萄牙乌克兰印度尼西亚哈萨克斯坦塔吉克斯坦希腊南非蒙古奥地利肯尼亚加纳丹麦津巴布韦埃及坦桑尼亚捷克阿联酋安哥拉