近300篇机器人操作工作汇总!涵盖从抓取到复杂操控的各类任务、方法和应用

科技   2024-11-03 00:02   江苏  

点击下方卡片,关注「3D视觉工坊」公众号
选择星标,干货第一时间送达

来源:具身智能之心

添加小助理:cv3d001,备注:方向+学校/公司+昵称,拉你入群。文末附3D视觉行业细分群。

扫描下方二维码,加入「3D视觉从入门到精通」知识星球,星球内凝聚了众多3D视觉实战问题,以及各个模块的学习资料:近20门秘制视频课程最新顶会论文、计算机视觉书籍优质3D视觉算法源码等。想要入门3D视觉、做项目、搞科研,欢迎扫码加入!

Robot Manipulation(机器人操控)是机器人技术中的一个关键领域,涉及机器人在物理环境中与物体的交互和操作能力。它旨在让机器人能够自主感知、规划并执行复杂的物体抓取、移动、旋转和精细操作等任务。机器人操控技术广泛应用于工业自动化、医疗手术、家务辅助、物流搬运等场景,为机器人能够适应和完成多样化的任务提供了技术支撑。
本项目汇总了Robot Manipulation领域的关键研究论文,涵盖从抓取到复杂操控的各类任务、方法和应用,提供了关于表征学习、强化学习、多模态学习、3D表征等技术的最新进展,方便机器人操控领域的研究者和实践者学习阅读。
最近收集整理了300+篇关于Robotics+Manipulation的文献,公开在了github上,repo链接:https://github.com/BaiShuanghao/Awesome-Robotics-Manipulation

Grasp相关

1)Rectangle-based Grasp

  • Title: HMT-Grasp: A Hybrid Mamba-Transformer Approach for Robot Grasping in Cluttered Environments|https://arxiv.org/abs/2410.03522
  • Title: Lightweight Language-driven Grasp Detection using Conditional Consistency Model|https://arxiv.org/abs/2407.17967
  • Title: grasp_det_seg_cnn: End-to-end Trainable Deep Neural Network for Robotic Grasp Detection and Semantic Segmentation from RGB|https://arxiv.org/abs/2107.05287
  • Title: GR-ConvNet: Antipodal Robotic Grasping using Generative Residual Convolutional Neural Network|https://arxiv.org/abs/1909.04810

2)6-DoF Grasp

  • Title: Real-to-Sim Grasp: Rethinking the Gap between Simulation and Real World in Grasp Detection|https://arxiv.org/abs/2410.06521
  • Title: OrbitGrasp: SE(3)-Equivariant Grasp Learning|https://arxiv.org/abs/2407.03531
  • Title: EquiGraspFlow: SE(3)-Equivariant 6-DoF Grasp Pose Generative Flows|https://openreview.net/pdf?id=5lSkn5v4LK
  • Title: An Economic Framework for 6-DoF Grasp Detection|https://arxiv.org/abs/2407.08366
  • Title: Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge|https://arxiv.org/abs/2404.01727
  • Title: Rethinking 6-Dof Grasp Detection: A Flexible Framework for High-Quality Grasping|https://arxiv.org/abs/2403.15054
  • Title: AnyGrasp: Robust and Efficient Grasp Perception in Spatial and Temporal Domains|https://arxiv.org/abs/2212.08333,
  • Title: GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping|https://openaccess.thecvf.com/content_CVPR_2020/papers/Fang_GraspNet-1Billion_A_Large-Scale_Benchmark_for_General_Object_Grasping_CVPR_2020_paper.pdf
  • Title: 6-DOF GraspNet: Variational Grasp Generation for Object Manipulation|https://arxiv.org/abs/1905.10520

3)Grasp with 3D Techniques

  • Title: Implicit Grasp Diffusion: Bridging the Gap between Dense Prediction and Sampling-based Grasping|https://openreview.net/pdf?id=VUhlMfEekm

  • Title: Learning Any-View 6DoF Robotic Grasping in Cluttered Scenes via Neural Surface Rendering|https://arxiv.org/abs/2306.07392,

  • Title: Language Embedded Radiance Fields for Zero-Shot Task-Oriented Grasping|https://arxiv.org/abs/2309.07970

  • Title: GraspNeRF: Multiview-based 6-DoF Grasp Detection for Transparent and Specular Objects Using Generalizable NeRF|https://arxiv.org/abs/2210.06575,

  • Title: GraspSplats: Efficient Manipulation with 3D Feature Splatting|https://arxiv.org/abs/2409.02084,

  • Title: GaussianGrasper: 3D Language Gaussian Splatting for Open-vocabulary Robotic Grasping|https://arxiv.org/abs/2403.09637,

4)Language-Driven Grasp

  • Title: RTAGrasp: Learning Task-Oriented Grasping from Human Videos via Retrieval, Transfer, and Alignment|https://arxiv.org/abs/2409.16033
  • Title: Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance|https://arxiv.org/abs/2407.13842,
  • Title: Reasoning Grasping via Multimodal Large Language Model|https://arxiv.org/abs/2402.06798
  • Title: ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter|https://arxiv.org/abs/2407.11298
  • Title: Towards Open-World Grasping with Large Vision-Language Models|https://arxiv.org/abs/2406.18722
  • Title: Reasoning Tuning Grasp: Adapting Multi-Modal Large Language Models for Robotic Grasping|https://openreview.net/pdf?id=3mKb5iyZ2V

5)Grasp for Transparent Objects

  • Title: T2SQNet: A Recognition Model for Manipulating Partially Observed Transparent Tableware Objects|https://openreview.net/pdf?id=M0JtsLuhEE
  • Title: ASGrasp: Generalizable Transparent Object Reconstruction and Grasping from RGB-D Active Stereo Camera|https://arxiv.org/abs/2405.05648
  • Title: Dex-NeRF: Using a Neural Radiance Field to Grasp Transparent Objects|https://arxiv.org/abs/2110.14217

Manipulation相关

1)Representation Learning with Auxiliary Tasks

  • Title: Contrastive Imitation Learning for Language-guided Multi-Task Robotic Manipulation|https://arxiv.org/abs/2406.09738

  • Title: Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers|https://arxiv.org/abs/2403.12943

  • Title: R3M: A Universal Visual Representation for Robot Manipulation|https://arxiv.org/abs/2203.12601

  • Title: HULC: What Matters in Language Conditioned Robotic Imitation Learning over Unstructured Data|https://arxiv.org/abs/2204.06252

  • Title: BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning|https://arxiv.org/abs/2202.02005

  • Title: Spatiotemporal Predictive Pre-training for Robotic Motor Control|https://arxiv.org/abs/2403.05304

  • Title: MUTEX: Learning Unified Policies from Multimodal Task Specifications|https://arxiv.org/abs/2309.14320

  • Title: Language-Driven Representation Learning for Robotics|https://arxiv.org/abs/2302.12766

  • Title: Real-World Robot Learning with Masked Visual Pre-training|https://arxiv.org/abs/2210.03109

  • Title: RACER: Rich Language-Guided Failure Recovery Policies for Imitation Learning|https://arxiv.org/abs/2409.14674

  • Title: EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought|https://arxiv.org/abs/2305.15021

  • Title: Chain-of-Thought Predictive Control|https://arxiv.org/abs/2304.00776

  • Title: VIRT: Vision Instructed Transformer for Robotic Manipulation|https://arxiv.org/abs/2410.07169

  • Title: KOI: Accelerating Online Imitation Learning via Hybrid Key-state Guidance|https://www.arxiv.org/abs/2408.02912

  • Title: GENIMA: Generative Image as Action Models|https://arxiv.org/abs/2407.07875

  • Title: ATM: Any-point Trajectory Modeling for Policy Learning|https://arxiv.org/abs/2401.00025

  • Title: Learning Manipulation by Predicting Interaction|https://www.arxiv.org/abs/2406.00439

  • Title: Object-Centric Instruction Augmentation for Robotic Manipulation|https://arxiv.org/abs/2401.02814

  • Title: Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans|https://arxiv.org/abs/2312.00775

  • Title: CALAMARI: Contact-Aware and Language conditioned spatial Action MApping for contact-RIch manipulation|https://openreview.net/pdf?id=Nii0_rRJwN

  • Title: GHIL-Glue: Hierarchical Control with Filtered Subgoal Images|https://arxiv.org/abs/2410.20018

  • Title: FoAM: Foresight-Augmented Multi-Task Imitation Policy for Robotic Manipulation|https://arxiv.org/abs/2409.19528

  • Title: VideoAgent: Self-Improving Video Generation|https://arxiv.org/abs/2410.10076

  • Title: GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal Conditioned Policy|https://arxiv.org/abs/2408.14368

  • Title: GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation|https://arxiv.org/abs/2410.06158

  • Title: VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation|https://arxiv.org/abs/2407.09829

  • Title: GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation|https://arxiv.org/abs/2312.13139

  • Title: SuSIE: Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models|https://arxiv.org/abs/2310.10639

  • Title: VLP: Video Language Planning|https://arxiv.org/abs/2310.10625,

2)Visual Representation Learning

  • Title: Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets|https://arxiv.org/abs/2410.22325
  • Title: Theia: Distilling Diverse Vision Foundation Models for Robot Learning|https://arxiv.org/abs/2407.20179
  • Title: Learning Manipulation by Predicting Interaction|https://www.arxiv.org/abs/2406.00439
  • Title: Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware|https://arxiv.org/abs/2304.13705
  • Title: Language-Driven Representation Learning for Robotics|https://arxiv.org/abs/2302.12766
  • Title: VIMA: General Robot Manipulation with Multimodal Prompts|https://arxiv.org/abs/2210.03094
  • Title: Real-World Robot Learning with Masked Visual Pre-training|https://arxiv.org/abs/2210.03109
  • Title: R3M: A Universal Visual Representation for Robot Manipulation|https://arxiv.org/abs/2203.12601
  • Title: LIV: Language-Image Representations and Rewards for Robotic Control|https://arxiv.org/abs/2306.00958
  • Title: VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training|https://arxiv.org/abs/2210.00030
  • Title: Can Foundation Models Perform Zero-Shot Task Specification For Robot Manipulation?|https://arxiv.org/abs/2204.11134

3)Multimodal Representation Learning

  • Title: Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation|https://arxiv.org/abs/2408.01366
  • Title: MUTEX: Learning Unified Policies from Multimodal Task Specifications|https://arxiv.org/abs/2309.14320

4)Latent Action Learning

  • Title: Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation|https://arxiv.org/abs/2409.18707, - Title: IGOR: Image-GOal Representations Atomic Control Units for Foundation Models in Embodied AI|https://www.microsoft.com/en-us/research/uploads/prod/2024/10/Project_IGOR_for_arXiv.pdf
  • Title: Latent Action Pretraining from Videos|https://arxiv.org/abs/2410.11758
  • Title: Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control|https://arxiv.org/abs/2307.00117
  • Title: MimicPlay: Long-Horizon Imitation Learning by Watching Human Play|https://arxiv.org/abs/2302.12422
  • Title: Imitation Learning with Limited Actions via Diffusion Planners and Deep Koopman Controllers|https://arxiv.org/abs/2410.07584
  • Title: Learning to Act without Actions|https://arxiv.org/abs/2312.10812
  • Title: Imitating Latent Policies from Observation|https://arxiv.org/abs/1805.07914

5)World Model

  • Title: MOTO: Offline Pre-training to Online Fine-tuning for Model-based Robot Learning|https://arxiv.org/abs/2401.03306, - Title: Finetuning Offline World Models in the Real World|https://arxiv.org/abs/2310.16029,
  • Title: Surfer: Progressive Reasoning with World Models for Robotic Manipulation|https://arxiv.org/abs/2306.11335,

6)Asynchronous Action Learning

  • Title: PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation|https://arxiv.org/abs/2410.10394
  • Title: HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers|https://arxiv.org/abs/2410.05273
  • Title: MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models|https://arxiv.org/abs/2401.14502

7)Diffusion Policy Learning

  • Title: Diffusion Transformer Policy|https://arxiv.org/abs/2410.15959,
  • Title: SDP: Spiking Diffusion Policy for Robotic Manipulation with Learnable Channel-Wise Membrane Thresholds|https://arxiv.org/abs/2409.11195,
  • Title: The Ingredients for Robotic Diffusion Transformers|https://arxiv.org/abs/2410.10088,
  • Title: GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy|https://arxiv.org/abs/2410.17488
  • Title: EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning|https://arxiv.org/abs/2407.01479
  • Title: Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning|https://arxiv.org/abs/2407.01531
  • Title: MDT: Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals|https://arxiv.org/abs/2407.05996
  • Title: Render and Diffuse: Aligning Image and Action Spaces for Diffusion-based Behaviour Cloning|https://arxiv.org/abs/2405.18196,
  • Title: DP3: 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations|https://arxiv.org/abs/2403.03954
  • Title: PlayFusion: Skill Acquisition via Diffusion from Language-Annotated Play|https://arxiv.org/abs/2312.04549
  • Title: Equivariant Diffusion Policy|https://arxiv.org/abs/2407.01812
  • Title: StructDiffusion: Language-Guided Creation of Physically-Valid Structures using Unseen Objects|https://arxiv.org/abs/2211.04604
  • Title: Goal-Conditioned Imitation Learning using Score-based Diffusion Policies|https://arxiv.org/abs/2304.02532
  • Title: Diffusion Policy: Visuomotor Policy Learning via Action Diffusion|https://arxiv.org/abs/2303.04137

8)Other Policies

  • Title: Autoregressive Action Sequence Learning for Robotic Manipulation|https://arxiv.org/abs/2410.03132,
  • Title: MaIL: Improving Imitation Learning with Selective State Space Models|https://arxiv.org/abs/2406.08234,

9)Vision Language Action Models

  • Title: Run-time Observation Interventions Make Vision-Language-Action Models More Visually Robust|https://arxiv.org/abs/2410.01971
  • Title: TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation|https://arxiv.org/abs/2409.12514
  • Title: RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation|https://arxiv.org/abs/2406.04339
  • Title: A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM|https://arxiv.org/abs/2410.15549
  • Title: OpenVLA: An Open-Source Vision-Language-Action Model|https://arxiv.org/abs/2406.09246
  • Title: LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning|https://arxiv.org/abs/2406.11815
  • Title: Robotic Control via Embodied Chain-of-Thought Reasoning|https://arxiv.org/abs/2406.11815
  • Title: 3D-VLA: A 3D Vision-Language-Action Generative World Model|https://arxiv.org/abs/2403.09631
  • Title: Octo: An Open-Source Generalist Robot Policy|https://arxiv.org/abs/2405.12213,
  • Title: RoboFlamingo: Vision-Language Foundation Models as Effective Robot Imitators|https://arxiv.org/abs/2311.01378
  • Title: RT-H: Action Hierarchies Using Language|https://arxiv.org/abs/2403.01823
  • Title: Open X-Embodiment: Robotic Learning Datasets and RT-X Models|https://arxiv.org/abs/2310.08864,
  • Title: MOO: Open-World Object Manipulation using Pre-trained Vision-Language Models|https://arxiv.org/abs/2303.00905
  • Title: RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control|https://arxiv.org/abs/2307.15818
  • Title: RT-1: Robotics Transformer for Real-World Control at Scale|https://arxiv.org/abs/2212.06817

10)Reinforcement Learning

  • Title: Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning|https://arxiv.org/abs/2410.21845
  • Title: PointPatchRL -- Masked Reconstruction Improves Reinforcement Learning on Point Clouds|https://arxiv.org/abs/2410.18800
  • Title: SPIRE: Synergistic Planning, Imitation, and Reinforcement for Long-Horizon Manipulation|https://arxiv.org/abs/2410.18065
  • Title: Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning|https://arxiv.org/abs/2407.15815
  • Title: Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks|https://arxiv.org/abs/2405.01534,
  • Title: Expansive Latent Planning for Sparse Reward Offline Reinforcement Learning|https://openreview.net/pdf?id=xQx1O7WXSA,
  • Title: Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions|https://arxiv.org/abs/2309.10150,
  • Title: Sim2Real Transfer for Reinforcement Learning without Dynamics Randomization|https://arxiv.org/abs/2002.11635,
  • Title: Pre-Training for Robots: Offline RL Enables Learning New Tasks from a Handful of Trials|https://arxiv.org/abs/2210.05178

11)Motion, Tranjectory and Flow

  • Title: Language-Conditioned Path Planning|https://arxiv.org/abs/2308.16893

  • Title: DiffusionSeeder: Seeding Motion Optimization with Diffusion for Rapid Motion Planning|https://arxiv.org/abs/2410.16727

  • Title: ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation|https://arxiv.org/abs/2409.01652

  • Title: CoPa: General Robotic Manipulation through Spatial Constraints of Parts with Foundation Models|https://arxiv.org/abs/2403.08248

  • Title: Task Generalization with Stability Guarantees via Elastic Dynamical System Motion Policies|https://arxiv.org/abs/2309.01884

  • Title: ORION: Vision-based Manipulation from Single Human Video with Open-World Object Graphs|https://arxiv.org/abs/2405.20321

  • Title: Learning Robotic Manipulation Policies from Point Clouds with Conditional Flow Matching|https://arxiv.org/abs/2409.07343

  • Title: RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation|https://arxiv.org/abs/2308.15975

  • Title: VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models|https://arxiv.org/abs/2307.05973

  • Title: LATTE: LAnguage Trajectory TransformEr|https://arxiv.org/abs/2208.02918

  • Title: Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation|https://arxiv.org/abs/2405.01527

  • Title: Any-point Trajectory Modeling for Policy Learning|https://arxiv.org/abs/2401.00025

  • Title: Waypoint-Based Imitation Learning for Robotic Manipulation|https://arxiv.org/abs/2307.14326

  • Title: Flow as the Cross-Domain Manipulation Interface|https://www.arxiv.org/abs/2407.15208

  • Title: Learning to Act from Actionless Videos through Dense Correspondences|https://arxiv.org/abs/2310.08576

12)Data Collection, Selection and Augmentation

  • Title: SkillMimicGen: Automated Demonstration Generation for Efficient Skill Learning and Deployment|https://arxiv.org/abs/2410.18907

  • Title: Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models|https://arxiv.org/abs/2410.17772

  • Title: Autonomous Improvement of Instruction Following Skills via Foundation Models|https://arxiv.org/abs/2407.20635

  • Title: Manipulate-Anything: Automating Real-World Robots using Vision-Language Models|https://arxiv.org/abs/2406.18915,

  • Title: DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation|https://arxiv.org/abs/2403.07788,

  • Title: SPRINT: Scalable Policy Pre-Training via Language Instruction Relabeling|https://arxiv.org/abs/2306.11886,

  • Title: Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition|https://arxiv.org/abs/2307.14535

  • Title: Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models|https://arxiv.org/abs/2211.11736

  • Title: RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation|https://arxiv.org/abs/2306.11706,

  • Title: Active Fine-Tuning of Generalist Policies|https://arxiv.org/abs/2410.05026

  • Title: Re-Mix: Optimizing Data Mixtures for Large Scale Imitation Learning|https://arxiv.org/abs/2408.14037

  • Title: An Unbiased Look at Datasets for Visuo-Motor Pre-Training|https://arxiv.org/abs/2310.09289,

  • Title: Retrieval-Augmented Embodied Agents|https://arxiv.org/abs/2404.11699,

  • Title: Behavior Retrieval: Few-Shot Imitation Learning by Querying Unlabeled Datasets|https://arxiv.org/abs/2304.08742,

  • Title: RoVi-Aug: Robot and Viewpoint Augmentation for Cross-Embodiment Robot Learning|https://arxiv.org/abs/2409.03403

  • Title: Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning|https://arxiv.org/abs/2407.20798

  • Title: Diffusion Meets DAgger: Supercharging Eye-in-hand Imitation Learning|https://arxiv.org/abs/2402.17768,

  • Title: GenAug: Retargeting behaviors to unseen situations via Generative Augmentation|https://arxiv.org/abs/2302.06671

  • Title: Contrast Sets for Evaluating Language-Guided Robot Policies|https://arxiv.org/abs/2406.13636

13)Affordance Learning

  • Title: UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models|https://arxiv.org/abs/2409.20551,

  • Title: A3VLM: Actionable Articulation-Aware Vision Language Model|https://arxiv.org/abs/2406.07549,

  • Title: AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation|https://arxiv.org/abs/2406.11548,

  • Title: SAGE: Bridging Semantic and Actionable Parts for Generalizable Manipulation of Articulated Objects|https://arxiv.org/abs/2312.01307,

  • Title: Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs|https://arxiv.org/abs/2311.02847,

  • Title: Ditto: Building Digital Twins of Articulated Objects from Interaction|https://arxiv.org/abs/2202.08227,

  • Title: Language-Conditioned Affordance-Pose Detection in 3D Point Clouds|https://arxiv.org/abs/2309.10911,

  • Title: Composable Part-Based Manipulation|https://arxiv.org/abs/2405.05876,

  • Title: PartManip: Learning Cross-Category Generalizable Part Manipulation Policy from Point Cloud Observations|https://arxiv.org/abs/2303.16958,

  • Title: GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts|https://arxiv.org/abs/2211.05272,

  • Title: SpatialBot: Precise Spatial Understanding with Vision Language Models|https://arxiv.org/abs/2406.13642,

  • Title: RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics|https://arxiv.org/abs/2406.10721,

  • Title: SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities|https://arxiv.org/abs/2401.12168,

  • Title: RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation|https://arxiv.org/abs/2407.04689,

  • Title: MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting|https://arxiv.org/abs/2403.03174

  • Title: SLAP: Spatial-Language Attention Policies|https://arxiv.org/abs/2304.11235,

  • Title: KITE: Keypoint-Conditioned Policies for Semantic Manipulation|https://arxiv.org/abs/2306.16605,

  • Title: HULC++: Grounding Language with Visual Affordances over Unstructured Data|https://arxiv.org/abs/2210.01911

  • Title: CLIPort: What and Where Pathways for Robotic Manipulation|https://arxiv.org/abs/2109.12098,

  • Title: Affordance Learning from Play for Sample-Efficient Policy Learning|https://arxiv.org/abs/2203.00352

  • Title: Transporter Networks: Rearranging the Visual World for Robotic Manipulation|https://arxiv.org/abs/2010.14406,

14)3D Representation for Manipulation

  • Title: MSGField: A Unified Scene Representation Integrating Motion, Semantics, and Geometry for Robotic Manipulation|https://arxiv.org/abs/2410.15730
  • Title: Splat-MOVER: Multi-Stage, Open-Vocabulary Robotic Manipulation via Editable Gaussian Splatting|https://arxiv.org/abs/2405.04378
  • Title: IMAGINATION POLICY: Using Generative Point Cloud Models for Learning Manipulation Policies|https://arxiv.org/abs/2406.11740
  • Title: Physically Embodied Gaussian Splatting: A Realtime Correctable World Model for Robotics|https://arxiv.org/abs/2406.10788
  • Title: RiEMann: Near Real-Time SE(3)-Equivariant Robot Manipulation without Point Cloud Segmentation|https://arxiv.org/abs/2403.19460
  • Title: RoboEXP: Action-Conditioned Scene Graph via Interactive Exploration for Robotic Manipulation|https://arxiv.org/abs/2402.15487
  • Title: D3Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Rearrangement|https://arxiv.org/abs/2309.16118
  • Title: Object-Aware Gaussian Splatting for Robotic Manipulation|https://openreview.net/pdf?id=gdRI43hDgo
  • Title: Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation|https://arxiv.org/abs/2308.07931
  • Title: Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation|https://arxiv.org/abs/2112.05124
  • Title: SE(3)-Equivariant Relational Rearrangement with Neural Descriptor Fields|https://arxiv.org/abs/2211.09786

15)3D Representation Policy Learning

  • Title: GravMAD: Grounded Spatial Value Maps Guided Action Diffusion for Generalized 3D Manipulation|https://arxiv.org/abs/2409.20154

  • Title: 3D Diffuser Actor: Policy Diffusion with 3D Scene Representations|https://arxiv.org/abs/2402.10885

  • Title: DP3: 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations|https://arxiv.org/abs/2403.03954

  • Title: ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation|https://arxiv.org/abs/2403.08321

  • Title: SGRv2: Leveraging Locality to Boost Sample Efficiency in Robotic Manipulation|https://arxiv.org/abs/2406.10615

  • Title: GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields|https://arxiv.org/abs/2308.16891

  • Title: Visual Reinforcement Learning with Self-Supervised 3D Representations|https://arxiv.org/abs/2210.07241

  • Title: PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation|https://arxiv.org/abs/2309.15596

  • Title: M2T2: Multi-Task Masked Transformer for Object-centric Pick and Place|https://arxiv.org/abs/2311.00926

  • Title: PerAct: Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation|https://arxiv.org/abs/2209.05451

  • Title: 3D-MVP: 3D Multiview Pretraining for Robotic Manipulation|https://arxiv.org/abs/2406.18158

  • Title: Discovering Robotic Interaction Modes with Discrete Representation Learning|https://arxiv.org/abs/2410.20258

  • Title: SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation|https://arxiv.org/abs/2405.19586

  • Title: RVT: Robotic View Transformer for 3D Object Manipulation|https://arxiv.org/abs/2306.14896

  • Title: Learning Generalizable Manipulation Policies with Object-Centric 3D Representations|https://arxiv.org/abs/2310.14386

  • Title: SGR: A Universal Semantic-Geometric Representation for Robotic Manipulation|https://arxiv.org/abs/2306.10474

16)Reasoning, Planning and Code Generation

  • Title: AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation|https://arxiv.org/abs/2410.00371

  • Title: REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction|https://arxiv.org/abs/2306.15724,

  • Title: Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models|https://arxiv.org/abs/2408.07975

  • Title: Physically Grounded Vision-Language Models for Robotic Manipulation|https://arxiv.org/abs/2309.02561

  • Title: Socratic Planner: Inquiry-Based Zero-Shot Planning for Embodied Instruction Following|https://arxiv.org/abs/2404.15190,

  • Title: Saycan: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances|https://arxiv.org/abs/2204.01691,

  • Title: LLM+P: Empowering Large Language Models with Optimal Planning Proficiency|https://arxiv.org/abs/2304.11477,

  • Title: Inner Monologue: Embodied Reasoning through Planning with Language Models|https://arxiv.org/abs/2207.05608,

  • Title: Teaching Robots with Show and Tell: Using Foundation Models to Synthesize Robot Policies from Language and Visual Demonstrations|https://openreview.net/pdf?id=G8UcwxNAoD

  • Title: RoCo: Dialectic Multi-Robot Collaboration with Large Language Models|https://arxiv.org/abs/2307.04738,

  • Title: Gesture-Informed Robot Assistance via Foundation Models|https://arxiv.org/abs/2309.02721,

  • Title: Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model|https://arxiv.org/abs/2305.11176

  • Title: ProgPrompt: Generating Situated Robot Task Plans using Large Language Models|https://arxiv.org/abs/2209.11302

  • Title: ChatGPT for Robotics: Design Principles and Model Abilities|https://arxiv.org/abs/2306.17582

  • Title: Code as Policies: Language Model Programs for Embodied Control|https://arxiv.org/abs/2209.07753

  • Title: TidyBot: Personalized Robot Assistance with Large Language Models|https://arxiv.org/abs/2305.05658

  • Title: Statler: State-Maintaining Language Models for Embodied Reasoning|https://arxiv.org/abs/2306.17840

  • Title: InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning|https://arxiv.org/abs/2405.19758

  • Title: Text2Motion: From Natural Language Instructions to Feasible Plans|https://arxiv.org/abs/2303.12153

  • Title: AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation|https://arxiv.org/abs/2410.00371

  • Title: Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations|https://arxiv.org/abs/2410.00436

  • Title: EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought|https://arxiv.org/abs/2305.15021

  • Title: ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation|https://arxiv.org/abs/2312.16217

  • Title: Chat with the Environment: Interactive Multimodal Perception Using Large Language Models|https://arxiv.org/abs/2303.08268

  • Title: PaLM-E: An Embodied Multimodal Language Model|https://arxiv.org/abs/2303.03378

  • Title: Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language|https://arxiv.org/abs/2204.00598

17)Generalization

  • Title: Mirage: Cross-Embodiment Zero-Shot Policy Transfer with Cross-Painting|https://arxiv.org/abs/2402.19249

  • Title: Policy Architectures for Compositional Generalization in Control|https://arxiv.org/abs/2203.05960

  • Title: Programmatically Grounded, Compositionally Generalizable Robotic Manipulation|https://arxiv.org/abs/2304.13826

  • Title: Efficient Data Collection for Robotic Manipulation via Compositional Generalization|https://arxiv.org/abs/2403.05110

  • Title: Natural Language Can Help Bridge the Sim2Real Gap|https://arxiv.org/abs/2405.10020

  • Title: Reconciling Reality through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation|https://arxiv.org/abs/2403.03949

  • Title: Local Policies Enable Zero-shot Long-horizon Manipulation|https://arxiv.org/abs/2410.22332,

  • Title: A Backbone for Long-Horizon Robot Task Understanding|https://arxiv.org/abs/2408.01334,

  • Title: STAP: Sequencing Task-Agnostic Policies|https://arxiv.org/abs/2210.12250

  • Title: BOSS: Bootstrap Your Own Skills: Learning to Solve New Tasks with Large Language Model Guidance|https://arxiv.org/abs/2310.10021

  • Title: Learning Compositional Behaviors from Demonstration and Language|https://openreview.net/pdf?id=fR1rCXjCQX

  • Title: Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation|https://arxiv.org/abs/2408.16228

18)Generalist

  • Title: Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation|https://arxiv.org/abs/2408.11812

  • Title: All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents|https://arxiv.org/abs/2408.10899

  • Title: Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers|https://arxiv.org/abs/2409.20537

  • Title: An Embodied Generalist Agent in 3D World|https://arxiv.org/abs/2311.12871

  • Title: Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation|https://arxiv.org/abs/2410.08001

  • Title: Effective Tuning Strategies for Generalist Robot Manipulation Policies|https://arxiv.org/abs/2410.01220,

  • Title: Octo: An Open-Source Generalist Robot Policy|https://arxiv.org/abs/2405.12213,

  • Title: Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance|https://arxiv.org/abs/2410.13816

  • Title: Open X-Embodiment: Robotic Learning Datasets and RT-X Models|https://arxiv.org/abs/2310.08864,

  • Title: RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking|https://arxiv.org/abs/2309.01918,

  • Title: Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning|https://arxiv.org/abs/2407.15815

  • Title: CAGE: Causal Attention Enables Data-Efficient Generalizable Robotic Manipulation|https://arxiv.org/abs/2407.15815

  • Title: Robot Utility Models: General Policies for Zero-Shot Deployment in New Environments|https://arxiv.org/abs/2409.05865

19)Human-Robot Interaction and Collaboration

  • Title: Vocal Sandbox: Continual Learning and Adaptation for Situated Human-Robot Collaboration|https://openreview.net/pdf?id=ypaYtV1CoG
  • Title: APRICOT: Active Preference Learning and Constraint-Aware Task Planning with LLMs|https://openreview.net/pdf?id=nQslM6f7dW
  • Title: Text2Interaction: Establishing Safe and Preferable Human-Robot Interaction|https://arxiv.org/abs/2408.06105
  • Title: KNOWNO: Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners|https://arxiv.org/abs/2307.01928,
  • Title: Yell At Your Robot: Improving On-the-Fly from Language Corrections|https://arxiv.org/abs/2403.12910,
  • Title: "No, to the Right" -- Online Language Corrections for Robotic Manipulation via Shared Autonomy|https://arxiv.org/abs/2301.02555,

Humanoid

1)Dexterous Manipulation

  • Title: DexGraspNet: A Large-Scale Robotic Dexterous Grasp Dataset for General Objects Based on Simulation|https://arxiv.org/abs/2210.02697,
  • Title: Demonstrating Learning from Humans on Open-Source Dexterous Robot Hands|https://www.roboticsproceedings.org/rss20/p014.pdf,
  • Title: CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation|https://arxiv.org/abs/2402.14795,
  • Title: Dexterous Functional Grasping|https://arxiv.org/abs/2312.02975,
  • Title: DEFT: Dexterous Fine-Tuning for Real-World Hand Policies|https://arxiv.org/abs/2310.19797,
  • Title: REBOOT: Reuse Data for Bootstrapping Efficient Real-World Dexterous Manipulation|https://arxiv.org/abs/2309.03322,
  • Title: Sequential Dexterity: Chaining Dexterous Policies for Long-Horizon Manipulation|https://arxiv.org/abs/2309.00987,
  • Title: AnyTeleop: A General Vision-Based Dexterous Robot Arm-Hand Teleoperation System|https://arxiv.org/abs/2307.04577,

2)Other Applications

  • Title: Leveraging Language for Accelerated Learning of Tool Manipulation|https://arxiv.org/abs/2206.13074,

Awesome Benchmarks

1)Grasp Datasets

  • Title: QDGset: A Large Scale Grasping Dataset Generated with Quality-Diversity|https://arxiv.org/abs/2410.02319,
  • Title: Real-to-Sim Grasp: Rethinking the Gap between Simulation and Real World in Grasp Detection|https://arxiv.org/abs/2410.06521,
  • Title: Grasp-Anything-6D: Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance|https://arxiv.org/abs/2407.13842
  • Title: Grasp-Anything++: Language-driven Grasp Detection|https://arxiv.org/abs/2406.09489
  • Title: Grasp-Anything: Large-scale Grasp Dataset from Foundation Models|https://arxiv.org/abs/2309.09818,
  • Title: GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping|https://openaccess.thecvf.com/content_CVPR_2020/papers/Fang_GraspNet-1Billion_A_Large-Scale_Benchmark_for_General_Object_Grasping_CVPR_2020_paper.pdf

2)Manipulation Benchmarks

  • Title: RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots|https://arxiv.org/abs/2406.02523

  • Title: ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes|https://arxiv.org/abs/2304.04321,

  • Title: HomeRobot: Open-Vocabulary Mobile Manipulation|https://arxiv.org/abs/2306.11565,

  • Title: ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks|https://arxiv.org/abs/1912.01734,

  • Title: Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy|https://arxiv.org/abs/2410.01345,

  • Title: THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation|https://arxiv.org/abs/2402.08191,

  • Title: VIMA: General Robot Manipulation with Multimodal Prompts|https://arxiv.org/abs/2210.03094,

  • Title: CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks|https://arxiv.org/abs/2112.03227,

  • Title: RLBench: The Robot Learning Benchmark & Learning Environment|https://arxiv.org/abs/1909.12271,

  • Title: Evaluating Real-World Robot Manipulation Policies in Simulation|https://arxiv.org/abs/2405.05941

  • Title: LADEV: A Language-Driven Testing and Evaluation Platform for Vision-Language-Action Models in Robotic Manipulation|https://arxiv.org/abs/2410.05191

  • Title: ClutterGen: A Cluttered Scene Generator for Robot Learning|https://arxiv.org/abs/2407.05425

  • Title: Efficient Tactile Simulation with Differentiability for Robotic Manipulation|https://openreview.net/pdf?id=6BIffCl6gsM,

  • Title: Open X-Embodiment: Robotic Learning Datasets and RT-X Models|https://arxiv.org/abs/2310.08864,

  • Title: DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset|https://arxiv.org/abs/2403.12945,

  • Title: BridgeData V2: A Dataset for Robot Learning at Scale|https://arxiv.org/abs/2308.12952,

  • Title: ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models|https://arxiv.org/abs/2403.11289,

  • Title: OpenEQA: Embodied Question Answering in the Era of Foundation Models|https://open-eqa.github.io/assets/pdfs/paper.pdf,

3)Cross-Embodiment Benchmarks

  • Title: All Robots in One: A New Standard and Unified Dataset for Versatile, General-Purpose Embodied Agents|https://arxiv.org/abs/2408.10899,
  • Title: Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?|https://arxiv.org/abs/2408.10899,

Awesome Techniques

  • Title: Awesome-Implicit-NeRF-Robotics: Neural Fields in Robotics: A Survey|https://arxiv.org/abs/2410.20220,
  • Title: Awesome-Video-Robotic-Papers,
  • Title: Awesome-Generalist-Robots-via-Foundation-Models: Neural Fields in Robotics: A Survey|https://arxiv.org/abs/2312.08782,
  • Title: Awesome-Robotics-3D,
  • Title: Awesome-Robotics-Foundation-Models: Foundation Models in Robotics: Applications, Challenges, and the Future|https://arxiv.org/abs/2312.07843,
  • Title: Awesome-LLM-Robotics,

Vision-Language Models

3D

  • Title:  Title

  • Title: Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding|https://arxiv.org/abs/2408.13024,

  • Title: Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis via State Space Model|https://arxiv.org/abs/2404.14966,

  • Title: PointMamba: A Simple State Space Model for Point Cloud Analysis|https://arxiv.org/abs/2402.10739,

  • Title: Point Transformer V3: Simpler, Faster, Stronger|https://arxiv.org/abs/2312.10035,

  • Title: Point Transformer V2: Grouped Vector Attention and Partition-based Pooling|https://arxiv.org/abs/2210.05666,

  • Title: Point Transformer|https://arxiv.org/abs/2402.10739,

  • Title: PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space|https://arxiv.org/abs/1706.02413,

  • Title: PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation|https://arxiv.org/abs/1612.00593,

  • Title: LERF: Language Embedded Radiance Fields|https://arxiv.org/abs/2303.09553,

  • Title: 3D Gaussian Splatting for Real-Time Radiance Field Rendering|https://arxiv.org/abs/2308.04079,

  • Title: LangSplat: 3D Language Gaussian Splatting|https://arxiv.org/abs/2312.16084,

本文仅做学术分享,如有侵权,请联系删文。

3D视觉交流群,成立啦!

目前我们已经建立了3D视觉方向多个社群,包括2D计算机视觉最前沿工业3D视觉SLAM自动驾驶三维重建无人机等方向,细分群包括:

工业3D视觉:相机标定、立体匹配、三维点云、结构光、机械臂抓取、缺陷检测、6D位姿估计、相位偏折术、Halcon、摄影测量、阵列相机、光度立体视觉等。

SLAM:视觉SLAM、激光SLAM、语义SLAM、滤波算法、多传感器融合、多传感器标定、动态SLAM、MOT SLAM、NeRF SLAM、机器人导航等。

自动驾驶:深度估计、Transformer、毫米波|激光雷达|视觉摄像头传感器、多传感器标定、多传感器融合、自动驾驶综合群等、3D目标检测、路径规划、轨迹预测、3D点云分割、模型部署、车道线检测、Occupancy、目标跟踪等。

三维重建:3DGS、NeRF、多视图几何、OpenMVS、MVSNet、colmap、纹理贴图等

无人机:四旋翼建模、无人机飞控等

2D计算机视觉:图像分类/分割、目标/检测、医学影像、GAN、OCR、2D缺陷检测、遥感测绘、超分辨率、人脸检测、行为识别、模型量化剪枝、迁移学习、人体姿态估计等

最前沿:具身智能、大模型、Mamba、扩散模型等

除了这些,还有求职硬件选型视觉产品落地、产品、行业新闻等交流群

添加小助理: cv3d001,备注:研究方向+学校/公司+昵称(如3D点云+清华+小草莓), 拉你入群。

▲长按扫码添加助理:cv3d001

3D视觉工坊知识星球

「3D视觉从入门到精通」知识星球,已沉淀6年,星球内资料包括:秘制视频课程近20门(包括结构光三维重建、相机标定、SLAM、深度估计、3D目标检测、3DGS顶会带读课程、三维点云等)、项目对接3D视觉学习路线总结最新顶会论文&代码3D视觉行业最新模组3D视觉优质源码汇总书籍推荐编程基础&学习工具实战项目&作业求职招聘&面经&面试题等等。欢迎加入3D视觉从入门到精通知识星球,一起学习进步。

▲长按扫码加入星球
3D视觉工坊官网:www.3dcver.com

具身智能、3DGS、NeRF结构光、相位偏折术、机械臂抓取、点云实战、Open3D、缺陷检测BEV感知、Occupancy、Transformer、模型部署、3D目标检测、深度估计、多传感器标定、规划与控制无人机仿真C++、三维视觉python、dToF、相机标定、ROS2机器人控制规划、LeGo-LAOM、多模态融合SLAM、LOAM-SLAM、室内室外SLAM、VINS-Fusion、ORB-SLAM3、MVSNet三维重建、colmap、线面结构光、硬件结构光扫描仪等。

长按扫码学习3D视觉精品课程

3D视觉模组选型:www.3dcver.com

—  —

点这里👇关注我,记得标星哦~

一键三连「分享」、「点赞」和「在看」

3D视觉科技前沿进展日日相见 ~ 

3D视觉工坊
专注于工业3D视觉、SLAM、自动驾驶、三维重建、无人机、具身智能、扩散模型等前沿技术分享与产业落地,力争打造为国内最专业的3D视觉社区。官网:www.3dcver.com 佳v:cv3d007 或 13451707958
 最新文章