前言:
作者:赵鉴
原文:https://zhuanlan.zhihu.com/p/2355110476
前些日子,去参加RLchina,初步学习了一些大语言模型(LLM)的知识。训练LLM一直苦恼的是没有奖励模型,我想我们决策任务有的是奖励函数啊,何不让LLM来解决做决策任务。当然现在已经有不少LLM做决策任务的工作了,但我发现大部分都是将状态描述一下,直接通过LLM来试图输出宏观/微观动作。感觉这种方式对大模型的要求也太高,并且没有进行可以发展的空间。于有我想了一个非常直观的想法,能不能让LLM来写决策树代码,然后来完成决策任务呢。
于是我召集了圈内的好友,以MARL最通用的决策任务SMAC开始,尝试进行实验开发。
兵贵神速,一周时间我们完成了初步的工作,简单写了一个论文LLM-SMAC,并开源了我们的代码,欢迎大家引用。
文章地址:
https://arxiv.org/pdf/2410.1602...[1]
代码地址:
https://github.com/devindeng94...[2]
启发点:
其实用大模型生成决策树代码,我很早就有想过,由于我LLM的论文看的不多,我也不知道有没有有过相似的想法,但我调研了一下,用在星际里的应该是没有,所以我们还是能认为我们是首次在SMAC环境中,提出这种解决思路的人。
这个方法有几个好处:
• 1、生成出来的策略是白盒,具有可解释性。不像神经网络模型是个黑盒,还需要后处理例如决策树蒸馏的方式让他变成白盒。
• 2、这种方法对环境的交互次数极少,不需要像MARL之类算法,至少要交互个1e6次,才能学习到策略。
• 3、这个策略迁移性鲁棒性特别好,由于它是基于“知识”做出来的策略,你会发现它不会轻易去卡bug获得胜利,而是通过“堂堂正正”手法去赢,导致策略迁移性比较好,对于同类地图,决策树代码一行都不用改,就能继续work。
缺点呢,就是首先你得站在一个巨人的肩膀上,有个不错的代码生成大模型,我们初步试了几个code-LLM,发现性能最好用的还是deepseek-V2.5的那个版本,后续会把其他大模型都测试一遍。
具体方法:
这是整体的结构图,其实非常简单。我们结合具体的2s_vs_1sc任务来说
首先你需要给Planner 一个System Prompt,给他“洗脑”格式:
You are a StarCraft II player and a helpful assistant and you are facing the micro-management tasks.
I will describe the map and the units on the map. You should not build any new buildings or new units.
You may focus on micro-management tactics to win the combat.
You should provide me the tactics in the format below:
### Tactic 1: Tactic 1' name
**Condition to use:**
**Tactic Skeleton**
### Tactic 2: Tactic 2' name
**Condition to use:**
**Tactic Skeleton**
Meanwhile, I will tell you history taken tactics, you may add new tactics onto the history tactics by adding new tactic in the list or try a new tactic.
然后给予地图信息:
The map is 2s_vs_1sc.
You can control 2 Stalker units individually and the enemy controls 1 Spine Crawler Structure.
The Stalker unit has 80 health, 80 shield, 1 defense, 6 attacking range, 4.13 speed, 13 damage with 9.7 DPS.
The Spine Crawler has 300 health, 0 shield, 2 defense, 7 attacking range, and 25 damage with 18.9 DPS.
All the units has no abilities such as blinking or equipments.
The map is a 32*32 sized square map.
The available area of x axis is from 4 to 23, and the y axis is from 7 to 30.
The enemy units are at (14, 21) point and your Stalker units are at (11, 9) and (17, 9) points initially.
There is no terrain advantages nor choke points in this map. You cannot get back to the enemy units.
You should provide me with at most 3 most important tactics and describe the chosen tactic skeleton in detail according to the situations of your unit and enemy units.
You should also indicate the condition to use this tactic. Make sure the conditions are not conflict with each other.
Planner此时会给你回答一个所需的skill:
### Tactic 1: Hit and Run
**Condition to use:** When the enemy units are within the Stalkers' attack range and the Stalkers are not heavily damaged.
**Tactic Skeleton:**
1. Control both Stalkers to attack the Spine Crawler simultaneously.
2. After dealing damage, retreat slightly to avoid the Spine Crawler's counter-attack.
3. Repeat this process to slowly whittle down the Spine Crawler's health.
然后你就可以把得到的skill加入prompt,问coder生成对应的决策树代码了:
You are a StarCraft II player and a helpful assistant and you are facing the micro-management tasks.
I will describe the map and the units on the map. You should not build any new buildings or new units or expand your base.
You should concentrate on the micro-management strategy.
I will give you the strategy in JSON array format. The keys are 'tactic_name' and 'tactic_description'.
You should implement the strategy in python with burnysc2/pythonsc2 package.
You should concentrate on implementing the 'def async on_step(self, iteration: int):' function.
The result should be surrounded in the '```python' and '```' structure.
2024-10-17 10:15:58,142 DEBUG User Input
The map is 2s_vs_1sc.
You can control 2 Stalker units individually and the enemy controls 1 Spine Crawler Structure.
The Stalker unit has 80 health, 80 shield, 1 defense, 6 attacking range, 4.13 speed, 13 damage with 9.7 DPS.
The Spine Crawler has 300 health, 0 shield, 2 defense, 7 attacking range, and 25 damage with 18.9 DPS.
All the units has no abilities such as blinking or equipments.
The map is a 32*32 sized square map.
The available area of x axis is from 4 to 23, and the y axis is from 7 to 30.
The enemy units are at (14, 21) point and your Stalker units are at (11, 9) and (17, 9) points initially.
There is no terrain advantages nor choke points in this map. You cannot get back to the enemy units.
You should not use the await keyword. Make sure to check whether the list variables are empty or not.
The tactic is:
### Tactic 1: Hit and Run
**Condition to use:** When the enemy units are within the Stalkers' attack range and the Stalkers are not heavily damaged.
**Tactic Skeleton:**
1. Control both Stalkers to attack the Spine Crawler simultaneously.
2. After dealing damage, retreat slightly to avoid the Spine Crawler's counter-attack.
3. Repeat this process to slowly whittle down the Spine Crawler's health.
Please implement the code.
此时coder生成了决策树代码:
```python
from sc2.bot_ai import BotAI
from sc2.ids.unit_typeid import UnitTypeId
from sc2.position import Point2
class HitAndRunBot(BotAI):
async def on_step(self, iteration: int):
stalkers = self.units(UnitTypeId.STALKER)
spine_crawler = self.enemy_structures.first
if not spine_crawler or not stalkers.exists:
return
for stalker in stalkers:
# Check if the Stalker is in attack range of the Spine Crawler
if stalker.distance_to(spine_crawler) <= stalker.ground_range + 1:
# Attack the Spine Crawler
stalker.attack(spine_crawler)
else:
# Move towards the Spine Crawler if not in attack range
stalker.move(spine_crawler.position)
# After attacking, retreat slightly to avoid counter-attack
if stalker.weapon_cooldown > 0:
retreat_position = stalker.position.towards(self.start_location, 3)
stalker.move(retreat_position) ```
我们将这部分代码移入py-sc2中,得到运行结果:
You Win 4 out of 10 times and lose 6 out of 10 times.
You achieve 227.5 scores, give 280.0 damages to the enemy, take 140.6625 damage on health, and take 170.7625 damage on shield on average.
于是接着问critic大模型,为啥没打好:
You are a StarCraft II player and a helpful assistant and you are facing the micro-management tasks.
You are now working as a critic.
I will describe the map and the units on the map. You should not build any new buildings or new units.
After that, I will provide you the tactic and the python script which is the implementation of this tactic.
You should concentrate on the micro-management strategy to kill more enemy and preserve more units of yourself.
I will also provide you the result of the code, which might be the bug stacktrace or the combat results.
You should analyse why the code leads to the result and tell me the potential method to improve the performance based on the code.
You can suggest improve the current tactic or delete some tactic based on the current code.
You do not need to provide me the refinement code.
After that you should provide me a suggestion from ```[Change Tactic]``` or ```[Improve Tactic]```
The map is 2s_vs_1sc.
You can control 2 Stalker units individually and the enemy controls 1 Spine Crawler Structure.
The Stalker unit has 80 health, 80 shield, 1 defense, 6 attacking range, 4.13 speed, 13 damage with 9.7 DPS.
The Spine Crawler has 300 health, 0 shield, 2 defense, 7 attacking range, 0 speed, and 25 damage with 18.9 DPS.
All the units has no abilities such as blinking or equipments.
The map is a 32*32 sized square map.
The available area of x axis is from 4 to 23, and the y axis is from 7 to 30.
The enemy structure is at (14, 21) point and your Stalker units are at (11, 9) and (17, 9) points initially.
There is no terrain advantages nor choke points in this map. You cannot get back to the enemy units.
The code is:
```python
from sc2.bot_ai import BotAI
from sc2.ids.unit_typeid import UnitTypeId
from sc2.position import Point2
class HitAndRunBot(BotAI):
async def on_step(self, iteration: int):
stalkers = self.units(UnitTypeId.STALKER)
spine_crawler = self.enemy_structures.first
if not spine_crawler or not stalkers.exists:
return
for stalker in stalkers:
# Check if the Stalker is in attack range of the Spine Crawler
if stalker.distance_to(spine_crawler) <= stalker.ground_range + 1:
# Attack the Spine Crawler
stalker.attack(spine_crawler)
else:
# Move towards the Spine Crawler if not in attack range
stalker.move(spine_crawler.position)
# After attacking, retreat slightly to avoid counter-attack
if stalker.weapon_cooldown > 0: retreat_position = stalker.position.towards(self.start_location, 3)
stalker.move(retreat_position) ```
The result is:
You Win 4 out of 10 times and lose 6 out of 10 times. You achieve 227.5 scores, give 280.0 damages to the enemy, take 140.6625 damage on health, and take 170.7625 damage on shield on average.
You should check whether the api you invoked follows the burnysc2/python-sc2 package.
Please summarize why the code cause the result.
Meanwhile, please briefly provide me 1 most important refinement of the tactic aiming at killing more enemy units and cause more damage.
Do not show the revised code to me.
critic大模型哇哇一顿分析:
### Summary of the Result and Code Analysis
The provided code implements a basic 'hit-and-run' tactic where Stalkers attack the Spine Crawler and then retreat slightly to avoid counter-attacks. However, the results show that this tactic is not very effective, with a 40% win rate and significant damage taken on both health and shields.
#### Why the Code Leads to the Result:
1. **Inconsistent Retreat Mechanism**:
- The retreat mechanism is triggered only when the Stalker's weapon cooldown is greater than 0. This means that the Stalker will only retreat after attacking, but not necessarily after being attacked by the Spine Crawler. This can lead to situations where the Stalker takes multiple hits from the Spine Crawler before retreating, resulting in significant damage.
2. **No Consideration for Health and Shield Levels**:
- The code does not consider the current health or shield levels of the Stalkers. If a Stalker's shield is depleted, it will take full damage from the Spine Crawler, which can quickly deplete its health. The tactic should prioritize retreating when shields are low or health is critically low.
3. **Fixed Retreat Distance**:
- The retreat distance is fixed at 3 units towards the starting location. This might not be sufficient to ensure the Stalker is out of range of the Spine Crawler's attack, especially if the Spine Crawler is positioned close to the starting location.
4. **No Group Coordination**:
- The code handles each Stalker individually without any coordination between them. This can lead to situations where one Stalker is retreating while the other is still attacking,
potentially exposing both to damage.
#### Potential Methods to Improve the Tactic:
1. **Dynamic Retreat Mechanism**:
- Implement a more dynamic retreat mechanism that considers both the Stalker's weapon cooldown and its current health/shield levels. Retreat should be triggered not only after attacking but also when the Stalker's shields are low or health is critically low.
2. **Variable Retreat Distance**:
- Use a variable retreat distance based on the Stalker's current position relative to the Spine Crawler. Ensure that the retreat distance is sufficient to get the Stalker out of the Spine Crawler's attack range.
3. **Group Coordination**:
- Coordinate the actions of both Stalkers to ensure that they are not both attacking or retreating at the same time. This can help in managing the Spine Crawler's attention and reducing overall damage taken.
4. **Shield Management**:
- Prioritize retreating when shields are low to prevent the Stalker from taking full damage from the Spine Crawler. Consider using a shield recharge mechanic if available in the environment.
### Suggestion:
```[Improve Tactic]```
那把分析的结果再给coder模型,让他重新生成决策树代码:
The map is 2s_vs_1sc.
You can control 2 Stalker units individually and the enemy controls 1 Spine Crawler Structure.
The Stalker unit has 80 health, 80 shield, 1 defense, 6 attacking range, 4.13 speed, 13 damage with 9.7 DPS.
The Spine Crawler has 300 health, 0 shield, 2 defense, 7 attacking range, 0 speed, and 25 damage with 18.9 DPS.
All the units has no abilities such as blinking or equipments.
The map is a 32*32 sized square map.
The available area of x axis is from 4 to 23, and the y axis is from 7 to 30.
The enemy structure is at (14, 21) point and your Stalker units are at (11, 9) and (17, 9) points initially.
There is no terrain advantages nor choke points in this map. You cannot get back to the enemy units.
You should not use the await keyword. Make sure to check whether the list variables are empty or not.
### Summary of the Result and Code Analysis
The provided code implements a basic 'hit-and-run' tactic where Stalkers attack the Spine Crawler and then retreat slightly to avoid counter-attacks. However, the results show that this tactic is not very effective, with a 40% win rate and significant damage taken on both health and shields.
#### Why the Code Leads to the Result:
1. **Inconsistent Retreat Mechanism**:
- The retreat mechanism is triggered only when the Stalker's weapon cooldown is greater than 0. This means that the Stalker will only retreat after attacking, but not necessarily after being attacked by the Spine Crawler. This can lead to situations where the Stalker takes multiple hits from the Spine Crawler before retreating, resulting in significant damage.
2. **No Consideration for Health and Shield Levels**:
- The code does not consider the current health or shield levels of the Stalkers. If a Stalker's shield is depleted, it will take full damage from the Spine Crawler, which can quickly deplete its health. The tactic should prioritize retreating when shields are low or health is critically low.
3. **Fixed Retreat Distance**:
- The retreat distance is fixed at 3 units towards the starting location. This might not be sufficient to ensure the Stalker is out of range of the Spine Crawler's attack, especially if the Spine Crawler is positioned close to the starting location.
4. **No Group Coordination**:
- The code handles each Stalker individually without any coordination between them. This can lead to situations where one Stalker is retreating while the other is still attacking,
potentially exposing both to damage.
#### Potential Methods to Improve the Tactic:
1. **Dynamic Retreat Mechanism**:
- Implement a more dynamic retreat mechanism that considers both the Stalker's weapon cooldown and its current health/shield levels. Retreat should be triggered not only after attacking but also when the Stalker's shields are low or health is critically low.
2. **Variable Retreat Distance**:
- Use a variable retreat distance based on the Stalker's current position relative to the Spine Crawler. Ensure that the retreat distance is sufficient to get the Stalker out of the Spine Crawler's attack range.
3. **Group Coordination**:
- Coordinate the actions of both Stalkers to ensure that they are not both attacking or retreating at the same time. This can help in managing the Spine Crawler's attention and reducing overall damage taken.
4. **Shield Management**:
- Prioritize retreating when shields are low to prevent the Stalker from taking full damage from the Spine Crawler. Consider using a shield recharge mechanic if available in the environment.
### Suggestion:
```[Improve Tactic]```
Please implement the code.
新生成的决策树代码,他还帮你分析下哪边改进了:
```python
from sc2.bot_ai import BotAI
from sc2.ids.unit_typeid import UnitTypeId
from sc2.position import Point2
class HitAndRunBot(BotAI):
async def on_step(self, iteration: int):
stalkers = self.units(UnitTypeId.STALKER)
spine_crawler = self.enemy_structures.first
if not spine_crawler or not stalkers.exists:
return
for stalker in stalkers:
# Check if the Stalker is in attack range of the Spine Crawler
if stalker.distance_to(spine_crawler) <= stalker.ground_range + 1:
# Attack the Spine Crawler
stalker.attack(spine_crawler)
else:
# Move towards the Spine Crawler if not in attack range
stalker.move(spine_crawler.position)
# After attacking, retreat slightly to avoid counter-attack
if stalker.weapon_cooldown > 0 or stalker.shield_percentage < 0.2 or stalker.health_percentage < 0.2:
retreat_distance = 5 # Adjust retreat distance based on current position
retreat_position = stalker.position.towards(self.start_location, retreat_distance)
stalker.move(retreat_position)
# Group coordination: Ensure both Stalkers are not both attacking or retreating at the same time
if len(stalkers) == 2:
stalker1, stalker2 = stalkers
if stalker1.weapon_cooldown > 0 and stalker2.weapon_cooldown == 0:
stalker2.attack(spine_crawler)
elif stalker2.weapon_cooldown > 0 and stalker1.weapon_cooldown == 0:
stalker1.attack(spine_crawler)```
### Explanation of Improvements:
1. **Dynamic Retreat Mechanism**:
- The retreat mechanism now considers not only the weapon cooldown but also the Stalker's shield and health percentages. Retreat is triggered when the shield is below 20% or health is below 20%, ensuring the Stalker avoids taking significant damage.
2. **Variable Retreat Distance**:
- The retreat mechanism now considers not only the weapon cooldown but also the Stalker's shield and health percentages. Retreat is triggered when the shield is below 20% or health is below 20%, ensuring the Stalker avoids taking significant damage.
2. **Variable Retreat Distance**:
- The retreat distance is now set to 5 units, which should be sufficient to get the Stalker out of the Spine Crawler's attack range. This distance can be adjusted based on the specific map layout and unit positions.
3. **Group Coordination**:
- The code ensures that both Stalkers are not both attacking or retreating at the same time. This helps in managing the Spine Crawler's attention and reducing overall damage taken.
4. **Shield Management**:
- The retreat is prioritized when shields are low to prevent the Stalker from taking full damage from the Spine Crawler. This should help in maintaining the Stalkers' health and shields more effectively.
此时把这段代码移进环境中,就得到了100%的胜率啦。
You Win 10 out of 10 times and lose 0 out of 10 times.
You achieve 350.0 scores, give 331.2 damages to the enemy, take 115.3125 damage on health, and take 405.0875 damage on shield on average.
当时有时会被认定为plan做的不够好,那就把条件重新转换到planner那一步。
实验结果:
我们所有的SMAC地图都采用的相同prompt结构,没有做任何定制化开发,并且采用deepseek的API模式,实现了策略生成的全部自动化(除了要描述一下地图)。实验结果如下:
这是在SMAC上的实验结果,我们还是非常高兴的发现其实用不了几轮修改,就可以生成胜率接近100%胜率的决策树策略的,并且这个决策树模型还是白盒的,可以看到智能体是按什么逻辑进行行动的。
可视化的结果更加让我们惊喜一点,我们发现大模型生成的决策非常'纯粹',例如2m_vs_1z这个地图来说,得到的策略和MARL训练得到的差别极大。
MARL训练的策略如果大家熟悉,他是两个枪兵”一直后退一下,平A一下“。这是卡了敌方单位索敌的bug,敌方单位总是倾向于攻击最近的单位,所以这个叉子他就会犹豫。如果叉子的策略发生一下改变,比如一直对着一个单位攻击,训出来的策略就会失效。而我们决策树生成的策略,就是一个标准的完美策略,学会一个放风筝,一个进行攻击,敌方无论用什么样的策略都会败北。
更多的视频大家可以关注github上的gif。
同时我们发现决策树的代码对于接口的要求也更低,所以8m上得到的决策树模型,一行代码都不用改,就可以完美应用在3m的地图上。3s_vs_4z的决策树也可以无缝切换到3s_vs_3z和3s_vs_5z上,摆脱了传统MARL需要对输入输出维度的问题。
未来的工作:
这次我们'编程马拉松'做的整体方案其实还是比较简单的。因为我们的本意并不是做成这样,只是SMAC实在是太不经打了。
未来我们是希望可以通过拿到对局结果作为奖励函数,来微调LLM模型参数的。
再未来一点,我们也爬取了很多星际争霸脚本的代码,可以在预训练上就训练得到一个星际大模型。
我个人觉得决策任务发展几十年,在强化学习之前已经有非常多优质的决策树代码,与其把他们浪费了,不如把这些决策树语料利用起来,结合RL,做出一个高质量的决策大模型。
引用链接
[1]
https://arxiv.org/pdf/2410.1602...: https://arxiv.org/pdf/2410.16024[2]
https://github.com/devindeng94...: https://github.com/devindeng94/