手撕Transformer之Feed-Forward Network

文摘科技 2024-10-03 12:46 河南

点击蓝字

关注我们

引言

本文是手撕Transformer系列的第四篇。它从头开始介绍前馈神经网络（Position-wise Feed-Forward Network），它使用全连接层对每个序列进行变换。

闲话少说，我们直接开始吧！

背景介绍

基于位置的前馈神经网络（FFN）由两个全连接层或多层感知机（MLP）组成。隐藏层（称为 d_ffn）的维度一般设定为 d_model 的四倍左右。因此，它有时也被称为扩展收缩网络。

FNN 第一层的权重维度为（d_model, d_ffn），这意味着在张量乘法过程中，必须对每个序列进行广播。这意味着每个序列都乘以相同的权重。如果输入相同的序列，输出也将相同。这一逻辑同样适用于大小为 (d_ffn, d_model) 的第二个全连接层，它将张量返回到原始大小。

各层之间使用 ReLU 激活函数 max(0,X)。任何大于 0 的值都保持不变，任何小于或等于 0 的值都变为 0。它引入了非线性，有助于防止梯度消失。

基础实现

下面的代码依赖于Transformer模型的前几个模块的实现。到此为止，各层的输出为 (3,6,8)。有 3 个由 6 个 token 组成的序列，每个Token由8 维嵌入表示。

torch.set_printoptions(precision=2, sci_mode=False)
# convert the sequences to integerssequences = ["I wonder what will come next!",             "This is a basic example paragraph.",             "Hello, what is a basic split?"]
# tokenize the sequencestokenized_sequences = [tokenize(seq) for seq in sequences]
# index the sequences indexed_sequences = [[stoi[word] for word in seq] for seq in tokenized_sequences]
# convert the sequences to a tensortensor_sequences = torch.tensor(indexed_sequences).long()
# vocab sizevocab_size = len(stoi)
# embedding dimensionsd_model = 8
# create the embeddingslut = Embeddings(vocab_size, d_model) # look-up table (lut)
# create the positional encodingspe = PositionalEncoding(d_model=d_model, dropout=0.1, max_length=10)
# embed the sequenceembeddings = lut(tensor_sequences)
# positionally encode the sequencesX = pe(embeddings)
# set the n_headsn_heads = 4
# create the attention layerattention = MultiHeadAttention(d_model, n_heads, dropout=0.1)
# pass X through the attention layer three times to create Q, K, and Voutput, attn_probs = attention(X, X, X, mask=None)print(output)

结果如下：

现在，可以将上述输出送入FFN中。这将把8维嵌入转换为32 维嵌入表示。这也会通过 ReLU 激活函数。新的张量的维度将是（3，6，8）x（8，32）→（3，6，32）。

d_ffn = d_model * 4  # 32
w_1 = nn.Linear(d_model, d_ffn)  # (8, 32)w_2 = nn.Linear(d_ffn, d_model)  # (32, 8)
ffn_1 = w_1(output).relu()print(ffn_1)

结果如下：

tensor([        # sequence 0        [[    0.00,     0.00,     0.58,     0.00,     0.86,     0.00,     0.00,     0.44,     0.00,     0.00,     0.00,     0.23,     0.00,     0.40,     0.00,     0.30,     0.10,     0.00,     0.48,     0.00,     0.00,     0.00,     0.30,     0.71,     0.17,     0.00,     0.47,     0.00,     0.00,     0.00,     0.00,     0.40],         [    0.00,     0.00,     0.62,     0.00,     0.90,     0.00,     0.00,     0.51,     0.00,     0.00,     0.05,     0.29,     0.00,     0.37,     0.00,     0.33,     0.02,     0.00,     0.44,     0.00,     0.00,     0.00,     0.20,     0.83,     0.19,     0.00,     0.47,     0.00,     0.00,     0.00,     0.00,     0.32],         [    0.00,     0.00,     0.28,     0.00,     0.81,     0.00,     0.00,     0.53,     0.00,     0.00,     0.00,     0.04,     0.23,     0.30,     0.00,     0.61,     0.00,     0.00,     0.52,     0.00,     0.00,     0.00,     0.17,     0.80,     0.08,     0.00,     0.46,     0.00,     0.00,     0.00,     0.00,     0.50],         [    0.06,     0.00,     0.11,     0.00,     0.60,     0.00,     0.00,     0.47,     0.00,     0.00,     0.00,     0.00,     0.41,     0.10,     0.00,     0.76,     0.00,     0.14,     0.35,     0.00,     0.00,     0.00,     0.13,     0.49,     0.00,     0.00,     0.28,     0.00,     0.00,     0.00,     0.00,     0.57],         [    0.00,     0.12,     0.40,     0.00,     0.63,     0.00,     0.00,     0.34,     0.00,     0.25,     0.26,     0.40,     0.00,     0.31,     0.00,     0.21,     0.03,     0.00,     0.62,     0.00,     0.00,     0.00,     0.00,     1.83,     0.45,     0.00,     0.65,     0.00,     0.00,     0.09,     0.00,     0.00],         [    0.00,     0.13,     0.29,     0.00,     0.67,     0.00,     0.00,     0.41,     0.00,     0.15,     0.27,     0.30,     0.00,     0.27,     0.00,     0.40,     0.00,     0.00,     0.58,     0.00,     0.00,     0.00,     0.00,     1.78,     0.34,     0.00,     0.62,     0.00,     0.00,     0.06,     0.00,     0.00]],
        # sequence 1        [[    0.00,     0.00,     0.89,     0.00,     0.51,     0.00,     0.00,     0.28,     0.00,     0.00,     0.00,     0.38,     0.00,     0.17,     0.00,     0.00,     0.32,     0.00,     0.35,     0.06,     0.00,     0.00,     0.11,     0.54,     0.47,     0.00,     0.32,     0.20,     0.20,     0.04,     0.00,     0.17],         [    0.00,     0.00,     0.96,     0.00,     0.39,     0.00,     0.00,     0.15,     0.00,     0.00,     0.00,     0.31,     0.00,     0.03,     0.05,     0.00,     0.50,     0.00,     0.12,     0.00,     0.00,     0.00,     0.20,     0.36,     0.41,     0.00,     0.32,     0.29,     0.56,     0.08,     0.00,     0.22],         [    0.07,     0.00,     0.56,     0.00,     0.22,     0.00,     0.00,     0.38,     0.00,     0.00,     0.00,     0.00,     0.00,     0.00,     0.00,     0.00,     0.13,     0.00,     0.26,     0.00,     0.00,     0.04,     0.07,     0.31,     0.00,     0.11,     0.25,     0.00,     0.41,     0.15,     0.00,     0.34],         [    0.68,     0.00,     0.01,     0.31,     0.00,     0.18,     0.00,     0.00,     0.77,     0.23,     0.00,     0.00,     0.00,     0.00,     0.34,     0.00,     0.45,     0.00,     0.37,     0.00,     0.10,     0.00,     0.00,     0.50,     0.00,     0.00,     0.05,     0.00,     0.34,     0.00,     0.00,     0.00],         [    0.00,     0.00,     0.31,     0.32,     0.00,     0.00,     0.00,     0.11,     0.00,     0.00,     0.00,     0.05,     0.00,     0.00,     0.00,     0.00,     0.23,     0.05,     0.58,     0.00,     0.23,     0.00,     0.00,     0.83,     0.62,     0.19,     0.34,     0.19,     0.19,     0.00,     0.00,     0.25],         [    0.24,     0.00,     0.12,     0.00,     0.00,     0.00,     0.00,     0.22,     0.00,     0.00,     0.00,     0.00,     0.13,     0.00,     0.00,     0.32,     0.08,     0.00,     0.49,     0.00,     0.00,     0.00,     0.00,     0.59,     0.00,     0.00,     0.28,     0.00,     0.00,     0.00,     0.00,     0.24]],
        # sequence 2        [[    0.00,     1.00,     0.67,     0.07,     1.18,     0.00,     0.00,     0.85,     0.00,     0.00,     0.00,     0.98,     0.00,     0.44,     0.00,     0.17,     0.00,     0.09,     1.07,     0.38,     0.10,     0.12,     0.00,     1.89,     2.11,     1.44,     0.69,     0.91,     0.00,     0.06,     0.00,     0.22],         [    0.00,     0.10,     0.00,     0.68,     0.00,     0.00,     0.42,     0.00,     0.00,     0.00,     0.00,     0.00,     0.00,     0.00,     0.00,     0.00,     0.19,     0.00,     0.92,     0.00,     0.43,     0.05,     0.00,     1.76,     0.92,     0.00,     0.57,     0.07,     0.00,     0.00,     0.00,     0.12],         [    0.00,     0.00,     0.00,     0.14,     0.00,     0.00,     0.30,     0.00,     0.00,     0.10,     0.00,     0.00,     0.26,     0.00,     0.00,     0.50,     0.05,     0.00,     0.77,     0.00,     0.08,     0.00,     0.00,     1.43,     0.00,     0.00,     0.53,     0.00,     0.00,     0.00,     0.00,     0.23],         [    0.00,     0.08,     0.00,     0.22,     0.00,     0.00,     0.45,     0.32,     0.00,     0.00,     0.00,     0.00,     0.27,     0.00,     0.00,     0.46,     0.00,     0.11,     1.03,     0.00,     0.22,     0.39,     0.00,     1.66,     0.49,     0.49,     0.54,     0.00,     0.00,     0.00,     0.00,     0.22],         [    0.35,     0.00,     0.00,     0.55,     0.00,     0.13,     0.03,     0.00,     0.51,     0.42,     0.00,     0.00,     0.00,     0.00,     0.03,     0.00,     0.63,     0.00,     0.66,     0.00,     0.32,     0.00,     0.00,     0.61,     0.00,     0.00,     0.27,     0.00,     0.00,     0.00,     0.28,     0.09],         [    0.01,     0.00,     0.00,     0.79,     0.00,     0.01,     0.42,     0.00,     0.28,     0.52,     0.00,     0.00,     0.00

然后，上述张量可以通过第二层全连接层恢复到正常大小，即 (3, 6, 32) x (32, 8) = (3, 6, 8)。根据权重和激活函数，数值发生了对应的改变。

ffn_2 = w_2(ffn_1)print(ffn_2)

结果如下：

Transformer中的FFN

在Transformer中的FFN的实现非常简单。它主要由两个线性层构成，第一个线性层的大小为（d_model, d_ffn），第二个线性层的大小为（d_ffn, d_model）。

模型的输入 X 的维度大小为（batch_size、seq_length、d_model）。因此，输入将经过以下转换：

(batch_size, seq_length, d_model) x (d_model, d_ffn) = (batch_size, seq_length, d_ffn)
max(0, (batch_size, seq_length, d_ffn)) = (batch_size, seq_length, d_ffn)
(batch_size, seq_length, d_ffn) x (d_ffn, d_model) = (batch_size, seq_length, d_model)

上述代码实现如下：

class PositionwiseFeedForward(nn.Module):  def __init__(self, d_model: int, d_ffn: int, dropout: float = 0.1):    """    Args:        d_model:      dimension of embeddings        d_ffn:        dimension of feed-forward network        dropout:      probability of dropout occurring    """    super().__init__()
    self.w_1 = nn.Linear(d_model, d_ffn)    self.w_2 = nn.Linear(d_ffn, d_model)    self.dropout = nn.Dropout(dropout)
  def forward(self, x):    """    Args:        x:            output from attention (batch_size, seq_length, d_model)           Returns:        expanded-and-contracted representation (batch_size, seq_length, d_model)    """    # w_1(x).relu(): (batch_size, seq_length, d_model) x (d_model,d_ffn) -> (batch_size, seq_length, d_ffn)    # w_2(w_1(x).relu()): (batch_size, seq_length, d_ffn) x (d_ffn, d_model) -> (batch_size, seq_length, d_model)     return self.w_2(self.dropout(self.w_1(x).relu()))

前向过程

前向传递过程可以假定数据已通过嵌入层、位置编码层和多头注意力层。它暂时不使用层归一化或残差加法，这些功能将在后面的编码器中的该网络的前后实现。

torch.set_printoptions(precision=2, sci_mode=False)
# convert the sequences to integerssequences = ["I wonder what will come next!",             "This is a basic example paragraph.",             "Hello, what is a basic split?"]
# tokenize the sequencestokenized_sequences = [tokenize(seq) for seq in sequences]
# index the sequences indexed_sequences = [[stoi[word] for word in seq] for seq in tokenized_sequences]
# convert the sequences to a tensortensor_sequences = torch.tensor(indexed_sequences).long()
# vocab sizevocab_size = len(stoi)
# embedding dimensionsd_model = 8
# create the embeddingslut = Embeddings(vocab_size, d_model) # look-up table (lut)
# create the positional encodingspe = PositionalEncoding(d_model=d_model, dropout=0.1, max_length=10)
# embed the sequenceembeddings = lut(tensor_sequences)
# positionally encode the sequencesX = pe(embeddings)
# set the n_headsn_heads = 4
# create the attention layerattention = MultiHeadAttention(d_model, n_heads, dropout=0.1)
# pass X through the attention layer three times to create Q, K, and Voutput, attn_probs = attention(X, X, X, mask=None)
# calculate the d_ffnd_ffn = d_model*4 # 32
# pass the tensor through the position-wise feed-forward networkffn = PositionwiseFeedForward(d_model, d_ffn, dropout=0.1)
print(ffn(output))

结果如下：

输出结果与手动执行的结果类似，但由于使用了Dropout层，数值有所不同。

您学废了嘛？

有了上面的介绍，希望大家都可以看懂前馈神经网络的代码实现。该系列的下一篇文章是 "层归一化"。

请不要忘记点赞和关注，以获取更多信息！

点击上方小卡片关注我

添加个人微信，进专属粉丝群！

http://mp.weixin.qq.com/s?__biz=MzkzODI1NzQyNA==&mid=2247491999&idx=1&sn=fbc243976da5913da24150da4c7d9702

AI算法之道

一个专注于深度学习、计算机视觉和自动驾驶感知算法的公众号，涵盖视觉CV、神经网络、模式识别等方面，包括相应的硬件和软件配置，以及开源项目等。

最新文章

Fine-Tuning vs. Zero-Shot vs. Few-Shot Learning

RAG Fusion -- 新一代信息检索技术

【Python】关于Python Itertools 后悔没有早点了解的九个函数

【Python】都2024了，还不会用dataclass，你Out了？

自回归模型的关键：Causal self-Attention

机器学习中的Teacher Forcing

一文弄懂Bert模型

【Python】七个提升Python代码性能的技巧

一文学会LLM参数量计算

Transformer为什么使用LayerNorm而不是BatchNorm?

BatchNorm VS LayerNorm

掌握LLaMA: 深入探索MetaAI的革命性模型

一文弄懂Grouped-Query Attention

一文弄懂Multi-Query Attention

手撕Transformer之组合各组件

手撕Transformer之The Decoder

手撕Transformer之The Encoder

手撕Transformer之Layer Normalization

手撕Transformer之Feed-Forward Network

手撕Transformer之Multi-Head Attention

一文弄懂Flash-Attention

手撕Transformer之Positional Encoding

手撕Transformer之Embedding Layer

掌握Transformer之KV Cahce

NLP领域中BeluScore直观解释

掌握Transformer之注意力为什么有效

掌握Transformer之深入多头注意力机制

掌握Transformer之学习各组件

掌握Transformer之概述

十分钟深入理解BatchNorm层

NLP领域中Beam Search直观解释

【Python】一文弄懂Python中的@wraps

BatchNorm层直观性解释

手把手教你打造虚拟AI Talker

【Python】关于F-Strings的六种高级用法

ReLU如何让神经网络逼近连续非线性函数?

Luma推出 Dream Machine 1.5 - 新的人工智能视频生成器

RAG检索增强生成最佳实践

万字长文讲解文本嵌入及其高阶应用

我后悔没有早点知道 Python中迭代的八个技巧

【Python】五种方法实现两个变量数值交换

一文弄懂RAG检索增强生成技术

强烈推荐10个人工智能小项目

一文弄懂Python在Windows/Mac/Linux上路径兼容问题

推荐免费访问最强AI绘画FLUX.1的五种方式

推荐提升时序数据可视化展示的三个技巧

如何优雅地在Python中管理环境变量？

超越Midjourney最强AI绘画FLUX.1发布

什么是Python中的requirements.txt文件？

通用NLP入门技术介绍

分类

时事

民生

政务

教育

文化

科技

财富

体娱

健康

情感

旅行

百科

职场

楼市

企业

乐活

学术

汽车

时尚

创业

美食

幽默

美体

文摘

原创标签

时事社会财经军事教育体育科技汽车科学房产搞笑综艺明星音乐动漫游戏时尚健康旅游美食生活摄影宠物职场育儿情感小说曲艺文化历史三农文学娱乐电影视频图片新闻宗教电视剧纪录片广告创意壁纸头像心灵鸡汤星座命理教育培训艺术文化金融财经健康医疗美妆时尚餐饮美食母婴育儿社会新闻工业农业时事政治星座占卜幽默笑话独立短篇连载作品文化历史科技互联网

发布位置

广东北京山东江苏河南浙江山西福建河北上海四川陕西湖南安徽湖北内蒙古江西云南广西甘肃辽宁黑龙江贵州新疆重庆吉林天津海南青海宁夏西藏香港澳门台湾美国加拿大澳大利亚日本新加坡英国西班牙新西兰韩国泰国法国德国意大利缅甸菲律宾马来西亚越南荷兰柬埔寨俄罗斯巴西智利卢森堡芬兰瑞典比利时瑞士土耳其斐济挪威朝鲜尼日利亚阿根廷匈牙利爱尔兰印度老挝葡萄牙乌克兰印度尼西亚哈萨克斯坦塔吉克斯坦希腊南非蒙古奥地利肯尼亚加纳丹麦津巴布韦埃及坦桑尼亚捷克阿联酋安哥拉