名师讲堂｜使用 Stata 测算数实融合水平

教育 2024-11-07 13:56 安徽

该课程已经讲解完，感兴趣的小伙伴可以点击文末的阅读原文跳转到 RStata 平台观看录播～

购买名师讲堂会员：https://rstata.duanshu.com/#/card/6c8dc065c753498e948c45e032a98729 更多详情欢迎添加李老师微信 r_stata 了解：

今天给大家分享一下「新质生产力背景下数实融合的测算与时空比较——基于专利共分类方法的研究」一文中的数实融合水平指标的 Stata 计算方法，不过由于论文中关于指标计算的一些细节描述的较为含糊，所以计算过程中的一些处理方法还有待探讨，感兴趣的小伙伴也可以说说自己的想法。

按照论文的介绍，数实融合水平的测算过程如下所示：

一个简单的示例

准备示例数据

按照上面的示意图。计算数实融合需要下面三个数据：

一是各个专利的分类号：

clear all  
input str2 id str20 ipc 
"P1" "IPC1;IPC3"
"P2" "IPC1;IPC2"
"P3" "IPC1;IPC2;IPC3"
end 
compress 
save 专利示例数据_示例, replace

二是 IPC 分类号与产业对照表：

clear all  
input str4 uniq_ipc str20 industry
"IPC1" "I1"
"IPC2" "I2;I3"
"IPC3" "I3;I4"
end 
compress 
save IPC与产业对照表_示例, replace

三是产业分类表：

clear all  
input str4 uniq_industry str20 class
"I1" "数字产业"
"I2" "数字产业"
"I3" "实体产业"
"I4" "实体产业"
end 
compress 
save 产业分类表_示例, replace

计算 IPC 融合矩阵

根据文献的介绍，IPC 融合矩阵的结果就是如果两个 IPC 号出现在一个专利的分类号上，返回 1，否则返回 0。因此我们可以使用 Mata 的代码循环各个专利，分别获取每个分类号里面的 IPC 组合，合并再统计数量：

use 专利示例数据_示例, clear 
gen _freq = 1 
mata:
    ipc = st_sdata(., "ipc")
    value = st_data(., "_freq") 
    fullcombinations = J(0, 3, "")
    for (r = 1; r <= rows(ipc); r++) {
        class_list = colshape(ustrsplit(ipc[r, 1], ";"), 1)
        n_classes = rows(class_list)
        if (n_classes == 2) {
          fullcombinations = fullcombinations \ (class_list[1,1], class_list[2,1], strofreal(value[r,1]))
        }
        if (n_classes > 2) {
          combinations = J(n_classes * n_classes, 3, "")
          row_index = 1
        
          for (i = 1; i <= n_classes; i++) {
              for (j = 1; j <= n_classes; j++) {
                  combinations[row_index, 1] = class_list[i,1]
                  combinations[row_index, 2] = class_list[j,1]
                  combinations[row_index, 3] = strofreal(value[r,1])
                  row_index++
              }
          }
          fullcombinations = fullcombinations \ combinations 
        }
    }

    st_matrix("obs", rows(fullcombinations))
    stata("clear")
    st_addvar("str100", "ipc1")
    st_addvar("str100", "ipc2")
    st_addvar("str100", "value")
    stata("set obs `=obs[1,1]'")
    st_sstore(., ("ipc1", "ipc2", "value"), fullcombinations)
end 

destring, replace 
compress 
save temp1, replace 

list in 1/10 

*>     +---------------------+
*>     | ipc1   ipc2   value |
*>     |---------------------|
*>  1. | IPC1   IPC3       1 |
*>  2. | IPC1   IPC2       1 |
*>  3. | IPC1   IPC1       1 |
*>  4. | IPC1   IPC2       1 |
*>  5. | IPC1   IPC3       1 |
*>     |---------------------|
*>  6. | IPC2   IPC1       1 |
*>  7. | IPC2   IPC2       1 |
*>  8. | IPC2   IPC3       1 |
*>  9. | IPC3   IPC1       1 |
*> 10. | IPC3   IPC2       1 |
*>     +---------------------+

汇总统计各组合的总共现次数：

use temp1, clear 
collapse (sum) value, by(ipc1 ipc2) 
replace value = 0 if ipc1 == ipc2
save temp1a, replace

不过这个结果并不是一个对称矩阵，下面我们再把这个数据转换成对称矩阵（对称矩阵转换成长数据就是一个平衡面板）：

*- 处理成对称矩阵 
use temp1a, clear 
gen id = _n 
drop value 
gather ipc1 ipc2 
keep value 
duplicates drop value, force 
save temp1b, replace 

ren value value2 
cross using temp1b
ren value ipc1 
ren value2 ipc2 
merge 1:1 ipc1 ipc2 using temp1a 
drop _m 
replace value = 0 if mi(value)
save temp1c, replace 

ren ipc1 temp 
ren ipc2 ipc1 
ren temp ipc2 
ren value value2 
merge 1:1 ipc1 ipc2 using temp1c 
*- 因为对于每个专利，ab ba 的情况都统计了，所以这里选择最大的（统计最全面的）作为共现结果
egen value3 = rowmax(value value2) 
replace value = value3 
drop value2 _m value3 
replace value = 0 if ipc1 == ipc2 
drop if value == 0 
save IPC融合数量计算结果, replace 

list in 1/6, sep(0)
*>     +---------------------+
*>     | ipc1   ipc2   value |
*>     |---------------------|
*>  1. | IPC1   IPC2       2 |
*>  2. | IPC1   IPC3       2 |
*>  3. | IPC2   IPC1       2 |
*>  4. | IPC2   IPC3       1 |
*>  5. | IPC3   IPC1       2 |
*>  6. | IPC3   IPC2       1 |
*>     +---------------------+

转换成矩阵就是这样的：

spread ipc2 value 
list in 1/3 

*>    +---------------------------+
*>    | ipc1   IPC1   IPC2   IPC3 |
*>    |---------------------------|
*> 1. | IPC1      .      2      2 |
*> 2. | IPC2      2      .      1 |
*> 3. | IPC3      2      1      . |
*>    +---------------------------+

然后我们把 IPC 分类号和 IPC与产业对照表 进行匹配，进而将 IPC 替换成产业分类编号：

use IPC融合数量计算结果, clear 
ren ipc1 uniq_ipc
ren ipc2 uniq_ipc2
merge m:1 uniq_ipc using IPC与产业对照表_示例
drop if _m == 2 
drop _m 
ren industry industry1 
drop uniq_ipc 

ren uniq_ipc2 uniq_ipc 
merge m:1 uniq_ipc using IPC与产业对照表_示例
drop if _m == 2 
drop _m 
ren industry industry2 
drop uniq_ipc 

order industry*
foreach i of varlist _all {
  cap format `i' %10s 
}
drop if mi(industry1) | mi(industry2) 

drop if industry1 == ";"
replace industry1 = subinstr(industry1, ";", " ", .)
replace industry1 = strtrim(industry1)
replace industry1 = subinstr(industry1, " ", ";", .)

drop if industry2 == ";"
replace industry2 = subinstr(industry2, ";", " ", .)
replace industry2 = strtrim(industry2)
replace industry2 = subinstr(industry2, " ", ";", .)
save temp2, replace 

list, sep(0)

*>     +-----------------------------+
*>     | indust~1   indust~2   value |
*>     |-----------------------------|
*>  1. |    I2;I3         I1       2 |
*>  2. |    I3;I4         I1       2 |
*>  3. |       I1      I2;I3       2 |
*>  4. |    I3;I4      I2;I3       1 |
*>  5. |       I1      I3;I4       2 |
*>  6. |    I2;I3      I3;I4       1 |
*>     +-----------------------------+

计算产业融合矩阵

使用类似上面的 Mata 代码就可以计算产业融合矩阵了：

use temp2, clear 
collapse (sum) value, by(industry1 industry2) 
mata:
  mata clear 
  fullcombinations = J(0, 3, "")
  value = st_data(., "value")
  // industry 矩阵
  industry1 = st_sdata(., "industry1")
  industry2 = st_sdata(., "industry2")

  for (z = 1; z <= rows(industry1); z++) {
    a = ustrsplit(industry1[z,1], ";")
    b = ustrsplit(industry2[z,1], ";")
    combinations = J(cols(a) * cols(b), 3, "")
    row_index = 1
    for (i = 1; i <= cols(a); i++) {
      for (j = 1; j <= cols(b); j++) {
        combinations[row_index, 1] = a[1, i]
        combinations[row_index, 2] = b[1, j]
        combinations[row_index, 3] = strofreal(value[z,1]) 
        row_index++
      }
    }
    fullcombinations = fullcombinations \ combinations 
  }
  st_matrix("obs", rows(fullcombinations))
  stata("clear")
  st_addvar("str100", "industry1")
  st_addvar("str100", "industry2")
  st_addvar("str100", "value")
  stata("set obs `=obs[1,1]'")
  st_sstore(., ("industry1", "industry2", "value"), fullcombinations)
end 
destring, replace 
compress 
collapse (sum) value, by(industry1 industry2)
save temp3, replace 

list, sep(0)

*>     +-----------------------------+
*>     | indust~1   indust~2   value |
*>     |-----------------------------|
*>  1. |       I1         I2       2 |
*>  2. |       I1         I3       4 |
*>  3. |       I1         I4       2 |
*>  4. |       I2         I1       2 |
*>  5. |       I2         I3       1 |
*>  6. |       I2         I4       1 |
*>  7. |       I3         I1       4 |
*>  8. |       I3         I2       1 |
*>  9. |       I3         I3       2 |
*> 10. |       I3         I4       1 |
*> 11. |       I4         I1       2 |
*> 12. |       I4         I2       1 |
*> 13. |       I4         I3       1 |
*>     +-----------------------------+

然后我们统计下数字产业和实体产业的数量：

use temp3, clear 
gen id = _n 
drop value 
gather industry1 industry2 
keep value
duplicates drop _all, force 
save temp3a, replace 

ren value uniq_industry
save temp3b, replace 
merge 1:m uniq_industry using 产业分类表_示例
keep if _m == 3 
drop _m 
contract class 
save 融合的各产业数量, replace

和前面一样，这里的 temp3 结果并不是一个对称矩阵，我们也要使用类似的方法把该结果处理成对称矩阵（平衡面板）：

*- 融合的各产业数量
use temp3, clear 
gen id = _n 
drop value 
gather industry1 industry2 
keep value
duplicates drop _all, force 
save temp3a, replace 

ren value uniq_industry
save temp3b, replace 
merge 1:m uniq_industry using 产业分类表_示例
keep if _m == 3 
drop _m 
contract class 
save 融合的各产业数量, replace 

*- 处理成对称矩阵
use temp3a, clear
cross using temp3b 
ren value industry1 
ren uniq_industry industry2 
merge 1:1 industry1 industry2 using temp3 
drop _m 
replace value = 0 if mi(value)
save temp3c, replace 

ren industry1 temp 
ren industry2 industry1 
ren temp industry2 
ren value value2 
merge 1:1 industry1 industry2 using temp3c 

egen value3 = rowmax(value value2) 
replace value = value3 
drop value2 _m value3 
replace value = 0 if industry1 == industry2 
save 行业融合数量计算结果, replace 

list, sep(0)

*>     +-----------------------------+
*>     | indust~2   indust~1   value |
*>     |-----------------------------|
*>  1. |       I1         I1       0 |
*>  2. |       I2         I1       2 |
*>  3. |       I3         I1       4 |
*>  4. |       I4         I1       2 |
*>  5. |       I1         I2       2 |
*>  6. |       I2         I2       0 |
*>  7. |       I3         I2       1 |
*>  8. |       I4         I2       1 |
*>  9. |       I1         I3       4 |
*> 10. |       I2         I3       1 |
*> 11. |       I3         I3       0 |
*> 12. |       I4         I3       1 |
*> 13. |       I1         I4       2 |
*> 14. |       I2         I4       1 |
*> 15. |       I3         I4       1 |
*> 16. |       I4         I4       0 |
*>     +-----------------------------+

然后我们就可以根据公式计算数实融合水平了：

use 行业融合数量计算结果, clear 
ren industry1 uniq_industry 
merge m:1 uniq_industry using 产业分类表_示例
keep if _m == 3
drop _m 
ren uniq_industry industry1 
ren class class1 

ren industry2 uniq_industry 
merge m:1 uniq_industry using 产业分类表_示例
keep if _m == 3
drop _m 
ren uniq_industry industry2 
ren class class2 
drop if value == 0 
save temp4, replace 

*- 计算三种融合的数量
collapse (sum) value, by(class1 class2)
save temp5, replace 

use temp4, clear 
ren class1 class 
merge m:1 class using 融合的各产业数量 
drop _m 
ren _freq freq1 
ren class class1 

ren class2 class 
merge m:1 class using 融合的各产业数量 
drop _m 
ren _freq freq2 
ren class class2 

collapse (sum) value (first) freq1 (first) freq2, by(class1 class2)
gen RH = value / (freq1 * freq2)

gen class = "数实融合" if class1 != class2 
replace class = "数数融合" if class1 == "数字产业" & class2 == "数字产业" 
replace class = "实实融合" if class1 == "实体产业" & class2 == "实体产业" 

keep class RH 
duplicates drop _all, force 

list
*>     +---------------+
*>     | RH      class |
*>     |---------------|
*>  1. | .5   实实融合 |
*>  2. |  2   数实融合 |
*>  3. |  1   数数融合 |
*>     +---------------+

这样就计算得到了数实融合水平是 2，数数融合水平是 1，实实融合水平是 0.5。

基于真实数据的计算

数字经济及其核心产业统计分类（2021）

从国家知识产权局的官网上就可以下载到「数字经济及其核心产业统计分类（2021）.docx」文件了，使用下面的代码即可提取里面的表格：

library(tidyverse)
library(docxtractr)

# 读取 docx 文件
doc <- read_docx("数字经济及其核心产业统计分类（2021）.docx")

# 提取表格数据
doc %>%
  docx_extract_all_tbls() %>% 
  .[[1]] %>% 
  slice(-1, -2) %>% 
  set_names("大类", "中类", "小类", "名称", "说明", "国民经济行业代码及名称") %>% 
  type_convert() %>% 
  mutate(国民经济行业代码 = str_extract(国民经济行业代码及名称, "\\d{4}")) %>% 
  haven::write_dta("数字经济及其核心产业统计分类.dta")

haven::read_dta("数字经济及其核心产业统计分类.dta") %>% 
  DT::datatable()

根据论文的建议，我们选取数字经济前四类核心产业作为测算数实融合的数字产业部分：

use 数字经济及其核心产业统计分类.dta, clear 
keep if substr(小类, 1, 2) != "05" & !mi(小类)
drop if mi(国民经济行业代码) 
gen 数字经济产业 = ""
replace 数字经济产业 = "数字产品制造业" if substr(小类, 1, 2) == "01"
replace 数字经济产业 = "数字产品服务业" if substr(小类, 1, 2) == "02"
replace 数字经济产业 = "数字技术应用业" if substr(小类, 1, 2) == "03"
replace 数字经济产业 = "数字要素驱动业" if substr(小类, 1, 2) == "04"
keep 国民经济行业代码 数字经济产业
ren 国民经济行业代码 industry2 
save 数字经济核心产业, replace 


list in 1/10, sep(0)

*>     +---------------------------+
*>     | indust~2     数字经济产业 |
*>     |---------------------------|
*>  1. | 3911       数字产品制造业 |
*>  2. | 3912       数字产品制造业 |
*>  3. | 3913       数字产品制造业 |
*>  4. | 3914       数字产品制造业 |
*>  5. | 3915       数字产品制造业 |
*>  6. | 3919       数字产品制造业 |
*>  7. | 3921       数字产品制造业 |
*>  8. | 3922       数字产品制造业 |
*>  9. | 3940       数字产品制造业 |
*> 10. | 3931       数字产品制造业 |
*>     +---------------------------+

国民经济分类与IPC分类号对照表

为了建立 IPC 分类号和国民经济行业分类的关系，我们还需要处理下「国民经济分类与IPC分类号对照表」，该文件也是来自国家知识产权局：

library(tidyverse)
readxl::read_xlsx("国际专利分类与国民经济行业分类参照关系表（2018）.xlsx") %>% 
  set_names("国民经济行业代码", "国民经济行业名称",
            "国际专利分类号", "国际专利分类号类名") %>% 
  mutate(行业门类代码 = if_else(str_detect(国民经济行业代码, "[A-Z]"), 国民经济行业代码, "")) %>% 
  type_convert() %>% 
  fill(行业门类代码) %>% 
  mutate(国民经济行业代码 = if_else(str_length(国民经济行业代码) == 3 &
                              !str_detect(国民经济行业代码, "^0"), 
                            paste0("0", 国民经济行业代码), 国民经济行业代码)) %>% 
  filter(str_length(国民经济行业代码) == 4 | is.na(国民经济行业代码)) %>% 
  fill(国民经济行业代码, 国民经济行业名称) %>% 
  filter(!is.na(国际专利分类号)) %>% 
  mutate_all(~str_remove_all(.x, "\\r|\\n")) %>% 
  mutate(国民经济行业代码 = paste0(行业门类代码, 国民经济行业代码)) %>% 
  select(-行业门类代码) %>% 
  haven::write_dta("国民经济分类与IPC分类号对照表.dta", label = "数据处理：微信公众号 RStata")

haven::read_dta("国民经济分类与IPC分类号对照表.dta") %>% 
  DT::datatable()

实体产业分类

按照论文的介绍，实体产业可以分为下面四类：

制造业
农业
建筑业及其他工业
服务业

use 国民经济分类与IPC分类号对照表.dta, clear 
gen 实体产业 = ""
gen 行业门类 = substr(国民经济行业代码, 1, 1)
order 行业门类 
replace 实体产业 = "制造业" if 行业门类 == "C"
replace 实体产业 = "农业" if 行业门类 == "A"
replace 实体产业 = "建筑业及其他工业" if inlist(行业门类, "B", "D", "E")
replace 实体产业 = "服务业" if inlist(行业门类, "I", "O")
save 实体产业分类, replace

不过这个在本次课中用不到。

专利数据中所有的 IPC 号

虽然和国民经济行业代码对照的国际专利分类号有差不多 18000 个，不过并不是每个 IPC 号都在专利数据中出现过。因此我们首先需要统计所有专利数据中互不相同的 IPC 号，然后和这个「国民经济分类与IPC分类号对照表」进行匹配：

专利数据太大了，这里就不再放在附件了。大家重现下面的代码可以从这里下载：https://rstata.duanshu.com/#/brief/course/edef981592a64257aa41e2b6d6d21746

local url = "/Volumes/A10R2copy/A10T/大数据/1985～2022 年专利申请数据（已更新，含授权信息、申请人地址的经纬度及其所处的省市区县信息）/newpatentdata_coord2/"
use "`url'/1985patent_v2_coord.dta", clear 
ren 分类号 IPC 
replace IPC = subinstr(IPC, "//", "", .)
replace IPC = subinstr(IPC, "(", "", .)
replace IPC = subinstr(IPC, ")", "", .)
replace IPC = subinstr(IPC, "=", "", .)
replace IPC = subinstr(IPC, ",", ";", .)
drop if 专利类型 == "外观设计"
drop if !index(IPC, ";")
keep newzlid IPC 
replace IPC = subinstr(IPC, " ", "", .)

mata:
ipc = st_sdata(., "IPC")
ipc2 = rowshape(ipc, cols(ipc))
ipc3 = invtokens(ipc2, ";")
ipc4 = ustrsplit(ipc3, ";")
ipc5 = colshape(ipc4, 1)
ipc6 = uniqrows(ipc5)
st_matrix("obs", rows(ipc6))
stata("clear")
stata("cap drop uniq_ipc")
st_addvar("strL", "uniq_ipc")
stata("set obs `=obs[1,1]'")
st_sstore(., "uniq_ipc", ipc6)
end

循环所有年份的：

*- 循环所有年份获得所有的 IPC 
cap mkdir "uniq_ipc"
local url = "/Volumes/A10R2copy/A10T/大数据/1985～2022 年专利申请数据（已更新，含授权信息、申请人地址的经纬度及其所处的省市区县信息）/newpatentdata_coord2/"
forval y = 1985/2022 {
  di "`y'"
  qui {
    use "`url'/`y'patent_v2_coord.dta", clear 
    ren 分类号 IPC 
    replace IPC = subinstr(IPC, "//", "", .)
    replace IPC = subinstr(IPC, ",", ";", .)
    replace IPC = subinstr(IPC, "=", "", .)
    drop if 专利类型 == "外观设计"
    drop if !index(IPC, ";")
    keep newzlid IPC 
    replace IPC = subinstr(IPC, " ", "", .)

    mata: ipc = st_sdata(., "IPC")
    mata: ipc2 = rowshape(ipc, cols(ipc))
    mata: ipc3 = invtokens(ipc2, ";")
    mata: ipc4 = ustrsplit(ipc3, ";")
    mata: ipc5 = colshape(ipc4, 1)
    mata: ipc6 = uniqrows(ipc5)
    mata: st_matrix("obs", rows(ipc6))
    mata: stata("clear")
    mata: stata("cap drop uniq_ipc")
    mata: st_addvar("strL", "uniq_ipc")
    mata: stata("set obs `=obs[1,1]'")
    mata: st_sstore(., "uniq_ipc", ipc6)
    save uniq_ipc/`y', replace 
  }
}

合并所有的：

use uniq_ipc/1985, clear 
forval y = 1986/2022 {
  append using uniq_ipc/`y'
}
duplicates drop uniq_ipc, force 
replace uniq_ipc = subinstr(uniq_ipc, "(", "", .) if ustrregexm(uniq_ipc, "^\(")
replace uniq_ipc = ustrregexs(1) if ustrregexm(uniq_ipc, "(.*)\(")
replace uniq_ipc = subinstr(uniq_ipc, ")", "", .)
replace uniq_ipc = ustrregexs(1) if ustrregexm(uniq_ipc, "(.*):")
replace uniq_ipc = subinstr(uniq_ipc, "C07D209/14C07D209", "C07D209/14", .)
replace uniq_ipc = subinstr(uniq_ipc, "A61K31/52A61K31", "A61K31/52", .)
drop if mi(uniq_ipc)
drop if !ustrregexm(substr(uniq_ipc, 1, 1), "[A-Z]")
save uniq_ipc, replace

筛选能够和国民经济分类对应上的 IPC 分类号

「国民经济分类与IPC分类号对照表.dta」数据中给出了非常多的分类号和行业小类的对照，不过并不是所有的 IPC 号都在专利数据中出现过，因此我们可以筛选能够匹配上行业小类的：

use 国民经济分类与IPC分类号对照表.dta, clear 
gen dzid = _n 
order dzid 
ren 国际专利分类号 IPC 
replace IPC = subinstr(IPC, "*", "", .) 
replace IPC = subinstr(IPC, " ", "", .) 
gen len = strlen(IPC)
tab len 

*- 分长度保存
cap mkdir "对照表"
forval i = 3/11 {
  preserve 
  keep if len == `i' 
  drop len 
  save 对照表/len`i', replace 
  restore 
}

*- IPC 数据和对照表匹配
use uniq_ipc, clear 
cap mkdir "行业匹配结果"
gen len = strlen(uniq_ipc)
tab len 
gsort -len 
*- 删除 len > 11 的，不值得为个别观测值浪费大量时间处理
drop if len > 11 | len < 3 
forval i = 3/11 {
  preserve 
  keep if len == `i'
  drop len 
  gen IPC = substr(uniq_ipc, 1, `i')
  joinby IPC using 对照表/len`i' 
  save 行业匹配结果/`i', replace 
  restore 
} 

use 行业匹配结果/3, clear 
forval i = 4/11 {
  append using 行业匹配结果/`i'
}
duplicates drop _all, force 
keep uniq_ipc 国民经济行业代码 
save 专利数据与行业小类代码简易对照表, replace 

list in 1/10

*>     +--------------------+
*>     | uniq_ipc   国民~码 |
*>     |--------------------|
*>  1. |     H01L   C3974   |
*>  2. |     H01L   C3563   |
*>  3. |     H01L   C3073   |
*>  4. |     H01L   C3979   |
*>  5. |     H01L   C3562   |
*>     |--------------------|
*>  6. |     H01L   C3973   |
*>  7. |     H01L   C3976   |
*>  8. |     H01L   C3825   |
*>  9. |     H01L   C4028   |
*> 10. |     H01L   C3421   |
*>     +--------------------+

可以看到每个分类号是可能对应多个行业小类的，我们把这些行业小类收集起来：

*- 注意每个 IPC 会对应不同的行业，每个行业也会对应不同的 IPC 分类号
use 专利数据与行业小类代码简易对照表, clear 
recast str11 uniq_ipc
bysort uniq_ipc: gen id = _n 
tostring id, replace 
replace id = "m" + id
spread id 国民经济行业代码
unite m*, gen(国民经济行业代码) sep(";")
drop m*
forval i = 1/30 {
  replace 国民经济行业代码 = subinstr(国民经济行业代码, ";;", ";", .)
}
replace 国民经济行业代码 = subinstr(国民经济行业代码, ";", " ", .)
replace 国民经济行业代码 = strtrim(国民经济行业代码)
replace 国民经济行业代码 = subinstr(国民经济行业代码, " ", ";", .)
save 专利数据与行业小类代码简易对照表2, replace

也就是大概只有 3000 多个 IPC 分类号是可以对应上行业小类的。

产业数实分类

基于上面的数字经济产业分类和实体产业分类，我们就可以对所有的行业小类进行分类了：

*- 产业数实分类
use 数字经济核心产业.dta, clear 
use 国民经济分类与IPC分类号对照表.dta, clear 
keep 国民经济行业代码 
duplicates drop 国民经济行业代码, force 
gen industry2 = substr(国民经济行业代码, 2, .)
joinby industry2 using 数字经济核心产业.dta, unmatched(master)
drop if mi(国民经济行业代码)
drop _m 
joinby 国民经济行业代码 using 实体产业分类.dta, unmatched(master)
drop industry2 _m 
keep 国民经济行业代码 数字经济产业 实体产业 
gen class = (!mi(数字经济产业)) 

collapse (sum) class, by(国民经济行业代码)
gen 类别 = "数字经济产业" if class > 0 
replace 类别 = "实体产业" if class == 0 
drop class 
tab 类别 
save 产业数实分类, replace

如果某个行业小类同时对应了实体产业和数字产业，归类到数字产业中。

以 2012 年的专利数据为例进行计算

这样我们就已经有了所需的所有数据了。

首先，保留所需的变量：

*- use 2012patent_v2_coord.dta, clear 
*- keep newzlid 公开公告号 专利类型 分类号 
*- ren 分类号 IPC 
*- save 2012patent_small, replace

去除重复的专利以及删除用不到的外观设计专利：

use 2012patent_small, clear 
*- 去除重复的专利
drop if 专利类型 == "外观设计" 
gen temp = substr(公开公告号, -1, 1)
replace 公开公告号 = subinstr(公开公告号, temp, "", .) if ustrregexm(temp, "[A-Z]")
drop temp 
duplicates drop 公开公告号, force 
drop if !index(IPC, ";")
keep newzlid IPC 
replace IPC = subinstr(IPC, " ", "", .)
split IPC, parse(;)
drop IPC 
keep newzlid IPC1-IPC10 
save tempdata1, replace

把没有行业小类对照的都替换成空（这样可以减少下面的计算量）：

use tempdata1, clear 
forval i = 1/10 {
  ren IPC`i' uniq_ipc 
  merge m:1 uniq_ipc using 专利数据与行业小类代码简易对照表2 
  drop if _m == 2 
  replace uniq_ipc = "" if _m != 3 
  drop _m 
  ren uniq_ipc IPC`i'
  drop 国民经济行业代码
}
save tempdata1a, replace 

use tempdata1a, clear 
unite IPC*, gen(国际专利分类号) sep(";")
drop IPC*

forval i = 1/30 {
  replace 国际专利分类号 = subinstr(国际专利分类号, ";;", ";", .)
}

drop if 国际专利分类号 == ";"
replace 国际专利分类号 = subinstr(国际专利分类号, ";", " ", .)
replace 国际专利分类号 = strtrim(国际专利分类号)
replace 国际专利分类号 = subinstr(国际专利分类号, " ", ";", .)

drop if !index(国际专利分类号, ";")
ren 国际专利分类号 ipc 
save tempdata2, replace

统计 IPC 共现矩阵：

use tempdata2, clear 
contract ipc 
mata:
    ipc = st_sdata(., "ipc")
    value = st_data(., "_freq") 
    fullcombinations = J(0, 3, "")
    for (r = 1; r <= rows(ipc); r++) {
        class_list = colshape(ustrsplit(ipc[r, 1], ";"), 1)
        n_classes = rows(class_list)
        if (n_classes == 2) {
          fullcombinations = fullcombinations \ (class_list[1,1], class_list[2,1], strofreal(value[r,1]))
        }
        if (n_classes > 2) {
          combinations = J(n_classes * n_classes, 3, "")
          row_index = 1
        
          for (i = 1; i <= n_classes; i++) {
              for (j = 1; j <= n_classes; j++) {
                  combinations[row_index, 1] = class_list[i,1]
                  combinations[row_index, 2] = class_list[j,1]
                  combinations[row_index, 3] = strofreal(value[r,1])
                  row_index++
              }
          }
          fullcombinations = fullcombinations \ combinations 
        }
    }

    st_matrix("obs", rows(fullcombinations))
    stata("clear")
    st_addvar("str100", "ipc1")
    st_addvar("str100", "ipc2")
    st_addvar("str100", "value")
    stata("set obs `=obs[1,1]'")
    st_sstore(., ("ipc1", "ipc2", "value"), fullcombinations)
end

destring, replace 
compress 
save tempdata3, replace 
list in 1/10 

*>     +-------------------------------+
*>     |      ipc1        ipc2   value |
*>     |-------------------------------|
*>  1. |  A01G1/00    A01G1/02       1 |
*>  2. |  A01G1/00    A01G1/04      23 |
*>  3. |  A01G1/00    A01G1/06      28 |
*>  4. |  A01G1/00    A01G1/12       2 |
*>  5. |  A01G1/00   A01G23/00      25 |
*>     |-------------------------------|
*>  6. |  A01G1/00    A01G1/00       2 |
*>  7. |  A01G1/00   A01G23/00       2 |
*>  8. |  A01G1/00   A01G23/04       2 |
*>  9. | A01G23/00    A01G1/00       2 |
*> 10. | A01G23/00   A01G23/00       2 |
*>     +-------------------------------+

汇总后再把这个结果转换成平衡面板（对称矩阵的长面板样式）：

use tempdata3, clear 
collapse (sum) value, by(ipc1 ipc2) 
replace value = 0 if ipc1 == ipc2
save tempdata3a, replace 

*- 处理成对称矩阵 
use tempdata3a, clear 
gen id = _n 
drop value 
gather ipc1 ipc2 
keep value 
duplicates drop value, force 
save tempdata3b, replace 

ren value value2 
cross using tempdata3b
ren value ipc1 
ren value2 ipc2 
merge 1:1 ipc1 ipc2 using tempdata3a 
drop _m 
replace value = 0 if mi(value)
save tempdata3c, replace 

ren ipc1 temp 
ren ipc2 ipc1 
ren temp ipc2 
ren value value2 
merge 1:1 ipc1 ipc2 using tempdata3c 
egen value3 = rowmax(value value2) 
replace value = value3 
drop value2 _m value3 
replace value = 0 if ipc1 == ipc2
drop if value == 0 
save IPC融合数量计算结果, replace 

list in 1/10 

*>     +------------------------------+
*>     |     ipc1        ipc2   value |
*>     |------------------------------|
*>  1. | A01G1/00    A01G1/02       1 |
*>  2. | A01G1/00    A01G1/04      23 |
*>  3. | A01G1/00    A01G1/06      28 |
*>  4. | A01G1/00    A01G1/12       2 |
*>  5. | A01G1/00   A01G23/00      30 |
*>     |------------------------------|
*>  6. | A01G1/00   A01G23/02       3 |
*>  7. | A01G1/00   A01G23/04      21 |
*>  8. | A01G1/00   A01G25/16       1 |
*>  9. | A01G1/00    A23L1/08       1 |
*> 10. | A01G1/00    A23L1/10       1 |
*>     +------------------------------+

根据 IPC 和行业小类的对照表就可以把上面的 IPC 号替换成行业小类了：

use IPC融合数量计算结果, clear 
ren ipc1 uniq_ipc
merge m:1 uniq_ipc using 专利数据与行业小类代码简易对照表2.dta
drop if _m == 2 
drop _m 
ren 国民经济行业代码 industry1 
drop uniq_ipc 

ren ipc2 uniq_ipc 
merge m:1 uniq_ipc using 专利数据与行业小类代码简易对照表2.dta
drop if _m == 2 
drop _m 
ren 国民经济行业代码 industry2 
drop uniq_ipc 

order industry*
foreach i of varlist _all {
  cap format `i' %10s 
}
drop if mi(industry1) | mi(industry2) 

drop if industry1 == ";"
replace industry1 = subinstr(industry1, ";", " ", .)
replace industry1 = strtrim(industry1)
replace industry1 = subinstr(industry1, " ", ";", .)

drop if industry2 == ";"
replace industry2 = subinstr(industry2, ";", " ", .)
replace industry2 = strtrim(industry2)
replace industry2 = subinstr(industry2, " ", ";", .)

save tempdata4, replace

再使用 Mata 代码统计行业共现频次：

*- 统计各行业间的合作频次
use tempdata4, clear 
collapse (sum) value, by(industry1 industry2) 
mata:
  mata clear 
  fullcombinations = J(0, 3, "")
  value = st_data(., "value")
  // industry 矩阵
  industry1 = st_sdata(., "industry1")
  industry2 = st_sdata(., "industry2")

  for (z = 1; z <= rows(industry1); z++) {
    a = ustrsplit(industry1[z,1], ";")
    b = ustrsplit(industry2[z,1], ";")
    combinations = J(cols(a) * cols(b), 3, "")
    row_index = 1
    for (i = 1; i <= cols(a); i++) {
      for (j = 1; j <= cols(b); j++) {
        combinations[row_index, 1] = a[1, i]
        combinations[row_index, 2] = b[1, j]
        combinations[row_index, 3] = strofreal(value[z,1]) 
        row_index++
      }
    }
    fullcombinations = fullcombinations \ combinations 
  }
  st_matrix("obs", rows(fullcombinations))
  stata("clear")
  st_addvar("str100", "industry1")
  st_addvar("str100", "industry2")
  st_addvar("str100", "value")
  stata("set obs `=obs[1,1]'")
  st_sstore(., ("industry1", "industry2", "value"), fullcombinations)
end 
destring, replace 
compress 
collapse (sum) value, by(industry1 industry2)
save tempdata5, replace

同样，这个结果也并不是平衡面板（对称矩阵），下面还会需要再处理下。不过我们先统计下数字和实体行业的数量：

use tempdata5, clear 
replace value = 0 if industry1 == industry2 
gen id = _n 
drop value 
gather industry1 industry2 
keep value
duplicates drop _all, force 
save tempdata5a, replace 
ren value 国民经济行业代码 
save tempdata5b, replace 
merge 1:m 国民经济行业代码 using 产业数实分类
keep if _m == 3 
drop _m 
contract 类别
ren 类别 class
save 融合的各产业数量, replace 

list

*>     +----------------------+
*>     |        class   _freq |
*>     |----------------------|
*>  1. |       实体产业     296 |
*>  2. |    数字经济产业      26 |
*>     +----------------------+

处理平衡：

use tempdata5a, clear 
cross using tempdata5b 
ren value industry1 
ren 国民经济行业代码 industry2
merge 1:1 industry1 industry2 using tempdata5 
replace value = 0 if mi(value)
drop _m 
replace value = 0 if industry1 == industry2 
save tempdata5c, replace 
ren industry1 temp 
ren industry2 industry1 
ren temp industry2 
ren value value2 
merge 1:1 industry1 industry2 using tempdata5c 
egen value3 = rowmax(value value2) 
replace value = value3 
drop value2 _m value3 
replace value = 0 if industry1 == industry2

匹配上行业类别再分类计算总数量：

ren industry1 国民经济行业代码 
recast str5 国民经济行业代码
merge m:1 国民经济行业代码 using 产业数实分类
keep if _m == 3
drop _m 
ren 国民经济行业代码 industry1 
ren 类别 class1

ren industry2 国民经济行业代码 
recast str5 国民经济行业代码
merge m:1 国民经济行业代码 using 产业数实分类
keep if _m == 3
drop _m 
ren 国民经济行业代码 industry2 
ren 类别 class2 
drop if value == 0 
collapse (sum) value , by(class1 class2)
save tempdata6, replace

最后就可以计算数实、数数和实实融合水平了：

use tempdata6, clear 
ren class1 class 
merge m:1 class using 融合的各产业数量 
drop _m 
ren _freq freq1 
ren class class1 

ren class2 class 
merge m:1 class using 融合的各产业数量 
drop _m 
ren _freq freq2 
ren class class2 

collapse (sum) value (first) freq1 (first) freq2, by(class1 class2)
gen RH = value / (freq1 * freq2)

gen class = "数实融合" if class1 != class2 
replace class = "数数融合" if class1 == "数字经济产业" & class2 == "数字经济产业" 
replace class = "实实融合" if class1 == "实体产业" & class2 == "实体产业" 

keep class RH 
duplicates drop _all, force 

list

*>     +---------------------+
*>     |       RH      class |
*>     |---------------------|
*>  1. | 18.74731   实实融合   |
*>  2. | 1.294958   数实融合   |
*>  3. | 52.50888   数数融合   |
*>     +---------------------+

不过计算结果和原论文中的有些差异，令人费解。不过考虑到数字经济产业并不多，所以数实融合的专利不会很多，因此数实融合水平应该也不会很高。这个结果应该是合理的。

如果想分城市，只需要循环使用各城市的专利数据计算即可；如果想要分类别（农业、建筑业、制造业、建筑业及其他工业），只需要最后在分组计算的时候使用这个类别变量即可。

如何参加课程？

购买 RStata 名师讲堂会员都可以参加这个课程啦！

详情可阅读这篇推文：数据处理、图表绘制、效率分析与计量经济学如何学习～，更多关于 RStata 培训班的信息可添加微信号 r_stata 咨询（上图中有二维码）。

会员购买：从首页的会员卡专区即可查看和购买会员卡。

https://rstata.duanshu.com/#/card/list/

课程主页（点击文末的阅读原文即可跳转）：
https://rstata.duanshu.com/#/brief/course/e4f7e30bd24f4b67a637b702b08d2d6b

http://mp.weixin.qq.com/s?__biz=Mzg5ODE0MDgyNQ==&mid=2247519700&idx=4&sn=19236388c5d333664661ca4713be8189

RStata

一起学习 R 语言和 Stata 吧！