任意缺失模式下的MI方法主要是基于两种填补策略: 联合模型法(joint model) 和完全条件定义法(fully conditional specification,FCS)。联合模型法又被称为数据扩增法(data augmentation),假设数据服从多元正态分布,利用贝叶斯理论从联合后验分布中抽取填补值。马尔科夫链蒙特卡罗法(Markov Chain Monte Carlo,MCMC)就是基于联合模型策略下的填补方法。FCS又被称为链式方程多变量填补(multiple imputation by chained equations,MICE)、逐步回归多变量填补(sequential regression multivariate imputation,SRMI)。FCS不预先指定数据分布,而是利用单个变量的条件分布分别建立回归模型,通过一系列的迭代算法进行填补。FCS采用的是逐一插补(variable-by-variable imputation)的方式,能为多种类型的变量提供复杂的插补模型,从而灵活处理多种类型协变量混合缺失问题,但是其计算量也更大。FCS 也已被证明在分类变量填补方面优于联合模型法,并且在填补模型误设情形下更加稳健。van Buuren 推荐使用FCS方法进行多重填补。
MCMC是multivariate joint Gaussian imputation model or Multivariate Normal imputation model(所有变量的联合分布),FCS逐个变量建立填补模型。
MCMC和FCS对数据的缺失模式均没有限制。
单调缺失数据通过序列式的回归模型 monotone regression 填补。
Sas code:
proc mi data=DATAIN out=DATAOUT;
var TRT SCORE_0 SCORE_1 SCORE_2;
monotone regression;
run;
proc mi data=DATAIN out=DATAIN_MONO nimpute=100 seed=123;
var TRT SCORE_0 SCORE_1 SCORE_2 SCORE_3;
mcmc chain=multiple impute=monotone;
run;
proc sort data=DATAIN_MONO; by _Imputation_ TRT; run;
proc mi data=DATAIN_MONO out=DATAIN_REG seed=465 nimpute=1;
by _Imputation_;
var TRT SCORE_0 SCORE_1 SCORE_2 SCORE_3;
class TRT;
monotone regression;
run;
data DATAIN_REG1; set DATAIN_REG;
TIMEPTN=1; SCORE_C = SCORE_1 - SCORE_0; OUTPUT;
TIMEPTN=2; SCORE_C = SCORE_2 - SCORE_0; OUTPUT;
TIMEPTN=3; SCORE_C = SCORE_3 - SCORE_0; OUTPUT;
run;
proc sort data=DATAIN_REG1; by _Imputation_ TIMEPTN TRT; run;
proc mixed data=DATAIN_REG1;
by _Imputation_ TIMEPTN;
class TRT;
model SCORE_C = TRT SCORE_0 / solution covb;
lsmeans TRT / diff=control('0') cl;
ods output Diffs=DIFF_MI LSMeans=LSM_MI;
run;
proc sort data=DIFF_MI; by TIMEPTN _Imputation_; run;
proc mianalyze parms(classvar=full)=DIFF_MI;
class TRT ;
modeleffects TRT;
ods output ParameterEstimates=DIFF_MIAN;
by TIMEPTN;
run;
proc sort data=LSM_MI; by TIMEPTN _Imputation_; run;
proc mianalyze parms(classvar=full)=LSM_MI;
class Trt ;
modeleffects TRT;
ods output ParameterEstimates=LSM_MIAN;
by TIMEPTN;
run;
SAP模板:
MI assuming MAR using a joint imputation model for repeated measurements
Mean changes from baseline in MEASURE1 will be analyzed based on data observed while the subject remains on study as well as data imputed using multiple imputation (MI) methodology for time points at which no value is observed.
Multiple imputation will be performed under the assumption of missing-at-random (MAR) and will be implemented in a single joint imputation {software, version}.
The Multivariate Normal imputation model for MEASURE1 across {Visit x,…,y} will include the fixed, categorical effects of treatment, {list of baseline covariates}, visit, and treatment-by-visit interaction, as well as the continuous, fixed covariates of baseline score and baseline score-by-visit-interaction. A single shared unstructured covariance matrix will be used. {Explain if any aspect is different from model for direct likelihood}.
The MCMC method will be used with a single chain, 2000 tuning units and a minimum number of 4 tuning cycles, a burn-in of 1000, and a thinning of 100 and non-informative priors for all parameters.
Imputed data will consist of {MM} imputed datasets. The random seed number for the MCMC stage will be {XXXXX}, and the random seed number for imputation stage will be {YYYYYY}.
Each of the {MM} imputed datasets will be analyzed using the following analysis method. Change in MEASURE1 from baseline to each post-baseline visit will be calculated based on observed and imputed data.
Data will be analyzed with ANCOVA using fixed, categorical effects of treatment, {list of baseline covariates}, as well as the continuous, fixed covariates of baseline score {specify the same terms as in the repeated measures model but without crossing with visit}. The same variance will be used for each arm. Treatment group comparison at Visit {T} will be based on the least squares mean (LSM) difference between treatment groups in change from baseline in MEASURE1 estimated by the ANCOVA model in each of the imputed datasets. Results from analysis of each imputed dataset, i.e., LSM treatment differences and their standard errors, will be combined using Rubin’s imputation rules to produce a pooled LSM estimate of treatment difference, its 95% confidence interval, and a pooled p-value for the test of null hypothesis of no treatment effect. The latter will be adjusted to allow for the degrees of freedom in the ANCOVA model.
MI assuming MAR with sequential imputation for monotone missing data and multivariate Gaussian model/partial MCMC imputation for non-monotone missing data
Mean changes from baseline in MEASURE1 will be analyzed based on data observed while the subject remains on study as well as data imputed using multiple imputation (MI) methodology for time points at which no value is observed.
Multiple imputation will be performed under the assumption of missing-at-random (MAR) and will be implemented in two steps using {software, version}.
First, partial imputation assuming MAR will be carried out to impute intermittent (non-monotone) missing data based on a multivariate joint Gaussian imputation model using the Markov chain Monte Carlo (MCMC) method. A separate imputation model will be used for each treatment arm. The imputation models will include {list of baseline covariates}, MEASURE1 assessments at each time point {Baseline, Visit x,…,y}. The MCMC method in the MI procedure in SAS will be used with multiple chains, 200 burn-in iterations, and a non-informative prior. In case of non-convergence or non-estimability issues, a ridge prior and a single model will be considered with treatment arm added as explanatory variable to the model.
The remaining monotone missing data will be imputed using sequential regression multiple imputation, where a separate regression model is estimated for imputation of each variable (i.e., measurement at each time point). Each regression model will include explanatory variables for {list of baseline covariates}, treatment and all previous (Baseline, Visit x,…,y) values of MEASURE1.
No rounding or range restrictions will be applied to imputed continuous values.
Imputed data will consist of {MM} imputed datasets. The random seed number for partial imputation with the MCMC method will be {XXXXX}, and the random seed number for the sequential regression multiple imputation will be {YYYYYY}.
Each of the {MM} imputed datasets will be analyzed using the following analysis method. Change in MEASURE1 from baseline to each post-baseline visit will be calculated based on observed and imputed data. {Insert description of analysis model/method, e.g., direct likelihood MMRM as described above, or ANCOVA.} Treatment group comparison at Visit T will be based on the least squares mean (LSM) difference between treatment groups in change from baseline in MEASURE1 estimated by the analysis model in each of the imputed datasets. Results from analysis of each imputed dataset, i.e., LSM treatment differences and their standard errors, will be combined using Rubin’s imputation rules to produce a pooled LSM estimate of treatment difference, its 95% confidence interval, and a pooled p-value for the test of null hypothesis of no treatment effect.
Take home message:
1. MCMC 使用多变量正态插补模型,需要将所有变量视为连续变量:在 SAS PROC MI 中,MCMC 方法不允许使用 CLASS 语句,因此分类变量必须使用二元指示变量进行编码才能包含在插补模型中。在这种情况下,可以指定以下内容:“具有 K 水平的分类变量将在插补模型中由 (K-1) 二元指示变量表示,以反映每个非参考类别中的成员资格。为清楚起见,可以添加有关虚拟编码的更多详细信息。如果需要对具有缺失数据的分类变量进行插补,则插补的值将是连续的(带小数),并且可能需要在分析之前映射到类别。另一种选择是在 BY 语句中指定分类变量,而不是在 VAR 语句中指定。
2.上面的示例文本假设单调缺失数据是由于受试者过早退出整体研究造成的。其他情况也是可能的,例如,当受试者可以提前停止研究治疗但预计会留在研究中并根据研究评估时间表进行评估时,和/或当挽救治疗可以开始并与研究治疗同时进行时。在这种情况下,可能需要相应地修改示例文本,以指定是否将研究治疗停止和/或开始挽救治疗后收集的数据用于特定分析。例如,在以下句子中,斜体文本可以按以下建议替换:
“Mean changes from baseline in MEASURE1 will be analyzed based on data observed while the subject remains on study [or see Option 1] as well as data imputed using multiple imputation (MI) methodology for time points at which no value is observed [and see Option 2].”
Option 1: … subject remains on
ostudy, regardless of adherence to randomized treatment or concomitant administration of rescue therapy
orandomized study treatment, regardless of concomitant administration of rescue therapy
orandomized study treatment until initiation of rescue therapy
orandomized study treatment
Option 2: … no value is observed
oand at time points following premature discontinuation of randomized treatment
oand at time points following premature discontinuation of randomized treatment or initiation of rescue therapy
3.如果在分析每个插补数据集期间估计的参数不具有近似正态分布(例如,当估计优势比时),则需要在使用 Rubin 规则合并分析结果之前,通过归一化转换来转换估计的统计数据及其标准误差。例如,在优势比的情况下,需要应用对数变换。此步骤需要在 汇集来自多个插补数据集的结果的描述中进行描述。
Van Buuren (2012) 提出了一些可以应用于几种类型的估计统计量的转换(参见表 1 部分复制了 Van Buuren 书中的汇总表)。他还讨论了对多重插补数据进行多变量 Wald 检验、似然比检验、卡方检验和一些针对模型参数的自定义假设检验的方法,但指出最后两种方法 - 卡方检验 (Rubin, 1997;Li et al., 1991) 和自定义假设检验 - 可能不是很可靠,而且在实践中使用它们的经验还不够
4.使用 t 统计量执行基于参数及其标准误差的合并估计的假设检验,其自由度取决于插补数以及插补内和插补间方差。当完整数据分析模型的自由度较小且缺失数据比例较小时,合并假设检验的自由度标准定义可能不合适。在这种情况下,建议使用调整后的自由度,其中考虑了完整数据自由度。SAS 中的 PROC MI 为使用此方法提供了一个选项。如果计划使用调整后的自由度,则需要在合并来自多个插补数据集的结果的描述中提及这一点。事实上,默认情况下,这种方法可能被推荐为一般做法。Sas输出的variance information包含计算自由度的所有信息,核心就是total variance和imputation number和within variance。如果计算调整后的自由度,只需要再带入没有缺失时候t统计量的自由度即可。
5.对于来自于连续的变量的二分类变量,可以直接应用于原始的连续变量,在完成多重填补后,再进行分析与合并。
6.MI model和proc分析model放入的自变量基本一致或者后者少于前者,因为前者包含一些辅助变量。辅助变量的选取可以基于两组基线指标脱落率的单因素分析,也可以基于逻辑回归,也可以基于KM曲线。有一些如果前面没有探索出来,但是医生想加入,最终会以医生建议为准。一般来说,研究者给出的建议是纳入尽可能多的变量以及可能的交互项。根据经验来讲,当预测变量超过15个之后,在线性回归中增加的对变异的解释量几乎可以忽略不计。选择变量的优先顺序可以是:完整数据模型中纳入的变量,可能对缺失产生影响的变量,能够解释大量变异的变量(最后去掉缺失情况太多的预测变量)。
7.SAS中MI过程步中可以通过指定IMPUTE=MONOTONE来获得单调缺失的数据集,然后在每一个填补后的数据集上进行一次单调回归。当使用CHAIN=MULTIPLE选项时,过程将使用多链,且每次填补前完成默认的200次退火(burn-in)迭代,使迭代收敛到平稳分布。默认使用的是Jefferys无信息先验,使用EM算法获得MCMC迭代的初始值。当使用initial=em(itprint)时,可以看到计算初始值时EM算法的迭代过程。当使用displayinit时,可以看到马尔科夫链的初始值(即EM算法的收敛值)。
8.http://www.missingdata.org.uk可以查询到所有的 Sas code,也可参考之前两篇文章分类变量的多重填补和缺失值的多重填补。
9.基于同一组中终止治疗后仍在研究中并继续随访受试者数据(Retrieved dropout,RDO)填补是MAR的衍生。
10.非劣效/等效性试验估计目标中,在优效试验中,常用疗法策略,因为它更接近真实的治疗条件,但它也是一个相对保守的策略,如果用于非劣的情况下,它会降低检验的检定敏感性。在数据缺失的处理方法上,常见的LOCF、BOCF、R2B等方法会降低试验的检定敏感性,Reference-based方法会降低非劣效检验的检定敏感性,因此应选择合理的缺失数据处理方法比如RDO等以保证试验的检定敏感性。
11.TTE变量中的疗法策略,应该尽量收集伴发事件发生以后的数据;万一发生缺失数据,需要尽力去还原伴发事件发生以后的真实数据。此时常用的非信息删失的假设可能就并不合适,可以使用reference-based或者delta-adjusted多重填补方法来还原缺失数据。TTE变量中的在治策略只关注伴发事件发生前的终点,在TTE变量的分析中,在治策略和竞争风险模型有紧密关系。虽然TTE的在治策略也将时间删失在伴发事件发生时,但关注的是有伴发事件这一竞争事件存在时试验药物的疗效。和基于非信息删失的假想策略相对应,在治策略下也有类似(但不同)的群体层面汇总。在治策略对应的群体层面汇总cause-specific风险,累计发生率和subdistribution风险。在治策略常用的Cox模型和Fine-Gray模型。假想策略一般是假设伴发事件不存在,且会将受试者的时间删失在伴发事件发生时。经典的TTE方法会使用无信息删失的方法,采用经典的KM方法和Cox模型进行分析。但是根据实际情况,也可以采用有信息删失的方法,使用例如IPCW和RPSFT方法。
12.增加或选择试验治疗组。改变对照组。 添加,选择,或汇总人群。其他类型的适应性设计和 MRCT 等。处理伴发事件个缺失值可能不同监管有不同的意见,需要提前上会沟通策略。
13.多重填补的开山鼻祖 little and rubin,中兴的 stef van buuren,到最后全面应用落地的 carperter and kenward。三个人的原版书籍建议细读。读完以后差不多无师自通。
参考文献:
1.Van Buuren, S. 2012. Flexible Imputation of Missing Data. Boca Raton, FL: Chapman & Hall/CRC Press.
2.Ratitch, B., Lipkovich, I., O’Kelly, M. 2013. “Combining Analysis Results from Multiply Imputed Categorical Data.” Proceedings of the Pharmaceutical SAS Users Group Conference (PharmaSUG 2013), paper SP03. Cary, NC: SAS Institute Inc
3.http://www.missingdata.org.uk-SAP Text_Describing_analyses_Final_2016-08-11