Source: Documenting
Research in Scientific Articles: Guidelines for Authors, 3.Reporting
Multivariate Analyses
Author: Tom Lang
CHEST, 2007, 131:628-632
多變數分析包括兩個統計技術:迴歸分析與變方分析(ANOVA)
一、迴歸分析之報導
迴歸分析之目的在於由已知一個或多個變數(explanatory
variables, predictors)以預測或估計另一個反應變數(response
variable)。迴歸分析之型式由獨立變數(explanatory
or Independent),反應變數(dependent)與變數的量測水準(level
of measurement)所決定。
量測水準代表對一個變數收集其知識。類型數據如血型(A、B、AB、O)。序數(Ordinal)數據例如反應程度(適度、中等、嚴重)。連續變數則是連續性,相同間隔之數值。
量測水準也可由研究人員自己定義,制定標準。以血型為例,可為名義(nominal)變數(hypertensive,
not hypertensive),序數變數(hypotensive,
normotensive, hypertensive),連續變數(由血壓計量側之數值)
常用之迴歸分析有如下類型:
1.Simple
linear regression
2.Multiple linear regression
3.Simple
logistic regression
4.Nonlinear regression
5.Polynomial regression
6.Cox
Proportional hazard regression
有關迴歸分析之研究報導指引:
1.描述迴歸分析之目的,何者為獨立變數(因),何者為他變數(果)。
2.變數的數據要加以描述。
連續性變數要說明平均值與標準差(常態分配數據)。或說明中數,範圍,四分之一與四分之三之數值(非常態分配)。類型變數要說明其次數、頻率。
3.確定每一個變數之假定條件,並說明是如何檢定。
對各種假設要進行假設檢定,或以非正式(informal)檢定,例如以圖形,殘差圖等,數據如果違反假定,要加以調整,例如使用對數轉換以調整為常態分配。
4.任何遺失數據要加以報導,其處理方式要說明
在多變數分析中,數據遺失往往減少了可用的數據,例如一個對象其年齡數據(Xj)如果未調查,而引起整個數據被刪除。
通常遺失之數據可以以”imputation”技術加以補救。最簡單的補救方法包括使用所有觀察數據之平均值,使用相同觀察時間、相同人群之平均值,或是在之前之後調查人群之平均值等。
5.報導如何處理離群數據
離群組不能疏忽。甚至只有一個離群組也會造成嚴重影響。所有離群組必須加以報導,也可報導有此離群數據或刪除離群數據所產生影響之比較。
6.報導迴歸模式
典型線性、多重或logistic
RA模型之報導舉例如下:
Y=
40.8 + 3.98X1 + 1.22X2 - 2.09X3
Figure 1. A multiple linear regression equation. In this example, the
model predicts overall function score, Y, for patients with multiple
sclerosis based on: disease severity, X1; ambulatory ability
(measured as the rate of walking in laps per minute), X2; and
number of lesions, X3. Here, X1, X2,
and X3 are explanatory variables (sometimes called risk
factors); the numbers in front of the X values are called regression
coefficients or β -weights. (40.8 is the Y intercept point, where the
line crosses the Y axis.) Coefficients are interpreted as follows: if X1
and X3 are held constant (or “controlling for” disease
severity and number of lesions), then mean functional score increases by
about 1.25
times
(1.22, the coefficient for X2) for each additional lap per
minute. The final model had a coefficient of multiple determination, R2,
of 0.58, indicating that the three variables in the model explain 58% of
the variation in the response variable.
Table
1—
A
Table for
Reporting a
Multiple Linear
Regression Model
With Three
Explanatory
Variables*
Variables |
Coefficient (β) |
SE |
95% CI |
Wald x2 |
p-Value† |
Intercept |
40.79 |
2.55 |
|
|
|
X1 |
3.98 |
2.37 |
-0.67 to 8.63 |
1.68 |
0.10 |
X2 |
1.23 |
0.29 |
0.66 to 1.80 |
4.20 |
< 0.001 |
X3 |
-2.09 |
0.28 |
-2.64 to - 1.54 |
-7.34 |
< 0.001 |
*Intercept =
a mathematical constant (no
clinical
interpretation); X1
to X3
= the
explanatory variables; Coefficient =
the mathematical
weightings of the
explanatory variables in the
equation (the
regression coefficient or β -weight);
SE =
estimated precision of the
coefficients; 95% CI
= 95% confidence
intervals for
the coefficients;
Wald x2
= the Wald test
statistic calculated from
the data
to be
compared with
the x2
distribution
with 1 degree
of freedom.
†Variables X2
and
X3 are
statistically significant
independent predictors of the response
variable.
Table
2—
A
Table for
Reporting a
Multiple Logistic
Regression Model
With Four
Explanatory Variables*
Variable |
Coefficient (β ) |
SE |
Wald x2 |
p Value |
Odds Ratio |
95% CI |
Intercept |
- 1.88 |
0.48 |
|
|
|
|
X1 |
1.435 |
0.589 |
5.93 |
0.02 |
4.2 |
1.32–13.33 |
X2 |
- 0.847 |
0.690 |
1.51 |
0.22 |
0.43 |
0.11–1.66 |
X3 |
3.045 |
1.260 |
5.84 |
0.02 |
21.01 |
1.78–248.29 |
X4 |
2.200 |
0.990 |
4.94 |
0.03 |
9.03 |
1.30–62.83 |
*Odds
Ratio =
controlling for
other variables
in the
model, for
every unit
increase in,
for example,
variable 1,
the odds
of having
the event
of interest increase
by 4.2
(likewise, controlling for
other variables
in the
model, for
every unit
increase in,
for example,
variable 2,
the odds
of having the
event decrease
by 0.43);
95% CI
= the 95% confidence
interval for
the estimated odds
ratio. See
Table 1
for other
abbreviations or explanations not
used in the
text.
對迴歸分析各係數要報導p值,95%信賴區間等(Table
l)。
Logistic迴歸需要報導Odds
Ratio, 95%CI等統計量(Table
2)
7.報導使用何種統計技術以選擇適用的多重迴歸模式。選擇適用模式的第一步驟要決定信賴水準,通常以p值表示,p=0.05或p=0.1
第二步驟要決定選用方法,例如forward,
backward, stepwise與best-subset.
8.針對多重迴歸,每一變數要檢定fu6是否有重合性。
如果某一個變數(xi)與其他一個或多個變數(xj,
xk等)有相關性(非獨立),代表具有重合性(collinear)。
9.多重迴歸要探查是否有交互效應。
多重迴歸中,自變數有xi,
xj,
如果xiXj為影響他變數之一個變數,代表xi及xj有交互效應(interaction)。在迴歸分析中,往往需要檢定此交互效應是否存在。
10.要提供一個量測值以代表模式對數據之適合程度。
對於線性迴歸而言,最簡單的量測值為相關係數(r,
Correlation coefficient)。對於多重迴歸,常用之標準為決定係數(R2,
Coefficient of determination)
另一個重要標準是殘差(residuals)。除了使用殘差圖檢定模式也可以殘差值評估離群值(outliers)。
11.以另一組獨立數據驗證模式
A.在一群數據中,以95%數據建立模式,再以25%數據驗證模式是否適用。
B.一次移走一組數據,以剩餘數據建模。再將原來數據組代入模式進入驗證。此方法稱為jack-
knife procedures.
C.將數據分成兩組。兩組數據分別建立模式,再比較兩個模式是否相同。
12.報導使用的統計軟體,例如SAS,SPSS等。
二、ANOVA之報導
經常使用的ANOVA
1.
One-way ANOVA
2.
Two-way ANOVA
3.
Multiway ANOVA
4.
Analysis of covariance
對一個連續性反應變數,受到其他變數影響(可能是連續性數據),而這些變數又受到一或數個類型變數之影響。
5.
Repeated-measures ANOVA
用以評估同一對象在不同狀況或不同時間點所受到之影響。例如相同病人的血壓,在三個狀態之量測值(躺臥、坐著、站立),或病人的肌肉力量在手術後1天、5天、10天、20天等之量測值。
典型ANOVA之報導如下:
Table
3—
A
Table for
Presenting the
Results of
a Two-Way
ANOVA for
Analyzing the
Two Factors
Group and
Age*
Source of Variation |
df† |
Sums of Squares |
Mean Square |
F Statistic |
p Value |
Group |
1 |
0.64 |
0.64 |
2.24 |
0.16 |
Age |
3 |
3.92 |
1.31 |
4.57 |
0.02 |
Group × age |
3 |
4.91 |
1.64 |
5.72 |
0.01 |
Error |
12 |
3.43 |
0.29 |
|
|
*ANOVA =
includes the two
factors: group
(two levels
or categories)
and age
(four categories
or levels),
and the
levels of
each category should be stated
in the
description of
the study
(group and
age significantly interact and
so must
be considered
together); Source of variation
= identification
of the
sources of
variability in
the response
variable as
the factors
in the
model (group,
age, and
the interaction between group and
age) and
as random
error (the
variability not
explained by
the factors); df =
the degrees
of freedom,
a mathematical
concept; Sums of squares
= unlike
one-way ANOVA,
the sums
of squares
in multiway
ANOVA are
not easily
explained and are
best regarded
as simply
steps in the
calculation of the
mean squares;
Mean square
= the sums of
squares divided
by the
degrees of
freedom (essentially, estimates of
the variation in
the data);
F statistic
= the test statistic
for the
F distribution,
for testing
for interaction
effects and
main effects,
equals the
mean square for
each factor
divided by
the mean square
of the
error; p
Value =
the probability values indicating
the statistical
significance of the
effect of each factor
on the
response variable
(eg, age
and group
interact [p
= 0.01] in
affecting the
response variable
and should
be further
investigated together;
ie, the
main effect
of group
or the
main effect
of age
should not
be investigated
alone).
For two groups,
the df
is 2
1, or
1. For
four age
categories, the df
is 4
1, or
3. For
the interaction
effect between
group and
age
(ie,
group ×
age), the
df values
for each
factor are
multiplied (3
× 1
= 3).
|