logo821.gif (10572 bytes) 首頁          新增資料與公告

   

最新消息  :

數字分析 - 中國於非洲農業報導之破解

索馬利亞的乾旱影響和解決方案建議

在非洲商業邊緣爭奪空間-本土企業與中資企業之間日益激烈的競爭

 

 

生醫研究之統計方法

首頁
上一層
BSE LAB 介紹
非洲產業研究中心
授課資料
人文關懷
無官御史台
武漢肺炎與產業
智能生物產業
活動公告區
數據分析與知識產業

 

 

多變數分析之報導

 

中興大學 生物系統工程研究室 陳加忠

 

 

Source: Documenting Research in Scientific Articles: Guidelines for Authors, 3.Reporting Multivariate Analyses

Author: Tom Lang

CHEST, 2007, 131:628-632

 

多變數分析包括兩個統計技術:迴歸分析與變方分析(ANOVA

一、迴歸分析之報導

迴歸分析之目的在於由已知一個或多個變數(explanatory variables, predictors)以預測或估計另一個反應變數(response variable)。迴歸分析之型式由獨立變數(explanatory or Independent),反應變數(dependent)與變數的量測水準(level of measurement)所決定。

量測水準代表對一個變數收集其知識。類型數據如血型(ABABO)。序數(Ordinal)數據例如反應程度(適度、中等、嚴重)。連續變數則是連續性,相同間隔之數值。

量測水準也可由研究人員自己定義,制定標準。以血型為例,可為名義(nominal)變數(hypertensive, not hypertensive),序數變數(hypotensive, normotensive, hypertensive),連續變數(由血壓計量側之數值)

常用之迴歸分析有如下類型:

1.Simple linear regression

2.Multiple linear regression

3.Simple logistic regression

4.Nonlinear regression

5.Polynomial regression

6.Cox Proportional hazard regression

有關迴歸分析之研究報導指引:

 1.描述迴歸分析之目的,何者為獨立變數(因),何者為他變數(果)。

2.變數的數據要加以描述。

連續性變數要說明平均值與標準差(常態分配數據)。或說明中數,範圍,四分之一與四分之三之數值(非常態分配)。類型變數要說明其次數、頻率。

3.確定每一個變數之假定條件,並說明是如何檢定。

對各種假設要進行假設檢定,或以非正式(informal)檢定,例如以圖形,殘差圖等,數據如果違反假定,要加以調整,例如使用對數轉換以調整為常態分配。

4.任何遺失數據要加以報導,其處理方式要說明

在多變數分析中,數據遺失往往減少了可用的數據,例如一個對象其年齡數據(Xj)如果未調查,而引起整個數據被刪除。

通常遺失之數據可以以”imputation”技術加以補救。最簡單的補救方法包括使用所有觀察數據之平均值,使用相同觀察時間、相同人群之平均值,或是在之前之後調查人群之平均值等。

5.報導如何處理離群數據

離群組不能疏忽。甚至只有一個離群組也會造成嚴重影響。所有離群組必須加以報導,也可報導有此離群數據或刪除離群數據所產生影響之比較。

6.報導迴歸模式

典型線性、多重或logistic RA模型之報導舉例如下:

 

Y= 40.8 + 3.98X1 + 1.22X2 - 2.09X3

Figure 1. A multiple linear regression equation. In this example, the model predicts overall function score, Y, for patients with multiple sclerosis based on: disease severity, X1; ambulatory ability (measured as the rate of walking in laps per minute), X2; and number of lesions, X3. Here, X1, X2, and X3 are explanatory variables (sometimes called risk factors); the numbers in front of the X values are called regression coefficients or β -weights. (40.8 is the Y intercept point, where the line crosses the Y axis.) Coefficients are interpreted as follows: if X1 and X3 are held constant (or “controlling for” disease severity and number of lesions), then mean functional score increases by about 1.25 times (1.22, the coefficient for X2) for each additional lap per minute. The final model had a coefficient of multiple determination, R2, of 0.58, indicating that the three variables in the model explain 58% of the variation in the response variable.

 

Table 1— A Table for Reporting a Multiple Linear Regression Model With Three Explanatory Variables*

Variables

Coefficient (β)

SE

95% CI

Wald x2

p-Value†

Intercept

40.79

2.55

 

 

 

X1

3.98

2.37

-0.67 to 8.63

1.68

0.10

X2

1.23

0.29

0.66 to 1.80

4.20

< 0.001

X3

-2.09

0.28

-2.64 to - 1.54

-7.34

< 0.001

 

*Intercept = a mathematical constant (no clinical interpretation); X1 to X3 = the explanatory variables; Coefficient = the mathematical weightings of the explanatory variables in the equation (the regression coefficient or β -weight); SE = estimated precision of the coefficients; 95% CI = 95% confidence intervals for the coefficients; Wald x2 = the Wald test statistic calculated from the data to be compared with the x2 distribution with 1 degree of freedom.

†Variables X2 and X3 are statistically significant independent predictors of the response variable.

 

Table 2— A Table for Reporting a Multiple Logistic Regression Model With Four Explanatory Variables*

Variable

Coefficient (β )

SE

Wald x2

p Value

Odds Ratio

95% CI

Intercept

- 1.88

0.48

 

 

 

 

X1

1.435

0.589

5.93

0.02

4.2

1.32–13.33

X2

- 0.847

0.690

1.51

0.22

0.43

0.11–1.66

X3

3.045

1.260

5.84

0.02

21.01

1.78–248.29

X4

2.200

0.990

4.94

0.03

9.03

1.30–62.83

 

*Odds Ratio = controlling for other variables in the model, for every unit increase in, for example, variable 1, the odds of having the event of interest increase by 4.2 (likewise, controlling for other variables in the model, for every unit increase in, for example, variable 2, the odds of having the event decrease by 0.43); 95% CI = the 95% confidence interval for the estimated odds ratio. See Table 1 for other abbreviations or explanations not used in the text.

 

對迴歸分析各係數要報導p值,95%信賴區間等(Table l)。 Logistic迴歸需要報導Odds Ratio, 95%CI等統計量(Table 2

7.報導使用何種統計技術以選擇適用的多重迴歸模式。選擇適用模式的第一步驟要決定信賴水準,通常以p值表示,p=0.05p=0.1

第二步驟要決定選用方法,例如forward, backward, stepwisebest-subset.

8.針對多重迴歸,每一變數要檢定fu6是否有重合性。

如果某一個變數(xi)與其他一個或多個變數(xj, xk等)有相關性(非獨立),代表具有重合性(collinear)。

9.多重迴歸要探查是否有交互效應。

多重迴歸中,自變數有xi, xj, 如果xiXj為影響他變數之一個變數,代表xixj有交互效應(interaction)。在迴歸分析中,往往需要檢定此交互效應是否存在。

10.要提供一個量測值以代表模式對數據之適合程度。

對於線性迴歸而言,最簡單的量測值為相關係數(r, Correlation coefficient)。對於多重迴歸,常用之標準為決定係數(R2, Coefficient of determination

另一個重要標準是殘差(residuals)。除了使用殘差圖檢定模式也可以殘差值評估離群值(outliers)。

11.以另一組獨立數據驗證模式

      A.在一群數據中,以95%數據建立模式,再以25%數據驗證模式是否適用。

      B.一次移走一組數據,以剩餘數據建模。再將原來數據組代入模式進入驗證。此方法稱為jack- knife procedures.

      C.將數據分成兩組。兩組數據分別建立模式,再比較兩個模式是否相同。

12.報導使用的統計軟體,例如SASSPSS等。

 

二、ANOVA之報導

經常使用的ANOVA

1. One-way ANOVA

2. Two-way ANOVA

3. Multiway ANOVA

4. Analysis of covariance

對一個連續性反應變數,受到其他變數影響(可能是連續性數據),而這些變數又受到一或數個類型變數之影響。

5.   Repeated-measures ANOVA

用以評估同一對象在不同狀況或不同時間點所受到之影響。例如相同病人的血壓,在三個狀態之量測值(躺臥、坐著、站立),或病人的肌肉力量在手術後1天、5天、10天、20天等之量測值。

 

典型ANOVA之報導如下:

Table 3— A Table for Presenting the Results of a Two-Way ANOVA for Analyzing the Two Factors Group and Age*

Source of Variation

 df†

Sums of Squares

Mean Square

F Statistic

 p Value

Group

1

0.64

0.64

2.24

0.16

Age

3

3.92

1.31

4.57

0.02

Group × age

3

4.91

1.64

5.72

0.01

Error

12

3.43

0.29

 

 

 

 

 

 

 

 

 

 

*ANOVA = includes the two factors: group (two levels or categories) and age (four categories or levels), and the levels of each category should be stated in the description of the study (group and age significantly interact and so must be considered together); Source of variation = identification of the sources of variability in the response variable as the factors in the model (group, age, and the interaction between group and age) and as random error (the variability not explained by the factors); df = the degrees of freedom, a mathematical concept; Sums of squares = unlike one-way ANOVA, the sums of squares in multiway ANOVA are not easily explained and are best regarded as simply steps in the calculation of the mean squares; Mean square = the sums of squares divided by the degrees of freedom (essentially, estimates of the variation in the data); F statistic = the test statistic for the F distribution, for testing for interaction effects and main effects, equals the mean square for each factor divided by the mean square of the error; p Value = the probability values indicating the statistical significance of the effect of each factor on the response variable (eg, age and group interact [p = 0.01] in affecting the response variable and should be further investigated together; ie, the main effect of group or the main effect of age should not be investigated alone).

For two groups, the df is 2 1, or 1. For four age categories, the df is 4 1, or 3. For the interaction effect between group and age (ie, group × age), the df values for each factor are multiplied (3 × 1 = 3).