Book

Multivariable Model - Building: A Pragmatic Approach to Regression Anaylsis based on Fractional Polynomials for Modelling Continuous Variables

Patrick Royston, Willi Sauerbrei

ISBN: 978-0-470-02842-1  

322 pages

John Wiley & Sons Ltd, Chichester, England

May 2008

 

for datasets see the end of the page

Everything on this page was reproduced with permission from John Wiley & Sons Ltd.

From the preface:

"[...] Our general objective is to provide a readable text giving the rationale of, and practical advice on, a unified approach to multivariable modelling which aims to make such models simpler and more effective. [...] No multivariable model-building strategy has rigorous theoretical underpinnings. Even those approaches most used in practice have not had their properties studied adequately by simulation. In particular, handling continuous variables in a multivariable context has largely been ignored. Since there is no consensus among researchers on the ‘best’ strategy, a pragmatic approach is required. Our book reflects our views derived from wide experience. The text assumes a basic understanding of multiple regression modelling, but it can be read without detailed mathematical knowledge. [...] As expressed in a very readable paper by Chatfield (2002), we aim to ‘encourage and guide practitioners, and also to counterbalance a literature that can be overly concerned with theoretical matters far removed from the day-to-day concerns of many working statisticians’. [...]"

 

Table of Contents

1 Introduction 1
  1.1 Real-Life Problems as Motivation for Model Building 1
    1.1.1 Many Candidate Models 1
    1.1.2 Functional Form for Continuous Predictors 2
    1.1.3 Example 1: Continuous Response 2
    1.1.4 Example 2: Multivariable Model for Survival Data 5
  1.2 Issues in Modelling Continuous Predictors 8
    1.2.1 Effects of Assumptions 8
    1.2.2 Global versus Local Influence Models 9
    1.2.3 Disadvantages of Fractional Polynomial Modelling 9
    1.2.4 Controlling Model Complexity 10
  1.3 Types of Regression Model Considered 10
    1.3.1 Normal-Errors Regression 10
    1.3.2 Logistic Regression 12
    1.3.3 Cox Regression 12
    1.3.4 Generalized Linear Models 14
    1.3.5 Linear and Additive Predictors 14
  1.4 Role of Residuals 15
    1.4.1 Uses of Residuals 15
    1.4.2 Graphical Analysis of Residuals 15
  1.5 Role of Subject-Matter Knowledge in Model Development 16
  1.6 Scope of Model Building in our Book 17
  1.7 Modelling Preferences 18
    1.7.1 General Issues 18
    1.7.2 Criteria for a Good Model 18
    1.7.3 Personal Preferences 19
  1.8 General Notation 20
2 Variable Selection 23
  2.1 Introduction 23
  2.2 Background 24
  2.3 Preliminaries for a Multivariable Analysis 25
  2.4 Aims of Multivariable Models 26
  2.5 Prediction: Summary Statistics and Comparisons 29
  2.6 Procedures for Selecting Variables 29
    2.6.1 Strength of Predictors 30
    2.6.2 Stepwise Procedures 31
    2.6.3 All-Subsets Model Selection Using Information Criteria 32
    2.6.4 Further Considerations 33
  2.7 Comparison of Selection Strategies in Examples 35
    2.7.1 Myeloma Study 35
    2.7.2 Educational Body-Fat Data 36
    2.7.3 Glioma Study 38
  2.8 Selection and Shrinkage 40
    2.8.1 Selection Bias 40
    2.8.2 Simulation Study 40
    2.8.3 Shrinkage to Correct for Selection Bias 42
    2.8.4 Post-estimation Shrinkage 44
    2.8.5 Reducing Selection Bias 45
    2.8.6 Example 46
  2.9 Discussion 47
    2.9.1 Model Building in Small Datasets 47
    2.9.2 Full, Pre-specified or Selected Model? 47
    2.9.3 Comparison of Selection Procedures 49
    2.9.4 Complexity, Stability and Interpretability 49
    2.9.5 Conclusions and Outlook 50
3 Handling Categorical and Continuous Predictors 53
  3.1 Introduction 53
  3.2 Types of Predictor 54
    3.2.1 Binary 54
    3.2.2 Nominal 54
    3.2.3 Ordinal, Counting, Continuous 55
    3.2.4 Derived 55
  3.3 Handling Ordinal Predictors 55
    3.3.1 Coding Schemes 55
    3.3.2 Effect of Coding Schemes on Variable Selection 56
  3.4 Handling Counting and Continuous Predictors: Categorization 58
    3.4.1 ‘Optimal’ Cutpoints: A Dangerous Analysis 58
    3.4.2 Other Ways of Choosing a Cutpoint 59
  3.5 Example: Issues in Model Building with Categorized Variables 60
    3.5.1 One Ordinal Variable 61
    3.5.2 Several Ordinal Variables 62
  3.6 Handling Counting and Continuous Predictors: Functional Form 64
    3.6.1 Beyond Linearity 64
    3.6.2 Does Nonlinearity Matter? 65
    3.6.3 Simple versus Complex Functions 66
    3.6.4 Interpretability and Transportability 66
  3.7 Empirical Curve Fitting 67
    3.7.1 General Approaches to Smoothing 68
    3.7.2 Critique of Local and Global Influence Models 68
  3.8 Discussion 69
    3.8.1 Sparse Categories 69
    3.8.2 Choice of Coding Scheme 69
    3.8.3 Categorizing Continuous Variables 70
    3.8.4 Handling Continuous Variables 70
4 Fractional Polynomials for One Variable 71
  4.1 Introduction 72
  4.2 Background 72
    4.2.1 Genesis 72
    4.2.2 Types of Model 73
    4.2.3 Relation to Box–Tidwell and Exponential Functions 73
  4.3 Definition and Notation 74
    4.3.1 Fractional Polynomials 74
    4.3.2 First Derivative 74
  4.4 Characteristics 75
    4.4.1 FP1 and FP2 Functions 75
    4.4.2 Maximum or Minimum of a FP2 Function 75
  4.5 Examples of Curve Shapes with FP1 and FP2 Functions 76
  4.6 Choice of Powers 78
  4.7 Choice of Origin 79
  4.8 Model Fitting and Estimation 79
  4.9 Inference 79
    4.9.1 Hypothesis Testing 79
    4.9.2 Interval Estimation 80
  4.10 Function Selection Procedure 82
    4.10.1 Choice of Default Function 82
    4.10.2 Closed Test Procedure for Function Selection 82
    4.10.3 Example 83
    4.10.4 Sequential Procedure 83
    4.10.5 Type I Error and Power of the Function Selection Procedure 84
  4.11 Scaling and Centering 84
    4.11.1 Computational Aspects 84
    4.11.2 Examples 85
  4.12 FP Powers as Approximations to Continuous Powers 85
    4.12.1 Box–Tidwell and Fractional Polynomial Models 85
    4.12.2 Example 85
  4.13 Presentation of Fractional Polynomial Functions 86
    4.13.1 Graphical 86
    4.13.2 Tabular 87
  4.14 Worked Example 89
    4.14.1 Details of all Fractional Polynomial Models 89
    4.14.2 Function Selection 90
    4.14.3 Details of the Fitted Model 90
    4.14.4 Standard Error of a Fitted Value 91
    4.14.5 Fitted Odds Ratio and its Confidence Interval 91
  4.15 Modelling Covariates with a Spike at Zero 92
  4.16 Power of Fractional Polynomial Analysis 94
    4.16.1 Underlying Function Linear 95
    4.16.2 Underlying Function FP1 or FP2 95
    4.16.3 Comment 96
  4.17 Discussion 97
5 Some Issues with Univariate Fractional Polynomial Models 71
  5.1 Introduction 99
  5.2 Susceptibility to Influential Covariate Observations 100
  5.3 A Diagnostic Plot for Influential Points in FP Models 100
    5.3.1 Example 1: Educational Body-Fat Data 101
    5.3.2 Example 2: Primary Biliary Cirrhosis Data 101
  5.4 Dependence on Choice of Origin 103
  5.5 Improving Robustness by Preliminary Transformation 105
    5.5.1 Example 1: Educational Body-Fat Data 106
    5.5.2 Example 2: PBC Data 107
    5.5.3 Practical Use of the Pre-transformation gδ(x) 107
  5.6 Improving Fit by Preliminary Transformation 108
    5.6.1 Lack of Fit of Fractional Polynomial Models 108
    5.6.2 Negative Exponential Pre-transformation 108
  5.7 Higher Order Fractional Polynomials 109
    5.7.1 Example 1: Nerve Conduction Data 109
    5.7.2 Example 2: Triceps Skinfold Thickness 110
  5.8 When Fractional Polynomial Models are Unsuitable 111
    5.8.1 Not all Curves are Fractional Polynomials 111
    5.8.2 Example: Kidney Cancer 112
  5.9 Discussion 113
6 MFP: Multivariable Model-Building with Fractional Polynomials 115
  6.1 Introduction 115
  6.2 Motivation 116
  6.3 The MFP Algorithm 117
    6.3.1 Remarks 118
    6.3.2 Example 118
  6.4 Presenting the Model 120
    6.4.1 Parameter Estimates 120
    6.4.2 Function Plots 121
    6.4.3 Effect Estimates 121
  6.5 Model Criticism 123
    6.5.1 Function Plots 123
    6.5.2 Graphical Analysis of Residuals 124
    6.5.3 Assessing Fit by Adding More Complex Functions 125
    6.5.4 Consistency with Subject-Matter Knowledge 129
  6.6 Further Topics 129
    6.6.1 Interval Estimation 129
    6.6.2 Importance of the Nominal Significance Level 130
    6.6.3 The Full MFP Model 131
    6.6.4 A Single Predictor of Interest 132
    6.6.5 Contribution of Individual Variables to the Model Fit 134
    6.6.6 Predictive Value of Additional Variables 136
  6.7 Further Examples 138
    6.7.1 Example 1: Oral Cancer 138
    6.7.2 Example 2: Diabetes 139
    6.7.3 Example 3: Whitehall I 140
  6.8 Simple Versus Complex Fractional Polynomial Models 144
    6.8.1 Complexity and Modelling Aims 144
    6.8.2 Example: GBSG Breast Cancer Data 144
  6.9 Discussion 146
    6.9.1 Philosophy of MFP 147
    6.9.2 Function Complexity, Sample Size and Subject-Matter Knowledge 148
    6.9.3 Improving Robustness by Preliminary Covariate Transformation 148
    6.9.4 Conclusion and Future 149
7 Interactions 151
  7.1 Introduction 151
  7.2 Background 152
  7.3 General Considerations 152
    7.3.1 Effect of Type of Predictor 152
    7.3.2 Power 153
    7.3.3 Randomized Trials and Observational Studies 153
    7.3.4 Predefined Hypothesis or Hypothesis Generation 153
    7.3.5 Interactions Caused by Mismodelling Main Effects 154
    7.3.6 The ‘Treatment–Effect’ Plot 154
    7.3.7 Graphical Checks, Sensitivity and Stability Analyses 154
    7.3.8 Cautious Interpretation is Essential 155
  7.4 The MFPI Procedure 155
    7.4.1 Model Simplification 156
    7.4.2 Check of the Results and Sensitivity Analysis 156
  7.5 Example 1: Advanced Prostate Cancer 157
    7.5.1 The Fitted Model 158
    7.5.2 Check of the Interactions 160
    7.5.3 Final Model 161
    7.5.4 Further Comments and Interpretation 162
    7.5.5 FP Model Simplification 163
  7.6 Example 2: GBSG Breast Cancer Study 163
    7.6.1 Oestrogen Receptor Positivity as a Predictive Factor 163
    7.6.2 A Predefined Hypothesis: Tamoxifen–Oestrogen Receptor Interaction 163
  7.7 Categorization 165
    7.7.1 Interaction with Categorized Variables 165
    7.7.2 Example: GBSG Study 166
  7.8 STEPP 167
  7.9 Example 3: Comparison of STEPP with MFPI 168
    7.9.1 Interaction in the Kidney Cancer Data 168
    7.9.2 Stability Investigation 168
  7.10 Comment on Type I Error of MFPI 171
  7.11 Continuous-by-Continuous Interactions 172
    7.11.1 Mismodelling May Induce Interaction 173
    7.11.2 MFPIgen: An FP Procedure to Investigate Interactions 174
    7.11.3 Examples of MFPIgen 175
    7.11.4 Graphical Presentation of Continuous-by-Continuous Interactions 179
    7.11.5 Summary 180
  7.12 Multi-Category Variables 181
  7.13 Discussion 181
8 Model Stability 183
  8.1 Introduction 183
  8.2 Background 184
  8.3 Using the Bootstrap to Explore Model Stability 185
    8.3.1 Selection of Variables within a Bootstrap Sample 185
    8.3.2 The Bootstrap Inclusion Frequency and the Importance of a Variable 186
  8.4 Example 1: Glioma Data 186
  8.5 Example 2: Educational Body-Fat Data 188
    8.5.1 Effect of Influential Observations on Model Selection 189
  8.6 Example 3: Breast Cancer Diagnosis 190
  8.7 Model Stability for Functions 191
    8.7.1 Summarizing Variation between Curves 191
    8.7.2 Measures of Curve Instability 192
  8.8 Example 4: GBSG Breast Cancer Data 193
    8.8.1 Interdependencies among Selected Variables and Functions in Subsets 193
    8.8.2 Plots of Functions 193
    8.8.3 Instability Measures 195
    8.8.4 Stability of Functions Depending on Other Variables Included 196
  8.9 Discussion 197
    8.9.1 Relationship between Inclusion Fractions 198
    8.9.2 Stability of Functions 198
9 Some Comparisons of MFP with Splines 201
  9.1 Introduction 201
  9.2 Background 202
  9.3 MVRS: A Procedure for Model Building with Regression Splines 203
    9.3.1 Restricted Cubic Spline Functions 203
    9.3.2 Function Selection Procedure for Restricted Cubic Splines 205
    9.3.3 The MVRS Algorithm 205
  9.4 MVSS: A Procedure for Model Building with Cubic Smoothing Splines 205
    9.4.1 Cubic Smoothing Splines 205
    9.4.2 Function Selection Procedure for Cubic Smoothing Splines 206
    9.4.3 The MVSS Algorithm 206
  9.5 Example 1: Boston Housing Data 207
    9.5.1 Effect of Reducing the Sample Size 208
    9.5.2 Comparing Predictors 212
  9.6 Example 2: GBSG Breast Cancer Study 214
  9.7 Example 3: Pima Indians 215
  9.8 Example 4: PBC 217
  9.9 Discussion 219
    9.9.1 Splines in General 220
    9.9.2 Complexity of Functions 221
    9.9.3 Optimal Fit or Transferability? 221
    9.9.4 Reporting of Selected Models 221
    9.9.5 Conclusion 222
10 How ToWork with MFP 223
  10.1 Introduction 223
  10.2 The Dataset 223
  10.3 Univariate Analyses 226
  10.4 MFP Analysis 227
  10.5 Model Criticism 228
    10.5.1 Function Plots 228
    10.5.2 Residuals and Lack of Fit 228
    10.5.3 Robustness Transformation and Subject-Matter Knowledge 229
    10.5.4 Diagnostic Plot for Influential Observations 230
    10.5.5 Refined Model 231
    10.5.6 Interactions 231
  10.6 Stability Analysis 232
  10.7 Final Model 235
  10.8 Issues to be Aware of 235
    10.8.1 Selecting the Main-Effects Model 235
    10.8.2 Further Comments on Stability 236
    10.8.3 Searching for Interactions 238
  10.9 Discussion 238
11 Special Topics Involving Fractional Polynomials 241
  11.1 Time-Varying Hazard Ratios in the Cox Model 241
    11.1.1 The Fractional Polynomial Time Procedure 242
    11.1.2 The MFP Time Procedure 243
    11.1.3 Prognostic Model with Time-Varying Effects for Patients with Breast Cancer 243
    11.1.4 Categorization of Survival Time 245
    11.1.5 Discussion 246
  11.2 Age-specific Reference Intervals 247
    11.2.1 Example: Fetal growth 247
    11.2.2 Using FP Functions as Smoothers 248
    11.2.3 More Sophisticated Distributional Assumptions 249
    11.2.4 Discussion 249
  11.3 Other Topics 250
    11.3.1 Quantitative Risk Assessment in Developmental Toxicity Studies 250
    11.3.2 Model Uncertainty for Functions 251
    11.3.3 Relative Survival 252
    11.3.4 Approximating Smooth Functions 253
    11.3.5 Miscellaneous Applications 254
12 Epilogue 255
  12.1 Introduction 255
  12.2 Towards Recommendations for Practice 255
    12.2.1 Variable Selection Procedure 255
    12.2.2 Functional Form for Continuous Covariates 257
    12.2.3 Extreme Values or Influential Points 257
    12.2.4 Sensitivity Analysis 257
    12.2.5 Check for Model Stability 258
    12.2.6 Complexity of a Predictor 258
    12.2.7 Check for Interactions 258
  12.3 Omitted Topics and Future Directions 258
    12.3.1 Measurement Error in Covariates 258
    12.3.2 Meta-analysis 258
    12.3.3 Multi-level (Hierarchical) Models 259
    12.3.4 Missing Covariate Data 259
    12.3.5 Other Types of Model 259
  12.4 Conclusion 259
Appendix A: Data and Software Resources 261
  A.1 Summaries of Datasets 261
  A.2 Datasets used more than once 262
  A.3 Software 267
Appendix B: Glossary of Abbreviations 269
References 271
Index 285

 

Datasets and some information are available for download here.
Datasets in available formats - Stata - SAS - Excel - ASCII
For more details about the data see the Appendix A of the book.

Table A.1   Datasets  used  once  in  our  book.  N/A  =  not  applicable.  Further  details accompany the example in the relevant section (page 261).

Name (and Link) Outcome Obs. Events Variablesa Section reference
Myeloma Survival 65 48 16 2.7.1
Freiburg DNA breast cancer Survival 109 56 1 3.4.1
Cervix cancer Binary 899 141 21 3.5
Nerve conduction Cont. 406 N/A 1 5.7.1
Triceps skinfold thickness Cont. 892 N/A 1 5.7.2
Diabetes Cont. 42 N/A 2 6.7.2
Advanced prostate cancer Survival 475 338 13 7.5
Quit smoking study Cont. 250 N/A 3 7.11.3
Breast cancer diagnosis Binary 458 133 6 8.6
Boston housing  Cont. 506 N/A 13 9.5
Pima Indians Binary 768 268 8 9.7
Rotterdam breast cancer  Survival 2982 1518 11 11.1.3
Fetal growth Cont. 574 N/A 1 11.2.1
Cholesterol   Cont. 553 N/A 1 11.2.3

a Maximum number of predictors used in analyses. Categorical variables count as
>1 predictor, if modelled using several dummy variables.

 

Table A.2   Datasets used more than once in our book. N/A = not applicable. Further details are given in Appendix A.2 (page 262).

Name Outcome Obs. Events Variablesa Section reference
Research body fat Cont. 326 N/A 1 1.1.3, 4.2.1, 4.9.1, 4.9.2, 4.10.3, 4.12
GBSG breast cancer Survival 686 299 9 1.1.4,3.6.2, 5.6.2,6.5.2, 6.5.3, 6.5.4,6.6.5, 6.6.6, 6.8.2, 7.6, 7.7.2, 8.8, 9.6
Educational body fat Cont. 252 N/A 13 2.7.2, 2.8.6, 5.2, 5.3.1, 5.5.1, 8.5
Glioma Survial 411 274 15 2.7.3, 8.4
Prostate cancer  Cont. 97 N/A 7 3.6.2, 3.6.3, 4.15, 6.2, 6.3.2, 6.4.2, 6.4.3, 6.5.1, 6.5.3, 6.6.1, 6.6.2, 6.6.3, 6.6.4, 7.11.3
Whitehall I  Survival 17 260 2576 10 6.7.3
  Binary 17 260 1670 10 4.13.1, 4.13.2, 4.14, 7.11.1,7.11.3
PBC Survival 418 161 17 5.3.2, 5.4, 5.5.2, 9.8
Oral cancer  Binary 397 194 1 6.7.1, 9.3.1
Kidney cancer   Survival 347 322 10 5.8.2,7.9

a Maximum number of predictors used in analyses. Categorical variables count as
>1 predictor, if modelled using several dummy variables.