1. 1 Introduction to applied statistics & applied statistical methods Prof. Dr. Chang Zhu1 Overview Chi-square test Discriminant analysis Logistic regression Nominal data/categorical data
2. 2 Dichotomous variable Only 2 values, yes or no, male or female Binary variable Assign a 0 (yes) or 1 (no) to indicate presence or absence of something 3 Chi-square analysis Level of measurement is nominal The chi square test is non-parametric. It can be used when normality is not assumed.
3. 3 Association between categorical variables Suppose both response and explanatory variables are categorical. There is association if the population conditional distribution for the response variable differs among the categories of the explanatory variable Example: Contingency table on happiness cross-classified by family income (data from 2006 GSS) Chi-square analysis Happiness Income Very Pretty Not too Total --------------------------------------------- Above 272 (44%) 294 (48%) 49 (8%) 615 Average 454 (32%) 835 (59%) 131 (9%) 1420 Below 185 (20%) 527 (57%) 208 (23%) 920 ---------------------------------------------- Response: Happiness, Explanatory: Income Relationship between income and happiness? Chi-square analysis
4. 4 Chi-Squared Test of Independence (Karl Pearson, 1900) Tests H0: The variables are statistically independent Ha: The variables are statistically dependent Intuition behind test statistic: Summarize differences between observed cell counts and expected cell counts (what is expected if H0 true) Notation: fo = observed frequency (cell count) fe = expected frequency r = number of rows in table, c = number of columns Chi-square analysis Chi-squared test answers Is there an association? Standardized residuals answer How do data differ from what independence predicts? How strong is the association? using a measure of the strength of association, such as the difference of proportions
5. 5 Chi-square analysis Like all tests of hypothesis, chi square is sensitive to sample size. As N increases, obtained chi square increases. With large samples, trivial relationships may be significant. To correct for this, when N>1000, set your alpha = .01. Practice 1 CHI-SQUARE TEST (CROSS-TAB) A group of students were classified in terms of personality (introvert or extrovert) and in terms of colour preference (red, yellow, green or blue). Personality and colour preference are categorical variables. We want to find answer to this question: Is there an association between personality and colour preference?
6. 6 Practice 1 In SPSS, Analyze > Descriptive Statistics > Crosstab Practice 1 (output) Chi-Square Tests Value df Asymp. Sig. (2- sided) Pearson Chi-Square 71.200 a 3 .000 Likelihood Ratio 70.066 3 .000 Linear-by-Linear Association 69.124 1 .000 N of Valid Cases 400 a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 10.00. There is a relationship between students personality and preferences for colours: (3, N = 400) = 71.20, p < .0001.
7. 7 Discriminant analysis Similar to Regression, except that criterion (or dependent variable) is categorical rather than continuous. used to identify boundaries between groups of objects For example: (a) does a person have the disease or not (b) Is someone a good credit risk or not? (c) Should a student be admitted to college or not? 14 Discriminant analysis We wish to predict group membership for a number of subjects from a set of predictor variables. The criterion variable (also called grouping variable) is the object of classification. This is ALWAYS a categorical variable. Simple case: two groups and p predictor variables.
8. 8 Discriminant analysis Similar to regression: What predictor variables are related to the criterion (dependent variable) Predict values on the criterion variable when given new values on the predictor variable Discriminant analysis Can we classify new (unclassified) subjects into groups? Given the classification functions how accurate are we? And when we are inaccurate is there some pattern to the misclassification? What is the strength of association between group membership and the predictors? D = (.024 age) + (.080 self-concept) + (-.100 anxiety) + (-.012 days absent) + (.134 anti-smoking score) - 4.543
9. 9 Discriminant analysis Questions? Which predictors are most important in predicting group membership? Practice 2 A study is set up to determine if the following variables help to discriminate between those who smoke and those whose dont: age absence (days of absence last year) selfcon (self-concept score) anxiety (anxiety score) anti_smoking (attitude towards anti-smoking policies)
10. 10 Practice 2 In SPSS, Analyze > Classify > Discriminant Practice 2 In SPSS, Analyze > Classify > Discriminant
11. 11 Practice 2 D = (.024 age) + (.080 self-concept) + (-.100 anxiety) + (-.012 days absent) + (.134 anti-smoking score) - 4.543 Functions at Group Centroids (means of group calculatedby the D function) Function 1 non-smoker 1.125 smoker -1.598 Canonical Discriminant Function Coefficients Function 1 age .024 self concept score .080 anxiety score -.100 days absent last year -.012 total anti-smoking test score .134 (Constant) -4.543 Unstandardized coefficients Practice 2 Classification Resultsa,c smoke or not Predicted Group Membership Total non- smoker smoker Original Count non-smoker 238 19 257 smoker 17 164 181 % non-smoker 92.6 7.4 100.0 smoker 9.4 90.6 100.0 Cross- validatedb Count non-smoker 238 19 257 smoker 17 164 181 % non-smoker 92.6 7.4 100.0 smoker 9.4 90.6 100.0 a. 91.8% of original grouped cases correctly classified.
12. 12 Practice 2 When reporting the result, we should include the following: Name of the predictors and sample size Results of the Univariate ANOVAs and the Boxs M test The significance of the discriminant function The variance explained (Canonical correlation coefficient) Significant predictors and their contribution to the model (discriminant function) Result from the cross-validation process (page 9) Logistic regression In logistic regression the response (Y) is a dichotomous categorical variable. For example: voting, mortality, and participation data is not continuous or distributed normally. Binary logistic regression is a type of regression analysis where the dependent variable is a dummy variable: coded 0 (did not vote) or 1(did vote)
13. 13 Logistic regression Models the relationship between a set of variables xi dichotomous (eat : yes/no) categorical (social class, ... ) continuous (age, ...) and dichotomous variable Y Binary Logistic regression Binary logistic regression is a type of regression analysis where the dependent variable is a dummy variable (coded 0, 1)
14. 14 Binary Dependent Variables A few examples: Consumer chooses brand (1) or not (0); A quality defect occurs (1) or not (0); A person is hired (1) or not (0); Other Examples Binary Logistic regression Binary Logistic regression The logistic regression model is simply a non- linear transformation of the linear regression. The logistic distribution is an S-shaped distribution function (cumulative density function) which is similar to the standard normal distribution and constrains the estimated probabilities to lie between 0 and 1.
15. 15 Binary Logistic regression p: the probability of success/event (range from 0 to 1) 1-p: probability of failure/non-event If the probability of success is .8 (80%), the probability of failure is ??? The odds of success: the ratio between the probability of success over the probability of failure What is the odds of success for the above situation? What can we conclude about the probabilities of success and failure in a situation when odds equal to 1? Binary Logistic regression The odds of success: the ratio between the probability of success over the probability of failure Logistic regression: model the logit-transformed probability as a linear relationship with the predictor variables. We can also transform the log of the odds back to a probability: logit(p) = log(p/(1-p)) = log (odds) = b0 + b1*x1 + ... + bk*xk p= exp(b0 + b1*x1 + ... + bk*xk)/(1+exp(b0 + b1*x1 + ... + bk*xk))
16. 16 Binary Logistic regression (SPSS output) Variables in the Equation B (log odds) S.E. Wald df Sig. Exp(B) (odds) Step 1a gender(1) -.005 .202 .001 1 .981 .995 If the odds ratio > 1: when the predictor increases, the odds of the event occurs increase. If the odds ratio < 1: when the predictor increases, the odds of the event occurs decreases. Conduct logistic regression to see if gender is a significant predictor of whether someone is a smoker or non-smoker. In SPSS, Analyze > Regression > Binary Logistic The data file is smoker_DA.sav. Practice 3
17. 17 Practice 3 Analyze > Regression > Binary Logistic Practice 3 Practice 3 Conduct logistic regression to see if anti- smoking attitude is a significant predictor of whether someone is a smoker or non- smoker. In SPSS, Analyze > Regression > Binary Logistic The data file is smoker_DA.sav.
18. 18 Practice 3 Conduct logistic regression to see the following are significant predictors of whether someone is a smoker or non-smoker: age gender absence (days of absence last year) selfcon (self-concept score) anxiety (anxiety score) anti_smoking (attitude towards anti-smoking policies) When we have no idea about the importance of the predictors, so well choose Stepwise: Forward LR) Practice 3 Practice 3 B S.E. Odds Ratio 95% C.I. for Odds Ratio Lower Upper constant 9.257** 2.050 10480.856 self-concept -.260** .033 .771 .724 .822 anxiety .236** .036 1.266 1.181 1.357 absence .075* .030 1.078 1.016 1.144 anti-smoking test score -.303** .075 .739 .638 .85