Discrete Choice Models - Princeton University ?· Discrete Choice Models Kosuke Imai Princeton University…

  • Published on

  • View

  • Download

Embed Size (px)


<ul><li><p>Discrete Choice Models</p><p>Kosuke Imai</p><p>Princeton University</p><p>POL573 Quantitative Analysis IIIFall 2016</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 1 / 34</p></li><li><p>Recall Binary Logit and Probit Models</p><p>Logit and probit models for binary outcome Yi {0,1}:</p><p>Yiindep. Bernoulli(i)</p><p>i =exp(X&gt;i )</p><p>1 + exp(X&gt;i )=</p><p>11 + exp(X&gt;i )</p><p>Logit function: logit(i) log(i/(1 i)) = X&gt;i Probit function: 1(i) = X&gt;i </p><p>6 4 2 0 2 4 6</p><p>0.0</p><p>0.2</p><p>0.4</p><p>0.6</p><p>0.8</p><p>1.0</p><p>linear predictor</p><p>prob</p><p>abili</p><p>ty</p><p>ProbitLogit</p><p>monotone increasingsymmetric around 0maximum slope at 0logit coef. = probitcoef. 1.6</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 2 / 34</p></li><li><p>Latent Variable Interpretation</p><p>The latent variable or the Utility: Y iThe Model:</p><p>Yi ={</p><p>1 if Y i &gt; 00 if Y i 0</p><p>Y i = X&gt;i + i with E(i) = 0</p><p>Logit: ii.i.d. logistic (the density is exp(i)/{1 + exp(i)}2)</p><p>Probit: ii.i.d. N (0,1)</p><p>The variance of Y i is not identifiableThe cutpoint is not identifiable</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 3 / 34</p></li><li><p>Inference with the Logit and Probit Models</p><p>Likelihood and log-likelihood functions:</p><p>Ln( | Y ,X) =n</p><p>i=1</p><p>Yii (1 i)1Yi</p><p>ln( | Y ,X) =n</p><p>i=1</p><p>{Yi logi + (1 Yi) log(1 i)}</p><p>Logit model:Score function: sn() =</p><p>ni=1(Yi i )Xi</p><p>Hessian: Hn() = n</p><p>i=1 i (1 i )XiX&gt;i 0Approximate variance: V(n | X) {</p><p>ni=1 i (1 i )XiX&gt;i }1</p><p>Globally concave</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 4 / 34</p></li><li><p>Calculating Quantities of Interest</p><p>Logistic regression coefficients are NOT quantities of interest</p><p>Predicted probability: (x) = Pr(Y = 1 | X = x) = exp(x&gt;)</p><p>1+exp(x&gt;)</p><p>Attributable risk (risk difference): (x1) (x0)Relative risk: (x1)/(x0)</p><p>Odds and odds ratio: (x)1(x) and(x1)/{1(x1)}(x0)/{1(x0)}</p><p>Average Treatment Effect:</p><p>E{Pr(Yi = 1 | Ti = 1,Xi) Pr(Yi = 1 | Ti = 0,Xi)}</p><p>MLE: plug in nAsymptotic distribution: the Delta method (a bit painful!)</p><p>n((x) (x)) D N</p><p>(0,</p><p>(x)2</p><p>{1 + exp(x&gt;0)}2x&gt;(0)1x</p><p>)</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 5 / 34</p></li><li><p>Application 1: Case-Control Design</p><p>Research design mantra: Dont select on dependent variableBut, sometimes, we want to select on dependent variable (e.g.,rare events)</p><p>civil war, campaign contribution, protest, lobbying, etc.</p><p>The standard case-control (choice-based sampling) design:1 Randomly sample cases2 Randomly sample controls</p><p>Under this design, Pr(Yi = 1) is known and hence Pr(Yi = 1 | Xi)is non-parametrically identifiableWhen Pr(Yi = 1) is unknown, the odds ratio is stillnonparametrically identifiable</p><p>The design extends to the contaminated control</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 6 / 34</p></li><li><p>Application 2: Ideal Point Estimation</p><p>Originally developed for educational testing: measuring theability of students based on their exam performancePolitical science application: measuring ideology using rollcallsPoole and Rosenthal; Clinton, Jackman and Rivers</p><p>Naive approach: count the number of correct answersThe problem: some questions are easier than others</p><p>The model:</p><p>Pr(Yij = 1 | xi , j , j) = logit1(j jxi)</p><p>wherexi : ideal pointj : difficulty parameterj : discrimination parameter</p><p>The key assumption: dimensionality</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 7 / 34</p></li><li><p>Connection to the special theory of votingQuadratic random utilities:</p><p>Ui (yea) = xi j2 + ijUi (nay) = xi j2 + ij</p><p>where iji.i.d. N (0, 2) and ij</p><p>i.i.d. N (0, 2)Latent utility differential:</p><p>Y ij = Ui (yea) Ui (nay)= 2(j j )&gt;xi &gt;j j + &gt;j j + ij ij= &gt;j xi j + ij</p><p>Identification: scale, rotationEstimation: EM algorithm, Markov chain Monte CarloVarious extensions to survey, speech, etc.</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 8 / 34</p></li><li><p>Ordered Outcome</p><p>The outcome: Yi {1,2, . . . , J} where Yi = 1 Yi = 3, etc.Assumption: there exists a underlying unidimensional scale5-level Likert scale:stronglydisagree disagree</p><p>neither agreeor disagree agree</p><p>stronglyagree</p><p>Ordered logistic regression model:</p><p>Pr(Yi j | Xi) =exp(j X&gt;i )</p><p>1 + exp(j X&gt;i )</p><p>for j = 1, . . . , J, which implies,</p><p>j (Xi ) Pr(Yi = j | Xi ) =exp(j X&gt;i )</p><p>1 + exp(j X&gt;i )</p><p>exp(j1 X&gt;i )1 + exp(j1 X&gt;i )</p><p>Normalization for identification (Xi includes an intercept):0 = &lt; 1 = 0 &lt; 2 &lt; &lt; J1 &lt; J =Generalization of binary logistic regression</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 9 / 34</p></li><li><p>Latent Variable Representation</p><p>Random utility: Y i = X&gt;i + i where i</p><p>i.i.d. logistic</p><p>If ii.i.d. N (0,1), then the model becomes ordered probit</p><p>j(Xi) = (j X&gt;i ) (j1 X&gt;i )</p><p>Normalization for varianceThe observation mechanism:</p><p>Yi =</p><p>1 if = 0 &lt; Y i 1,2 if 1 = 0 &lt; Y i 2,...</p><p>...J if J1 &lt; Y i &lt; J =</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 10 / 34</p></li><li><p>Inference and Quantities of Interest</p><p>Likelihood function:</p><p>L(, | Y ,X ) =n</p><p>i=1</p><p>Jj=1</p><p>{exp(j X&gt;i )</p><p>1 + exp(j X&gt;i )</p><p>exp(j1 X&gt;i )1 + exp(j1 X&gt;i )</p><p>}1{Yi =j} itself is difficult to interpretDirectly calculate the predicted probabilities and other quantitiesof interestSuppose J = 3 and &gt; 0. Then,</p><p>XiPr(Yi = 1 | Xi) &lt; 0</p><p>XiPr(Yi = 3 | Xi) &gt; 0</p><p>XiPr(Yi = 2 | Xi) ? 0</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 11 / 34</p></li><li><p>Differential Item Functioning in Survey Research</p><p>Different respondents may interpret the same questions differentlyCross-national surveys = cultural differencesVague questions = more room for different interpretation</p><p>Such measurement bias is called differential item functioning (DIF)</p><p>2002 WHO survey in Chinaand Mexico:How much say do you have ingetting the government toaddress issues that interestyou?</p><p>1 no say at all2 little say3 some say4 a lot of say5 unlimited say</p><p>1 2 3 4 5</p><p>China</p><p>score</p><p>prop</p><p>ortio</p><p>n of</p><p> res</p><p>pond</p><p>ents</p><p>0.0</p><p>0.1</p><p>0.2</p><p>0.3</p><p>0.4</p><p>0.5</p><p>1 2 3 4 5</p><p>Mexico</p><p>score</p><p>prop</p><p>ortio</p><p>n of</p><p> res</p><p>pond</p><p>ents</p><p>0.0</p><p>0.1</p><p>0.2</p><p>0.3</p><p>0.4</p><p>0.5</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 12 / 34</p></li><li><p>Anchoring to Reduce DIF</p><p>Item Response Theory (IRT) and NOMINATEHow to bridge across chambers, over time, different actors?Key idea: anchoring responses using the same itemsKing et al. (2004) APSR: anchoring vignettes</p><p>Alison lacks clean drinking water.She and her neighbors aresupporting an opposition candidate inthe forthcoming elections that haspromised to address the issue. Itappears that so many people in herarea feel the same way that theopposition candidate will defeat theincumbent representative.</p><p>Jane lacks clean drinking waterbecause the government is pursuingan industrial development plan. In the</p><p>campaign for an upcoming election,an opposition party has promised toaddress the issue, but she feels itwould be futile to vote for theopposition since the government iscertain to win.</p><p>Moses lacks clean drinking water. Hewould like to change this, but he cantvote, and feels that no one in thegovernment cares about this issue.So he suffers in silence, hopingsomething will be done in the future.</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 13 / 34</p></li><li><p>The respondent was thenasked to assess eachvignette in the same manneras the self-assessmentquestion.</p><p>How much say doesAlison/Jane/Moses in gettingthe government to addressissues that interest him/her?</p><p>1 no say at all</p><p>2 little say</p><p>3 some say</p><p>4 a lot of say</p><p>5 unlimited say</p><p>Plot relative rank of selfagainst vignettes:4 Alison &gt; 3 Jane &gt; 2 Moses &gt; 1</p><p>1 2 3 4</p><p>China</p><p>relative rank</p><p>prop</p><p>ortio</p><p>n of</p><p> res</p><p>pond</p><p>ents</p><p>0.0</p><p>0.1</p><p>0.2</p><p>0.3</p><p>0.4</p><p>0.5</p><p>1 2 3 4</p><p>Mexico</p><p>relative rank</p><p>prop</p><p>ortio</p><p>n of</p><p> res</p><p>pond</p><p>ents</p><p>0.0</p><p>0.1</p><p>0.2</p><p>0.3</p><p>0.4</p><p>0.5</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 14 / 34</p></li><li><p>Multinomial Outcome</p><p>Yi {1,2, . . . , J} as before but is not ordered!A generalization of binary/ordered logit/probitExample: vote choice (abstein, vote for dem., vote for rep.)Multinomial logit model:</p><p>j(Xi) Pr(Yi = j | Xi)</p><p>=exp(X&gt;i j)J</p><p>k=1 exp(X&gt;i k )</p><p>=exp(X&gt;i j)</p><p>1 +J1</p><p>k=1 exp(X&gt;i k )</p><p>J = 0 for identification:J</p><p>k=1 k = 1</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 15 / 34</p></li><li><p>Latent Variable Representation</p><p>Observation mechanism and model:</p><p>Yi = j Y ij = max(Y i1,Y i2, . . . ,Y iJ)Y ij = X</p><p>&gt;i j + ij</p><p>ij has Type I extreme-value distribution:</p><p>F () = exp{exp()}f () = exp{ exp()}</p><p>McFaddens Proof:</p><p>Pr(Yi = j | Xi ) =j 6=j</p><p>Pr(Y ij &gt; Yij | Xi ) =</p><p>j 6=j</p><p>Pr{ij &lt; ij + X&gt;i (j j)}</p><p>=</p><p>j 6=j</p><p>F{ij + X&gt;i (j j)}</p><p> f (ij )dij=</p><p>exp(X&gt;i j )Jj=1 exp(X</p><p>&gt;i j)</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 16 / 34</p></li><li><p>Conditional Logit Model</p><p>A further generalization:</p><p>j(Xij) Pr(Yi = j | Xij) =exp(X&gt;ij )J</p><p>k=1 exp(X&gt;ik )</p><p>Subject-specific and choice-specific covariates, and theirinteractionsMultinomial logit model as a special case:</p><p>Xi1 =</p><p>Xi00...00</p><p>, Xi2 =</p><p>0Xi0...00</p><p>, , XiJ =</p><p>000...0Xi</p><p>Some restrictions are necessary for identification: for example,one cannot include a different intercept for each category</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 17 / 34</p></li><li><p>Multinomial Probit Model</p><p>IIA (Independence of Irrelevant Alternatives)Multinomial/Conditional logit</p><p>Pr(Yi = j | Xij)Pr(Yi = j | Xij )</p><p>= exp{(Xij Xij )&gt;}</p><p>blue vs. red bus; Chicken vs. FishMNP: Allowing for the dependence among errors</p><p>Yi = j Y ij = max(Y i1,Y i2, . . . ,Y iJ)</p><p>Y iJ1</p><p>= X&gt;ijJK</p><p>K1</p><p>+ iJ1</p><p>where ii.i.d. N (0, </p><p>JJ</p><p>)</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 18 / 34</p></li><li><p>Identification and Inference</p><p>Two additional steps for identification:1 Subtract the Jth equation from the other equations:</p><p>Wi = Z&gt;i + i where ii.i.d. N (0,)</p><p>where Wij = Y ij Y iJ , Zij = Xij XiJ , and = [IJ1,1J1][IJ1,1J1]</p><p>2 Set 11 = 1</p><p>Likelihood function for unit i who selects the Jth category:</p><p>L(, | Xi ,Yi) = Z&gt;i1 </p><p> Z&gt;i,J1</p><p>f (i | )di1 di,J1</p><p>High-dimensional integration Bayesian MCMC</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 19 / 34</p></li><li><p>Other Discrete Choice Models</p><p>Nested Multinomial Logit: Modeling the first choicej {1,2, . . . , J} and the second choice given the firstk {1,2, . . . ,Kj}</p><p>Pr(Y = (j , k) | Xi) =exp(X&gt;ijk)J</p><p>j =1Kj</p><p>k =1 exp(Xij k )</p><p>Multivariate Logit/Probit: Modeling multiple correlated choiceYi = (Yi1,Yi2, . . . ,YiJ) where Yij = 1{Y ij &gt; 0} and</p><p>Y i = X&gt;i + i where i</p><p>i.i.d. N (0,)</p><p>where jj = 1 and jj = jj for all j and j 6= j</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 20 / 34</p></li><li><p>Optimization Using the EM Algorithm</p><p>The Expectation and Maximization algorithm by Dempster, Laird,and Rubin: Google scholar 30,000 citations!Useful for maximizing the likelihood function with missing dataPedagogical reference: S. Jackman (AJPS, 2000)</p><p>Goal: maximize the observed-data log-likelihood, ln( | Yobs)The EM algorithm: Repeat the following steps until convergence</p><p>1 E-step: Compute</p><p>Q( | (t)) E{ln( | Yobs,Ymis) | Yobs, (t)}</p><p>where ln( | Yobs,Ymis) is the complete-data log-likelihood2 M-step: Find</p><p>(t+1) = argmax</p><p>Q( | (t))</p><p>The ECM algorithm: M-step replaced with multiple conditionalmaximization steps</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 21 / 34</p></li><li><p>Monotone Convergence Property</p><p>The observed-data likelihood increases each step:</p><p>ln((t+1) | Yobs) ln((t) | Yobs)</p><p>Proof:1 ln( | Yobs) = log f (Yobs,Ymis | ) log f (Ymis | Yobs, )2 Taking the expectation w.r.t. f (Ymis | Yobs, (t))</p><p>ln( | Yobs) = Q( | (t))</p><p>log f (Ymis | Yobs, )f (Ymis | Yobs, (t))dYmis</p><p>3 Finally,</p><p>ln((t+1) | Yobs) ln((t) | Yobs)= Q((t+1) | (t))Q((t) | (t))</p><p>+</p><p>log</p><p>f (Ymis | Yobs, (t))f (Ymis | Yobs, (t+1))</p><p>f (Ymis | Yobs, (t))dYmis</p><p> 0</p><p>Stable, no derivative requiredKosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 22 / 34</p></li><li><p>Application 1: A Finite Mixture Model</p><p>Used for flexible modeling and clustering in statisticsBuilding block for many advanced modelsCan be used to test competing theories (Imai &amp; Tingley AJPS)M competing theories, each of which implies a statistical modelfm(y | x , m) for m = 1, . . . ,M</p><p>The data generating process:</p><p>Yi | Xi ,Zi fZi (Yi | Xi , Zi )</p><p>where Zi is the latent variable indicating the theory whichgenerates observation iProbability that an observation is generated by theory m:</p><p>m(Xi , m) = Pr(Zi = m | Xi)</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 23 / 34</p></li><li><p>Observed-Data and Complete-Data Likelihoods</p><p>The observed-data likelihood function:</p><p>Lobs(, | {Xi ,Yi}Ni=1) =N</p><p>i=1</p><p>{M</p><p>m=1</p><p>m(Xi , m)fm(Yi | Xi , m)</p><p>},</p><p>The complete-data likelihood function:</p><p>Lcom(, | {Xi ,Yi ,Zi}Ni=1)</p><p>=N</p><p>i=1</p><p>Zi (Xi , Zi )fZi (Yi | Xi , Zi )</p><p>=N</p><p>i=1</p><p>Mm=1</p><p>{m(Xi , m)fm(Yi | Xi , m)}1{Zi =m}</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 24 / 34</p></li><li><p>Estimation via the EM Algorithm</p><p>The EStep:</p><p>(t1)i,m = Pr(Zi = m | </p><p>(t1),(t1), {Xi ,Yi}Ni=1)</p><p>=m(Xi , </p><p>(t1)m ) fm(Yi | Xi , </p><p>(t1)m )M</p><p>m=1 m(Xi , (t1)m ) fm(Yi | Xi , </p><p>(t1)m )</p><p>and thus, the Q function is,</p><p>Ni=1</p><p>Mm=1</p><p>(t1)i,m</p><p>{logm(Xi , </p><p>(t)m ) + log fm(Yi | Xi , </p><p>(t)m )}</p><p>The MStep: 2M weighted regressions!</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 25 / 34</p></li><li><p>Application 2: Sample Selection Model</p><p>Non-random sampling: Si = 1 if unit i is in the sample, Si = 0otherwiseThe outcome model: Yi = X&gt;i + iThe selection model: Si = 1{Si &gt; 0} with Si = X&gt;i + iSelection bias:</p><p>E(Yi | Xi ,Si = 1) = X&gt;i + E(i | Xi , i &gt; X&gt;i ) 6= E(Yi | Xi)</p><p>Inverse Mills ratio (under normality(</p><p>ii</p><p>)i.i.d. N</p><p>[(00</p><p>),</p><p>(2 1</p><p>)]):</p><p>E(i | Xi , i &gt; X&gt;i ) = (X&gt;i )(X&gt;i )</p><p>Sample selection as a specification errorExclusion restriction needed for nonparametric identificationSensitive to changes in the assumptions: Yi is unobserved forSi = 0</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 26 / 34</p></li><li><p>Application to the Heckmans Selection Model</p><p>Bivariate Normal as a complete-data model:(YiSi</p><p>) N</p><p>((X&gt;i X&gt;i </p><p>),</p><p>(2 1</p><p>))Factoring the bivariate normal</p><p>N (X&gt;i , 1) f (Si |Xi )</p><p>N (X&gt;i + (Si X&gt;i ), 2(1 2)) f (Yi |Si ,Xi )</p><p>The EStep:Sufficient statistics: Yi ,Y 2i for Si = 0, and S</p><p>i ,S</p><p>i</p><p>2,YiSi for allCompute the conditional expectation of sufficient statistics given allobserved data and parameters</p><p>The MStep:Run two regression with the results from the E-stepRegress Si on Xi ; Regress Yi on S</p><p>i and Xi</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 27 / 34</p></li><li><p>Application 3: Topic Modeling</p><p>Clustering documents based on topics:word or term: basic unit of datadocument: a sequence of wordscorpus: a collection of documents</p><p>Supervised and unsupervised learning approaches</p><p>The bag-of-words assumption: term-document matrixtf idf(w ,d) = tf(w ,d) idf(w)</p><p>1 term frequency tf(w ,d): frequency of term w in document d2 document frequency df(w): # of documents that contain term w3 inverse document frequency idf(w) = log Ndf(w) where N is the</p><p>total number of documentstf idf(w ,d) is:</p><p>1 highest when w occurs many times within a small number ofdocuments</p><p>2 lower when w occurs fewer times in a document or occurs in manydocuments</p><p>Kosuke Imai (Princeton) Discrete Choice Models POL573 Fall 2016 28 / 34</p></li><li><p>Latent Dirichlet Allocation (LDA)</p><p>Probabilistic modeling = statistical inferenceNotation:</p><p>Documents: i = 1, . . . ,NNumber of words in document i : MiA sequence of words in document i : Wi1, . . . ,WiMiNumber of unique words: KLatent topic for the j th word in document i : Zij {1, . . . ,L}</p><p>topic = a probability distribution over word frequenciesa document belongs...</p></li></ul>