Numerical Descriptive Techniques

  • Published on
    04-Jan-2016

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Numerical Descriptive Techniques. Chapter 4. Introduction. Recall Chapter 2, where we used graphical techniques to describe data:. While this histogram provides some new insight, other interesting questions (e.g. what is the class average? what is the mark spread?) go unanswered. - PowerPoint PPT Presentation

Transcript

<ul><li><p>Numerical Descriptive TechniquesChapter 4</p><p>2007()</p><p>IntroductionRecall Chapter 2, where we used graphical techniques to describe data:While this histogram provides some new insight, other interesting questions (e.g. what is the class average? what is the mark spread?) go unanswered.</p><p>2007()</p><p>Numerical Descriptive TechniquesMeasures of Central LocationMean, Median, Mode</p><p>Measures of VariabilityRange, Standard Deviation, Variance, Coefficient of Variation</p><p>Measures of Relative StandingPercentiles, Quartiles</p><p>Measures of Linear RelationshipCovariance, Correlation, Least Squares Line</p><p>2007()</p><p>4.1 Measures of Central LocationUsually, we focus our attention on two types of measures when describing population characteristics:Central location (e.g. average)Variability or spreadThe measure of central location reflects the locations of all the actual data points.</p><p>2007()</p><p>4.1 Measures of Central LocationThe measure of central location reflects the locations of all the actual data points.How?But if the third data point appears on the left hand-sideof the midrange, it should pullthe central location to the left.With two data points,the central location should fall in the middlebetween them (in order to reflect the location ofboth of them).</p><p>2007()</p><p>The Arithmetic MeanThis is the most popular and useful measure of central location</p><p>2007()</p><p>NotationWhen referring to the number of observations in a population, we use uppercase letter N</p><p>When referring to the number of observations in a sample, we use lower case letter n</p><p>The arithmetic mean for a population is denoted with Greek letter mu: </p><p>The arithmetic mean for a sample is denoted with an x-bar.</p><p>2007()</p><p>Statistics is a pattern language</p><p>PopulationSampleSizeNnMean</p><p>2007()</p><p>The Arithmetic MeanSample meanPopulation meanSample sizePopulation size</p><p>2007()</p><p>Statistics is a pattern language</p><p>PopulationSampleSizeNnMean</p><p>2007()</p><p>The Arithmetic Mean Example 4.1The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 hours. Find the mean time on the Internet. 072211.042.1938.4545.7743.59</p><p>2007()</p><p> Additional Example The Arithmetic Mean</p><p>2007()</p><p>The Arithmetic Meanis appropriate for describing measurement data, e.g. heights of people, marks of student papers, etc.</p><p>is seriously affected by extreme values called outliers. E.g. as soon as a billionaire moves into a neighborhood, the average household income increases beyond what it was previously!</p><p>2007()</p><p>The MedianThe Median of a set of observations is the value that falls in the middle when the observations are arranged in order of magnitude.Odd number of observations0, 0, 5, 7, 8 9, 12, 14, 220, 0, 5, 7, 8, 9, 12, 14, 22, 338.5,8</p><p>2007()</p><p>The ModeThe Mode of a set of observations is the value that occurs most frequently.Set of data may have one mode (or modal class), or two or more modes.For large data setsthe modal class is much more relevant than a single-value mode.</p><p>2007()</p><p>The ModeExample 4.5 Find the mode for the data in Example 4.1. Here are the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 </p><p>SolutionAll observation except 0 occur once. There are two 0. Thus, the mode is zero. Is this a good measure of central location?The value 0 does not reside at the center of this set (compare with the mean = 11.0 and the mode = 8.5).</p><p>2007()</p><p> The ModeAdditional exampleThe manager of a mens store observes the waist size (in inches) of trousers sold yesterday: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40.The mode of this data set is 34 in. This information seems to be valuable (for example, for the design of a new display in the store), much more than the median is 33.5 in. </p><p>2007()</p><p>Measures of Central LocationThe mode of a set of observations is the value that occurs most frequently.</p><p>A set of data may have one mode (or modal class), or two, or more modes. </p><p>Mode is a useful for all data types, though mainly used for nominal data.</p><p>For large data sets the modal class is much more relevant than a single-value mode. Sample and population modes are computed the same way.</p><p>2007()</p><p>=MODE(range) in ExcelNote: if you are using Excel for your data analysis and your data is multi-modal (i.e. there is more than one mode), Excel only calculates the smallest one. </p><p>You will have to use other techniques (i.e. histogram) to determine if your data is bimodal, trimodal, etc.</p><p>2007()</p><p> The Mean, Median and ModeAdditional example A professor of statistics wants to report the results of a midterm exam, taken by 100 students. The mean of the test marks is 73.90 The median of the test marks is 81 The mode of the test marks is 84 Describe the information each one provides.The mean provides informationabout the over-all performance level of the class. It can serve as a tool for making comparisons with other classes and/or other exams. The Median indicates that half of the class received a grade below 81%, and half of the class received a grade above 81%. A student can use this statistic to place his mark relative to other students in the class.The mode must be used when data are nominal If marks are classified by letter grade, the frequency of each grade can be calculated. Then, the mode becomes a logical measure to compute.</p><p>2007()</p><p>Relationship among Mean, Median, and Mode If a distribution is symmetrical, the mean, median and mode coincide If a distribution is asymmetrical, and skewed to the left or to the right, the three measures differ.A positively skewed distribution(skewed to the right)MeanMedianMode</p><p>2007()</p><p>Relationship among Mean, Median, and ModeIf a distribution is symmetrical, the mean, median and mode coincideIf a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.A positively skewed distribution(skewed to the right)MeanMedianModeMeanMedianModeA negatively skewed distribution(skewed to the left)</p><p>2007()</p><p>Mean, Median, ModeIf data are symmetric, the mean, median, and mode will be approximately the same.</p><p>If data are multimodal, report the mean, median and/or mode for each subgroup.</p><p>If data are skewed, report the median.</p><p>2007()</p><p>Mean, Median, &amp; Modes for Ordinal &amp; Nominal DataFor ordinal and nominal data the calculation of the mean is not valid. </p><p>Median is appropriate for ordinal data.</p><p>For nominal data, a mode calculation is useful for determining highest frequency but not central location.</p><p>2007()</p><p>The Geometric MeanThis is a measure of the average growth rate.Let Ri denote the the rate of return in period i (i=1,2,n). The geometric mean of the returns R1, R2, ,Rn is the constant Rg that produces the same terminal wealth at the end of period n as do the actual returns for the n periods.</p><p>2007()</p><p>The Geometric MeanIf the rate of return was Rg in everyperiod, the nth period return wouldbe calculated by:For the given series of rate of returns the nth period return iscalculated by:=Rg is selected such that</p><p>2007()</p><p>Finance ExampleSuppose a 2-year investment of $1,000 grows by 100% to $2,000 in the first year, but loses 50% from $2,000 back to the original $1,000 in the second year. What is your average return?</p><p>Using the arithmetic mean, we have</p><p>This would indicate we should have $1,250 at the end of our investment, not $1,000. </p><p>Solving for the geometric mean yields a rate of 0%, which is correct.The upper case Greek Letter Pi represents a product of terms</p><p>2007()</p><p> The Geometric MeanAdditional Example A firms sales were $1,000,000 three years ago.Sales have grown annually by 20%, 10%, -5%.Find the geometric mean rate of growth in sales.SolutionSince Rg is the geometric mean (1+Rg)3 = (1+.2)(1+.1)(1-.05)= 1.2540Thus, </p><p>2007()</p><p>Measures of Central Location SummaryCompute the Mean to Describe the central location of a single set of interval data</p><p>Compute the Median toDescribe the central location of a single set of interval or ordinal data</p><p>Compute the Mode to Describe a single set of nominal data</p><p>Compute the Geometric Mean to Describe a single set of interval data based on growth rates</p><p>2007()</p><p>4.2 Measures of variabilityMeasures of central location fail to tell the whole story about the distribution.A question of interest still remains unanswered:How much are the observations spread outaround the mean value?</p><p>2007()</p><p>4.2 Measures of variabilityObserve two hypothetical data sets:The average value provides a good representation of theobservations in the data set.Small variability</p><p>This data set is now changing to...</p><p>2007()</p><p>4.2 Measures of variabilityObserve two hypothetical data sets:The average value provides a good representation of theobservations in the data set.Small variabilityLarger variabilityThe same average value does not provide as good representation of theobservations in the data set as before.</p><p>2007()</p><p>Measures of VariabilityMeasures of central location fail to tell the whole story about the distribution; that is, how much are the observations spread out around the mean value?For example, two sets of class grades are shown. The mean (=50) is the same in each case</p><p>But, the red class has greater variability than the blue class.</p><p>2007()</p><p> The rangeThe range of a set of observations is the difference between the largest and smallest observations.</p><p>Its major advantage is the ease with which it can be computed.</p><p>Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points. But, how do all the observations spread out?SmallestobservationLargestobservationThe range cannot assist in answering this questionRange</p><p>2007()</p><p>RangeThe range is the simplest measure of variability, calculated as:</p><p>Range = Largest observation Smallest observation</p><p>E.g.Data: {4, 4, 4, 4, 50}Range = 46Data: {4, 8, 15, 24, 39, 50}Range = 46The range is the same in both cases,but the data sets have very different distributions</p><p>2007()</p><p>RangeIts major advantage is the ease with which it can be computed.</p><p>Its major shortcoming is its failure to provide information on the dispersion of the observations between the two end points.</p><p>Hence we need a measure of variability that incorporates all the data and not just two observations. Hence</p><p>2007()</p><p>VarianceVariance and its related measure, standard deviation, are arguably the most important statistics. Used to measure variability, they also play a vital role in almost all statistical inference procedures.</p><p>Population variance is denoted by (Lower case Greek letter sigma squared)</p><p>Sample variance is denoted by (Lower case S squared)</p><p>2007()</p><p>Statistics is a pattern language</p><p>PopulationSampleSizeNnMeanVariance</p><p>2007()</p><p>The Variance</p><p>2007()</p><p>The VarianceExample 4.7The following sample consists of the number of jobs six students applied for: 17, 15, 23, 7, 9, 13. Finds its mean and varianceSolution</p><p>2007()</p><p>The Variance Shortcut method</p><p>2007()</p><p>Why not use the sum of deviations?Consider two small populations:1098741011121316 8-10= -2 9-10= -111-10= +112-10= +2 4-10 = - 6 7-10 = -313-10 = +316-10 = +6The mean of both populations is 10...but measurements in Bare more dispersedthen those in A.A measure of dispersion Should agrees with this observation.Can the sum of deviationsBe a good measure of dispersion?ABThe sum of deviations is zero for both populations, therefore, is not a good measure of dispersion.</p><p>2007()</p><p>The VarianceLet us calculate the variance of the two populationsWhy is the variance defined as the average squared deviation?Why not use the sum of squared deviations as a measure of variation instead?After all, the sum of squared deviations increases in magnitude when the variationof a data set increases!!</p><p>2007()</p><p>The VarianceWhich data set has a larger dispersion?131325ABData set Bis more dispersedaround the meanLet us calculate the sum of squared deviations for both data sets.</p><p>2007()</p><p>The Variance131325ABSumA &gt; SumB. This is inconsistent with the observation that set B is more dispersed. </p><p>2007()</p><p>The Variance131325ABHowever, when calculated on per observation basis (variance), the data set dispersions are properly ranked.sA2 = SumA/N = 10/10 = 1sB2 = SumB/N = 8/2 = 4 </p><p>2007()</p><p>Standard DeviationThe standard deviation of a set of observations is the square root of the variance .</p><p>2007()</p><p>Statistics is a pattern language</p><p>PopulationSampleSizeNnMeanVarianceStandard Deviation</p><p>2007()</p><p>Standard DeviationExample 4.8To examine the consistency of shots for a new innovative golf club, a golfer was asked to hit 150 shots, 75 with a currently used (7-iron) club, and 75 with the new club.The distances were recorded. Which 7-iron is more consistent?</p><p>2007()</p><p>Standard DeviationExample 4.8 solution</p><p>Excel printout, from the Descriptive Statistics sub-menu.The innovation club is more consistent, and because the means are close, is considered a better club</p><p>Sheet1</p><p>CurrentInnovationCurrentInnovation</p><p>153150</p><p>151149Mean150.5466666667Mean150.1466666667</p><p>153155Standard Error0.6688145579Standard Error0.3570112842</p><p>158145Median151Median150</p><p>142153Mode150Mode149</p><p>146153Standard Deviation5.792103976Standard Deviation3.0918084157</p><p>156154Sample Variance33.5484684685Sample Variance9.5592792793</p><p>149147Kurtosis0.1267395862Kurtosis-0.8854179947</p><p>150152Skewness-0.4298882891Skewness0.1773377326</p><p>142149Range28Range12</p><p>153152Minimum134Minimum144</p><p>159150Maximum162Maximum156</p><p>146148Sum11291Sum11261</p><p>141152Count75Count75</p><p>155147</p><p>152151</p><p>134147</p><p>160148</p><p>158152</p><p>149149</p><p>154147</p><p>143153</p><p>154156</p><p>150150</p><p>145147</p><p>150148</p><p>138150</p><p>145152</p><p>157146</p><p>155149</p><p>141151</p><p>162154</p><p>160145</p><p>151155</p><p>151148</p><p>150148</p><p>156155</p><p>145148</p><p>150144</p><p>150146</p><p>148145</p><p>152149</p><p>150151</p><p>152146</p><p>150150</p><p>150156</p><p>152147</p><p>145148</p><p>148149</p><p>157153</p><p>156153</p><p>155149</p><p>151155</p><p>148151</p><p>149147</p><p>150152</p><p>143147</p><p>148151</p><p>148151</p><p>155151</p><p>155155</p><p>153151</p><p>151154</p><p>153156</p><p>144149</p><p>156149</p><p>155149</p><p>144155</p><p>145148</p><p>149152</p><p>157148</p><p>161154</p><p>159152</p><p>137147</p><p>151146</p><p>2007()</p><p> Standard DeviationAdditional Example Rates of return over the past 10 years for two mutual funds are shown below. Which one have a higher level of risk? Fund A: 8.3, -6.2, 20.9, -2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05 Fund B: 12.1, -2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, -1.3, 11.4</p><p>2007()</p><p> Standard DeviationSolutionLet us use the Excel printout that is run from the Descriptive statistics sub-menu.Fund A should be considered riskier because its standard deviation is larger</p><p>Sheet1</p><p>Fund AFund B</p><p>Mean16Mean12</p><p>Standard Error5.2947143455Standard Error3.1523536181</p><p>Median14.6Median11.75</p><p>ModeMode</p><p>Standard Deviation16.7433568916Standard Deviation9.9686174234</p><p>Sample Variance280.34Sample Variance99.3733333333</p><p>Kurtosis-1.3419311008Kurtosis-0.4639392636</p><p>Skewness0.2169714115Skewness0.1069521064</p><p>Range49.1Range30.6</p><p>Minimum-6.2Minimum-2.8</p><p>Maximum42.9Maximum27.8</p><p>Sum160Sum120</p><p>Count10Count10</p><p>2007()</p><p>Interpreting Standard DeviationThe standard deviation can be used tocompare the variability of several distributionsmake a statement about the general shape of a distribution. The empirical rule: If a sample of observations has a bell-shaped distribution, the interval</p><p>2007()</p><p>The Empirical RuleApproximately 68% of all observations fallwithin one standard deviation of the mean.</p><p>Approximately 95% of all observations fallwithin two standard deviations of the mean.</p><p>Approximately 99.7% of all obs...</p></li></ul>