Anscombe's quartet explained

Anscombe's quartet comprises four datasets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (xy) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data when analyzing it, and the effect of outliers and other influential observations on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough".[1]

Data

For all four datasets:

PropertyValueAccuracy
Mean of x9exact
Sample variance of x: s11exact
Mean of y7.50to 2 decimal places
Sample variance of y: s4.125±0.003
Correlation between x and y0.816to 3 decimal places
Linear regression liney = 3.00 + 0.500xto 2 and 3 decimal places, respectively
Coefficient of determination of the linear regression:

R2

0.67to 2 decimal places

The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.[2] [3] [4] [5] [6]

The datasets are as follows. The x values are the same for the first three datasets.[1]

Anscombe's quartet
Dataset IDataset IIDataset IIIDataset IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

It is not known how Anscombe created his datasets.[7] Since its publication, several methods to generate similar datasets with identical statistics and dissimilar graphics have been developed.[7] [8] One of these, the Datasaurus dozen, consists of points tracing out the outline of a dinosaur, plus twelve other datasets that have the same summary statistics.[9] [10]

See also

External links

Notes and References

  1. Anscombe . F. J. . Frank Anscombe . Graphs in Statistical Analysis . . 27 . 1973 . 1 . 17–21 . 2682899 . 10.1080/00031305.1973.10478966.
  2. Linear Regression . The Physics Hypertextbook . Elert . Glenn . 2021 . 2017-02-23 . 2020-10-01 . https://web.archive.org/web/20201001193224/http://physics.info/linear-regression/practice.shtml#4 . live .
  3. Book: Janert, Philipp K. . Data Analysis with Open Source Tools . 2010 . . 65–66 . 978-0-596-80235-6 .
  4. Book: Chatterjee . Samprit . Hadi . Ali S. . 2006 . Regression Analysis by Example . John Wiley and Sons . 91 . 0-471-74696-7.
  5. Book: Saville . David J. . Wood . Graham R. . 1991 . Statistical Methods: The geometric approach . . 418 . 0-387-97517-9.
  6. Book: Tufte, Edward R. . Edward Tufte

    . Edward Tufte . 2001 . The Visual Display of Quantitative Information . 2nd . Cheshire, CT . Graphics Press . 0-9613921-4-2 .

  7. Chatterjee . Sangit . Firat . Aykut . 2007 . Generating Data with Identical Statistics but Dissimilar Graphics: A follow up to the Anscombe dataset . . 61 . 3 . 248–254 . 10.1198/000313007X220057. 27643902. 121163371 .
  8. Book: Matejka . Justin . Fitzmaurice . George . Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems . Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing . 2017 . 1290–1294 . 10.1145/3025453.3025912. 9781450346559 . 9247543 .
  9. Web site: Matejka . Justin . Fitzmaurice . George . 2017 . Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing . live . 2021-04-20 . Autodesk Research . en-US . https://web.archive.org/web/20201004003855/https://www.autodesk.com/research/publications/same-stats-different-graphs . 2020-10-04 .
  10. Murray . Lori L. . Wilson . John G. . April 2021 . Generating data sets for teaching the importance of regression analysis . Decision Sciences Journal of Innovative Education . en . 19 . 2 . 157–166 . 10.1111/dsji.12233 . 233609149 . 1540-4595 . 2021-04-20 . 2021-04-23 . https://web.archive.org/web/20210423155254/https://onlinelibrary.wiley.com/doi/10.1111/dsji.12233 . live .