Introduction to Statistics

Guillaume Fürst, 2012

1. Basics Statistics

This first part introduces some basics about variables, graphs, and descriptives statistics. Notions of sampling and statistical inference are covered here as well.


1.1. About variables

Variable are the base of any statistical analysis. They can represent a vast array of concepts; for instance:


1.1.1. Variable types

Variables come in different types (e.g., quantitative or qualitative). These types really matter: knowing the type of variable you are dealing with let you choose the correct statistical analysis to run.


Qualitative and quantitative variables

We usually distinguish bewteen:

Examples of variables

Dependant and independant variables

Symbol used to represent this relation: IV -> DV
(Correlation imply no direction: V1 <-> V2)


1.1.2. Variable modifications

Once in the database, variable type can't be changed. However various manipulations can be done, for instance variables can be recoded or transformed.



Recoding do not change the fundamental nature of the variable. However, recoding allows more flexibility. For instance, two categories can be treated as numerical values, and the difference between quantified. (see "dummy variables in linear regression", section 2.4.1.)

Example: recoding sex

Standardization (linear transformation)

Standardization is not normalization. Standardization just change the variable scaling. The proportion between values is not affected; the distribution is not changed in the slightest way.

Z scores

xi are the raw scores
μx is the mean
σx is the standard deviation

(See subsection 1.2.1. about univariate graph for an example.)
(See subsection 1.3.1. about univariate descriptive statistics.)

Non-linear transformation

Non-linear transformation change the original proportion (distance) between original scores. Such transformations are for instance logarithmic, exponential, quadratic or squared root transformations.
These transformations are useful to model non-linear relations between variables in the linear models framework (i.e., generalized linear model).
For instance, the relation between learning and performance is more logarithmic than linear, while the relation between stress and performance is more quadratic in nature.

Example: logarithmic transformation

xi are the raw scores (original distribution)
log( ) is the function transforming the scores
x'i is the log-transformed distribution

[see subsection 1.2.1 about univariate graph for an example]

Operationalisation and variable type

If the fundamental nature of variable can't be changed once measured, some important choices can nonetheless be done when designing the operationalisation.
For instance "adult" vs. "children" is a poor operationalisation of the variable age. Simply measuring age in years or month is a much more powerful operationalisation.
The same logic stands for many variables (e.g., extraversion, prices, intelligence, buying intentions). As a general rule, and when the nature of the phenomenon allows it, contiunous variables with many possible values should be prefered.
Continuous variable can always be categorized if necessary (though it's often not recommended), whereas the reverse is not true.


1.2. Graphs

Graphs are massively important to inspect and visualize data — either for representing one variable at the time (univariate graphs, e.g., histogram) or the relation between two variables (bivariate graphs, e.g., scatterplot).
Graphs allows the detection of unexpected distributions, extreme values (univariate or bivariate), as well as errors in the data (e.g., impossible values). Such problems often cannot (or hardly) be detected otherwise, in particular in big datasets.


1.2.1. Univariate graphs

Univariate graphs (i.e., one variable at the time) comes basically in two kinds. One is the represention of the frequency of discrete observations (qualitative or quatitative variable with a few possible values). The other is the representation of distributions of continuous variables.


Qualitative/categorial variable

Example of Pie chart
Example of Barplot

Continuous variable

Examples of histograms

Examples of boxplots


1.2.2. Bivariate graphs

Bivariate graphs allow us to visualize relations bewteen two variables, either two categorical variables, two continuous variables, or a mix of both.


Categorized plots

Categorized barplot
Categorized boxplot

Several quantitative variables

Matrix scatterplot

1.3. Descriptive statistics

Descriptive statistics are useful to synthetize information about variables. Just as for the graphs, descriptives statistics come in univariate or bivariate kind. Descriptive statistics should always be done conjointly with appropriate graphs.


1.3.1. Univariate statistics

Notions of interest here are frequency of observations, central tendency, and dispersion for a given variable.


Frequency tables

Frequency tables are useful to sum up qualitative variables or quantitative variables with a few values.

Example: number of children

Central tendancy

Central tendancy allows to answer the question "What values are the most frequent?".

Example: mean, median and mode for 3 variables


Dispersion allows to answer the question "To which extend values are spread around the central tendancy?".

Example: min./max., variance and quartiles
Variance's formula

1.3.2. Bivariate statistics

This subsection introduces indices and tables useful to sumarize relations between variables, such as cross tabulation and correlation.


Cross tabulation

Cross tabulation allows to investigate relationships between qualitative variables or quantitative variables with a few values.

Example: sex and social class

Correlation and covariance

Covariance is a non-standardized correlation; correlation is a standardized covariance. Both represent the strength of association between two continous variables. Both estimate the linear relation between variables. (See also section 2.4.1.)


Possible values of covariance go to +∞ to –∞. This makes the interpretation of covariance somewhat uneasy. In some cases 34 could be a large while in other cases 4'000'000'000 could be small.


Possible values of correlation go to -1 (perfect negative correlation) to +1 (perfect positive correlation). 0 means no relation at all between the two variables. Correlations between several variables are classically represented in triangular matrix.

Example of correlations

Self-rated importance (imp) and usefulness (use)
of a compact (C) and reflex (R) camera. n=204.

1.4. Statistical inference

Inference is the backbone of statistic, the very reason for its existence. Nothing is absolutely certain or uncertain in statistics; you always have to deal with the probability of something being true or false.
Specifically, we often want to know wether a given sample reflets some true propertie of a population (e.g., to what extent a correlation found in a sample can be generalized to population). The general aim of statistical theory is to provide formal tools to deals with incertainty.

If the basic notion of probability is somewhat intuitive, things can get blurry when dealing with specific statistic notions (e.g., sampling distribution). Thus this aim of this section is to introduce some general concepts of statistical inference, which are underlying virtually all statistical tests. (Specific tests are considered in parts 2 and 3.)


1.4.1. About sampling

All statistic decisions and estimations are based on samples. The aim of the first subsection is to provide an overview of the reasons and implications of using samples to draw conclusions about the population.


Why sampling?

Studying the entire population is never an option. Many population are huge or even of unkomwn or infinite size. Thus research hypotheses have to be tested on samples.

Example of sampling

A important implication of sampling is sampling error. That is, a sample will seldom be identical to the population.


Role of randomness

There are many different type of sampling, but most of them rely on randomness. If the sample is biased, no valid conclusion can be drawn.

Example of a non-random sample

If, for whatever reason, the different individuals of the population don't have an equal chance to be included in the sample, the sample will be biased.


Sampling error and sample size

Three examples of different sample size and impact on precision of estimation.

The underlying idea of these three example is the following: the bigger the sample, the better the estimation, and the higher the certainty. (See next subsection for more about this.)
In other terms, the probability of "bad luck" (i.e., a special sample very different from the population) get smaller and smaller as sample size increases.


1.4.2. Statistical decisions

We've just seen that there are "good" and "bad" samples — i.e., some seamingly more representative of the population than others. Since this is inevitable, one must have a solution, a criteria to decide wether or not a sample can be considered as representative beyond sampling error.
The statistical tools to help us taking such decision are hypothesis testing, p-value and α level.


Null and alternative hypothesis (H0 and H1)

The basic notion underlying virtually all statistics tests is the null hypothesis, where "null" convey the idea of equifrequence, no diffence between groups, no effect of treatment, no correlation, etc.

The alternative hypothesis represents the other possible outcome — there is a difference between groups, there an effect of the treatment, there is a correlation.

Note that this is a very "black and white" world here:


Correct decisions and error types

The null hypothesis is the one that is formally tested. This is so because it's simpler and straightforward than testing the alternative hypothesis. Plus, on a more fundamental, epistemic level a hypothesis can not be proven to be true.

An analogy
H0 is true
Truly not guilty
H1 is true
Truly guilty
Accept Null Hypothesis
Right decision Wrong decision
Type II Error
Reject Null Hypothesis
Wrong decision
Type I Error
Right decision

But how do we formally take this decision?



The p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. /!\ The p-value is not the probability of the null hypothesis being true.

Let's go back to our previous example (subsection 1.4.1.) to figure out what it means:


In this context, drawing a sample with 50% of black dots is very likely (i.e., large p-value), while drawing a sample with 90% of black dot is very unlikely (i.e., small p-value) — but not impossible!

Likelyhood of samples

At this point we face the following question: Where do we draw the line? When "unlikely" is closer to "almost impossible" than "quite likely"?


Null hypothesis rejection, α level and significance

The basic rule of statistial decision goes as follow. When the p-value is small, we reject reject the null hypothesis.

In other terms, we conclude that H0 is unlikely to be true.
Then, symetrically, we conclude that H1 is likely to be true.
Ultimately, we assert that there is a difference between groups, an effect of the treatment, a non-zero correlation. This is statistical significance.

But what is a small enough p-value to allow the rejection of the null hypothesis? This is where arbitrary choices come into the picture. By convention, one rejects the null hypothesis when the p-value is smaller than the significance level α (Greek alpha), which is often 0.05 or 0.01 (5% or 1%).
Of course these are just arbitrary cut-off values. In a more subtle perspective, we can say that the smaller the p-value is, the more confident about our conclusion we can be. This is why you should always report exact p-value and not merely say "this is significant because p<.05".


One- and two-tailed test

One last detail about testing the null hypothesis. Whether the alternative hypothesis is oriented, the test is one-tailed or two-tailed.

Illustration of one- and two-tailed tests

/!\ Decision about hypothesis orientation must be based on theory, not on data.


1.4.3. Statistical estimations

In statistics, significance is only one side of the story. The gross side of the story — there's a significant effect, there's no significant effect. This is one necessary information, but this is harldy enough. Other very important pieces of information are effect size and confidence interval.


Effect size

If we've rejected the null hypothesis, and thus conclude that some significant effect does exist, the notion of effect size informs us about the magnitude of this effect.

Example of 3 effect sizes

Indeed, for a given p-value, say .03, effect size can be very different. This is because p-values are influenced both by effect size and sample size.


Precision of estimation and confidence interval

Every statistic test give a point estimate (e.g., a mean difference, a correlation). We've seen that such estimate can be tested to decided whether or not it is significant. There's one more thing we can consider: confidence interval.

Confidence inteval are useful to give information about reliability of the point estimate. If the interval is large, the estimation is uncertain, not very reliable. If the interval is small, the estimation is more certain and reliable.

Effect size and confidence intervals

The precision of estimations is a direct function of sample size (see above, end of subsection 1.4.1.). As show below, point estimate can be high or low, independantly of coinfidence interval's size.


2. Bivariate Analyses

This section is about statistical tests such as Chi-Squared, T-test, One-way ANOVA and Simple Linear Regression. These tests focus on the relation bewteen two variables. They provide information about wheter the variables are significantly related, whether the relation is strong, and so on.


2.1. Chi-squared

There is two kinds of Chi-squared test: χ2 test of homogeneity (actually for one variable only) and χ2 test of independance, which tests the relation between to variables. In both case, variable should be qualitative (categorical).


2.1.1. Χ2 test of homogeneity

This test is useful to test the equifrequence of values of a categorical variables (uniform distribution).
For instance variable liked a movie, with possible values being "yes" or "no".



(See links "more" at the end of this section to know what to do if assumptions are not respected.)


Null hypothesis


There is equifrequence; the distribution is uniform.


You ask people whether they liked a movie. Null hypothesis: you get an equal number of "yes" and "no". Alternative hypothesis: more people liked it OR more people did not.


Test statistic


Oi = an observed frequency;
Ei = an expected frequency, asserted by H0;
N = the number of cells in the table.

You get 110 "yes" and 94 "no". This is the observed frequencies.
The expected frequency, asserted by the null hypothesis is (110+94)/2=102.
Thus, χ2 = ((114–102)2+(90–102)2)/102=2.82

Degrees of freedom

N is the variable's number of possible categories/values
Example (df and conclusion)

Here our variable have only two possible values. Thus, df is 2-1=1.
For this χ2(df=1)=2.82, the p-value is .093.
Conclusion: we cannot reject H0; there is no evidence of difference.


Effect size

Cohen's w

n = sample size.
w is most often between 0 and 0.90.

In the previous example, we couldn't reject the null hypothesis. This mean that, until further evidence (and/or a bigger sample), we have to assume that the effect size is 0. Hence calculating the effect size doesn't make any sense.

Let's take another example. Say whether people have a compact camera. Imagine data are "yes"=61 and "no"=29. In this case, χ2(df=1) would be 11.38, with an associated p-value of .00074.
Thus Cohen's w would then be the squared root of 11.38/90, that is 0.35. A fair-sized effect size.

Yet another example. Say the data are now "yes"=1'005'000 and "no"=1'000'000. χ2(df=1) would be 12.47, p-value .00041. Smaller than in the previous example, that is more significant. What about the effect size? Squared root of 12.47/1000000 is only 0.0035...


2.1.2. Χ2 test of independance

This test is useful to test the relation beetween two categorical variables.



(Same as for χ2 test of homogeneity. See links "more", end of this section to know what to do if assumptions are not respected.)


Null Hypothesis


The row variable is independent of the column variable; row percents are equal.


Test statistic


Ei, j = an expected frequency in a cell;
Ri = total in a row; Cj = total in a column; N = R x C.

Degrees of freedom

R = total number of rows; C = total number of columns.

Example (df and conclusion)

χ2(df=1)=25.01, the associated p-value is < .000001. This is strongly significant, there is a relation between aspirin and heart attack.


Effect size

Cramér's phi

n = sample size;
q = R or C, whichever is less.


Here, Cramer's phi is 0.0336, a very small effect size. Although the test statistic is highly significant (because of the large sample size), the effect size is negligeable.


2.2. T-test

The t-test focus on the mean of continuous variables: difference between an observed and an expected mean, as well as mean difference between two variables. Here we focus on the latter.


2.2.1. T-test for dependent samples

We test the difference between two means:

μX1 – μX2 = μD

where X1 and X2 are dependant or related (e.g., same person measured twice, married couple, twins, etc.).



Assumption check

What if normality is not respected?

Hypotheses and test

Null hypothesis

Test statistic

Degrees of freedom



Confidence interval (CI) and effect size

Example for HE0-HE1
Cohen's d
Example for HE0-HE1

2.2.2. T-test for independent samples

Also test the difference between two means, but for independent variables (i.e., two groups of different/unrelated people).



Assumption check


Levene test of homogeneity of variance: tests the null hypothesis of equality of variance. (This test calculates, for each score in each group, the distance to the mean of the group. Then it compares the means distances of the two groups.)

If the p-value is not significant, variances can be considered equal. (See below to know what to do if variances cannot be considered equal or if other problems with assumptions occur.)


Hypotheses and test

Null hypothesis
Test statistic


Degrees of freedom



What if assumptions are not respected?

If the equality of variance is not respected, you can use the alternative test-statistic t' with df' degrees of freedom. (Same principle and usage as for t', but with a different estimation.)

The t' statistic


If other assumptions are not respected, transform the data (see section 1.1.2.) or use a non-parametric analysis (see "more", end of this section).


Confidence interval (CI) and effect size


(Same principle as t-test for dependant samples.)


Cohen's d

(Same principle as t-test for dependant samples.)

2.3. One-way ANOVA

Basically ANOVA can be seen — among other things — as a generalization of the t-tests. That is comparing multiple groups (different categories of a single factor) on a single continuous dependant variable.


2.3.1. Between-subjects design

If there are only two groups (i.e., the factor has two values), this is strictly equivalent to the t-test for independent groups (between-subjects designs).
If there are more than two groups, this is a generalization; same basic principles with a few estimation and interpretation differences.




Null and alternative hypotheses




Test statistic



(a is the number of groups, Ntot is the number of observations)


i are subjects, j are conditions.

Degrees of freedom

(a is the number of groups, Ntot is the number of observations)



Assumptions check (analysis of residual)


Multiple comparisons, effect size and CI

Effect size associated to F-test

Here partial eta-squared is 7.876/(7.876+34.3) = 0.187. This can be interpreted as 18% explained variance.

Multiple comparisons

F statistic and partial eta-squared are not what ANOVA is all about. We are often interested in performing specific group comparisons or in testing specific pattern of differences. That's what multiple comparisons are.

    There are different kinds of multiple comparisons:
  1. planned groups comparisons (contrasts);
  2. post-hoc groups comparisons;
  3. test of polynomial contrasts.

(For practical reasons, numbers 1 and 2 are discussed in this section, while number 3 is discussed in the next subsection about within designs. However any of these comparisons can be done in both designs).


Both planned and post-hoc group comparisons (number 1 and 2 above) consist in comparing groups. The basic principle stays the same: one group (or several) is compared to another (or several others); either way, the null hypothesis is always that there is no difference.
The main difference between planned comparisons and post-hoc comparisons is about p-values. If you run a small set of educated comparison, you generally can use raw, standard p-values (see also "more", end of this section, about the notion of orthogonal comparisons). Whereas if you run many, blind comparisons, you have to correct the p-values for an additional "chance factor". Several solutions exist; some of them are introduced in the following examples.

Example of planned group comparisons

Example of post-hoc groups comparisons

2.3.2. Within-subjects design

If there is only two groups (i.e., the factor has two values), this is strictly equivalent to the t-test for dependent groups (within-subjects designs). If there more than two groups, this is a generalization; same basic principles with a few estimation and interpretation differences.




Null and alternative hypotheses


(Same as for between designs.)



Test statistic

Test statistic

This is the same basic idea as for between-subjects designs. However, one main advantage repeated measurement ANOVA is that you are able to partition out variability due to individual differences. In a between-subjects design there is an element of variance due to individual differences that is combined in with the group and residual terms:

SSTotal = SSCond + SSResid


In a repeated measures design it is possible to account for these subjects differences, and partition them out from the group and residuals terms. In such a case, the variability can be broken down into between-conditions variability (or within-subjects effects) and within-conditions variability. The within-groups variability can be further partitioned into between-subjects variability (individual differences) and residuals (excluding the individual differences).

SSTotal = SSSubjects + SSCond + SSResid

As the total variance is better "explained" (i.e., decomposed in more specific parts than in between designs), the residual term gets smaller. This contributes to higher F values, thanks to a smaller denominator (residual value) in the formula.

Mean squares and degrees of freedom

a is the number of conditions;
n the number of observations.



Assumptions check and sphericity correction factor

Normality of differences' scores

Greenhouse-Geisser (GG) correction factor

Huynt-Feldt (HF) correction factor


Multiple comparisons, effect size and CI

Effect size associated to F-test

Here partial eta-squared is 7.78/(7.78+76.8) = 0.092. This can be interpreted as 9.2% explained variance.

Multiple comparisons

As for between subjects designs, F statistic and partial eta-squared is not the whole story. Specific group comparisons can be done, and with them the tests of specific pattern of differences.
We've mentioned planned groups comparisons (contrasts) and post-hoc groups comparisons in the previous subsection; we'll look here at tests of polynomial contrasts. (However, remember that any of these comparisons can be done in both designs.)


Polynomial contrasts allow us to test specific pattern of differences between conditions. For instance:

Example of polynomial contrasts


2.4. Simple Linear Regression

Regression is classically used to investigate the linear relation between two continuous variables.
However, regression can also be used to test a non-linear relation, using a non-linear transformation of data (See "more" end of section 1.1.).
Last, regression can also be used to test the mean differences between two groups (equivalent to t-test for independent group).


2.4.1. Basic principles


These are true for all linear models, including ANOVA and t-test.



Estimated with least squares method, see here for an intuitive illustration:


Null and alternative hypotheses


Tests and estimations


2.4.2. Examples

Two-continuous variables



Assumptions check


Using a dummy variable as predictor


See section 2.2.2. (boxplot).
Actually all this test is equivalent to t-test!


Assumptions check


Comparison with structural model of ANOVA

ANOVA models (and t-tests) are actually a special case of linear regression model. The equations below (structural models of ANOVA) are indeed very similar to the equation of a regression model.

Between designs


Within designs



3. Multivariate Analyses

This section is about statistical tests such as factorial ANOVA and multiple linear regression. These test allows the investigation of the impact of several IVs (predictors, Xs) on one DV (criterion, Y). With these possibilities come the notions of statistical interaction and mediation.


3.1. Interaction & mediation

For an introduction about mediation and interaction, cf. this document: (pp. 1-4).


3.2. Factorial ANOVA

This is a generalization of simpler ANOVA design. We will specifically look here at the 2x2 between design. However further generalization can be done (e.g., 2x2 within designs, 2x3 between designs, 2x2x2 designs, mixed designs).


3.2.1. Basic principles

In 2x2 between designs, three types of effect can be tested: (1) main effect of factor 1; (2) main effect of factor 2; (3) interaction effect between factor 1 and factor 2.



Always the same.





Null and alternative hypotheses

You should be able to guess what the alternative hypotheses are! (Always the same story.)


Mean squares and degrees of freedom

a is the number of factor's 1 conditions;
b is the number of factor's 2 conditions;
n is the samble size.


Tests and estimations

In the (quite special) case of this 2x2 design:

/!\ In more complicated designs (i.e., factors with more than two conditions), F-test does not represent the difference between two groups (see section 2.3.).


3.2.2. Example





Assumptions check




3.3. Multiple Regression

Multiple regression allow the prediction of one continuous dependant variable (criterion, Y) with multiple independant variables (predictors, Xs).


3.3.1. Basic principles


Always the same.

Additionaly, one should also check that intercorrelation (redundancy) between predictor is not massive. No independent variable should be strongly predicted by the others.




Null and alternative hypotheses


Tests and estimations

Mostly similar to simple linear regression


3.3.2. Example





Assumptions check



You should be able to do it yourself! (See first example of section 2.4.2. and end of section 3.3.1.).