Some words of caution about significance testing by Kevin Gray: “I’ve long had three major concerns about significance testing. First, it assumes probability samples, which are rare in most fields. For example, even when probability sampling (e.g., RDD) is used in consumer surveys, because of (usually) low response rates, we don’t have true probability samples. Secondly, it assumes no measurement error. Measurement error can work in mysterious ways but generally weakens relationships between variables. Lastly, like automated modeling, it passes the buck to the machine and [READ MORE]

If we have a sample of data drawn randomly from a population with a normal distribution, we can assume that our sample distribution also has a normal distribution (provided a sample size of more than 30). If we have a mean of zero and a standard deviation (SD) of 1, then we can calculate the probability of getting a particular score based on the frequencies we have. To centre our data around a mean of zero, we need to subtract each individual score from the overall mean, then divide this by the standard deviation. This is the process of standardisation of raw data into z-scores. This [READ MORE]

When doing proposals or client reports, we often refer to “research questions” and “research hypotheses” (sometimes used interchangeably). What is the difference? Research Questions do NOT entail specific predictions (magnitude or direction of the outcome variable) and are therefore phrased in question format that could include questions about descriptives, difference or association (or relationship). These assist the researcher to choose the most appropriate statistics techniques. Lets look at each: 1. Research questions that relate to describing [READ MORE]

A test statistic such as the F-test, t-test, or the χ² test, all look at the proportion of variance explained (effect) by our model versus variance not explained (error) by our model. Our model can be as basic as a mean score which is calculated as the sum of the observed scores divided by the number of observations included. If this proportion is >1, then the variance explained (effect) is larger than the variance not explained (error). The higher this proportion the better our model. Lets say it is 5 (rather than 1), so the proportion of explained variance (effect) is 5 times [READ MORE]

In follow-up to the post about univariate outliers, there are a few ways we can identify the extent of bivariate and multivariate outliers: First, do the univariate outlier checks and with those findings in mind (and with no immediate remedial action), follow some, or all of these bivariate or multivariate outlier identifications depending on the type of analysis you are planning. _____________________________________________________ BIVARIATE OUTLIERS: For one-way ANOVA, we can use the GLM (univariate) procedure to save standardised or studentized residuals. Then do a normal [READ MORE]

(Statistical) Power Analysis refers to the ability of a statistical test to detect an effect of a certain size, if the effect really exists. In other words, power is the probability of correctly rejecting the null hypothesis when it should be rejected. So while statistical significance deals with Type I (α) errors (false positives), power analysis deals with Type II (β) errors (false negatives), which means power is 1- β Cohen (1988) recommends that research studies be designed to achieve alpha levels of at least .05 and if we use Cohen’s rule of .2 for β, then 1- β= 0.8 (an 80% [READ MORE]

I was talking this morning with someone about which blogs that review products and/or services are the most popular around my part of the world – Asia. I consulted Google Search but could not come up with an answer. I did however come across a recent report (June 25, 2012) by Kristen Sala, Senior Manager, Electronic Media at Cision (a public relations software and media tools firm) that lists the Top 50 independent “Product Review Blogs” in North America. Mama-B Blog is first, followed by Computer Audiophile, and 48 others. Still, I could not find much information [READ MORE]

Very brief description Multicollinearity is a condition in which the independent variables are highly correlated (r=0.8 or greater) such that the effects of the independents on the outcome variable cannot be separated. In other words, one of the predictor variables can be nearly perfectly predicted by one of the other predictor variables. Singularity is when the independent variables are (almost) perfectly correlated (r=1) so any one of the independent variables could be regarded as a combination of one or more of the other independent variables. In practice, you should not [READ MORE]

We’re all very familiar with the “Likert-scale” but do we know that a true Likert-scale consists not of a single item, but of several items which under the right conditions – i.e. subjected to an assessment of its reliability (e.g. intercorrelations between all pairs of items) and validity (e.g. convergent, discriminant, construct etc.) can be summed into a single score. The Likert-scale is a unidimensional scaling method (so it measures a one-dimensional construct), is bipolar, and in its purest form consists of only 5 scale points, though often we refer to a [READ MORE]

Many of the statistical procedures used by marketing researchers are based on “general linear models” (GLM). These can be categorised into univariate, multivariate, and repeated measures models. The underlying statistical formula is Y = Xb + e where Y is generally referred to as the “dependent variable”, X as the “independent variable”, b is the “parameters” to be estimated, and e is the “error” or noise which is present in all models (also generally referred to as the statistical error, error terms, or residuals). Note that both [READ MORE]

Now here’s an easy one: What is a variable? It is simply something that varies – either its value or its characteristic. In fact, it must vary. If it does not vary then we can’t call it a VARiable, so we call it a “constant” such as the regression constant (the y-intercept). In the equation of a straight line (linear relationship) Y = a + bX, where: Y=dependent variable X=independent variable a=constant (the Y-axes intercept, or the value of Y when X=0) b=coefficient (slope of the line, in other words the amount that Y increases [or [READ MORE]

BRIEF DESCRIPTION The Chi-square (χ²) goodness-of-fit test is a univariate measure for categorical scaled data, such as dichotomous, nominal, or ordinal data. It tests whether the variable’s observed frequencies differ significantly from a set of expected frequencies. For example, is our observed sample’s age distribution of 20%, 40%, 40% significantly different from what we expect (e.g. the population age distribution) of 30%, 30%, 40%. Chi-square (χ²) is a non-parametric procedure. SIMILAR STATISTICAL PROCEDURES: Binomial goodness-of-fit (for binary data) [READ MORE]