Correlation analysis, also known as bivariate analysis, is primarily concerned with finding out whether a relationship exists between variables and then determining the magnitude and action of that relationship.
Correlation does not equal causation. Correlation analysis identifies and evaluates a relationship between two variables, but a positive correlation does not automatically mean one variable affects the other.
The main benefits of correlation analysis are that it helps companies determine which variables they want to investigate further, and it allows for rapid hypothesis testing.
The main type of correlation analysis uses Pearson’s r formula to identify the degree of the linear relationship between two variables.
Because of the amount of data available, companies must be thoughtful when deciding which variables to analyze.
John Bates is the director of product management for predictive marketing solutions and for Adobe Analytics Premium in Adobe Experience Cloud. His core responsibility is to develop the product roadmap for all advanced statistics, data mining, predictive modeling, machine learning and text mining/natural language processing solutions found within the products of Adobe's Adobe's digital experience business unit.
Q: How is correlation analysis used?
A: Correlational studies are our attempts to find the extent to which two variables are related. No variables are manipulated as part of an experiment — the analyst is measuring naturally occurring events, behaviors, or characteristics.
It’s important to remember that correlation doesn't equal causation. You can’t draw any conclusions regarding the causal effect of one type of data on the other, but you can determine the size, degree, and direction of the relationship.
Q: What is the business value of correlation analysis?
A: Correlation analysis is useful for identifying possible inputs for a more sophisticated analysis, or for testing for future changes while holding other things constant. You may also want to just understand the relationship between two variables.
The great thing about correlation analysis is that it's fairly easy to interpret and understand, because you're only focused on the variance of one row of data in relation to the variance of another dataset. Correlation analysis can also be used to diagnose problems with multiple regression models. You may have some issues with a multivariate or multiple regression model, where it's not producing or you have different independent variables that are not truly independent. Those issues can be discovered by doing correlation analysis between the different independent variables.
Correlation analysis is also a quick way to identify potential. If there is a correlation between two variables, correlation analysis provides an opportunity for rapid hypothesis testing, especially if the test is low risk and won’t require a significant investment of time and money. For example, you might find that there’s a positive correlation between customers looking at reviews for a particular product and whether or not they purchase it. You can't say for certain that the product reviews caused the purchase, but it indicates a place where testing can provide more information. If you can get 10% more people to look at product reviews, especially positive ones, can you increase the number of purchases? Correlations can help to fuel different hypotheses that can then be rapidly tested, especially in digital environments.
Q: What is Pearson’s r formula?
A: The Pearson’s r formula is the most commonly used statistic to measure the degree of a relationship between linearly related variables. Once you run the formula, you will get a correlation report about the two tested variables. The output is often expressed as something called the Pearson product-moment correlation coefficient, also known as r. An r value of positive one (+1) indicates a strong positive correlation, while an r value of negative one (-1) indicates a strong negative correlation. An r value of zero indicates no correlation.
There are a couple other parts of Pearson’s r formula and the correlation report. As explained before, r is another term for the coefficient that appears in your report. This coefficient usually appears alongside the degrees of freedom (df). The degrees of freedom is the number of data points you have, minus two. So the output would report that r, within the context of the degrees of freedom, equals some correlation coefficient. The other thing that's often reported alongside the coefficient is the p value, which indicates the statistical significance of the correlation.
Another part of the correlation report is r-squared, which is called the coefficient of determination. The coefficient of determination is, with respect to the correlation, the proportion of the variance that is shared by both variables. It gives a measure of the amount of variation that can be explained by the model or the correlation. This value is usually written as a variable or percentage, like r-squared equals 0.36.
For the purposes of the following example, we will only focus on r, and the variables X and Y. If you want to determine the correlation between page views (X) and revenue (Y), you list all the X and Y values for a specific timeframe, and then plug those numbers into the formula in the correct places. If the value of r is between zero and one, that indicates that as page views go up, revenue will also go up. Similarly, a value between zero and negative one would indicate that as page views go up, revenue goes down. However, Pearson’s r formula can only tell you if there is a correlation between two variables, not whether one of the variables directly affects the other.
Q: What are the main types of correlation analysis?
A: The most common types of correlation analysis fall into three main families. Pearson’s correlation coefficient is used for linearly related variables, like age and height or temperature and ice cream sales. It requires certain assumptions about the variables: for instance, it assumes the variables are linearly connected and are normally distributed.
Spearman’s rank-order correlation, on the other hand, doesn’t carry any assumptions regarding the distribution of the data. It's most appropriate when correlation analysis is being applied to variables that contain some kind of natural order, like the relationship between starting salary and various degrees (high school, bachelor’s, master’s, etc.), or age and income.
The third main type of correlation analysis is Kendall’s tau correlation, and it’s used in ranked pairings. The purpose of Kendall’s tau correlation is to determine the strength of dependence between two variables. If the coefficient value is zero, the two variables X and Y can be assumed to be independent of each other.
Q: What problems do companies run into when conducting correlation analysis?
A: The main problem that companies run into with correlation analysis is that many people often quickly assume that the analysis indicates causation. Only proper testing can determine whether or not you’re looking at independent and dependent variables.
One of the modern challenges of correlation analysis is, with so much data that exists, there might be similar correlations and strengthened relationships between many different variables or sets of data with another set of data. There can be some paralysis when deciding which variable to evaluate more closely later using multivariate analysis. It isn’t always immediately clear which correlating relationship will be the most beneficial to pursue. It is important to choose one that may be representative of others that are not truly independent.
For example, when looking at orders or purchases, there might be similar correlations between that variable and visits to a website or store, page views, and number of visitors. One of the challenges is ensuring that your teams understand you can have multiple sets of data that correlate in a similar way because they're similar in nature. These data sets might get collected at the same time or with the same frequency, or they may have some sort of inherent relationship. It’s important to keep that relationship in mind when looking at different variables with similar correlation outcomes.
Companies can also run into problems with missing data. Let’s say you’re looking at the correlation between stock prices and sales in a specific time period. If you suddenly have missing data for a portion of that time, or if the variables don’t line up, it can really throw off the correlation analysis itself because it will treat the missing data as zeros, even though there is a difference between the two. To mitigate potential problems, make sure you choose a period of time for the data you're collecting, or observations that have the right distribution, that the assumptions align with the underlying data, and that you apply the proper technique. And when there's missing data, exclude it. If you’re looking at time-based data, try to find an observation period with consistently collected data.
Finally, a company can make an assumption that because a correlation is statistically significant it means there must be a strong association, but this is not always the case. The relationship can be statistically significant and still have a fairly weak association. Correlation analysis is simply testing the null hypothesis that there is no relationship. By rejecting the null hypothesis, you accept the alternative hypothesis that declares there is a relationship, but there is no information about the strength of the relationship or its importance. Be careful about how you interpret association or correlation, because the correlation coefficient and statistical significance are two separate concepts.