Cluster analysis

Cluster analysis

Quick definition

Cluster analysis is a form of exploratory data analysis in which observations are divided into groups that share common characteristics. Those groups are compared and contrasted with other groups to derive information about the observations.

Key takeaways

Cluster analysis allows organizations to better understand their customers by identifying individuals with similar traits, which can inform how the organization communicates with those customers.

There are five main clustering approaches, with the most common being K-means clustering and hierarchical, or hierarchy, clustering. The clustering approach an organization takes depends on what is being analyzed and why.

To ensure an accurate cluster analysis, make sure you choose helpful variables (behavior, geography, demographics, etc.) to evaluate the observations, cluster the observations into the right number of groups, and create clusters that have high intra-cluster similarity and low inter-cluster similarity. .

John Bates is the director of product management for Predictive Marketing Solutions and Analytics Premium for Adobe Marketing Cloud. His core responsibility is to develop the product roadmap for all advanced statistics, data mining, predictive modeling, machine learning, and text mining/natural language processing solutions found within the products of Adobe's digital experience business unit.

What is cluster analysis?

What is the purpose of clustering?

What are the different types of clustering?

What are the characteristics of a good cluster analysis?

How do you perform cluster analysis?

What do you do with the results of a cluster analysis?

How do you make sure your cluster analysis is accurate?

What is the business strategy of cluster analysis?

How often do organizations update clusters?

What are the disadvantages of cluster analysis, and how can companies avoid problems?

What is cluster analysis?

A: Cluster analysis is a type of unsupervised classification, meaning it doesn’t have any predefined classes, definitions, or expectations up front. It's a statistical data mining technique that's used to cluster observations that are similar to each other but dissimilar from other groups of observations.

A good metaphor for understanding clustering is an individual sorting out the chocolates in a sampler box. The person may have preferences for certain types of chocolate. When they sift through their box, there are lots of ways they can group that chocolate. They can group it by milk chocolate vs. dark chocolate, nuts vs. no nuts, fruit filling, nougat, etc. The process of separating pieces of candy into piles of similar candy based on those characteristics is clustering. We do it all the time.

What is the purpose of clustering?

A: The general purpose of cluster analysis is to construct groups, or clusters, while ensuring that within a group, the observations are as similar as possible, while observations belonging to different groups are as different as possible. Ultimately, the purpose depends on the application. In marketing, clustering helps marketers discover distinct groups of customers in their customer base. They then use this knowledge to develop targeted marketing campaigns. For example, clustering may help an insurance company identify groups of motor insurance policy holders with a high average claim cost.

The purpose behind clustering depends on how it is intended to be used, which is largely informed by the industry, the business unit, and the ultimate goal of what you're trying to accomplish.

Q: What are the different types of clustering?

A: There are five different major clustering approaches:

  1. Partitioning algorithms
  2. Hierarchy algorithms
  3. Density-based algorithms
  4. Grid-based algorithms
  5. Model-based algorithms

The most commonly used are partitioning and hierarchy algorithms. The main difference between those two is that partitioning algorithms look to create various partitions and then evaluate them by some criterion, while hierarchy-based algorithms decompose, or split information, based on a criterion.

K-means clustering is probably the most common partitioning algorithm. It's generally used when the number of classes is fixed in advance. An analyst tells the algorithm how many clusters they want to divide the observations into. Then each cluster is represented by the center of the cluster, or the mean. It's an efficient option, but it does have some weaknesses. It's only applicable when the mean is defined and the number of clusters are determined in advance. It also doesn't deal well with outliers, so if there are observations that are very different from the rest, K-means isn’t the best option.

Another type of algorithm is called expectation maximization (EM). EM is a type of partitioning algorithm, but it's model based. It works similarly to K-means, but instead of assigning examples to clusters to maximize that difference in means or the variables, the EM clustering over the variables computes the probability of cluster memberships, or the probability that a single observation falls into a particular cluster. It uses probability distributions to calculate that number. The great thing about EM is that it's not mutually exclusive. A customer can have the probability of being associated with multiple clusters. They will typically get assigned to the one with the highest probability, but they may also have a lot of characteristics or traits with another cluster.

The purpose of hierarchical clustering is to create a hierarchy of groups. This can either be done with an agglomerative process, which starts with each observation in its own cluster and then pairs up similar observations in multiple levels, or a divisive process, which starts with all the observations in a single cluster and then breaks them into different groups. A hierarchy cluster is like a data visualization tree. You can see how people start together and then divide out based on different criteria. Hierarchical clustering is great for the end user to be able to see those relationships.

Q: What are the characteristics of a good cluster analysis?

A: A good clustering method will produce high-quality clusters, which means there is high similarity between observations in a single cluster, and low similarity between observations in different clusters. The quality of the clustering result depends on both the similarity measure used by the method and its implementation. The quality is also measured by the method’s ability to discover some or all hidden patterns that may exist within the data.

A lot of this is evaluated using what's called a “distance.” Clustering algorithms use a distance measure or metric to determine how to separate observations in the different groups. The most common one is called Euclidean distance, which shows how far one center of a cluster is from another center of a cluster, but there are many options. A distance measure often shows how close an observation is to the mean, or average value, of the cluster, and identifies the shape of the cluster.

Q: How do you perform cluster analysis?

A: The first step of cluster analysis is usually to choose the analysis method, which will depend on the size of the data as well as the types of variables. Hierarchical clustering, for example, is appropriate for small datasets, while K-means clustering is more appropriate for moderately large datasets and when the number of clusters is known in advance. Large datasets usually require a mixture of different types of variables, and they generally require a two-step procedure.

After you decide on what method of analysis to use, start off the process by choosing the number of cases to subdivide into homogeneous groups or clusters. Those cases, or observations, can be any subject, person, or thing you want to analyze.

Next, you choose the variables to include. There could be 1,000 variables, or even 10,000 or 25,000. The number and types of variables chosen will determine what type of algorithm should be used. Then decide whether to standardize those variables in some way, so that every variable contributes equally to the distance or similarity between the cases. However, the analysis can be run with both standardized and unstandardized variables.

Each analysis method has a different approach. For K-means clustering, select the number of clusters, then the algorithm iteratively estimates the cluster means and assigns each case to the cluster for which its distance to the cluster mean is the smallest. For hierarchical clustering, choose a statistic that quantifies how far apart or similar two cases are. Next, the algorithm selects a method for forming the groups. Finally, the algorithm determines how many clusters are needed to represent the data. It looks at how similar clusters are and splits.

Q: What do you do with the results of a cluster analysis?

A: Depending on which clustering method is used, there's usually an associated visualization. That's very common for investigating the results. In the case of K-means, it’s common to use an X, Y axis that shows the distance of groups of observations. By using that type of visualization, those groupings become very clear. In the case of hierarchical clustering, visualization called a dendrogram is used, which shows the splits in the cut tree.

Q: How do you make sure your cluster analysis is accurate?

A: When looking at the accuracy of a cluster, there are really three important factors: cluster tendency, number of clusters, and clustering quality. Before evaluating cluster performance, make sure that the data set you're working with has clustering tendency, which means that it doesn't contain uniformly distributed points. For example, it doesn’t benefit the analysis to choose a variable like “species,” because every observation will be the same. There are statistical methods for assessing clustering tendency.

Number of clusters is a required parameter for K-means clustering, but it’s useful for evaluating accuracy in other methods as well. By identifying how many clusters a team intends to work with, they can make sure observations are grouped in the best way for deriving helpful insights. Too few clusters means putting together observations that aren’t similar enough to take action, while too many clusters will divide your observations up too much to be useful.

Clustering quality looks at the level of similarity within a cluster and among separate clusters. There are multiple methods that can be used to ensure a high clustering quality, including the adjusted rand index, the Fowlkes-Mallows scores, mutual information-based scores, and homogeneity completeness.

Q: What is the business strategy of cluster analysis?

A: Cluster analysis can benefit a company in multiple ways, including how they market their products. It can affect whom they market those products to, what retention and sales strategies might be employed, and how they might evaluate prospective customers. They can cluster current customers and determine their lifetime value relative to their propensity for attrition, and that can inform how they communicate with different customers, and how to identify new high-value customers.

Q: How often do organizations update clusters?

A: It often depends on the use case. A high-tech retailer, like Best Buy, might use clusters at the highest level to align the entire enterprise on personas. Every employee, from those in the call centers to the individuals that are in the stores themselves, can look at every customer and classify that individual into the cluster or persona that person most aligns with. The company won’t change those clusters very often because they are informing a higher-level strategy across the entire business.

But then within certain departments, you might have micro clusters. Given one of those higher-level clusters, companies may want to cluster individuals more often because they are moving through different life cycle stages of the sales process. Once they’ve clustered their customers, the cluster becomes stale, so companies might re-cluster those individuals depending on how long the sales cycle is.

Q: What are the disadvantages of cluster analysis, and how can companies avoid problems?

A: Cluster analysis is an exploratory technique. It's not about making predictions. In the case of expectation maximization, given the algorithm, it might look at the probability distribution of the data and the probability of assignment to a cluster, but it's not making any predictions regarding what those people are likely to do next. All EM is really doing is helping you to make sense of data across lots of different variables for a given observation. Companies can only look at a couple of datasets simultaneously and see patterns. These models are helpful for evaluating lots of data to identify those patterns and then group people that are similar to one another across those traits. The advantages are that it helps in exploration. It helps inform strategy—how a company might think about their marketing campaigns or make business decisions—but it's not the end.

Cluster analysis also only looks at known customers. When a new customer begins to interact with a business, and they don't have all of the necessary data yet, the customer is an unknown quantity. They haven't authenticated, for example, so the company has very little information about them, for instance where the customer lives. A cluster analysis is static to the assignment at the time, and it only pertains to the data that’s put into it. It’s important to regularly re-evaluate clustering and re-apply analysis. If new data comes in, it should be incorporated into the analysis. Don’t get too fixated on individual cluster assignments. Allow clusters to be fluid. And remember to evaluate how customers may move between clusters based on certain interactions that they have with the business.

People also view