Cluster analysis — a guide to smarter data-driven decisions.

Adobe Experience Cloud Team

02-24-2025

A smiling man in a patterned shirt talks on the phone while holding a tablet in a retail store. Floating elements show a personalized ad segment for 11K seasonal shoppers and social media posts featuring green fashion items.

Cluster analysis is a statistical method used to identify and group similar data points together while also highlighting differences between groups.

Imagine a clothing retailer grouping customers based on purchasing habits — frequent buyers, seasonal shoppers, or one-time purchasers. Cluster analysis helps businesses identify these groups and tailor marketing strategies, from targeted ads to personalized offers.

The purpose of cluster analysis in marketing is to segment consumers into distinct groups with similar characteristics, allowing businesses to understand their target audience better and tailor their marketing strategies accordingly.

What you’ll learn:

What is cluster analysis, and how does it work?

Cluster analysis is a type of unsupervised classification, meaning it doesn’t have any predefined classes, definitions, or expectations up front. It’s a statistical data mining technique used to cluster observations similar to each other but unlike other groups of observations.

An individual sorting out the chocolates from a sampler box is a good metaphor for understanding clustering. The person may have preferences for certain types of chocolate.

When they sift through their box, there are lots of ways they can group that chocolate. They can group it by milk chocolate versus dark chocolate, nuts versus no nuts, nougat versus no nougat, and so on.

The process of separating pieces of candy into piles of similar candy based on those characteristics is clustering. We do it all the time.

For instance, an ecommerce platform may group customers by purchasing habits—such as budget-conscious shoppers, premium product buyers, and occasional browsers. This segmentation allows the platform to create tailored promotions for each group, driving engagement and sales.

Understanding cluster analysis.

Cluster analysis is at the forefront of data analysis. It’s no wonder fields like finance, insurance, retail, ecommerce, and marketing find it useful to identify patterns and relationships within their data.

There are five main clustering approaches. The most common are k-means clustering and hierarchical (or hierarchy) clustering. The clustering approach an organization takes depends on what is being analyzed and why. With visualization techniques such as scatter plots and dendrograms, businesses can effortlessly showcase their cluster analysis results in a clear and understandable way.

What is the purpose of clustering datasets?

The general purpose of cluster analysis in marketing is to construct groups or clusters while ensuring that the observations are as similar as possible within a group.

Ultimately, the purpose depends on the application. In marketing, clustering helps marketers discover distinct groups of customers in their customer base. They then use this knowledge to develop targeted marketing campaigns.

For example, clustering may help an insurance company identify groups of motor insurance policyholders with a high average claim cost.

The purpose behind clustering depends on how a company intends to use it, which is largely informed by the industry, the business unit, and what the company is trying to accomplish.

Why is cluster analysis important for business strategy?

Cluster analysis can benefit a company in multiple ways, including how they market their products.

It can affect whom they market those products to, what retention and sales strategies might be employed, and how they might evaluate prospective customers.

They can cluster current customers and determine their lifetime value relative to their propensity for attrition, and that can inform how they communicate with different customers and how to identify new high-value customers.

What are the different types of clustering and when do you use them?

A row of five green icons representing different types of clustering algorithms: partitioning, hierarchical, density-based, grid-based, and model-based algorithms.

There are five different major clustering algorithms:

Clustering algorithm
Description
Best for
Disadvantages
Marketing use case
Partitioning algorithms
Partitioning algorithms, such as k-means clustering, divide the dataset into a predefined number of clusters by optimizing an objective function (e.g., minimizing the sum of squared distances).
Suitable for datasets where the number of clusters is known in advance and the clusters are well-separated.
  • Requires specifying the number of clusters beforehand
  • May struggle with clusters of varying sizes and densities
  • Sensitive to outliers
Segmenting corporate clients into distinct groups based on purchase patterns, enabling targeted B2B email campaigns and personalized product offerings.
Hierarchical algorithms
Hierarchical algorithms, including agglomerative and divisive clustering, build a nested hierarchy of clusters by merging or splitting clusters based on similarity.
Useful when the underlying data has a hierarchical structure or when the number of clusters is unknown.
  • Computationally intensive for large datasets
  • Early clustering decisions cannot be undone
  • Sensitive to noise and outliers
Organizing business customer data into a hierarchical structure (e.g., by industry, then by company size) to tailor multi-level marketing strategies and account management.
Density-based algorithms
Density-based algorithms, like DBSCAN, identify clusters as dense regions of data points separated by lower density areas, allowing for the discovery of arbitrarily shaped clusters.
Effective for datasets with clusters of varying shapes and sizes, especially in the presence of noise.
  • Sensitive to parameter selection
  • May struggle with clusters of varying densities and high-dimensional data
  • Can misclassify border points
Detecting clusters of high engagement among B2B clients within noisy transaction data to focus marketing efforts on high-value accounts or regions of concentrated business activity.
Grid-based algorithms
Grid-based algorithms divide the data space into a finite number of cells that form a grid structure, then identify clusters based on the density of data points within these cells.
Efficient for handling large datasets and when a fast clustering method is required.
  • Heavily dependent on grid resolution
  • May not capture clusters of arbitrary shapes
  • Potential loss of detail
Rapidly clustering large volumes of B2B lead data (e.g., by geolocation or other business attributes) to identify regional hotspots for targeted sales outreach and marketing campaigns.
Model-based algorithms
Model-based algorithms assume that the data is generated by a mixture of underlying probability distributions and aim to estimate the parameters of these distributions.
Suitable for datasets where the data distribution can be well-modeled by statistical distributions.
  • Computationally expensive
  • Requires correct model assumptions
  • Sensitive to initial conditions and potential model misspecification.
Segmenting B2B customers by fitting models (such as Gaussian mixture models) to transaction data, uncovering distinct buying behavior segments for tailored marketing strategies.

What are the characteristics of a good cluster analysis?

A good cluster analysis accurately groups data in a way that is useful and actionable. It uncovers real patterns in the data, leading to insights that drive decisions. A bad cluster analysis, on the other hand, creates misleading or arbitrary groups that don’t help solve a problem or add value.

The characteristics of good cluster analysis are:

For example, imagine you’re segmenting B2B customers based on their buying habits. A good clustering model groups them into:

Each group can be targeted with a specific marketing strategy, improving conversions and increasing customer satisfaction.

In contrast, poor cluster analyses offer:

For instance, suppose a marketing team clusters customers based on the number of vowels in their company name. While mathematically possible, this grouping has zero business value — it doesn’t predict behavior, preferences, or needs. The result? A useless segmentation that wastes time and resources.

What are the disadvantages of cluster analysis, and how can companies avoid problems?

Disadvantage
Problem
How to avoid it
Arbitrary number of clusters
Determining the optimal number of clusters (k) can be challenging and may not reflect the true data structure.
Employ methods like the elbow method or silhouette score to estimate the appropriate k. Experiment with different values and validate the results.
Sensitivity to outliers and noise
Outliers can distort cluster formation, leading to inaccurate groupings.
Preprocess data to identify and handle outliers. Consider using density-based clustering algorithms, such as DBSCAN, which are more robust to noise.
Poor interpretability
Clusters may be difficult to understand or apply in practical scenarios.
Select relevant features carefully. Utilize visualization techniques like Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE) to enhance interpretability.
Unequal cluster sizes and density
Algorithms like k-means assume clusters of similar size and density, which may not be realistic.
Use alternative methods such as hierarchical clustering or Gaussian mixture models (GMM) that can accommodate clusters of varying shapes and sizes.
Computational complexity
Clustering large datasets can be resource intensive and time consuming.
Implement grid-based or sampling-based approaches to improve computational efficiency.
Overfitting to noise
The model might identify patterns in random noise, resulting in spurious clusters.
Regularly validate clusters against real-world business logic and use holdout datasets to test for overfitting.
Dependency on feature selection
Inappropriate feature selection can lead to misleading clusters.
Perform thorough feature selection or apply dimensionality reduction techniques like PCA or linear discriminant analysis (LDA) before clustering.

Companies can avoid clustering pitfalls by:

How do you perform cluster analysis?

A step-by-step flowchart with six green circles detailing the clustering analysis process: choosing an analysis method, determining the number of cases, selecting variables, deciding on standardization, applying an algorithm, and finalizing clusters.

Step 1: Choose an analysis method.

The first step of cluster analysis is usually to choose the analysis method, which will depend on the size of the data and the types of variables.

Hierarchical clustering, for example, is appropriate for small datasets, while k-means clustering is more appropriate for moderately large datasets and when the number of clusters is known in advance.

Large datasets usually require a mixture of different types of variables, and they generally require a two-step procedure.

Step 2: Determine the number of cases.

After you decide on what method of analysis to use, start the process by choosing the number of cases to subdivide into homogeneous groups or clusters. Those cases, or observations, can be any subject, person, or thing you want to analyze.

Step 3: Select variables for analysis.

Next, choose the variables to include. There could be 1,000 variables, or even 10,000 or 25,000. The number and types of variables chosen will determine what type of algorithm should be used.

Step 4: Decide on variable standardization.

Then decide whether to standardize those variables in some way, so that every variable contributes equally to the distance or similarity between the cases. However, the analysis can be run with both standardized and unstandardized variables.

Step 5: Apply the chosen algorithm.

Each analysis method has a different approach.

Step 6: Finalize the number of clusters.

Finally, the algorithm selects a method for forming the groups and determines how many clusters are needed to represent the data. It looks at how similar clusters are and splits them accordingly.

What do you do with the results of a cluster analysis?

Depending on the clustering method, there’s usually an associated visualization. That’s very common for investigating the results. In the case of k-means, it’s common to use x and y axes that show the distance of groups of observations.

By using that type of visualization, those groupings become very clear. In the case of hierarchical clustering, a visualization called a dendrogram is used, which shows the splits in the cut tree.

How do you make sure your cluster analysis is accurate?

First, assess cluster tendency. Before diving into any clustering algorithm, it’s important to verify whether your dataset even has the potential to form meaningful clusters or if it is randomly distributed.

One common method to determine this is the Hopkins statistic, which measures how likely it is that your data is uniformly distributed. A value near 0 suggests that the data has a strong cluster tendency, whereas a value around 0.5 indicates randomness.

In addition, visual tools like the visual assessment of cluster tendency (VAT) help by reordering the dissimilarity matrix to visually highlight potential clusters. If these tests indicate that your data naturally groups together, you can proceed with clustering; if not, clustering might not yield useful insights.

Next, determine the optimal number of clusters. Selecting the right number of clusters (k) is crucial because too few clusters may oversimplify the data, while too many clusters can lead to overfitting.

The elbow method is one popular approach: You plot the within-cluster sum of squares against the number of clusters and look for a point where the improvement in clustering performance begins to level off — the “elbow.” Another useful metric is the silhouette score, which evaluates how well each data point fits into its assigned cluster relative to other clusters. Higher silhouette scores indicate more distinct and well-separated clusters.

Additionally, the gap statistic compares the observed within-cluster dispersion with that expected under a null distribution, helping to identify the optimal k by highlighting where the gap is largest.

Finally, evaluate the clustering quality. Once you have established clusters, it’s important to confirm that they are both internally cohesive and externally separated.

The silhouette coefficient, which ranges from –1 to 1, is widely used for this purpose — a score closer to 1 means that the clusters are well defined. The Dunn index calculates the ratio between the smallest distance between observations not in the same cluster (inter-cluster distance) and the largest distance within a cluster (intra-cluster distance). Higher Dunn index values suggest better quality clusters. Conversely, the Davies–Bouldin index measures the average similarity between each cluster and its most similar one, with lower values indicating better clustering quality.

In summary, ensuring an accurate cluster analysis involves a three-step process:

  1. Assess cluster tendency: Determine if your data naturally forms clusters using statistical tests like the Hopkins statistic and visualization tools such as VAT.
  2. Determine the optimal number of clusters: Identify the right number of clusters (k) by employing methods like the elbow method, silhouette score, and gap statistic to avoid oversimplification or overfitting.
  3. Evaluate clustering quality: Confirm that your clusters are both compact and well separated by using metrics such as the silhouette coefficient, Dunn index, and Davies-Bouldin index.

Getting started with cluster analysis.

The main benefit of cluster analysis is that it allows businesses to uncover patterns and relationships within their data, enabling them to make informed decisions and take action based on real-time insights.

If you’re ready to get started with cluster analysis, the first step is to find a proven software tool that can help you analyze and interpret your data effectively.

Adobe Analytics turns real-time data into real-time insights. As more than a web analytics solution, it takes data from any point in the customer journey and turns it into an insight that guides your next best action. Analytics uses artificial intelligence (AI) to deliver predictive insights based on the full scope of your data, allowing users to view and manipulate data in real time.

Request a demo or watch the overview video to learn more about Adobe Analytics.