Market Basket Analysis using Apriori Algorithm

Allan Vikiru

Definition of Market Basket Analysis

Market basket analysis is a derivation of association analysis, where businesses analyse volumes of customer transaction data to understand their purchasing behaviour. This information is beneficial in supporting operations such as marketing promotions, product placement, inventory handling and customer relationship management. [1],[2] Online businesses also leverage this information to carry out cross-selling (selling related or complementary products together), product recommendations and checkout/point-of-sale offers, and post-purchase marketing, all in real time. [3]

Association analysis on the other hand, involves determining hidden patterns within a dataset and measuring their strengths. [3] These relationships are represented as association rules, such as: {Sunflower Oil} -> {Salt}, which suggests that based on past transactions, if sunflower oil is found first in a customer’s ‘basket’ (antecedent), there is a chance that salt exists within the same basket (consequent).[4] Rules do not venture into the transaction history of individual customers, but solely focus on the items within each transaction whether they are purchased by a repeat customer or not. The former is applied in collaborative filtering, which is largely implemented in recommendation systems. [4],[5]

To measure the strength of generated rules, different metrics are applied to the itemset, i.e., list of all items within an association rule. Two popular measures are support and confidence, where support, as demonstrated in equation 1, measures the frequency of an occurrence of an itemset in the set of transactions. [3] Itemsets with high support values are considered more for analysis than those with low support, as they hint at products that are more frequently purchased and thus require active promotion. [2] Confidence, whereas defines the likelihood of the occurrence of the consequent in a rule given the set of antecedents, as illustrated in equation 2. [4] A higher value implies the rule to be more reliable.

\[ \small \textit{Equation 1: Support Metric }\\ \small Support({Detergent}→{Bleach}) = \frac{P(Detergent \bigcup Bleach)}{N} \text{; where N = number of all transactions} \\~\\ \small \textit{Equation 2: Confidence Metric }\\ \small Confidence({Detergent}→{Bleach}) = \frac{Support(Detergent \bigcup Bleach)}{Support(Detergent)} \] The Apriori algorithm is widely used in association rule mining; where besides analysis of retail transactions, it is also applied in areas such as Web mining in the analysing online clickstream data across web pages and healthcare to determine frequently occurring diseases within a particular area. [6],[2] Despite this method being suitable for market basket analysis, it has its drawbacks in that the Apriori algorithm can be computationally expensive to discover patterns in a large dataset, leading to generation of insignificant rules. [7] Nevertheless, improvements to the algorithm [8] and other techniques such as time series clustering [7] and UniqAR algorithm [9] have been developed to bridge these gaps.

Implementation of Association Rule Mining for Market Basket Analysis

The following is a demonstration of association rule mining using the Apriori algorithm for market basket analysis, where the objectives are to discover the most purchased products and product combinations from a dataset of purchases made at a grocery store between January 2014 and December 2015. [10], [11], [12], [13]

Installing Libraries

For this project, we’ll work with these libraries:

arules: used for mining association rules and frequent itemsets
arulesViz: used for visualising mined rules and frequent itemsets
ggplot2: used for data visualisations
plyr: used for data manipulation e.g. splitting, summarising
dplyr: goes hand in hand with plyr to manipulate data

install.packages('arules' , repos = "http://cran.us.r-project.org")
install.packages('arulesViz', repos = "http://cran.us.r-project.org")
install.packages('ggplot2', repos = "http://cran.us.r-project.org")
install.packages('plyr', repos = "http://cran.us.r-project.org")
install.packages('dplyr', repos = "http://cran.us.r-project.org")

Loading Libraries and Dataset

library(arules)
library(arulesViz)
library(ggplot2)
library(plyr)
library(dplyr)

Groceries_dataset <- read.csv("Groceries_dataset.csv")
glimpse(Groceries_dataset)

Rows: 38,765
Columns: 3
$ Member_number   <int> 1808, 2552, 2300, 1187, 3037, 4941, 4501, 38…
$ Date            <chr> "21-07-2015", "05-01-2015", "19-09-2015", "1…
$ itemDescription <chr> "tropical fruit", "whole milk", "pip fruit",…

These are the first 5 transactions in the dataset ordered randomly, containing the customer number, date of purchase and description of item bought.

Data Preparation

There’s no need to resolve missing values since all attributes are complete. (refer to data card on Kaggle)

Currently, the transactions are also in single format i.e. each row represents a single item purchased by a customer at a particular date. This needs to be converted to basket format where each row will represent a collection of items bought by a customer at a specified date.

To prepare the baskets/sets of transactions, first, the member number and date of purchase will be merged into a single character-type column transactionId, using paste(). Note, both columns need to be of type character, therefore Member_number will be first transformed before merging. After, the Member_number and Date columns can be dropped.

Groceries_dataset$transactionId <- paste(as.character(Groceries_dataset$Member_number),Groceries_dataset$Date,sep="_")
Groceries_dataset$Member_number <- NULL
Groceries_dataset$Date <- NULL
glimpse(Groceries_dataset)

Rows: 38,765
Columns: 2
$ itemDescription <chr> "tropical fruit", "whole milk", "pip fruit",…
$ transactionId   <chr> "1808_21-07-2015", "2552_05-01-2015", "2300_…

Next, the grouping of items will be done by the ddply() function which will group the items by the transaction ID and set it as vectors to a new dataframe df1. df1 will then be set to the Market_basket dataframe using paste() which converts the vectors into a rows of characters separated by ,. transactionId is then dropped and the new column of transactions renamed.

Groceries_basket <-ddply(Groceries_dataset, c("transactionId"),function(df1)paste(df1$itemDescription,collapse = ","))
Groceries_basket$transactionId<- NULL
colnames(Groceries_basket)<-c("basketTransactions")
head(Groceries_basket)

                             basketTransactions
1 sausage,whole milk,semi-finished bread,yogurt
2                 whole milk,pastry,salty snack
3                   canned beer,misc. beverages
4                      sausage,hygiene articles
5                       soda,pickled vegetables
6                              frankfurter,curd

Lastly, the baskets of transactions are exported to an external CSV file for later use in generating association rules. Here the quotes and header rows are removed.

write.csv(Groceries_basket,"groceries_basket.csv",quote = FALSE,row.names = FALSE)

Alternatively, the single format can still be applied, where items will not be grouped per transaction. Here, each transactionId will be paired to the item in a list(singleTransactions) from which a new dataframe Groceries_singles is created. Eventually, this is written into a CSV file for later usage and the list is deleted.

singleTransactions <- paste(as.character(Groceries_dataset$transactionId),Groceries_dataset$itemDescription,sep=";")
Groceries_singles <- data.frame(singleTransactions)
glimpse(Groceries_singles)

Rows: 38,765
Columns: 1
$ singleTransactions <chr> "1808_21-07-2015;tropical fruit", "2552_0…

write.csv(Groceries_singles,"groceries_single.csv",quote = FALSE,row.names = FALSE)
rm(singleTransactions)

Derive Best Selling Items

To generate a chart for the best sellers, the dataset is grouped according to the item name, and the number of items per group generated. Eventually, they are arranged in descending order. Note that %>% is an operator that allows values to be piped to another function or expression.

Best_sellers <- Groceries_dataset %>% group_by(itemDescription) %>% summarise(count=n()) %>% arrange(desc(count))
Best_sellers <- head(Best_sellers, n=15)
Best_sellers

# A tibble: 15 × 2
   itemDescription  count
   <chr>            <int>
 1 whole milk        2502
 2 other vegetables  1898
 3 rolls/buns        1716
 4 soda              1514
 5 yogurt            1334
 6 root vegetables   1071
 7 tropical fruit    1032
 8 bottled water      933
 9 sausage            924
10 citrus fruit       812
11 pastry             785
12 pip fruit          744
13 shopping bags      731
14 canned beer        717
15 bottled beer       687

The top 3 sellers are whole milk, other vegetables and rolls\buns. These items will be visualised as follows; find more info on the ggplot() functions used here

Best_sellers %>% ggplot(aes(x=reorder(itemDescription,count), y=count))+ geom_bar(stat="identity", fill="purple")+coord_flip()+labs(y= "Amount", x = "Item")

Generate Association Rules

Before the rules are generated, the set of transactions is first prepared through read.transactions() It can take either sets in basket or single format.

# Basket transactions - the separator between items per transaction is ','
# Transactions <- read.transactions("groceries_basket.csv", format = 'basket',sep = ",")

# Single transactions - the separator between the transactionId and item is ';' 'cols' defines the two columns for separation
Transactions <- read.transactions("groceries_single.csv", format = 'single',sep = ";", cols = c(1,2))
summary(Transactions)

transactions as itemMatrix in sparse format with
 14964 rows (elements/itemsets/transactions) and
 168 columns (items) and a density of 0.01511803 

most frequent items:
      whole milk other vegetables       rolls/buns             soda 
            2363             1827             1646             1453 
          yogurt          (Other) 
            1285            29432 

element (itemset/transaction) length distribution:
sizes
    1     2     3     4     5     6     7     8     9    10 
  206 10013  2726  1273   338   179   113    96    19     1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    2.00    2.00    2.54    3.00   10.00 

includes extended item information - examples:
            labels
1  1808_21-07-2015
2 abrasive cleaner
3 artif. sweetener

includes extended transaction information - examples:
    transactionID
1 1000_15-03-2015
2 1000_24-06-2014
3 1000_24-07-2015

This returns more information on the transactions i.e.: - number of transactions (14964) and number of items (168) - most frequently purchased items which align with the chart generated above - number of items per transaction ranging from 1 to 10, most transactions had either 2 or 3 items.

To generate the rules, the Apriori algorithm is run with a support of 0.001 and confidence of 10%. The minimum and maximum number of items per rule is set to 2 and 5 respectively, this avoids chances of empty sets being generated. Rules were sorted by the confidence value in descending order.

Rules <- apriori(Transactions, parameter = list(supp=0.001, conf=0.1, minlen=2, maxlen=5))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support
        0.1    0.1    1 none FALSE            TRUE       5   0.001
 minlen maxlen target  ext
      2      5  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 14 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[168 item(s), 14964 transaction(s)] done [0.00s].
sorting and recoding items ... [149 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [131 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].

Rules <- sort(Rules, by='confidence', decreasing = TRUE)
options(digits=2) # rounding off to 2 d.p.
summary(Rules)

set of 131 rules

rule length distribution (lhs + rhs):sizes
  2   3 
114  17 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    2.00    2.00    2.13    2.00    3.00 

summary of quality measures:
    support         confidence       coverage          lift     
 Min.   :0.0010   Min.   :0.100   Min.   :0.005   Min.   :0.65  
 1st Qu.:0.0013   1st Qu.:0.110   1st Qu.:0.010   1st Qu.:0.81  
 Median :0.0019   Median :0.122   Median :0.017   Median :0.88  
 Mean   :0.0029   Mean   :0.126   Mean   :0.024   Mean   :0.95  
 3rd Qu.:0.0038   3rd Qu.:0.135   3rd Qu.:0.032   3rd Qu.:1.03  
 Max.   :0.0148   Max.   :0.256   Max.   :0.122   Max.   :2.18  
     count    
 Min.   : 15  
 1st Qu.: 20  
 Median : 29  
 Mean   : 44  
 3rd Qu.: 56  
 Max.   :222  

mining info:
         data ntransactions support confidence
 Transactions         14964   0.001        0.1
                                                                                             call
 apriori(data = Transactions, parameter = list(supp = 0.001, conf = 0.1, minlen = 2, maxlen = 5))

inspect(Rules[1:10])

     lhs                      rhs          support confidence
[1]  {sausage, yogurt}     => {whole milk} 0.0015  0.26      
[2]  {rolls/buns, sausage} => {whole milk} 0.0011  0.21      
[3]  {sausage, soda}       => {whole milk} 0.0011  0.18      
[4]  {semi-finished bread} => {whole milk} 0.0017  0.18      
[5]  {rolls/buns, yogurt}  => {whole milk} 0.0013  0.17      
[6]  {sausage, whole milk} => {yogurt}     0.0015  0.16      
[7]  {detergent}           => {whole milk} 0.0014  0.16      
[8]  {ham}                 => {whole milk} 0.0027  0.16      
[9]  {bottled beer}        => {whole milk} 0.0072  0.16      
[10] {frozen fish}         => {whole milk} 0.0011  0.16      
     coverage lift count
[1]  0.0057   1.62  22  
[2]  0.0053   1.35  17  
[3]  0.0059   1.14  16  
[4]  0.0095   1.11  25  
[5]  0.0078   1.08  20  
[6]  0.0090   1.91  22  
[7]  0.0086   1.03  21  
[8]  0.0171   1.01  41  
[9]  0.0453   1.00 107  
[10] 0.0068   0.99  16

131 rules were generated in total, with each having either 2 or 3 items on either side. A summary of quality measures is also provided, find more information about them here. Looking at the top 10 rules, an analysis can be made using the confidence values e.g.

26% of customers who bought sausage and yoghurt also bought whole milk
16% of customers who bought bottled beer bought whole milk

Lastly, rules regarding a particular item of interest e.g. rolls/buns can be generated to find the most products most purchased together with it. Here, items on the lhs are regularly purchased first then rolls/buns. This info can be used by retailers in targeted promotions/discounts and product bundling.

Rolls_rules <- apriori(Transactions, parameter = list(supp=0.001, conf=0.1), minlen=2, appearance = list(default="lhs",rhs="rolls/buns"))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support
        0.1    0.1    1 none FALSE            TRUE       5   0.001
 minlen maxlen target  ext
      2     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 14 

set item appearances ...[1 item(s)] done [0.00s].
set transactions ...[168 item(s), 14964 transaction(s)] done [0.00s].
sorting and recoding items ... [149 item(s)] done [0.00s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [17 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].

Rolls_rules <- sort(Rolls_rules, by='confidence', decreasing = TRUE)
inspect(head(Rolls_rules))

    lhs                            rhs          support confidence
[1] {processed cheese}          => {rolls/buns} 0.0015  0.14      
[2] {packaged fruit/vegetables} => {rolls/buns} 0.0012  0.14      
[3] {seasonal products}         => {rolls/buns} 0.0010  0.14      
[4] {red/blush wine}            => {rolls/buns} 0.0013  0.13      
[5] {sausage, whole milk}       => {rolls/buns} 0.0011  0.13      
[6] {whole milk, yogurt}        => {rolls/buns} 0.0013  0.12      
    coverage lift count
[1] 0.0102   1.3  22   
[2] 0.0085   1.3  18   
[3] 0.0071   1.3  15   
[4] 0.0105   1.2  20   
[5] 0.0090   1.2  17   
[6] 0.0112   1.1  20

This can also be used to answer questions like: Customers who bought rolls/buns also bought…. We simply switch the order of purchasing i.e. from rhs to lhs.

Rolls_rules <- apriori(Transactions, parameter = list(supp=0.001, conf=0.1), minlen=2, appearance = list(lhs="rolls/buns",default="rhs"))

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support
        0.1    0.1    1 none FALSE            TRUE       5   0.001
 minlen maxlen target  ext
      2     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 14 

set item appearances ...[1 item(s)] done [0.00s].
set transactions ...[168 item(s), 14964 transaction(s)] done [0.00s].
sorting and recoding items ... [149 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [1 rule(s)] done [0.00s].
creating S4 object  ... done [0.00s].

Rolls_rules <- sort(Rolls_rules, by='confidence', decreasing = TRUE)
inspect(head(Rolls_rules))

    lhs             rhs          support confidence coverage lift
[1] {rolls/buns} => {whole milk} 0.014   0.13       0.11     0.8 
    count
[1] 209

Visualising Association Rules

Lastly, an interactive graph can be used to illustrate the top 10 rules.

Top_rules <- head(Rules, n = 10, by = "confidence")
plot(Top_rules, method = "graph", engine = "htmlwidget")

Acknowledgments

CSS Theme reused from John Paul Helveston

[1]

F. Espinosa, “From Data to Money in Retail: Market Basket Analysis,” Sintec Consulting, White {Paper}, Apr. 2014.Available: https://sintec.com/wp-content/uploads/2014/04/From-Data-to-Money-in-Retail.pdf

[2]

P.-N. Tan, M. Steinbach, A. Karpatne, and V. Kumar, Introduction to Data Mining. Pearson, 2019.

[3]

V. Kotu and B. Deshpande, “Association Analysis,” in Data Science, Elsevier, 2019, pp. 199–220. doi: 10.1016/B978-0-12-814761-0.00006-X.

[4]

A. Garg, “Complete guide to Association Rules (1/2),” Towards Data Science. Feb. 2019. Accessed: Jan. 04, 2023. [Online]. Available: https://towardsdatascience.com/association-rules-2-aa9a77241654

[5]

S. Luo, “Intro to Recommender System: Collaborative Filtering,” Towards Data Science. Feb. 2019. Accessed: Jan. 04, 2023. [Online]. Available: https://towardsdatascience.com/intro-to-recommender-system-collaborative-filtering-64a238194a26

[6]

A. B. Rao, J. S. Kiran, and P. G, “Application of market–basket analysis on healthcare,” International Journal of System Assurance Engineering and Management, Aug. 2021, doi: 10.1007/s13198-021-01298-2.

[7]

S. C. Tan and J. P. S. Lau, “Time Series Clustering: A Superior Alternative for Market Basket Analysis,” in Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), vol. 285, T. Herawan, M. M. Deris, and J. Abawajy, Eds. Singapore: Springer Singapore, 2014, pp. 241–248. doi: 10.1007/978-981-4585-18-7_28.

[8]

Z. Zhao, Z. Jian, G. S. Gaba, R. Alroobaea, M. Masud, and S. Rubaiee, “An improved association rule mining algorithm for large data,” Journal of Intelligent Systems, vol. 30, no. 1, pp. 750–762, Jan. 2021, doi: 10.1515/jisys-2020-0121.

[9]

M. Nasr, M. Hamdy, D. Hegazy, and K. Bahnasy, “An efficient algorithm for unique class association rule mining,” Expert Systems with Applications, vol. 164, p. 113978, Feb. 2021, doi: 10.1016/j.eswa.2020.113978.

[10]

H. Dedhia, “Groceries dataset.” Kaggle, 2021. Accessed: Jan. 02, 2023. [Online]. Available: https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset

[11]

M. Hahsler, B. Grün, and K. Hornik, “Arules - A Computational Environment for Mining Association Rules and Frequent Item Sets,” Journal of Statistical Software, vol. 14, pp. 1–25, Sep. 2005, doi: 10.18637/jss.v014.i15.

[12]

H. Jabeen, “R Market Basket Analysis using Apriori Examples,” R Market Basket Analysis using Examples DataCamp. Aug. 2018. Accessed: Jan. 02, 2023. [Online]. Available: https://www.datacamp.com/tutorial/market-basket-analysis-r

[13]

S. Li, “Data-Analysis-with-R.” Dec. 2022. Accessed: Dec. 25, 2022. [Online]. Available: https://github.com/susanli2016/Data-Analysis-with-R/blob/d8fbc5b3e6073cecba7846c05230d5f04f137400/Market_Basket_Analysis.Rmd