A descriptive technique of association analysis for use in business data analytics.
Market basket analysis is a derivation of association analysis, where businesses analyse volumes of customer transaction data to understand their purchasing behaviour. This information is beneficial in supporting operations such as marketing promotions, product placement, inventory handling and customer relationship management. [1],[2] Online businesses also leverage this information to carry out cross-selling (selling related or complementary products together), product recommendations and checkout/point-of-sale offers, and post-purchase marketing, all in real time. [3]
Association analysis on the other hand, involves determining hidden patterns within a dataset and measuring their strengths. [3] These relationships are represented as association rules, such as: {Sunflower Oil} -> {Salt}
, which suggests that based on past transactions, if sunflower oil is found first in a customer’s ‘basket’ (antecedent), there is a chance that salt exists within the same basket (consequent).[4] Rules do not venture into the transaction history of individual customers, but solely focus on the items within each transaction whether they are purchased by a repeat customer or not. The former is applied in collaborative filtering, which is largely implemented in recommendation systems. [4],[5]
To measure the strength of generated rules, different metrics are applied to the itemset, i.e., list of all items within an association rule. Two popular measures are support and confidence, where support, as demonstrated in equation 1, measures the frequency of an occurrence of an itemset in the set of transactions. [3] Itemsets with high support values are considered more for analysis than those with low support, as they hint at products that are more frequently purchased and thus require active promotion. [2] Confidence, whereas defines the likelihood of the occurrence of the consequent in a rule given the set of antecedents, as illustrated in equation 2. [4] A higher value implies the rule to be more reliable.
\[ \small \textit{Equation 1: Support Metric }\\ \small Support({Detergent}→{Bleach}) = \frac{P(Detergent \bigcup Bleach)}{N} \text{; where N = number of all transactions} \\~\\ \small \textit{Equation 2: Confidence Metric }\\ \small Confidence({Detergent}→{Bleach}) = \frac{Support(Detergent \bigcup Bleach)}{Support(Detergent)} \] The Apriori algorithm is widely used in association rule mining; where besides analysis of retail transactions, it is also applied in areas such as Web mining in the analysing online clickstream data across web pages and healthcare to determine frequently occurring diseases within a particular area. [6],[2] Despite this method being suitable for market basket analysis, it has its drawbacks in that the Apriori algorithm can be computationally expensive to discover patterns in a large dataset, leading to generation of insignificant rules. [7] Nevertheless, improvements to the algorithm [8] and other techniques such as time series clustering [7] and UniqAR algorithm [9] have been developed to bridge these gaps.
The following is a demonstration of association rule mining using the Apriori algorithm for market basket analysis, where the objectives are to discover the most purchased products and product combinations from a dataset of purchases made at a grocery store between January 2014 and December 2015. [10], [11], [12], [13]
For this project, we’ll work with these libraries:
install.packages('arules' , repos = "http://cran.us.r-project.org")
install.packages('arulesViz', repos = "http://cran.us.r-project.org")
install.packages('ggplot2', repos = "http://cran.us.r-project.org")
install.packages('plyr', repos = "http://cran.us.r-project.org")
install.packages('dplyr', repos = "http://cran.us.r-project.org")
Rows: 38,765
Columns: 3
$ Member_number <int> 1808, 2552, 2300, 1187, 3037, 4941, 4501, 38…
$ Date <chr> "21-07-2015", "05-01-2015", "19-09-2015", "1…
$ itemDescription <chr> "tropical fruit", "whole milk", "pip fruit",…
These are the first 5 transactions in the dataset ordered randomly, containing the customer number, date of purchase and description of item bought.
There’s no need to resolve missing values since all attributes are complete. (refer to data card on Kaggle)
Currently, the transactions are also in single format i.e. each row represents a single item purchased by a customer at a particular date. This needs to be converted to basket format where each row will represent a collection of items bought by a customer at a specified date.
To prepare the baskets/sets of transactions, first, the member number and date of purchase will be merged into a single character-type column transactionId
, using paste()
. Note, both columns need to be of type character, therefore Member_number
will be first transformed before merging. After, the Member_number
and Date
columns can be dropped.
Groceries_dataset$transactionId <- paste(as.character(Groceries_dataset$Member_number),Groceries_dataset$Date,sep="_")
Groceries_dataset$Member_number <- NULL
Groceries_dataset$Date <- NULL
glimpse(Groceries_dataset)
Rows: 38,765
Columns: 2
$ itemDescription <chr> "tropical fruit", "whole milk", "pip fruit",…
$ transactionId <chr> "1808_21-07-2015", "2552_05-01-2015", "2300_…
Next, the grouping of items will be done by the ddply()
function which will group the items by the transaction ID and set it as vectors to a new dataframe df1
. df1
will then be set to the Market_basket
dataframe using paste()
which converts the vectors into a rows of characters separated by ,
. transactionId
is then dropped and the new column of transactions renamed.
Groceries_basket <-ddply(Groceries_dataset, c("transactionId"),function(df1)paste(df1$itemDescription,collapse = ","))
Groceries_basket$transactionId<- NULL
colnames(Groceries_basket)<-c("basketTransactions")
head(Groceries_basket)
basketTransactions
1 sausage,whole milk,semi-finished bread,yogurt
2 whole milk,pastry,salty snack
3 canned beer,misc. beverages
4 sausage,hygiene articles
5 soda,pickled vegetables
6 frankfurter,curd
Lastly, the baskets of transactions are exported to an external CSV file for later use in generating association rules. Here the quotes and header rows are removed.
write.csv(Groceries_basket,"groceries_basket.csv",quote = FALSE,row.names = FALSE)
Alternatively, the single format can still be applied, where items will not be grouped per transaction. Here, each transactionId
will be paired to the item in a list(singleTransactions
) from which a new dataframe Groceries_singles
is created. Eventually, this is written into a CSV file for later usage and the list is deleted.
singleTransactions <- paste(as.character(Groceries_dataset$transactionId),Groceries_dataset$itemDescription,sep=";")
Groceries_singles <- data.frame(singleTransactions)
glimpse(Groceries_singles)
Rows: 38,765
Columns: 1
$ singleTransactions <chr> "1808_21-07-2015;tropical fruit", "2552_0…
To generate a chart for the best sellers, the dataset is grouped according to the item name, and the number of items per group generated. Eventually, they are arranged in descending order. Note that %>%
is an operator that allows values to be piped to another function or expression.
Best_sellers <- Groceries_dataset %>% group_by(itemDescription) %>% summarise(count=n()) %>% arrange(desc(count))
Best_sellers <- head(Best_sellers, n=15)
Best_sellers
# A tibble: 15 × 2
itemDescription count
<chr> <int>
1 whole milk 2502
2 other vegetables 1898
3 rolls/buns 1716
4 soda 1514
5 yogurt 1334
6 root vegetables 1071
7 tropical fruit 1032
8 bottled water 933
9 sausage 924
10 citrus fruit 812
11 pastry 785
12 pip fruit 744
13 shopping bags 731
14 canned beer 717
15 bottled beer 687
The top 3 sellers are whole milk
, other vegetables
and rolls\buns
. These items will be visualised as follows; find more info on the ggplot()
functions used here
Before the rules are generated, the set of transactions is first prepared through read.transactions()
It can take either sets in basket or single format.
# Basket transactions - the separator between items per transaction is ','
# Transactions <- read.transactions("groceries_basket.csv", format = 'basket',sep = ",")
# Single transactions - the separator between the transactionId and item is ';' 'cols' defines the two columns for separation
Transactions <- read.transactions("groceries_single.csv", format = 'single',sep = ";", cols = c(1,2))
summary(Transactions)
transactions as itemMatrix in sparse format with
14964 rows (elements/itemsets/transactions) and
168 columns (items) and a density of 0.01511803
most frequent items:
whole milk other vegetables rolls/buns soda
2363 1827 1646 1453
yogurt (Other)
1285 29432
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10
206 10013 2726 1273 338 179 113 96 19 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 2.00 2.00 2.54 3.00 10.00
includes extended item information - examples:
labels
1 1808_21-07-2015
2 abrasive cleaner
3 artif. sweetener
includes extended transaction information - examples:
transactionID
1 1000_15-03-2015
2 1000_24-06-2014
3 1000_24-07-2015
This returns more information on the transactions i.e.: - number of transactions (14964) and number of items (168) - most frequently purchased items which align with the chart generated above - number of items per transaction ranging from 1 to 10, most transactions had either 2 or 3 items.
To generate the rules, the Apriori algorithm is run with a support of 0.001 and confidence of 10%. The minimum and maximum number of items per rule is set to 2 and 5 respectively, this avoids chances of empty sets being generated. Rules were sorted by the confidence value in descending order.
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support
0.1 0.1 1 none FALSE TRUE 5 0.001
minlen maxlen target ext
2 5 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 14
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[168 item(s), 14964 transaction(s)] done [0.00s].
sorting and recoding items ... [149 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [131 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
Rules <- sort(Rules, by='confidence', decreasing = TRUE)
options(digits=2) # rounding off to 2 d.p.
summary(Rules)
set of 131 rules
rule length distribution (lhs + rhs):sizes
2 3
114 17
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 2.00 2.00 2.13 2.00 3.00
summary of quality measures:
support confidence coverage lift
Min. :0.0010 Min. :0.100 Min. :0.005 Min. :0.65
1st Qu.:0.0013 1st Qu.:0.110 1st Qu.:0.010 1st Qu.:0.81
Median :0.0019 Median :0.122 Median :0.017 Median :0.88
Mean :0.0029 Mean :0.126 Mean :0.024 Mean :0.95
3rd Qu.:0.0038 3rd Qu.:0.135 3rd Qu.:0.032 3rd Qu.:1.03
Max. :0.0148 Max. :0.256 Max. :0.122 Max. :2.18
count
Min. : 15
1st Qu.: 20
Median : 29
Mean : 44
3rd Qu.: 56
Max. :222
mining info:
data ntransactions support confidence
Transactions 14964 0.001 0.1
call
apriori(data = Transactions, parameter = list(supp = 0.001, conf = 0.1, minlen = 2, maxlen = 5))
inspect(Rules[1:10])
lhs rhs support confidence
[1] {sausage, yogurt} => {whole milk} 0.0015 0.26
[2] {rolls/buns, sausage} => {whole milk} 0.0011 0.21
[3] {sausage, soda} => {whole milk} 0.0011 0.18
[4] {semi-finished bread} => {whole milk} 0.0017 0.18
[5] {rolls/buns, yogurt} => {whole milk} 0.0013 0.17
[6] {sausage, whole milk} => {yogurt} 0.0015 0.16
[7] {detergent} => {whole milk} 0.0014 0.16
[8] {ham} => {whole milk} 0.0027 0.16
[9] {bottled beer} => {whole milk} 0.0072 0.16
[10] {frozen fish} => {whole milk} 0.0011 0.16
coverage lift count
[1] 0.0057 1.62 22
[2] 0.0053 1.35 17
[3] 0.0059 1.14 16
[4] 0.0095 1.11 25
[5] 0.0078 1.08 20
[6] 0.0090 1.91 22
[7] 0.0086 1.03 21
[8] 0.0171 1.01 41
[9] 0.0453 1.00 107
[10] 0.0068 0.99 16
131 rules were generated in total, with each having either 2 or 3 items on either side. A summary of quality measures is also provided, find more information about them here. Looking at the top 10 rules, an analysis can be made using the confidence values e.g.
Lastly, rules regarding a particular item of interest e.g. rolls/buns
can be generated to find the most products most purchased together with it. Here, items on the lhs
are regularly purchased first then rolls/buns
. This info can be used by retailers in targeted promotions/discounts and product bundling.
Rolls_rules <- apriori(Transactions, parameter = list(supp=0.001, conf=0.1), minlen=2, appearance = list(default="lhs",rhs="rolls/buns"))
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support
0.1 0.1 1 none FALSE TRUE 5 0.001
minlen maxlen target ext
2 10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 14
set item appearances ...[1 item(s)] done [0.00s].
set transactions ...[168 item(s), 14964 transaction(s)] done [0.00s].
sorting and recoding items ... [149 item(s)] done [0.00s].
creating transaction tree ... done [0.01s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [17 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
lhs rhs support confidence
[1] {processed cheese} => {rolls/buns} 0.0015 0.14
[2] {packaged fruit/vegetables} => {rolls/buns} 0.0012 0.14
[3] {seasonal products} => {rolls/buns} 0.0010 0.14
[4] {red/blush wine} => {rolls/buns} 0.0013 0.13
[5] {sausage, whole milk} => {rolls/buns} 0.0011 0.13
[6] {whole milk, yogurt} => {rolls/buns} 0.0013 0.12
coverage lift count
[1] 0.0102 1.3 22
[2] 0.0085 1.3 18
[3] 0.0071 1.3 15
[4] 0.0105 1.2 20
[5] 0.0090 1.2 17
[6] 0.0112 1.1 20
This can also be used to answer questions like: Customers who bought rolls/buns also bought…. We simply switch the order of purchasing i.e. from rhs
to lhs
.
Rolls_rules <- apriori(Transactions, parameter = list(supp=0.001, conf=0.1), minlen=2, appearance = list(lhs="rolls/buns",default="rhs"))
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support
0.1 0.1 1 none FALSE TRUE 5 0.001
minlen maxlen target ext
2 10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 14
set item appearances ...[1 item(s)] done [0.00s].
set transactions ...[168 item(s), 14964 transaction(s)] done [0.00s].
sorting and recoding items ... [149 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 done [0.00s].
writing ... [1 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
lhs rhs support confidence coverage lift
[1] {rolls/buns} => {whole milk} 0.014 0.13 0.11 0.8
count
[1] 209
Lastly, an interactive graph can be used to illustrate the top 10 rules.
CSS Theme reused from John Paul Helveston
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com/AllanVikiru/MarketBasketAnalysis/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Vikiru (2023, March 18). Market Basket Analysis using Apriori Algorithm. Retrieved from https://allanvikiru.github.io/MarketBasketAnalysis/
BibTeX citation
@misc{vikiru2023mba, author = {Vikiru, Allan}, title = {Market Basket Analysis using Apriori Algorithm}, url = {https://allanvikiru.github.io/MarketBasketAnalysis/}, year = {2023} }