R Basket Analysis using arules package with unique order number but duplicate order combinations

你说的曾经没有我的故事 提交于 2019-12-31 16:47:53

问题


R Basket analysis using arules package with unique order number but duplicate order combinations

Just learning R. I'm trying to do a basket analysis using the arules package (but I'm totally open to any other package suggestions!) to compare all possible combinations of 6 different item types being purchased.

My original data set looked like this:

OrderNo, ItemType, ItemCount  
111, Health, 1  
111, Leisure, 2  
111, Sports, 1  
222, Health, 3      
333, Food, 7  
333, Clothing, 1  
444, Clothing, 2  
444, Health, 1  
444, Accessories, 2  

. . .

the list goes on and has about 3,000 observations.

I collapsed the data into a matrix that contains one row for each unique order containing counts of specific ItemType:

 OrderNo, Accessories, Clothing, Food, Health, Leisure, Sports  
 111, 0, 0, 0, 1, 2, 1  
 222, 0, 0, 0, 3, 0, 0  
 333, 0, 1, 7, 0 , 0, 0  
 444, 2, 2, 0, 1, 0, 0  
 . . .

Every time I try to read in the transactions using the following command (and a million attempted variations of it):

tr <- read.transactions("dataset.csv", rm.duplicates=FALSE, format="basket", sep=",")

I get the error message: Error in asMethod(object): can not coerce list with transactions with duplicated items.

I'm assuming this is because I have 3,000 observations and inevitably certain combinations are going to show up more than once (i.e., more than one person is purchasing only one piece of Clothing and nothing else: OrderNo, 0, 1, 0, 0, 0, 0). I know I could collapse the data set on counts of unique combinations, but I'm worried that if I do that, there will be no weights to show the most frequent combinations.

I thought that using format="basket" would account for different orders containing the same item combinations, but apparently that's not the case. I'm so lost. All the documentation I've read implies that this is possible but I can't find any examples or advice on how to approach the problem.

Any advice would be so appreciated! My head is spinning on this one.

Extra info: For my end result, I'm looking to get the top five most significant combinations of purchase combinations. I don't know if that helps.


回答1:


You must remove duplicates, if you are using .CSV file, please run Data -> Remove Duplicate in Excel before processing this file. arules throws error if duplicate are found and it is because of that you are getting the error.

Another way is to use duplicated() on your itemset and remove the duplicate using unique().

Or a more simple approach would be found in this SO post

Association analysis with duplicate transactions using arules package in R




回答2:


Ok, after hours of searching and reading all the pdfs I could find, I finally found the answer (and most helpful walkthrough of apriori/basket analysis ever!) in the DATA MINING Desktop Survival Guide by Graham Williams:

The read.transactions function can also read data from a file with transaction ID and a single item per line (using the format="single" option).

So there was no need to do all those transformations after import. I should have just been importing straight from the original csv file specifying the "single" format option instead of "basket." I also had to make sure the file contained no column names and that there was a unique representation of item type paired with order number (for instance, if a person ordered two items from the "Grocery" category, this needs to be represented on one row). And the cols=c(2,1) option indicates that column 1 contains the order number and column 2 is the rest of the data (ItemType).

tr <- read.transactions(file='dataset.csv', format='single', sep=',', cols=c(2,1))


来源:https://stackoverflow.com/questions/16515370/r-basket-analysis-using-arules-package-with-unique-order-number-but-duplicate-or

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!