Iowa Liquor Sales Recommender System in Spark

Project objective

Build recommender systems for product cross-selling opportunity

Data Overview:

Iowa Liquor Sales Dataset(link)

The dataset contains alcoholic purchases by liquor stores registered with the Iowa Department of Commerce that were logged in the Commerce department system from January 1, 2012 to current.

  • Size of Dataset: 5GB
  • Start Date: 2012-01-01
  • Update Frequency: Monthly
  • Row Count: 18,532,158
  • Variables: Invoice ID, Store Name, Address, County, Item ID, Sales(Dollars), Volume Sold (Liters), Sales (Gallons) etc.

Technique and Library

Method : Collaborative Filtering

PySpark MLlib: Alternating Least Squares (ALS) algorithm

Recommender Systems built

  1. ALS with Explicit preferences (custom rating)
  2. ALS with Implicit feedback (percentage of total order cost)
  3. Fine tune ALS-Implicit feedback with Cross-validation

Import libraries and import data

In [1]:
spark
Out[1]:

SparkSession - hive

SparkContext

Spark UI

Version
v2.3.4
Master
yarn
AppName
PySparkShell
In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, unix_timestamp, month, year, date_format, date_sub, to_date, date_add, lag,  dayofweek, dayofmonth, dayofyear, dayofweek
import pandas as pd
import numpy as np
from pyspark.sql.functions import approxCountDistinct
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark.sql.types import DateType, IntegerType,NumericType
from pyspark.ml.feature import QuantileDiscretizer
from pyspark.ml.feature import Bucketizer
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder,CrossValidatorModel

import matplotlib.pyplot as plt
%matplotlib inline
In [3]:
spark = SparkSession.builder.appName('LiquorSales').getOrCreate()
In [4]:
df = spark.read.csv('gs://bigdata2_data/liquor_sales', inferSchema=True, header=True)
In [5]:
df.count()
Out[5]:
18532158
In [6]:
df.dtypes
Out[6]:
[('Invoice/Item Number', 'string'),
 ('Date', 'string'),
 ('Store Number', 'int'),
 ('Store Name', 'string'),
 ('Address', 'string'),
 ('City', 'string'),
 ('Zip Code', 'string'),
 ('Store Location', 'string'),
 ('County Number', 'int'),
 ('County', 'string'),
 ('Category', 'int'),
 ('Category Name', 'string'),
 ('Vendor Number', 'int'),
 ('Vendor Name', 'string'),
 ('Item Number', 'string'),
 ('Item Description', 'string'),
 ('Pack', 'int'),
 ('Bottle Volume (ml)', 'int'),
 ('State Bottle Cost', 'double'),
 ('State Bottle Retail', 'double'),
 ('Bottles Sold', 'int'),
 ('Sale (Dollars)', 'double'),
 ('Volume Sold (Liters)', 'double'),
 ('Volume Sold (Gallons)', 'double')]
In [7]:
df.printSchema()
root
 |-- Invoice/Item Number: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Store Number: integer (nullable = true)
 |-- Store Name: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Zip Code: string (nullable = true)
 |-- Store Location: string (nullable = true)
 |-- County Number: integer (nullable = true)
 |-- County: string (nullable = true)
 |-- Category: integer (nullable = true)
 |-- Category Name: string (nullable = true)
 |-- Vendor Number: integer (nullable = true)
 |-- Vendor Name: string (nullable = true)
 |-- Item Number: string (nullable = true)
 |-- Item Description: string (nullable = true)
 |-- Pack: integer (nullable = true)
 |-- Bottle Volume (ml): integer (nullable = true)
 |-- State Bottle Cost: double (nullable = true)
 |-- State Bottle Retail: double (nullable = true)
 |-- Bottles Sold: integer (nullable = true)
 |-- Sale (Dollars): double (nullable = true)
 |-- Volume Sold (Liters): double (nullable = true)
 |-- Volume Sold (Gallons): double (nullable = true)

In [8]:
df.show(3)
+-------------------+----------+------------+--------------------+-------------------+------------+--------+--------------------+-------------+----------+--------+-----------------+-------------+--------------------+-----------+--------------------+----+------------------+-----------------+-------------------+------------+--------------+--------------------+---------------------+
|Invoice/Item Number|      Date|Store Number|          Store Name|            Address|        City|Zip Code|      Store Location|County Number|    County|Category|    Category Name|Vendor Number|         Vendor Name|Item Number|    Item Description|Pack|Bottle Volume (ml)|State Bottle Cost|State Bottle Retail|Bottles Sold|Sale (Dollars)|Volume Sold (Liters)|Volume Sold (Gallons)|
+-------------------+----------+------------+--------------------+-------------------+------------+--------+--------------------+-------------+----------+--------+-----------------+-------------+--------------------+-----------+--------------------+----+------------------+-----------------+-------------------+------------+--------------+--------------------+---------------------+
|    INV-05539600037|06/15/2017|        2544|Hy-Vee Food Store...|   802 South Center|Marshalltown|   50158|POINT (-92.912817...|           64|  MARSHALL| 1011100| Blended Whiskies|          434|           LUXCO INC|      23277|Calvert Extra Bl ...|  12|              1000|             6.88|              10.32|          12|        123.84|                12.0|                 3.17|
|    INV-05532600011|06/15/2017|        4988|Happy's Wine & Sp...|5925 University Ave| Cedar Falls|   50613|POINT (-92.429331...|            7|BLACK HAWK| 1081600|  Whiskey Liqueur|          421|SAZERAC COMPANY  INC|      64866|Fireball Cinnamon...|  12|               750|              9.0|               13.5|          12|        161.64|                 9.0|                 2.38|
|    INV-05494700021|06/13/2017|        4265|Kwik Stop 3 / Wat...| 1104 Washington St|    Waterloo|   50702|POINT (-92.336965...|            7|BLACK HAWK| 1051100|American Brandies|          115|CONSTELLATION BRA...|      52199|Paul Masson Grand...|  48|               100|              1.0|                1.5|          48|          72.0|                 4.8|                 1.27|
+-------------------+----------+------------+--------------------+-------------------+------------+--------+--------------------+-------------+----------+--------+-----------------+-------------+--------------------+-----------+--------------------+----+------------------+-----------------+-------------------+------------+--------------+--------------------+---------------------+
only showing top 3 rows

Let's look at how many liquor stores in Iowa

In [9]:
df.select(["Store Number","Store Name"]).dropDuplicates().show(5)
+------------+--------------------+
|Store Number|          Store Name|
+------------+--------------------+
|        4405|KUM & GO #92 / AN...|
|        4726|Walgreens #07454 ...|
|        3056|Clarion Super Val...|
|        9002|Mississippi River...|
|        4909|Pump N Pak Rock V...|
+------------+--------------------+
only showing top 5 rows

In [10]:
df.select("Store Name").distinct().count()
Out[10]:
2528
In [11]:
df.select("Store Number").distinct().count()
Out[11]:
2379

The number of unique store number and store name are not equal, indicating there are might be some typos in the store names or name changed for some stores. In the following analysis, we are using store number as the unique identifier. And we will use a lookup table, Lowa_liquor_stores (sourced from City of Lowa website) to look up store names.

Data preprocessing

  • Feature engineering
  • Data Cleaning

Extract date, month year from date

In [6]:
df=df.withColumn('Date',
                  unix_timestamp(col('Date'), "MM/dd/yyyy").cast("timestamp"))
# add the month column 
df = df.withColumn('month', month(df['Date']))
# add the year column 
df = df.withColumn('year', year(df['Date']))
# add a week_day column 
df = df.withColumn("week_day", dayofweek(df['Date']))\
.withColumn("DayofMonth",dayofmonth(df['Date']))\
.withColumn("DayofYear", dayofyear(df['Date']))
In [7]:
# add a column of 1's
df= df.withColumn("Number of Sale",lit(1))

Normalize cleaned liquor type

This step to normalize text data for liquor category and store name

In [8]:
# read in updated categories file
liquor_categories = spark.read.csv('gs://bigdata2_data/liquor_categories.csv', header=True)
In [9]:
liquor_categories.show()
+--------------------+--------------------+-----------+
|       Category Name| Category Name Clean|Liquor Type|
+--------------------+--------------------+-----------+
|     100 PROOF VODKA|     100 Proof Vodka|      Vodka|
|  100% Agave Tequila|  100% Agave Tequila|    Tequila|
|       Aged Dark Rum|       Aged Dark Rum|        Rum|
| AMARETTO - IMPORTED| Amaretto - Imported|     Spirit|
|    AMERICAN ALCOHOL|    American Alcohol|      Other|
|   AMERICAN AMARETTO|   Imported Amaretto|     Spirit|
|   American Brandies|     American Brandy|     Brandy|
|  AMERICAN COCKTAILS|   American Cocktail|      Other|
|American Cordials...|American Cordials...|     Spirit|
|American Cordials...|American Cordials...|     Spirit|
|American Distille...|American Distille...|     Spirit|
|American Distille...|American Distille...|     Spirit|
|   AMERICAN DRY GINS|    American Dry Gin|        Gin|
|   American Dry Gins|    American Dry Gin|        Gin|
|American Flavored...|American Flavored...|      Vodka|
|AMERICAN GRAPE BR...|American Grape Br...|     Brandy|
|   American Schnapps|   American Schnapps|     Spirit|
|  American Sloe Gins|   American Sloe Gin|        Gin|
|  AMERICAN SLOE GINS|   American Sloe Gin|        Gin|
|      American Vodka|      American Vodka|      Vodka|
+--------------------+--------------------+-----------+
only showing top 20 rows

In [10]:
# join categories to sales data
df = df.join(liquor_categories,df['Category Name'] == liquor_categories['Category Name'],how='left')

Data clearning: liquor store

The names of liquor store is not consistent within the dataset. We cleaned up the liquor store names to match with liquor store ID

In [11]:
# read in liquor store lookup tables
liquor_store_clean = spark.read.csv('gs://bigdata2_data/Iowa_Liquor_Stores.csv', header=True)
In [12]:
liquor_store_clean.show(5)
+-----+--------------------+------------+--------------------+------------+-----+--------+--------------------+-----------+
|Store|                Name|Store Status|             Address|        City|State|Zip Code|       Store Address|Report Date|
+-----+--------------------+------------+--------------------+------------+-----+--------+--------------------+-----------+
| 2538|Hy-Vee Food Store...|           A|    1422 Flammang Dr|    Waterloo|   IA|   50702|POINT (-92.327917...| 04/01/2020|
| 3968|       Monroe Liquor|           I|105 W Sherman Po ...|      Monroe|   IA|   50170|                null| 04/01/2020|
| 4176|     Todds On The Go|           I|  235 Edgewood Rd NE|Cedar Rapids|   IA|   52405|                null| 04/01/2020|
| 4410|Kum & Go #203 / P...|           I|         1219 1st St|       Perry|   IA|   50220|                null| 04/01/2020|
| 4502|Neighborhood Mart...|           I|  2100, Lafayette St|    Waterloo|   IA|   50703|POINT (-92.313972...| 04/01/2020|
+-----+--------------------+------------+--------------------+------------+-----+--------+--------------------+-----------+
only showing top 5 rows

In [13]:
liquor_store_clean.printSchema()
root
 |-- Store: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Store Status: string (nullable = true)
 |-- Address: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Zip Code: string (nullable = true)
 |-- Store Address: string (nullable = true)
 |-- Report Date: string (nullable = true)

In [14]:
# Update Store column to integertype for later modeling purpose
liquor_store_clean = liquor_store_clean.withColumn("Store", liquor_store_clean["Store"].cast(IntegerType()))
In [15]:
# join categories to sales data
df = df.join(liquor_store_clean,df['Store Number'] == liquor_store_clean['Store'],how='left')
In [20]:
df.columns
Out[20]:
['Invoice/Item Number',
 'Date',
 'Store Number',
 'Store Name',
 'Address',
 'City',
 'Zip Code',
 'Store Location',
 'County Number',
 'County',
 'Category',
 'Category Name',
 'Vendor Number',
 'Vendor Name',
 'Item Number',
 'Item Description',
 'Pack',
 'Bottle Volume (ml)',
 'State Bottle Cost',
 'State Bottle Retail',
 'Bottles Sold',
 'Sale (Dollars)',
 'Volume Sold (Liters)',
 'Volume Sold (Gallons)',
 'month',
 'year',
 'week_day',
 'DayofMonth',
 'DayofYear',
 'Number of Sale',
 'Category Name',
 'Category Name Clean',
 'Liquor Type',
 'Store',
 'Name',
 'Store Status',
 'Address',
 'City',
 'State',
 'Zip Code',
 'Store Address',
 'Report Date']
In [21]:
df.show(2)
+-------------------+-------------------+------------+--------------------+-------------------+------------+--------+--------------------+-------------+----------+--------+----------------+-------------+--------------------+-----------+--------------------+----+------------------+-----------------+-------------------+------------+--------------+--------------------+---------------------+-----+----+--------+----------+---------+--------------+----------------+-------------------+-----------+-----+--------------------+------------+-------------------+------------+-----+--------+--------------------+-----------+
|Invoice/Item Number|               Date|Store Number|          Store Name|            Address|        City|Zip Code|      Store Location|County Number|    County|Category|   Category Name|Vendor Number|         Vendor Name|Item Number|    Item Description|Pack|Bottle Volume (ml)|State Bottle Cost|State Bottle Retail|Bottles Sold|Sale (Dollars)|Volume Sold (Liters)|Volume Sold (Gallons)|month|year|week_day|DayofMonth|DayofYear|Number of Sale|   Category Name|Category Name Clean|Liquor Type|Store|                Name|Store Status|            Address|        City|State|Zip Code|       Store Address|Report Date|
+-------------------+-------------------+------------+--------------------+-------------------+------------+--------+--------------------+-------------+----------+--------+----------------+-------------+--------------------+-----------+--------------------+----+------------------+-----------------+-------------------+------------+--------------+--------------------+---------------------+-----+----+--------+----------+---------+--------------+----------------+-------------------+-----------+-----+--------------------+------------+-------------------+------------+-----+--------+--------------------+-----------+
|    INV-05539600037|2017-06-15 00:00:00|        2544|Hy-Vee Food Store...|   802 South Center|Marshalltown|   50158|POINT (-92.912817...|           64|  MARSHALL| 1011100|Blended Whiskies|          434|           LUXCO INC|      23277|Calvert Extra Bl ...|  12|              1000|             6.88|              10.32|          12|        123.84|                12.0|                 3.17|    6|2017|       5|        15|      166|             1|Blended Whiskies|    Blended Whiskey|    Whiskey| 2544|Hy-Vee Food Store...|           A|   802 South Center|Marshalltown|   IA|   50158|POINT (-92.912817...| 04/01/2020|
|    INV-05532600011|2017-06-15 00:00:00|        4988|Happy's Wine & Sp...|5925 University Ave| Cedar Falls|   50613|POINT (-92.429331...|            7|BLACK HAWK| 1081600| Whiskey Liqueur|          421|SAZERAC COMPANY  INC|      64866|Fireball Cinnamon...|  12|               750|              9.0|               13.5|          12|        161.64|                 9.0|                 2.38|    6|2017|       5|        15|      166|             1| Whiskey Liqueur|    Whiskey Liqueur|     Spirit| 4988|Happy's Wine & Sp...|           A|5925 University Ave| Cedar Falls|   IA|   50613|POINT (-92.429331...| 04/01/2020|
+-------------------+-------------------+------------+--------------------+-------------------+------------+--------+--------------------+-------------+----------+--------+----------------+-------------+--------------------+-----------+--------------------+----+------------------+-----------------+-------------------+------------+--------------+--------------------+---------------------+-----+----+--------+----------+---------+--------------+----------------+-------------------+-----------+-----+--------------------+------------+-------------------+------------+-----+--------+--------------------+-----------+
only showing top 2 rows

Infer Ratings of liquor product by each liquor Store

As we need a rating feature for the ALS model, we are going to use the sales(in dollars) for a liquor product as a percentage of total sales for that store to infer the rating of product by the store. The more a liquor store order a specific product, the more likely a store like that product. We added a calculated column Perc_of_store_total_salewe calculate the percentage of sales from that liquor item out of the total sales by the store

In [16]:
# Calculate the most popular liqour item for each store
sale_by_store_item = df.groupBy("Store", "Item Number").sum("Sale (Dollars)")
In [17]:
# calculate total sales in dallors for each store
sales_by_store = df.groupBy("Store").sum("Sale (Dollars)")
In [18]:
# rename columns
sales_by_store = sales_by_store.withColumnRenamed("sum(Sale (Dollars))", "store_total_sale_dollars")
In [19]:
# Rename columns
sales_by_store = sales_by_store.withColumnRenamed("Store", "store_num")
In [20]:
# join two tables
sale_by_store_by_item = sale_by_store_item.alias("a").join(sales_by_store\
                     .alias("b"),sale_by_store_item['Store'] == sales_by_store['store_num'],how='left')

Here we created a table to group sales in dollars for each store and further for each items. We also calulated the total sales in dollars for all transactions of each store. Then we left-joined table 1 by table 2.

  • sum(Sale (Dollars)): Total cost of ordering the specific liquor product(Item Number*) for that store (store_num)
  • store_total_sale_dollars : Total cost of liquor orders for all past transactions store_num

*The sales in dollars means the cost of ordering that liquor products liquor store

In [21]:
sale_by_store_by_item.show(5)
+-----+-----------+-------------------+---------+------------------------+
|Store|Item Number|sum(Sale (Dollars))|store_num|store_total_sale_dollars|
+-----+-----------+-------------------+---------+------------------------+
| 2659|      34433|  4931.760000000001|     2659|               131108.44|
| 2659|      75214|              450.0|     2659|               131108.44|
| 2659|      27292|             138.06|     2659|               131108.44|
| 2659|      41989|  618.6400000000001|     2659|               131108.44|
| 2659|      88540|             236.16|     2659|               131108.44|
+-----+-----------+-------------------+---------+------------------------+
only showing top 5 rows

Perc_of_store_total_sale : percentage of sales generated from each liquor item (Item Number) for that specific store(store_num). We will use this column as the "rating" by a store towards a liquor item.

In [22]:
sale_by_store_by_item = sale_by_store_by_item.withColumn("perc_of_store_total_sale", 
                                                         sale_by_store_by_item["sum(Sale (Dollars))"]/ sale_by_store_by_item["store_total_sale_dollars"] )
In [23]:
sale_by_store_by_item.show(2)
+-----+-----------+-------------------+---------+------------------------+------------------------+
|Store|Item Number|sum(Sale (Dollars))|store_num|store_total_sale_dollars|perc_of_store_total_sale|
+-----+-----------+-------------------+---------+------------------------+------------------------+
| 2659|      57148| 1331.9700000000003|     2659|      131108.44000000003|    0.010159300194556506|
| 2659|      34006|            2400.99|     2659|      131108.44000000003|     0.01831300868197348|
+-----+-----------+-------------------+---------+------------------------+------------------------+
only showing top 2 rows

In [24]:
# For example, here are the top 3 liquor items in sales for store_num 2659 
sale_by_store_by_item.filter(sale_by_store_by_item.store_num == 2659).sort('perc_of_store_total_sale',ascending=False)\
.show(3)
+-----+-----------+-------------------+---------+------------------------+------------------------+
|Store|Item Number|sum(Sale (Dollars))|store_num|store_total_sale_dollars|perc_of_store_total_sale|
+-----+-----------+-------------------+---------+------------------------+------------------------+
| 2659|      11776|  6132.839999999998|     2659|               131108.44|    0.046776851284326154|
| 2659|      34433|            4931.76|     2659|               131108.44|     0.03761588498802976|
| 2659|      11788|            4882.56|     2659|               131108.44|     0.03724062310557581|
+-----+-----------+-------------------+---------+------------------------+------------------------+
only showing top 3 rows

In [25]:
from pyspark.sql.functions import *
from pyspark.sql.window import Window

#For each store, we can rank liquor items from most popular to least based on sales in dolloars 
ranked =  sale_by_store_by_item.withColumn("rank", dense_rank().over(Window.partitionBy("store_num").orderBy(desc("perc_of_store_total_sale"))))
In [26]:
ranked.show(5)
+-----+-----------+-------------------+---------+------------------------+------------------------+----+
|Store|Item Number|sum(Sale (Dollars))|store_num|store_total_sale_dollars|perc_of_store_total_sale|rank|
+-----+-----------+-------------------+---------+------------------------+------------------------+----+
| 2659|      11776|            6132.84|     2659|               131108.44|     0.04677685128432617|   1|
| 2659|      34433|            4931.76|     2659|               131108.44|     0.03761588498802976|   2|
| 2659|      11788|            4882.56|     2659|               131108.44|     0.03724062310557581|   3|
| 2659|      25608|  4727.400000000001|     2659|               131108.44|     0.03605717526651984|   4|
| 2659|      37998|             4516.2|     2659|               131108.44|     0.03444629499061998|   5|
+-----+-----------+-------------------+---------+------------------------+------------------------+----+
only showing top 5 rows

Given that there is no rating data within this database, we created a customized rating for each liquor product by each store based on perc_of_store_total_sale

In [27]:
store_item_rated = QuantileDiscretizer(numBuckets=5, inputCol="perc_of_store_total_sale",outputCol="rating")\
.fit(sale_by_store_by_item).transform(sale_by_store_by_item)
In [28]:
# For example for store 2659, the most highly rated items are the following:

store_item_rated.filter(store_item_rated.Store == 2659).sort('rating',ascending=False)\
.show(2)
+-----+-----------+-------------------+---------+------------------------+------------------------+------+
|Store|Item Number|sum(Sale (Dollars))|store_num|store_total_sale_dollars|perc_of_store_total_sale|rating|
+-----+-----------+-------------------+---------+------------------------+------------------------+------+
| 2659|      34006| 2400.9900000000002|     2659|               131108.44|    0.018313008681973487|   4.0|
| 2659|      57148| 1331.9700000000003|     2659|               131108.44|    0.010159300194556508|   4.0|
+-----+-----------+-------------------+---------+------------------------+------------------------+------+
only showing top 2 rows

To run recommender system, we create a table that contains store_num, item_num and rating on each item by each store

In [29]:
ratings = store_item_rated.select("Store","Item Number", "rating","perc_of_store_total_sale")
In [30]:
ratings.show(5)
+-----+-----------+------+------------------------+
|Store|Item Number|rating|perc_of_store_total_sale|
+-----+-----------+------+------------------------+
| 2659|      34433|   4.0|    0.037615884988029764|
| 2659|      75214|   4.0|    0.003432273315127...|
| 2659|      27292|   3.0|    0.001053021453081129|
| 2659|      89199|   4.0|    0.010245793482097719|
| 2659|      41846|   4.0|    0.006101285317711048|
+-----+-----------+------+------------------------+
only showing top 5 rows

Prepare data for modeling

Convert item number to format that's acceptable by the Recommender ALS function

In [31]:
df = df.withColumn("Item Number", regexp_extract("Item Number", "\\d+", 0))
In [32]:
ratings= ratings.withColumn("Item Number", regexp_extract("Item Number", "\\d+", 0))
In [33]:
ratings = ratings.withColumn("Item Number", ratings["Item Number"].cast(IntegerType()))
In [34]:
ratings = ratings.withColumnRenamed("Item Number", "item_num")
In [35]:
ratings.show(5)
+-----+--------+------+------------------------+
|Store|item_num|rating|perc_of_store_total_sale|
+-----+--------+------+------------------------+
| 2659|   63959|   4.0|    0.003983267591316012|
| 2659|   41846|   4.0|    0.006101285317711047|
| 2659|   89199|   4.0|    0.010245793482097717|
| 2659|   82836|   3.0|    6.486996565591047E-4|
| 2659|   15248|   4.0|    0.005510858034768775|
+-----+--------+------+------------------------+
only showing top 5 rows

In [36]:
#Count if any missing values in the row
ratings.dropna().count()
Out[36]:
991890
In [37]:
ratings.count()
Out[37]:
994839
In [38]:
# drop na rows
ratings = ratings.dropna()
In [39]:
liquor_store_clean = liquor_store_clean.withColumnRenamed("Store", "store_num_b")

Train , Test Split

In [40]:
training, test = ratings.randomSplit([0.8,0.2], seed = 42)

Build the recommendation model using ALS on the training data

In [42]:
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als = ALS(maxIter=10, regParam=0.01, userCol="Store", itemCol="item_num", ratingCol="rating",
          coldStartStrategy="drop", nonnegative = True)
In [43]:
#fit model
model = als.fit(training)
In [44]:
# Apply model on the test set to predict 
predictions = model.transform(test)
In [45]:
#explain parameters of the model
model.explainParams()
Out[45]:
'coldStartStrategy: strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: nan,drop. (default: nan, current: drop)\nitemCol: column name for item ids. Ids must be within the integer value range. (default: item, current: item_num)\npredictionCol: prediction column name (default: prediction)\nuserCol: column name for user ids. Ids must be within the integer value range. (default: user, current: Store)'
In [46]:
#item factors 
model.itemFactors.show(10, truncate = False)
+---+--------------------------------------------------------------------------------------------------------+
|id |features                                                                                                |
+---+--------------------------------------------------------------------------------------------------------+
|160|[0.104079135, 0.0, 0.0, 0.015655806, 0.24764529, 0.42537525, 0.098392375, 0.64675057, 0.0, 0.0023257982]|
|220|[0.0, 0.27672154, 0.0, 0.3154943, 0.36548373, 0.6829337, 0.10906999, 0.7571212, 0.0, 0.0]               |
|250|[0.3054234, 1.304413, 0.5826429, 0.14280456, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]                              |
|430|[0.0, 0.14261767, 0.2418759, 0.0, 0.0, 0.5199911, 0.10521233, 0.84113264, 0.0, 0.0]                     |
|470|[0.21089111, 0.0, 0.19378562, 0.0, 0.048613407, 0.5329675, 0.08425057, 1.0456526, 0.0, 0.51771694]      |
|490|[0.26544386, 0.23861194, 0.690093, 0.15776108, 0.16946802, 0.8162467, 0.0, 0.5008631, 0.0, 0.09647964]  |
|500|[0.0066083563, 0.0, 0.112353936, 0.5946616, 0.0130815385, 0.8685164, 0.0, 0.53497124, 0.0066140117, 0.0]|
|550|[0.0, 0.0, 0.44153577, 0.0, 0.0, 0.0, 0.0, 0.867034, 0.0, 0.23463565]                                   |
|580|[0.0, 0.32676858, 0.75448537, 0.014662421, 0.0, 0.0, 0.236765, 0.64169574, 0.3076353, 0.0]              |
|680|[0.016086703, 0.0, 0.23190835, 0.79327846, 0.0, 0.0, 0.29599348, 1.1050018, 0.0, 0.10177011]            |
+---+--------------------------------------------------------------------------------------------------------+
only showing top 10 rows

See what the model predicts

In [48]:
predictions = predictions.join(liquor_store_clean,liquor_store_clean.store_num_b== predictions.Store)
In [49]:
predictions.show(5)
+-----+--------+------+------------------------+----------+-----------+--------------------+------------+------------------+---------------+-----+--------+--------------------+-----------+
|Store|item_num|rating|perc_of_store_total_sale|prediction|store_num_b|                Name|Store Status|           Address|           City|State|Zip Code|       Store Address|Report Date|
+-----+--------+------+------------------------+----------+-----------+--------------------+------------+------------------+---------------+-----+--------+--------------------+-----------+
| 3698|   10623|   0.0|    4.750013577122141E-6|0.95741534|       3698|Ingersoll Wine Me...|           A|     1300  50th St|West Des Moines|   IA|   50266|POINT (-93.771159...| 04/01/2020|
| 4295|   10623|   0.0|     3.65156058143765E-5| 0.6251636|       4295|Nat Food Mart #1 ...|           I|      3804 Hubbell|     Des Moines|   IA|   50317|POINT (-93.541351...| 04/01/2020|
| 4545|   10623|   0.0|    1.892105279021528...|0.70939386|       4545|           Fill R Up|           I|  14151 Wabash Ave| Council Bluffs|   IA|   51503|POINT (-95.817909...| 04/01/2020|
| 2290|   10623|   0.0|    7.363685507009042...|0.21609718|       2290|Al's Liquors Unli...|           I|215 West Milwaukee|     Storm Lake|   IA|   50588|POINT (-95.202405...| 04/01/2020|
| 4077|   10623|   0.0|    1.870078847527492...| 0.8536081|       4077|  Xo Food And Liquor|           A|   428 Franklin St|       Waterloo|   IA|   50703|POINT (-92.334641...| 04/01/2020|
+-----+--------+------+------------------------+----------+-----------+--------------------+------------+------------------+---------------+-----+--------+--------------------+-----------+
only showing top 5 rows

In [50]:
predictions = predictions.na.drop()
predictions.show(10, truncate = False)
+-----+--------+------+------------------------+----------+-----------+------------------------------------------+------------+---------------------+---------------+-----+--------+------------------------------------+-----------+
|Store|item_num|rating|perc_of_store_total_sale|prediction|store_num_b|Name                                      |Store Status|Address              |City           |State|Zip Code|Store Address                       |Report Date|
+-----+--------+------+------------------------+----------+-----------+------------------------------------------+------------+---------------------+---------------+-----+--------+------------------------------------+-----------+
|3698 |10623   |0.0   |4.750013577122141E-6    |0.95741534|3698       |Ingersoll Wine Merchants                  |A           |1300  50th St        |West Des Moines|IA   |50266   |POINT (-93.771159 41.590871)        |04/01/2020 |
|4295 |10623   |0.0   |3.65156058143765E-5     |0.6251636 |4295       |Nat Food Mart #1 / Hubbell                |I           |3804 Hubbell         |Des Moines     |IA   |50317   |POINT (-93.541351 41.630849)        |04/01/2020 |
|4545 |10623   |0.0   |1.8921052790215288E-5   |0.70939386|4545       |Fill R Up                                 |I           |14151 Wabash Ave     |Council Bluffs |IA   |51503   |POINT (-95.81790900000001 41.220431)|04/01/2020 |
|2290 |10623   |0.0   |7.3636855070090425E-6   |0.21609718|2290       |Al's Liquors Unlimited                    |I           |215 West Milwaukee   |Storm Lake     |IA   |50588   |POINT (-95.202405 42.647492)        |04/01/2020 |
|4077 |10623   |0.0   |1.8700788475274923E-5   |0.8536081 |4077       |Xo Food And Liquor                        |A           |428 Franklin St      |Waterloo       |IA   |50703   |POINT (-92.33464100000002 42.501602)|04/01/2020 |
|3917 |10623   |0.0   |2.5384333237112743E-6   |0.89227206|3917       |Smokin' Joe's #2 Tobacco and Liquor Outlet|A           |1606 W Locust St     |Davenport      |IA   |52804   |POINT (-90.599037 41.538254)        |04/01/2020 |
|5487 |10623   |1.0   |9.288992031104113E-5    |1.8399215 |5487       |Independence Liquor & Food                |I           |1761 Independence Ave|Waterloo       |IA   |50707   |POINT (-92.302514 42.498404)        |04/01/2020 |
|4046 |10623   |0.0   |1.7309503054682417E-5   |0.95853496|4046       |J and K Market                            |I           |113 W Vanburen       |Centerville    |IA   |52544   |POINT (-92.874795 40.734946)        |04/01/2020 |
|4452 |15727   |0.0   |2.8354200819844552E-5   |0.8921843 |4452       |Select Mart    Gordon Dr                  |A           |2825 Gordon Dr       |Sioux City     |IA   |51105   |POINT (-96.372014 42.489632)        |04/01/2020 |
|5001 |15727   |0.0   |1.9929193655094986E-5   |0.54696256|5001       |The Music Station / Independence          |A           |709 First St W       |Independence   |IA   |50644   |POINT (-91.901687 42.468633)        |04/01/2020 |
+-----+--------+------+------------------------+----------+-----------+------------------------------------------+------------+---------------------+---------------+-----+--------+------------------------------------+-----------+
only showing top 10 rows

Prediction Performance

In [51]:
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating')
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))
Root-mean-square error = 0.765679137334

Generate top 10 liquor product for each store

The ALS function recommend these liquor items to liquor store to purchased based on our customed ratings

In [52]:
userRecs = model.recommendForAllUsers(10)
In [53]:
userRecs.show(10, truncate=False)
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Store|recommendations                                                                                                                                                                                                  |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|4900 |[[926875, 13.443826], [965269, 7.6748443], [902349, 7.668518], [902491, 7.6209292], [903722, 7.3122754], [904577, 7.125351], [969718, 6.8338003], [967275, 6.8059115], [965266, 6.701254], [940055, 5.936145]]   |
|5300 |[[926875, 18.143875], [903722, 9.45404], [965266, 8.400826], [967275, 8.382714], [922040, 8.2596855], [903929, 8.133111], [904827, 7.9986653], [969718, 7.978921], [903881, 7.692285], [902342, 7.2973957]]      |
|5803 |[[926875, 19.798565], [904827, 16.153008], [956366, 12.742643], [903590, 11.77609], [901140, 10.780106], [900750, 10.780106], [901290, 10.780106], [901150, 10.780106], [900250, 10.780106], [900620, 10.780106]]|
|4935 |[[926875, 12.685142], [965269, 7.6445694], [902349, 7.6116753], [903722, 7.589792], [902491, 7.532099], [967286, 7.0094624], [967275, 6.997804], [904577, 6.99329], [965266, 6.92415], [969718, 6.6844726]]      |
|5156 |[[926875, 17.93303], [904827, 15.411972], [922040, 10.499099], [903722, 10.226695], [956366, 10.221641], [903590, 10.220772], [967286, 10.165482], [967275, 10.064253], [902424, 9.828769], [976642, 9.710113]]  |
|3997 |[[926875, 10.480882], [904827, 9.429964], [940570, 6.3598623], [904912, 6.262345], [903590, 6.107545], [902219, 5.9371405], [903867, 5.847443], [904510, 5.712657], [994553, 5.599021], [967286, 5.5815654]]     |
|3918 |[[926875, 19.012634], [903722, 9.446412], [903881, 9.018579], [977358, 7.885659], [902349, 7.2009387], [904921, 7.1158476], [904577, 6.988691], [969718, 6.9796844], [903929, 6.7778115], [965269, 6.739849]]    |
|4818 |[[926875, 12.25255], [965266, 6.69639], [922040, 6.257215], [903929, 5.8107934], [969718, 5.6107855], [903265, 5.4943156], [902223, 5.414456], [922208, 5.210143], [967286, 5.0910196], [903722, 5.074853]]      |
|5518 |[[904827, 15.767314], [926875, 13.807736], [956366, 10.157382], [903590, 9.657692], [967286, 9.359275], [989458, 9.3565235], [922040, 9.146953], [976642, 8.869421], [902375, 8.830952], [994553, 8.486274]]     |
|2659 |[[926875, 14.038046], [904827, 12.877292], [989458, 9.375131], [956366, 9.066524], [922040, 8.539777], [965266, 8.418287], [902424, 8.332985], [967275, 8.331763], [976642, 8.271325], [903590, 8.060306]]       |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows

Generate top 10 liquor store recommendations for each liquor product

The ALS function recommend these liquor stores to liquor items to sell to based on our customed ratings

In [54]:
liquorRecs = model.recommendForAllItems(10)
In [55]:
liquorRecs.show(10, truncate=False)
+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|item_num|recommendations                                                                                                                                                                                    |
+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|21220   |[[2690, 2.9958515], [5894, 2.987344], [4834, 2.9722185], [5729, 2.944199], [5742, 2.8219154], [5814, 2.8108227], [2707, 2.7784762], [5855, 2.7715654], [2688, 2.7262979], [2676, 2.704467]]        |
|32460   |[[9010, 3.5773158], [9910, 3.5537372], [9919, 2.9481382], [9042, 2.8897936], [9936, 2.8589585], [9931, 2.8037949], [9014, 2.7649574], [9018, 2.758128], [5927, 2.5766726], [5053, 2.5295358]]      |
|64590   |[[5894, 0.9324489], [9031, 0.91519815], [5595, 0.8981756], [5213, 0.862606], [2691, 0.85508406], [5501, 0.84762967], [5888, 0.84289974], [2688, 0.8386687], [2703, 0.83819175], [2706, 0.83816123]]|
|65220   |[[2130, 0.0], [2190, 0.0], [2200, 0.0], [2290, 0.0], [2460, 0.0], [2500, 0.0], [2550, 0.0], [2560, 0.0], [2590, 0.0], [2600, 0.0]]                                                                 |
|100800  |[[5894, 3.8296232], [5595, 2.9337368], [5213, 2.8873358], [9901, 2.8818653], [2706, 2.875552], [2701, 2.8591323], [5729, 2.829994], [9031, 2.829974], [2708, 2.8283963], [9902, 2.8158348]]        |
|902440  |[[4996, 2.5922718], [9031, 2.5713181], [5178, 2.5560606], [5246, 2.501119], [5765, 2.4912188], [4753, 2.4876099], [5457, 2.484739], [5433, 2.4841352], [5630, 2.4712691], [5053, 2.4615793]]       |
|967100  |[[2130, 0.0], [2190, 0.0], [2200, 0.0], [2290, 0.0], [2460, 0.0], [2500, 0.0], [2550, 0.0], [2560, 0.0], [2590, 0.0], [2600, 0.0]]                                                                 |
|987960  |[[5894, 9.971245], [5729, 9.438034], [9031, 9.176365], [2688, 8.795011], [5208, 8.739379], [2707, 8.710201], [5742, 8.5697365], [2703, 8.552623], [4834, 8.549055], [2708, 8.521898]]              |
|34061   |[[9913, 4.1278815], [5851, 4.046236], [5675, 4.0416036], [5016, 4.028426], [5053, 4.023656], [9928, 4.008515], [9031, 3.9784608], [5500, 3.9781082], [5433, 3.9773424], [5729, 3.9752762]]         |
|37111   |[[5742, 2.076934], [4834, 2.0317795], [5814, 1.9850522], [5729, 1.9766899], [2690, 1.9724284], [2707, 1.9651794], [9928, 1.9549849], [9910, 1.9387245], [2688, 1.9316267], [9014, 1.9143723]]      |
+--------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows

In [64]:
# Find the top 10 stores by total sales in dollars
top_stores = df.groupBy(df['Store']).agg({'Sale (Dollars)':"sum"}).sort("sum(Sale (Dollars))", ascending=False).dropna().limit(10)
In [65]:
top_stores.select("Store").collect()
#.select("Store")
#top_stores.collect()
Out[65]:
[Row(Store=2633),
 Row(Store=4829),
 Row(Store=2512),
 Row(Store=3385),
 Row(Store=3420),
 Row(Store=3814),
 Row(Store=3952),
 Row(Store=3354),
 Row(Store=3773),
 Row(Store=2190)]
In [66]:
#Get the store number of top 10 stores by total sales in dollars
top10_store_list = [row.Store for row in top_stores.select("Store").collect()]
In [67]:
top10_store_list
Out[67]:
[2633, 4829, 2512, 3385, 3420, 3814, 3952, 3354, 3773, 2190]
In [62]:
# collect the recommendations to top 10 stores by total sales in dollars
rec_liquor_top10_store ={}
for store in top10_store_list:
    rec_liquor = userRecs.where(userRecs.Store == store).select("recommendations").collect()
    rec_liquor_top10_store[store] = [i.item_num for i in rec_liquor[0]["recommendations"]]
In [63]:
rec_liquor_top10_store
Out[63]:
{2190: [926875,
  935969,
  900087,
  903638,
  902585,
  34457,
  922178,
  15627,
  34422,
  927011],
 2512: [926875,
  34457,
  19477,
  15627,
  69947,
  902585,
  19067,
  28867,
  43127,
  34007],
 2633: [902585,
  933281,
  15627,
  41694,
  34029,
  34457,
  69947,
  10627,
  25607,
  19477],
 3354: [926875,
  902209,
  940718,
  903611,
  904916,
  903707,
  967286,
  956366,
  904884,
  904827],
 3385: [926875,
  903611,
  902585,
  902209,
  904827,
  940718,
  933281,
  904916,
  903707,
  968171],
 3420: [940718,
  902209,
  903611,
  904916,
  903707,
  967286,
  904827,
  956366,
  987955,
  987407],
 3773: [902585,
  34457,
  933281,
  15627,
  34029,
  19477,
  69947,
  28233,
  28867,
  34747],
 3814: [926875,
  903265,
  902223,
  903486,
  935969,
  902864,
  940570,
  901230,
  916850,
  989653],
 3952: [902585,
  933281,
  15627,
  41694,
  25607,
  34029,
  10627,
  69947,
  34457,
  67527],
 4829: [902585,
  933281,
  34457,
  15627,
  34029,
  19477,
  69947,
  41694,
  10627,
  28233]}
In [72]:
liquor_product_list = df.select("Item Number",'Item Description','Category Name Clean').dropDuplicates()
In [65]:
liquor_store_clean.filter(liquor_store_clean.store_num_b==2190).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+-------------------------+
|store_num_b|Name                     |
+-----------+-------------------------+
|2190       |Central City Liquor, Inc.|
+-----------+-------------------------+

In [66]:
# liquor products recommended to store 2190
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store[2190]))\
.show(truncate = False)
+-----------+----------------------------------------+------------------------+
|Item Number|Item Description                        |Category Name Clean     |
+-----------+----------------------------------------+------------------------+
|34422      |Grey Goose                              |Imported Vodka          |
|900087     |Smirnoff 1.75L Retail Ready 3pk Shipper |American Vodka          |
|935969     |Stolichnaya Hot                         |Imported Vodka - Misc   |
|926875     |Templeton Rye Special Reserve 6YR       |Blended Whiskey         |
|903638     |Merry's Irish Cream                     |Cream Liqueur           |
|15627      |Jameson                                 |Irish Whiskey           |
|935969     |Stolichnaya Hot                         |null                    |
|902585     |Belvedere Vodka                         |Special Order Item      |
|926875     |Templeton Rye Special Reserve 8YR       |Blended Whiskey         |
|34457      |Ketel One Imported Vodka                |Imported Vodka          |
|927011     |Cedar Ridge Rye Whiskey - Whole Cask Buy|Straight Rye Whiskey    |
|34422      |Grey Goose Vodka                        |Imported Vodka          |
|922178     |Russell's Reserve Single Barrel - Barrel|null                    |
|922178     |Russell's Reserve Single Barrel - Barrel|Straight Bourbon Whiskey|
|902585     |Belvedere Vodka                         |Imported Vodka          |
|927011     |Cedar Ridge Rye Whiskey - Whole Cask Buy|Corn Whiskey            |
|903638     |Merry's Irish Cream                     |Special Order Item      |
|34457      |Ketel One                               |Imported Vodka          |
+-----------+----------------------------------------+------------------------+

In [265]:
# Rename columns
sale_by_store_by_item =sale_by_store_by_item.withColumnRenamed("Item Number", "item_num")

Compared to currently most ordered liquor products by store 2190

In [271]:
sale_by_store_by_item.join(liquor_product_list ,sale_by_store_by_item["item_num"] == liquor_product_list["Item Number"], how='inner')\
.filter(sale_by_store_by_item.store_num==2190)\
.sort("sum(Sale (Dollars))", ascending=False)\
.select("store_num", "item_num", "Item Description","Category Name Clean")\
.dropDuplicates(["item_num"])\
.show(5, truncate=False)
+---------+--------+-----------------------------------------------+--------------------------------+
|store_num|item_num|Item Description                               |Category Name Clean             |
+---------+--------+-----------------------------------------------+--------------------------------+
|2190     |100227  |Glen Moray Scotch Classic Port Cask w/2 Glasses|null                            |
|2190     |20158   |Old Camp Wolf Moon Whiskey                     |Straight Bourbon Whiskey        |
|2190     |22121   |Wild Turkey 81                                 |Straight Bourbon Whiskey        |
|2190     |39917   |New Amsterdam Apple Vodka                      |American Flavored Vodka         |
|2190     |42370   |Bacardi 8 Rum                                  |Puerto Rico & Virgin Islands Rum|
+---------+--------+-----------------------------------------------+--------------------------------+
only showing top 5 rows

In [67]:
liquor_store_clean.filter(liquor_store_clean.store_num_b==2512).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+-----------------------------------+
|store_num_b|Name                               |
+-----------+-----------------------------------+
|2512       |Hy-Vee Wine and Spirits / Iowa City|
+-----------+-----------------------------------+

In [68]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store[2512])).show(truncate = False)
+-----------+----------------------------------------+--------------------------------+
|Item Number|Item Description                        |Category Name Clean             |
+-----------+----------------------------------------+--------------------------------+
|28867      |Tanqueray Gin                           |Imported Dry Gin                |
|43127      |Bacardi Superior Rum                    |Puerto Rico & Virgin Islands Rum|
|926875     |Templeton Rye Special Reserve 6YR       |Blended Whiskey                 |
|69947      |Rumple Minze Peppermint Schnapps Liqueur|Imported Schnapps               |
|34007      |Absolut Swedish Vodka 80prf             |Imported Vodka                  |
|15627      |Jameson                                 |Irish Whiskey                   |
|902585     |Belvedere Vodka                         |Special Order Item              |
|926875     |Templeton Rye Special Reserve 8YR       |Blended Whiskey                 |
|43127      |Bacardi Superior Rum                    |White Rum                       |
|34457      |Ketel One Imported Vodka                |Imported Vodka                  |
|19067      |Jim Beam                                |Straight Bourbon Whiskey        |
|34007      |Absolut Swedish Vodka 80 Prf            |Imported Vodka                  |
|902585     |Belvedere Vodka                         |Imported Vodka                  |
|43127      |Bacardi Superior                        |White Rum                       |
|19477      |Makers Mark                             |Straight Bourbon Whiskey        |
|19477      |Maker's Mark                            |Straight Bourbon Whiskey        |
|34457      |Ketel One                               |Imported Vodka                  |
+-----------+----------------------------------------+--------------------------------+

In [69]:
liquor_store_clean.filter(liquor_store_clean.store_num_b==2633).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+----------------------------+
|store_num_b|Name                        |
+-----------+----------------------------+
|2633       |Hy-Vee #3 / BDI / Des Moines|
+-----------+----------------------------+

In [70]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store[2633]))\
.show(truncate = False)
+-----------+----------------------------------------+------------------------+
|Item Number|Item Description                        |Category Name Clean     |
+-----------+----------------------------------------+------------------------+
|933281     |Seagram's Twisted Pineapple Gin         |Flavored Gin            |
|34029      |Absolut Citron                          |Imported Flavored Vodka |
|69947      |Rumple Minze Peppermint Schnapps Liqueur|Imported Schnapps       |
|34029      |Absolut Citron (lemon Vodka)            |Imported Flavored Vodka |
|15627      |Jameson                                 |Irish Whiskey           |
|902585     |Belvedere Vodka                         |Special Order Item      |
|34457      |Ketel One Imported Vodka                |Imported Vodka          |
|25607      |Seagrams 7 Crown                        |Blended Whiskey         |
|34029      |Absolut Citron (lemon Vodka)            |Imported Vodka - Misc   |
|25607      |Seagrams 7 Crown Bl Whiskey             |Blended Whiskey         |
|41694      |Uv Blue (raspberry) Vodka               |American Flavored Vodka |
|41694      |UV Blue Raspberry                       |American Flavored Vodka |
|902585     |Belvedere Vodka                         |Imported Vodka          |
|19477      |Makers Mark                             |Straight Bourbon Whiskey|
|19477      |Maker's Mark                            |Straight Bourbon Whiskey|
|10627      |Canadian Club Whisky                    |Canadian Whiskey        |
|34457      |Ketel One                               |Imported Vodka          |
|41694      |Uv Blue (raspberry) Vodka               |Vodka Flavored          |
+-----------+----------------------------------------+------------------------+

In [71]:
liquor_store_clean.filter(liquor_store_clean.store_num_b==3354).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+---------------------------+
|store_num_b|Name                       |
+-----------+---------------------------+
|3354       |Sam's Club 8238 / Davenport|
+-----------+---------------------------+

In [72]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store[3354]))\
.show(truncate = False)
+-----------+---------------------------------+----------------------+
|Item Number|Item Description                 |Category Name Clean   |
+-----------+---------------------------------+----------------------+
|903611     |Martell Cordon Bleu 12           |Special Order Item    |
|967286     |Ice Hole Salty Caramel Schnapps  |Triple Sec            |
|926875     |Templeton Rye Special Reserve 6YR|Blended Whiskey       |
|902209     |Edradour 10yr Sing Malt Scotch   |Single Malt Scotch    |
|903707     |Wiser's Deluxe                   |Special Order Item    |
|926875     |Templeton Rye Special Reserve 8YR|Blended Whiskey       |
|903707     |Wiser's Deluxe                   |Canadian Whiskey      |
|904884     |Ice Hole Exotic Schnapps         |Special Order Item    |
|904827     |Everclear Alcohol 190 Proof      |Special Order Item    |
|904916     |Scorpion Mezcal Reposado         |Special Order Item    |
|904827     |Everclear Alcohol 190 Proof      |American Alcohol      |
|903611     |Martell Cordon Bleu 12           |Imported Grape Brandy |
|904916     |Scorpion Mezcal Reposado         |Tequila               |
|940718     |Orloff Light Vodka               |Low Proof Vodka       |
|967286     |Ice Hole Salty Caramel Schnapps  |American Schnapps     |
|902209     |Edradour 10yr Sing Malt Scotch   |Special Order Item    |
|904884     |Ice Hole Exotic Schnapps         |Peppermint Schnapps   |
|956366     |Dekuyper Coffee Brandy           |Miscellaneous  Brandey|
|904827     |Everclear 190prf                 |Special Order Item    |
+-----------+---------------------------------+----------------------+

In [73]:
liquor_store_clean.filter(liquor_store_clean.store_num_b==3385).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+------------------------------+
|store_num_b|Name                          |
+-----------+------------------------------+
|3385       |Sam's Club 8162 / Cedar Rapids|
+-----------+------------------------------+

In [74]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store[3385]))\
.show(truncate = False)
+-----------+---------------------------------------+---------------------+
|Item Number|Item Description                       |Category Name Clean  |
+-----------+---------------------------------------+---------------------+
|903611     |Martell Cordon Bleu 12                 |Special Order Item   |
|933281     |Seagram's Twisted Pineapple Gin        |Flavored Gin         |
|926875     |Templeton Rye Special Reserve 6YR      |Blended Whiskey      |
|902209     |Edradour 10yr Sing Malt Scotch         |Single Malt Scotch   |
|903707     |Wiser's Deluxe                         |Special Order Item   |
|902585     |Belvedere Vodka                        |Special Order Item   |
|926875     |Templeton Rye Special Reserve 8YR      |Blended Whiskey      |
|903707     |Wiser's Deluxe                         |Canadian Whiskey     |
|968171     |Coole Swan Irish Superior Cream Liqueur|Cream Liqueur        |
|904827     |Everclear Alcohol 190 Proof            |Special Order Item   |
|904916     |Scorpion Mezcal Reposado               |Special Order Item   |
|904827     |Everclear Alcohol 190 Proof            |American Alcohol     |
|903611     |Martell Cordon Bleu 12                 |Imported Grape Brandy|
|904916     |Scorpion Mezcal Reposado               |Tequila              |
|940718     |Orloff Light Vodka                     |Low Proof Vodka      |
|902585     |Belvedere Vodka                        |Imported Vodka       |
|902209     |Edradour 10yr Sing Malt Scotch         |Special Order Item   |
|904827     |Everclear 190prf                       |Special Order Item   |
+-----------+---------------------------------------+---------------------+

In [75]:
liquor_store_clean.filter(liquor_store_clean.store_num_b==3420).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+---------------------------------+
|store_num_b|Name                             |
+-----------+---------------------------------+
|3420       |Sam's Club 6344 / Windsor Heights|
+-----------+---------------------------------+

In [76]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store[3420]))\
.show(truncate = False)
+-----------+---------------------------------+----------------------+
|Item Number|Item Description                 |Category Name Clean   |
+-----------+---------------------------------+----------------------+
|903611     |Martell Cordon Bleu 12           |Special Order Item    |
|967286     |Ice Hole Salty Caramel Schnapps  |Triple Sec            |
|902209     |Edradour 10yr Sing Malt Scotch   |Single Malt Scotch    |
|987407     |Jose Cuervo Especial Silver Flask|null                  |
|903707     |Wiser's Deluxe                   |Special Order Item    |
|987955     |Kavalan King Car Conductor       |null                  |
|903707     |Wiser's Deluxe                   |Canadian Whiskey      |
|987407     |Jose Cuervo Especial Silver/Flask|null                  |
|904827     |Everclear Alcohol 190 Proof      |Special Order Item    |
|904916     |Scorpion Mezcal Reposado         |Special Order Item    |
|904827     |Everclear Alcohol 190 Proof      |American Alcohol      |
|903611     |Martell Cordon Bleu 12           |Imported Grape Brandy |
|904916     |Scorpion Mezcal Reposado         |Tequila               |
|987955     |Kavalan King Car Conductor       |Single Malt Scotch    |
|940718     |Orloff Light Vodka               |Low Proof Vodka       |
|967286     |Ice Hole Salty Caramel Schnapps  |American Schnapps     |
|902209     |Edradour 10yr Sing Malt Scotch   |Special Order Item    |
|956366     |Dekuyper Coffee Brandy           |Miscellaneous  Brandey|
|904827     |Everclear 190prf                 |Special Order Item    |
+-----------+---------------------------------+----------------------+

In [77]:
liquor_store_clean.filter(liquor_store_clean.store_num_b==3814).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+---------------------------+
|store_num_b|Name                       |
+-----------+---------------------------+
|3814       |Costco Wholesale #788 / WDM|
+-----------+---------------------------+

In [78]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store[3814]))\
.show(truncate = False)
+-----------+-------------------------------------+------------------------+
|Item Number|Item Description                     |Category Name Clean     |
+-----------+-------------------------------------+------------------------+
|901230     |Tres Agaves Reposado                 |null                    |
|902864     |Carolan's Irish Cream                |Special Order Item      |
|935969     |Stolichnaya Hot                      |Imported Vodka - Misc   |
|926875     |Templeton Rye Special Reserve 6YR    |Blended Whiskey         |
|935969     |Stolichnaya Hot                      |null                    |
|903486     |Mr Boston Five Star Brandy           |American Grape Brandy   |
|902223     |Gosling Gold Rum                     |Jamaica Rum             |
|989653     |Patron Reposado Barrel Select Program|100% Agave Tequila      |
|940570     |Skinnygirl Bare Naked Vodka          |100 Proof Vodka         |
|926875     |Templeton Rye Special Reserve 8YR    |Blended Whiskey         |
|902864     |Carolan's Irish Cream                |Cream Liqueur           |
|940570     |Skinnygirl Bare Naked                |American Vodka          |
|903486     |Mr Boston Five Star Brandy           |Special Order Item      |
|901230     |Tres Agaves Reposado                 |Tequila                 |
|916850     |Blantons Buy the Barrel              |Straight Bourbon Whiskey|
|903265     |Tanqueray Sterling Vodka             |Imported Vodka          |
|940570     |Skinnygirl Bare Naked Vodka          |American Vodka          |
|989653     |Patron Barrel Select Reposado Barrel |100% Agave Tequila      |
|901230     |Tres Agaves Reposado                 |Special Order Item      |
+-----------+-------------------------------------+------------------------+

In [79]:
 
liquor_store_clean.filter(liquor_store_clean.store_num_b==3773).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+-----------------+
|store_num_b|Name             |
+-----------+-----------------+
|3773       |Benz Distributing|
+-----------+-----------------+

In [80]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store[3773]))\
.show(truncate = False)
+-----------+----------------------------------------+------------------------+
|Item Number|Item Description                        |Category Name Clean     |
+-----------+----------------------------------------+------------------------+
|34747      |Stolichnaya 80prf                       |Imported Vodka          |
|933281     |Seagram's Twisted Pineapple Gin         |Flavored Gin            |
|28867      |Tanqueray Gin                           |Imported Dry Gin        |
|34029      |Absolut Citron                          |Imported Flavored Vodka |
|69947      |Rumple Minze Peppermint Schnapps Liqueur|Imported Schnapps       |
|34029      |Absolut Citron (lemon Vodka)            |Imported Flavored Vodka |
|15627      |Jameson                                 |Irish Whiskey           |
|902585     |Belvedere Vodka                         |Special Order Item      |
|34457      |Ketel One Imported Vodka                |Imported Vodka          |
|34029      |Absolut Citron (lemon Vodka)            |Imported Vodka - Misc   |
|34747      |Stolichnaya Premium Vodka 80 Proof (lv) |Imported Vodka          |
|34747      |Stolichnaya Premium Vodka 80 Proof (lv) |null                    |
|902585     |Belvedere Vodka                         |Imported Vodka          |
|28233      |Bombay Sapphire Gin                     |Imported Dry Gin        |
|19477      |Makers Mark                             |Straight Bourbon Whiskey|
|19477      |Maker's Mark                            |Straight Bourbon Whiskey|
|34457      |Ketel One                               |Imported Vodka          |
+-----------+----------------------------------------+------------------------+

Recommend liquor stores to liquor items

Identify most popular sales liquor items

In [81]:
# Find the top 10 liquor items by total sales in dollars
df=df.withColumnRenamed("Item Number", "item_num")
top_items = df.groupBy(df['item_num']).agg({'Sale (Dollars)':"sum"}).sort("sum(Sale (Dollars))", ascending=False).dropna().limit(10)
In [82]:
# Find the top 10 stores by total sales in dollars
top_stores = df.groupBy(df['Store']).agg({'Sale (Dollars)':"sum"}).sort("sum(Sale (Dollars))", ascending=False).dropna().limit(10)
In [83]:
# Find the top 10 liquor items by total sales in dollars
top10_item_list = [row.item_num for row in top_items.select("item_num").collect()]
In [84]:
# Find the top 110 liquor items by total sales in dollars
top10_item_list
Out[84]:
[u'11788',
 u'43337',
 u'26827',
 u'11297',
 u'36308',
 u'11296',
 u'43338',
 u'88296',
 u'26826',
 u'1799']

Recommend liquor stores to top 10 best-selling liquor products

In [91]:
rec_store_top10_item ={}
for item in top10_item_list:
    rec_store = liquorRecs.where(liquorRecs.item_num == item).select("recommendations").collect()
    rec_store_top10_item[item] = [i.Store for i in rec_store[0]["recommendations"]]
In [92]:
rec_store_top10_item
Out[92]:
{u'11296': [3696, 3663, 2507, 3769, 4014, 3645, 5899, 3776, 5575, 3644],
 u'11297': [3524, 3494, 5675, 3447, 5178, 5053, 5690, 4957, 5457, 5020],
 u'11788': [3663, 3769, 4346, 2584, 3776, 3549, 3696, 3644, 4014, 3592],
 u'1799': [5053, 5629, 5082, 5721, 5675, 5853, 5925, 5192, 5554, 5796],
 u'26826': [3696, 3663, 2507, 3769, 4014, 3645, 5899, 4447, 4004, 3644],
 u'26827': [3524, 5675, 3494, 5053, 5178, 3447, 4957, 5690, 5144, 5457],
 u'36308': [3769, 3663, 3644, 3696, 3776, 4004, 3645, 4014, 3592, 3722],
 u'43337': [3549, 3524, 3494, 4762, 3447, 3818, 3942, 2636, 2578, 2602],
 u'43338': [5575, 3696, 5899, 5178, 5278, 5727, 5053, 3660, 5565, 4014],
 u'88296': [5178, 5899, 5811, 5516, 5398, 5630, 5020, 5720, 5531, 5602]}
In [93]:
df.filter(df.item_num=='11296').select("item_num", 'Item Description').dropDuplicates().show() 
+--------+----------------+
|item_num|Item Description|
+--------+----------------+
|   11296|     Crown Royal|
+--------+----------------+

In [94]:
df.filter(df.item_num=='11296').select("item_num", 'Item Description','Category Name Clean').dropDuplicates().show(truncate = False)
+--------+----------------+-------------------+
|item_num|Item Description|Category Name Clean|
+--------+----------------+-------------------+
|11296   |Crown Royal     |Canadian Whiskey   |
+--------+----------------+-------------------+

In [95]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item['11296'])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+------------------------------------+---------------------+---------------+
|store_num_b|Name                                |Address              |City           |
+-----------+------------------------------------+---------------------+---------------+
|2507       |Hy-Vee Food Store #1 / Burlington   |939 Angular          |Burlington     |
|3645       |Wal-Mart 1764 / Windsor Heights     |1001  73rd St        |Windsor Heights|
|3776       |Wal-Mart 5115 / Davenport           |3101 West Kimberly Rd|Davenport      |
|3644       |Wal-Mart 2764 / Altoona             |3501  8th St SW      |Altoona        |
|4014       |Wal-Mart 1732 / Denison             |510 C Ave            |Denison        |
|3769       |Wal-Mart 0753 / Cedar Fall          |525 Brandilynn Blvd  |Cedar Falls    |
|3696       |Wal-Mart 1723 / Des Moines          |5101 SE 14th St      |Des Moines     |
|5575       |Casey's General Store #2919 / Marion|340 Marion Blvd      |Marion         |
|3663       |Wal-Mart 1496 / Waterloo            |1334 Flammang Dr     |Waterloo       |
|5899       |Fort Dodge Smoke Shop               |1923 5th Ave S       |Fort Dodge     |
+-----------+------------------------------------+---------------------+---------------+

In [96]:
df.filter(df.item_num=='11297').select("item_num", 'Item Description','Category Name Clean').dropDuplicates().show(truncate = False)
+--------+---------------------------+-------------------+
|item_num|Item Description           |Category Name Clean|
+--------+---------------------------+-------------------+
|11297   |Crown Royal Canadian Whisky|Canadian Whiskey   |
|11297   |Crown Royal                |Canadian Whiskey   |
+--------+---------------------------+-------------------+

In [97]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item['11297'])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+---------------------------------+----------------------+------------+
|store_num_b|Name                             |Address               |City        |
+-----------+---------------------------------+----------------------+------------+
|3524       |Sam's Club 6568 / Ames           |305 Airport Rd        |Ames        |
|5020       |Petromart Stanwood               |1118, Highway 30      |Stanwood    |
|4957       |Kimmes Wall Lake Country Store   |129 Center St         |Wall Lake   |
|5690       |Dyno's #42 / Sac City            |210 W Main St         |Sac City    |
|5053       |Cork & Bottle / Iowa City        |610, Eastbury Dr      |Iowa City   |
|3447       |Sam's Club 6432 / Sioux City     |4201 S. York St.      |Sioux City  |
|3494       |Sam's Club 6514 / Waterloo       |210 East Tower Park Dr|Waterloo    |
|5457       |ThuLords                         |405 Broad St          |Whittemore  |
|5178       |New York Dollar Stores           |320, N Calhoun St     |West Liberty|
|5675       |10th Hole Inn & Suite / Gift Shop|402 E Main St STE A   |Riceville   |
+-----------+---------------------------------+----------------------+------------+

In [98]:
df.filter(df.item_num=='11788').select("item_num", 'Item Description','Category Name Clean')\
                                       .dropDuplicates().show(truncate = False)
+--------+----------------+-------------------+
|item_num|Item Description|Category Name Clean|
+--------+----------------+-------------------+
|11788   |Black Velvet    |Canadian Whiskey   |
+--------+----------------+-------------------+

In [99]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item['11788'])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+--------------------------------+---------------------+------------+
|store_num_b|Name                            |Address              |City        |
+-----------+--------------------------------+---------------------+------------+
|2584       |Hy-Vee Food Store / Fort Madison|2606 Avenue L        |Fort Madison|
|3592       |Wal-Mart 0886 / Fort Dodge      |3036 1st Ave South   |Fort Dodge  |
|3776       |Wal-Mart 5115 / Davenport       |3101 West Kimberly Rd|Davenport   |
|3644       |Wal-Mart 2764 / Altoona         |3501  8th St SW      |Altoona     |
|4014       |Wal-Mart 1732 / Denison         |510 C Ave            |Denison     |
|3769       |Wal-Mart 0753 / Cedar Fall      |525 Brandilynn Blvd  |Cedar Falls |
|3696       |Wal-Mart 1723 / Des Moines      |5101 SE 14th St      |Des Moines  |
|3549       |Quicker Liquor Store            |1414 48th St         |Fort Madison|
|3663       |Wal-Mart 1496 / Waterloo        |1334 Flammang Dr     |Waterloo    |
|4346       |Roy's Foodland                  |105 Pearl St         |Shellsburg  |
+-----------+--------------------------------+---------------------+------------+

In [100]:
df.filter(df.item_num=='1799').select("item_num", 'Item Description','Category Name Clean').dropDuplicates().show(truncate = False)
+--------+-------------------------------------+-------------------+
|item_num|Item Description                     |Category Name Clean|
+--------+-------------------------------------+-------------------+
|1799    |Captain Morgan Original Spiced Barrel|Spiced Rum         |
|1799    |Captain Morgan Spiced Barrel         |Spiced Rum         |
+--------+-------------------------------------+-------------------+

In [101]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item['1799'])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+---------------------------------------+---------------------------+--------------+
|store_num_b|Name                                   |Address                    |City          |
+-----------+---------------------------------------+---------------------------+--------------+
|5082       |Casey's General Store #2365 / Atlant   |911 SW 7th St              |Atlantic      |
|5192       |Guppy's On the Go  /  Robins           |990, Robins Square Dr      |Robins        |
|5796       |Casey's General Store #1541 / West Bend|316 2nd Ave NE             |West Bend     |
|5053       |Cork & Bottle / Iowa City              |610, Eastbury Dr           |Iowa City     |
|5554       |Blairstown Quick Stop                  |321 Locust St NW           |Blairstown    |
|5721       |West Main Liquor                       |306 West Main Street Unit C|Calmar        |
|5853       |R & R Town Mart / Rudd                 |522 Chickasaw St           |Rudd          |
|5629       |The Ville                              |111 W Main St              |Riceville     |
|5925       |Fredricksburg Food Center              |100 W Main St              |Fredericksburg|
|5675       |10th Hole Inn & Suite / Gift Shop      |402 E Main St STE A        |Riceville     |
+-----------+---------------------------------------+---------------------------+--------------+

In [102]:
df.filter(df.item_num=='26826').select("item_num", 'Item Description','Category Name Clean').dropDuplicates().show(truncate = False)
+--------+-------------------------------+-------------------+
|item_num|Item Description               |Category Name Clean|
+--------+-------------------------------+-------------------+
|26826   |Jack Daniels Old #7 Black Lbl  |Tennessee Whiskey  |
|26826   |Jack Daniels Old #7 Black Label|Tennessee Whiskey  |
+--------+-------------------------------+-------------------+

In [103]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item['26826'])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+---------------------------------+-------------------+---------------+
|store_num_b|Name                             |Address            |City           |
+-----------+---------------------------------+-------------------+---------------+
|2507       |Hy-Vee Food Store #1 / Burlington|939 Angular        |Burlington     |
|3645       |Wal-Mart 1764 / Windsor Heights  |1001  73rd St      |Windsor Heights|
|4004       |Wal-Mart 4256 / Ames             |534 S Duff Ave     |Ames           |
|4447       |QUIK TRIP #567 / URBANDALE       |9915 Douglas Ave   |Urbandale      |
|3644       |Wal-Mart 2764 / Altoona          |3501  8th St SW    |Altoona        |
|4014       |Wal-Mart 1732 / Denison          |510 C Ave          |Denison        |
|3769       |Wal-Mart 0753 / Cedar Fall       |525 Brandilynn Blvd|Cedar Falls    |
|3696       |Wal-Mart 1723 / Des Moines       |5101 SE 14th St    |Des Moines     |
|3663       |Wal-Mart 1496 / Waterloo         |1334 Flammang Dr   |Waterloo       |
|5899       |Fort Dodge Smoke Shop            |1923 5th Ave S     |Fort Dodge     |
+-----------+---------------------------------+-------------------+---------------+

In [104]:
df.filter(df.item_num=='26827').select("item_num", 'Item Description','Category Name Clean').dropDuplicates().show(truncate = False)
+--------+-------------------------------+-------------------+
|item_num|Item Description               |Category Name Clean|
+--------+-------------------------------+-------------------+
|26827   |Jack Daniels Old #7 Black Lbl  |Tennessee Whiskey  |
|26827   |Jack Daniels Old #7 Black Label|Tennessee Whiskey  |
+--------+-------------------------------+-------------------+

In [105]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item["26827"])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+---------------------------------+----------------------+------------+
|store_num_b|Name                             |Address               |City        |
+-----------+---------------------------------+----------------------+------------+
|3524       |Sam's Club 6568 / Ames           |305 Airport Rd        |Ames        |
|4957       |Kimmes Wall Lake Country Store   |129 Center St         |Wall Lake   |
|5690       |Dyno's #42 / Sac City            |210 W Main St         |Sac City    |
|5053       |Cork & Bottle / Iowa City        |610, Eastbury Dr      |Iowa City   |
|3447       |Sam's Club 6432 / Sioux City     |4201 S. York St.      |Sioux City  |
|5144       |Sam's Club 6979 / Ankeny         |4625 SE Delaware Ave  |Ankeny      |
|3494       |Sam's Club 6514 / Waterloo       |210 East Tower Park Dr|Waterloo    |
|5457       |ThuLords                         |405 Broad St          |Whittemore  |
|5178       |New York Dollar Stores           |320, N Calhoun St     |West Liberty|
|5675       |10th Hole Inn & Suite / Gift Shop|402 E Main St STE A   |Riceville   |
+-----------+---------------------------------+----------------------+------------+

In [106]:
df.filter(df.item_num=='36308').select("item_num", 'Item Description','Category Name Clean').dropDuplicates().show(truncate = False)
+--------+----------------+-------------------+
|item_num|Item Description|Category Name Clean|
+--------+----------------+-------------------+
|36308   |Hawkeye Vodka   |American Vodka     |
|36308   |Hawkeye Vodka   |Vodka 80 Proof     |
+--------+----------------+-------------------+

In [107]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item['36308'])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+-------------------------------+-----------------------+---------------+
|store_num_b|Name                           |Address                |City           |
+-----------+-------------------------------+-----------------------+---------------+
|3722       |Wal-Mart 1361 / Sioux City     |3400 Singing Hills Blvd|Sioux City     |
|3645       |Wal-Mart 1764 / Windsor Heights|1001  73rd St          |Windsor Heights|
|3592       |Wal-Mart 0886 / Fort Dodge     |3036 1st Ave South     |Fort Dodge     |
|4004       |Wal-Mart 4256 / Ames           |534 S Duff Ave         |Ames           |
|3776       |Wal-Mart 5115 / Davenport      |3101 West Kimberly Rd  |Davenport      |
|3644       |Wal-Mart 2764 / Altoona        |3501  8th St SW        |Altoona        |
|4014       |Wal-Mart 1732 / Denison        |510 C Ave              |Denison        |
|3769       |Wal-Mart 0753 / Cedar Fall     |525 Brandilynn Blvd    |Cedar Falls    |
|3696       |Wal-Mart 1723 / Des Moines     |5101 SE 14th St        |Des Moines     |
|3663       |Wal-Mart 1496 / Waterloo       |1334 Flammang Dr       |Waterloo       |
+-----------+-------------------------------+-----------------------+---------------+

Implicit ALS

As the rating matrix is derived from another source of information (i.e. it is inferred from product sales as percentage of store total sales), we set implicitPrefs to true to see how it performs.

For this model, instead of using the rating column (a column transformed by QuantileDiscretizerizing Percentage of Total Store Sales), we are using the Percentage of Total Store Sales the column itself.

In [108]:
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als_imp = ALS(maxIter=10, regParam=0.01, implicitPrefs=True, userCol="Store", itemCol="item_num", 
              ratingCol="perc_of_store_total_sale",
          coldStartStrategy="drop", nonnegative = True)
In [109]:
#fit model
model_imp = als_imp.fit(training)
In [110]:
# Apply model on the test set to predict 
predictions_imp = model_imp.transform(test)
In [111]:
#explain parameters of the model
model_imp.explainParams()
Out[111]:
'coldStartStrategy: strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: nan,drop. (default: nan, current: drop)\nitemCol: column name for item ids. Ids must be within the integer value range. (default: item, current: item_num)\npredictionCol: prediction column name (default: prediction)\nuserCol: column name for user ids. Ids must be within the integer value range. (default: user, current: Store)'
In [112]:
#item factors 
model_imp.itemFactors.show(10, truncate = False)
+---+--------------------------------------------------------------------------------------------------------------------+
|id |features                                                                                                            |
+---+--------------------------------------------------------------------------------------------------------------------+
|160|[0.0, 0.051229566, 0.0, 0.0034925682, 0.0, 0.08537091, 0.0, 0.02805142, 0.030124618, 0.0]                           |
|220|[0.0, 0.054519143, 0.0, 0.0, 0.0, 0.096707225, 0.0, 0.0, 0.0, 0.0]                                                  |
|250|[5.4738927E-4, 0.0, 0.0, 0.0012837213, 0.0, 0.055962477, 0.0, 0.011006589, 0.0, 0.0]                                |
|430|[3.040627E-5, 0.0, 0.0, 0.36154985, 0.0, 0.0522809, 0.0, 0.0, 0.0, 0.0]                                             |
|470|[0.0, 0.20460382, 0.07106565, 0.52246356, 0.003693969, 0.21560815, 0.0077198595, 0.22964433, 0.18124579, 0.10420551]|
|490|[0.0012321548, 0.2585866, 0.007357442, 0.027171498, 0.0, 0.27428597, 0.0, 0.0, 0.0061031673, 0.0]                   |
|500|[0.0, 0.0042500193, 0.0, 0.0062343413, 0.0, 0.071106695, 0.0, 0.049014922, 0.030283857, 0.0]                        |
|550|[0.0, 0.0, 0.0012960728, 0.0038924422, 0.0, 0.09571959, 0.0, 0.009968499, 0.0, 0.013368217]                         |
|580|[0.0, 0.0, 0.0, 0.2599259, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]                                                            |
|680|[0.0, 0.21680893, 0.0012619422, 0.0, 0.0, 0.05519837, 0.004345927, 0.0, 0.0, 0.0]                                   |
+---+--------------------------------------------------------------------------------------------------------------------+
only showing top 10 rows

In [113]:
#liquor_store_clean = liquor_store_clean.withColumnRenamed("Store", "store_num_b")

See what the model predicts

In [114]:
predictions_imp = predictions_imp.join(liquor_store_clean,liquor_store_clean.store_num_b== predictions.Store)
In [115]:
predictions_imp.show(5)
+-----+--------+------+------------------------+-----------+-----------+--------------------+------------+------------------+---------------+-----+--------+--------------------+-----------+
|Store|item_num|rating|perc_of_store_total_sale| prediction|store_num_b|                Name|Store Status|           Address|           City|State|Zip Code|       Store Address|Report Date|
+-----+--------+------+------------------------+-----------+-----------+--------------------+------------+------------------+---------------+-----+--------+--------------------+-----------+
| 3698|   10623|   0.0|    4.750013577122141E-6|0.048487708|       3698|Ingersoll Wine Me...|           A|     1300  50th St|West Des Moines|   IA|   50266|POINT (-93.771159...| 04/01/2020|
| 4295|   10623|   0.0|     3.65156058143765E-5|0.090148866|       4295|Nat Food Mart #1 ...|           I|      3804 Hubbell|     Des Moines|   IA|   50317|POINT (-93.541351...| 04/01/2020|
| 4545|   10623|   0.0|    1.892105279021528...|0.023556076|       4545|           Fill R Up|           I|  14151 Wabash Ave| Council Bluffs|   IA|   51503|POINT (-95.817909...| 04/01/2020|
| 2290|   10623|   0.0|    7.363685507009042...|0.076922834|       2290|Al's Liquors Unli...|           I|215 West Milwaukee|     Storm Lake|   IA|   50588|POINT (-95.202405...| 04/01/2020|
| 4077|   10623|   0.0|    1.870078847527492...| 0.06517759|       4077|  Xo Food And Liquor|           A|   428 Franklin St|       Waterloo|   IA|   50703|POINT (-92.334641...| 04/01/2020|
+-----+--------+------+------------------------+-----------+-----------+--------------------+------------+------------------+---------------+-----+--------+--------------------+-----------+
only showing top 5 rows

In [116]:
predictions_imp =predictions_imp.na.drop()
predictions_imp.show(3, truncate = False)
+-----+--------+------+------------------------+-----------+-----------+--------------------------+------------+----------------+---------------+-----+--------+------------------------------------+-----------+
|Store|item_num|rating|perc_of_store_total_sale|prediction |store_num_b|Name                      |Store Status|Address         |City           |State|Zip Code|Store Address                       |Report Date|
+-----+--------+------+------------------------+-----------+-----------+--------------------------+------------+----------------+---------------+-----+--------+------------------------------------+-----------+
|3698 |10623   |0.0   |4.750013577122141E-6    |0.048487708|3698       |Ingersoll Wine Merchants  |A           |1300  50th St   |West Des Moines|IA   |50266   |POINT (-93.771159 41.590871)        |04/01/2020 |
|4295 |10623   |0.0   |3.65156058143765E-5     |0.090148866|4295       |Nat Food Mart #1 / Hubbell|I           |3804 Hubbell    |Des Moines     |IA   |50317   |POINT (-93.541351 41.630849)        |04/01/2020 |
|4545 |10623   |0.0   |1.8921052790215288E-5   |0.023556076|4545       |Fill R Up                 |I           |14151 Wabash Ave|Council Bluffs |IA   |51503   |POINT (-95.81790900000001 41.220431)|04/01/2020 |
+-----+--------+------+------------------------+-----------+-----------+--------------------------+------------+----------------+---------------+-----+--------+------------------------------------+-----------+
only showing top 3 rows

Prediction Performance

In [133]:
evaluator = RegressionEvaluator(metricName='rmse', labelCol='perc_of_store_total_sale')
rmse_imp = evaluator.evaluate(predictions_imp)
print("Root-mean-square error = " + str(rmse_imp))
Root-mean-square error = 0.48240176621
In [118]:
userRecs_imp = model_imp.recommendForAllUsers(10)
In [119]:
userRecs_imp.show(2, truncate=False)
+-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Store|recommendations                                                                                                                                                                                             |
+-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|4900 |[[89193, 1.0120503], [49189, 1.0109377], [49185, 1.0046719], [36903, 1.0030771], [48099, 0.9988279], [53213, 0.9939643], [26823, 0.9802368], [11773, 0.9775729], [43333, 0.9661156], [35913, 0.96293867]]   |
|5300 |[[26821, 0.7096183], [37994, 0.69815487], [26826, 0.69717896], [11290, 0.6882901], [43331, 0.6870129], [64866, 0.6850258], [11294, 0.68019557], [11774, 0.6794355], [64858, 0.67601943], [64864, 0.6743206]]|
+-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 2 rows

Generate top 10 liquor store recommendations for each liquor product

The ALS function recommend these liquor stores to liquor items to sell to based on our customed ratings

In [120]:
liquorRecs_imp = model_imp.recommendForAllItems(10)
In [121]:
liquorRecs_imp.show(2, truncate=False)
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|item_num|recommendations                                                                                                                                                                                        |
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|21220   |[[2619, 0.22014774], [2614, 0.20923689], [2643, 0.20775041], [2603, 0.20030355], [2665, 0.19950663], [2648, 0.19860944], [2515, 0.19681376], [2512, 0.1955989], [2590, 0.19223362], [2538, 0.19113725]]|
|32460   |[[3773, 0.2162186], [4129, 0.18812409], [2538, 0.18042792], [3869, 0.17780903], [2190, 0.17657104], [2648, 0.17634706], [2619, 0.17563295], [2643, 0.17469415], [4988, 0.17314237], [2666, 0.17279202]]|
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 2 rows

In [122]:
#Get the store number of top 10 stores by total sales in dollars
top10_store_list 
Out[122]:
[2633, 4829, 2512, 3385, 3420, 3814, 3952, 3354, 3773, 2190]
In [123]:
# collect the recommendations to top 10 stores by total sales in dollars
rec_liquor_top10_store_imp ={}
for store in top10_store_list:
    rec_liquor_imp = userRecs_imp.where(userRecs_imp.Store == store).select("recommendations").collect()
    rec_liquor_top10_store_imp[store] = [i.item_num for i in rec_liquor_imp[0]["recommendations"]]
In [124]:
rec_liquor_top10_store_imp
Out[124]:
{2190: [16850, 18348, 49186, 88291, 5747, 21236, 89610, 18006, 89175, 88186],
 2512: [18348, 21236, 89610, 5133, 22175, 34244, 18006, 16850, 27100, 67595],
 2633: [18348, 16850, 64776, 27605, 27125, 16676, 15658, 21236, 75165, 5747],
 3354: [42717, 34007, 19067, 41694, 82607, 43127, 82847, 65257, 10627, 82787],
 3385: [42717, 34007, 19067, 82607, 41694, 43127, 82847, 65257, 82787, 11297],
 3420: [42717, 34007, 19067, 41694, 82607, 43127, 82847, 65257, 10627, 11297],
 3773: [18348, 16850, 5747, 27125, 15658, 27605, 75165, 27100, 28279, 5635],
 3814: [41694, 19067, 42717, 10627, 34007, 43127, 82607, 86887, 65257, 82847],
 3952: [19067, 41694, 42717, 43127, 10627, 34007, 28867, 67527, 65257, 82607],
 4829: [19067, 38177, 28867, 64776, 41694, 67527, 10627, 42717, 43127, 80096]}
In [125]:
liquor_store_clean.filter(liquor_store_clean.store_num_b==2190).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+-------------------------+
|store_num_b|Name                     |
+-----------+-------------------------+
|2190       |Central City Liquor, Inc.|
+-----------+-------------------------+

In [126]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store_imp[2190]))\
.show(truncate = False)
+-----------+---------------------------+-----------------------------+
|Item Number|Item Description           |Category Name Clean          |
+-----------+---------------------------+-----------------------------+
|88186      |Olmeca Altos Plata         |100% Agave Tequila           |
|89610      |Olmeca Altos Reposado      |100% Agave Tequila           |
|89175      |Don Julio Anejo Tequila    |100% Agave Tequila           |
|88291      |Patron Tequila Silver Mini |100% Agave Tequila           |
|89175      |Don Julio Anejo Tequila    |Tequila                      |
|49186      |Remy Martin VSOP           |Imported Brandy              |
|16850      |Blanton's Bourbon          |Straight Bourbon Whiskey     |
|49186      |Remy Martin Vsop Cognac    |Imported Brandy              |
|5747       |Johnnie Walker White Walker|Temporary & Specialty Package|
|18348      |Four Roses Small Batch     |Straight Bourbon Whiskey     |
|16850      |Blantons Bourbon           |Straight Bourbon Whiskey     |
|89610      |Olmeca Altos Reposado      |null                         |
|88186      |Olmeca Altos Plata         |Tequila                      |
|49186      |Remy Martin Vsop Cognac    |Imported Grape Brandy        |
|21236      |Ridgemont Reserve 1792     |Straight Bourbon Whiskey     |
|88291      |Patron Silver Mini         |100% Agave Tequila           |
|89175      |Don Julio Anejo            |100% Agave Tequila           |
|5747       |Johnnie Walker White Walker|Scotch Whiskey               |
|18006      |Buffalo Trace Bourbon      |Straight Bourbon Whiskey     |
|88291      |Patron Tequila Silver Mini |Tequila                      |
+-----------+---------------------------+-----------------------------+
only showing top 20 rows

In [134]:
liquor_store_clean.filter(liquor_store_clean.store_num_b== 2512).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+-----------------------------------+
|store_num_b|Name                               |
+-----------+-----------------------------------+
|2512       |Hy-Vee Wine and Spirits / Iowa City|
+-----------+-----------------------------------+

In [135]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store_imp[2512]))\
.show(truncate = False)
+-----------+-------------------------------------+------------------------+
|Item Number|Item Description                     |Category Name Clean     |
+-----------+-------------------------------------+------------------------+
|89610      |Olmeca Altos Reposado                |100% Agave Tequila      |
|67595      |Patron Xo Cafe                       |Coffee Liqueur          |
|67595      |Patron XO Cafe                       |Coffee Liqueur          |
|22175      |Russells Reserve 10YR                |Straight Bourbon Whiskey|
|22175      |Russell's Reserve 10YR               |Straight Bourbon Whiskey|
|5133       |Glenmorangie 10 Yr Single Malt Scotch|Irish Whiskey           |
|5133       |Glenmorangie 10 Yr Single Malt Scotch|Single Malt Scotch      |
|16850      |Blanton's Bourbon                    |Straight Bourbon Whiskey|
|18348      |Four Roses Small Batch               |Straight Bourbon Whiskey|
|16850      |Blantons Bourbon                     |Straight Bourbon Whiskey|
|89610      |Olmeca Altos Reposado                |null                    |
|27100      |Sazerac Rye                          |Straight Rye Whiskey    |
|21236      |Ridgemont Reserve 1792               |Straight Bourbon Whiskey|
|18006      |Buffalo Trace Bourbon                |Straight Bourbon Whiskey|
|34244      |Chopin Vodka                         |Imported Vodka          |
|89610      |Olmeca Altos Reposado                |Tequila                 |
|5133       |Glenmorangie 10YR                    |Single Malt Scotch      |
+-----------+-------------------------------------+------------------------+

Recommend liquor stores to top 10 best-selling liquor products

In [127]:
rec_store_top10_item_imp ={}
for item in top10_item_list:
    rec_store_imp = liquorRecs_imp.where(liquorRecs_imp.item_num == item).select("recommendations").collect()
    rec_store_top10_item_imp[item] = [i.Store for i in rec_store_imp[0]["recommendations"]]
In [128]:
rec_store_top10_item_imp
Out[128]:
{u'11296': [4140, 4779, 4621, 4678, 5382, 4873, 4767, 5061, 4457, 5357],
 u'11297': [3612, 5236, 5140, 2465, 3390, 4312, 3525, 2191, 5068, 5102],
 u'11788': [4779, 5382, 4621, 5357, 4873, 4767, 5061, 4519, 4457, 4140],
 u'1799': [3731, 3644, 4516, 3719, 3899, 3696, 3645, 3628, 4521, 2556],
 u'26826': [4457, 4779, 4873, 5382, 4353, 4140, 5061, 4427, 4621, 4395],
 u'26827': [3612, 2465, 3390, 4312, 3525, 5140, 2191, 5326, 5068, 5236],
 u'36308': [4140, 4253, 4702, 2562, 4688, 2200, 4703, 4705, 4795, 4678],
 u'43337': [3612, 5236, 4312, 5326, 2465, 3525, 3390, 5068, 5102, 5140],
 u'43338': [5357, 3943, 2200, 5382, 5326, 4621, 4767, 4253, 4779, 3390],
 u'88296': [5326, 5427, 4167, 5251, 3525, 4312, 2465, 5236, 5162, 3612]}
In [129]:
df.filter(df.item_num=='11296').select("item_num", 'Item Description','Category Name Clean').dropDuplicates().show(truncate = False)
+--------+----------------+-------------------+
|item_num|Item Description|Category Name Clean|
+--------+----------------+-------------------+
|11296   |Crown Royal     |Canadian Whiskey   |
+--------+----------------+-------------------+

In [130]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item_imp['11296'])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+---------------------------------------+-------------------+------------+
|store_num_b|Name                                   |Address            |City        |
+-----------+---------------------------------------+-------------------+------------+
|4621       |Casey's General Store #23 / Maxwell    |100, Main St       |Maxwell     |
|4678       |Fareway Stores #124 / Adel             |804 Nile Kinnick Dr|Adel        |
|4457       |Kum & Go #422 / Iowa City              |731 S Riverside Dr |Iowa City   |
|4779       |Casey's General Store #3782 / Mount Ayr|1305 E SOUTH STREET|Mount Ayr   |
|4873       |Casey's General Store #2559 / Granger  |1802, Sycamore St  |Granger     |
|5061       |Casey's General Store #3382 / Cedar    |560, 33rd Ave SW   |Cedar Rapids|
|4767       |Casey's General Store #3075 / Ankeny   |6981 NE 14th St    |Ankeny      |
|5382       |Casey's General Store # 2870/ Altoona  |1419 1st Ave N     |Altoona     |
|5357       |Gameday Liquor/ Orange City            |632 8th St. SE     |Orange City |
|4140       |Fareway Stores #900 / Euclid           |100 Euclid Ave     |Des Moines  |
+-----------+---------------------------------------+-------------------+------------+

In [131]:
df.filter(df.item_num=='11297').select("item_num", 'Item Description','Category Name Clean').dropDuplicates().show(truncate = False)
+--------+---------------------------+-------------------+
|item_num|Item Description           |Category Name Clean|
+--------+---------------------------+-------------------+
|11297   |Crown Royal Canadian Whisky|Canadian Whiskey   |
|11297   |Crown Royal                |Canadian Whiskey   |
+--------+---------------------------+-------------------+

In [132]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item_imp['11297'])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+--------------------------------+-------------------+--------------+
|store_num_b|Name                            |Address            |City          |
+-----------+--------------------------------+-------------------+--------------+
|5236       |Giggle Juice Liquor Station, LLC|600 W Main St      |Anamosa       |
|5140       |A to Z Liquor                   |114 South  2nd  St |Cherokee      |
|3390       |Okoboji Avenue Liquor           |1610 Okoboji Avenue|Milford       |
|2191       |Keokuk Spirits                  |1013 Main          |Keokuk        |
|5068       |Tobacco Shop / Arnolds Park     |612 S Highway 71   |Arnolds Park  |
|5102       |Wilkie Liquors                  |724 1st  St  SE    |Mount Vernon  |
|2465       |Sid's Beverage Shop             |2727 Dodge St      |Dubuque       |
|4312       |I-80 Liquor / Council Bluffs    |2411 S 24TH ST #1  |Council Bluffs|
|3525       |Wines and Spirits               |106 W 2nd St       |Washington    |
|3612       |B and C Liquor / Maquoketa      |509 E Platt        |Maquoketa     |
+-----------+--------------------------------+-------------------+--------------+

Model Fine-Tuning with cross validation-ALS Explicit Model

In [136]:
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als_cv = ALS(userCol="Store", itemCol="item_num", ratingCol="rating",
          coldStartStrategy="drop", nonnegative = True)

evaluator_cv = RegressionEvaluator(metricName='rmse', labelCol='rating')
In [143]:
# build parameter grid
paramGrid = ParamGridBuilder() \
    .addGrid(als_cv.maxIter, [10, 15, 20]) \
    .addGrid(als_cv.rank, [10, 15, 20]) \
    .addGrid(als_cv.regParam, [0.01, 0.001]) \
    .build()
In [144]:
crossval = CrossValidator(estimator=als_cv,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator_cv, 
                          numFolds=5)  
In [ ]:
# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)
In [ ]:
# Apply model on the test set to predict 
prediction_cv = cvModel.transform(test)
In [220]:
prediction_cv = prediction_cv.join(liquor_store_clean,liquor_store_clean.store_num_b== predictions.Store)
prediction_cv.show(10, truncate = False)
+-----+--------+------+------------------------+----------+-----------+------------------------------------------+------------+---------------------+---------------+-----+--------+------------------------------------+-----------+
|Store|item_num|rating|perc_of_store_total_sale|prediction|store_num_b|Name                                      |Store Status|Address              |City           |State|Zip Code|Store Address                       |Report Date|
+-----+--------+------+------------------------+----------+-----------+------------------------------------------+------------+---------------------+---------------+-----+--------+------------------------------------+-----------+
|3698 |10623   |0.0   |4.750013577122141E-6    |1.127011  |3698       |Ingersoll Wine Merchants                  |A           |1300  50th St        |West Des Moines|IA   |50266   |POINT (-93.771159 41.590871)        |04/01/2020 |
|4295 |10623   |0.0   |3.65156058143765E-5     |0.9814801 |4295       |Nat Food Mart #1 / Hubbell                |I           |3804 Hubbell         |Des Moines     |IA   |50317   |POINT (-93.541351 41.630849)        |04/01/2020 |
|4545 |10623   |0.0   |1.8921052790215288E-5   |0.8427552 |4545       |Fill R Up                                 |I           |14151 Wabash Ave     |Council Bluffs |IA   |51503   |POINT (-95.81790900000001 41.220431)|04/01/2020 |
|2290 |10623   |0.0   |7.3636855070090425E-6   |0.35837775|2290       |Al's Liquors Unlimited                    |I           |215 West Milwaukee   |Storm Lake     |IA   |50588   |POINT (-95.202405 42.647492)        |04/01/2020 |
|4077 |10623   |0.0   |1.8700788475274923E-5   |0.7510376 |4077       |Xo Food And Liquor                        |A           |428 Franklin St      |Waterloo       |IA   |50703   |POINT (-92.33464100000002 42.501602)|04/01/2020 |
|3917 |10623   |0.0   |2.5384333237112743E-6   |0.9247605 |3917       |Smokin' Joe's #2 Tobacco and Liquor Outlet|A           |1606 W Locust St     |Davenport      |IA   |52804   |POINT (-90.599037 41.538254)        |04/01/2020 |
|5487 |10623   |1.0   |9.288992031104113E-5    |1.6438034 |5487       |Independence Liquor & Food                |I           |1761 Independence Ave|Waterloo       |IA   |50707   |POINT (-92.302514 42.498404)        |04/01/2020 |
|4046 |10623   |0.0   |1.7309503054682417E-5   |1.1092533 |4046       |J and K Market                            |I           |113 W Vanburen       |Centerville    |IA   |52544   |POINT (-92.874795 40.734946)        |04/01/2020 |
|4452 |15727   |0.0   |2.8354200819844552E-5   |0.8519919 |4452       |Select Mart    Gordon Dr                  |A           |2825 Gordon Dr       |Sioux City     |IA   |51105   |POINT (-96.372014 42.489632)        |04/01/2020 |
|5001 |15727   |0.0   |1.9929193655094986E-5   |0.47148913|5001       |The Music Station / Independence          |A           |709 First St W       |Independence   |IA   |50644   |POINT (-91.901687 42.468633)        |04/01/2020 |
+-----+--------+------+------------------------+----------+-----------+------------------------------------------+------------+---------------------+---------------+-----+--------+------------------------------------+-----------+
only showing top 10 rows

Model evaluation

In [219]:
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating')
rmse_cv = evaluator.evaluate(prediction_cv)
print("Root-mean-square error = " + str(rmse_cv))
Root-mean-square error = 0.762046505889
In [188]:
sorted(zip(cvModel.avgMetrics, paramGrid))[0][1]
Out[188]:
{Param(parent=u'ALS_4f30acbd8c3ed1e4abd9', name='maxIter', doc='max number of iterations (>= 0).'): 20,
 Param(parent=u'ALS_4f30acbd8c3ed1e4abd9', name='rank', doc='rank of the factorization'): 10,
 Param(parent=u'ALS_4f30acbd8c3ed1e4abd9', name='regParam', doc='regularization parameter (>= 0).'): 0.01}
In [161]:
params = [{p.name: v for p, v in m.items()} for m in cvModel.getEstimatorParamMaps()]
In [194]:
# All grid search results
cv_rmse_df = pd.DataFrame.from_dict({"params": params, cvModel.getEvaluator().getMetricName():cvModel.avgMetrics}).sort_values(by="rmse")
In [195]:
cv_rmse_df
Out[195]:
params rmse
12 {u'regParam': 0.01, u'rank': 10, u'maxIter': 20} 0.769431
14 {u'regParam': 0.01, u'rank': 15, u'maxIter': 20} 0.770444
6 {u'regParam': 0.01, u'rank': 10, u'maxIter': 15} 0.770842
8 {u'regParam': 0.01, u'rank': 15, u'maxIter': 15} 0.772819
0 {u'regParam': 0.01, u'rank': 10, u'maxIter': 10} 0.773809
16 {u'regParam': 0.01, u'rank': 20, u'maxIter': 20} 0.775638
2 {u'regParam': 0.01, u'rank': 15, u'maxIter': 10} 0.776515
10 {u'regParam': 0.01, u'rank': 20, u'maxIter': 15} 0.777429
4 {u'regParam': 0.01, u'rank': 20, u'maxIter': 10} 0.780080
13 {u'regParam': 0.001, u'rank': 10, u'maxIter': 20} 0.781602
7 {u'regParam': 0.001, u'rank': 10, u'maxIter': 15} 0.784289
1 {u'regParam': 0.001, u'rank': 10, u'maxIter': 10} 0.790510
15 {u'regParam': 0.001, u'rank': 15, u'maxIter': 20} 0.793520
9 {u'regParam': 0.001, u'rank': 15, u'maxIter': 15} 0.796225
3 {u'regParam': 0.001, u'rank': 15, u'maxIter': 10} 0.801525
17 {u'regParam': 0.001, u'rank': 20, u'maxIter': 20} 0.803200
11 {u'regParam': 0.001, u'rank': 20, u'maxIter': 15} 0.804655
5 {u'regParam': 0.001, u'rank': 20, u'maxIter': 10} 0.806767
In [218]:
#  parameters for best model
for k,v in cv_rmse_df.params[cv_rmse_df.rmse == cv_rmse_df.rmse.min()].values[0].items():
    print k, v
regParam 0.01
rank 10
maxIter 20
In [258]:
# RMSE for best model
cv_rmse_df.rmse.min()
Out[258]:
0.7694314028344444

Use best model to produce recommendations

In [233]:
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als_cv = ALS(maxIter=20, regParam=0.01, rank = 10, userCol="Store", itemCol="item_num", ratingCol="rating",
          coldStartStrategy="drop", nonnegative = True)
In [234]:
#fit model
als_cv_model = als_cv.fit(training)
In [235]:
# Apply model on the test set to predict 
als_cv_predictions = als_cv_model.transform(test)
In [237]:
userRecs_cv = als_cv_model.recommendForAllUsers(10)
In [246]:
userRecs_cv.show(3, truncate=False)
+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Store|recommendations                                                                                                                                                                                              |
+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|4900 |[[926875, 9.377141], [935969, 8.673643], [940570, 8.64259], [903786, 8.1399555], [902349, 7.568014], [904577, 7.545292], [965269, 7.2390184], [903722, 7.0344067], [902491, 6.223467], [901882, 5.6931076]]  |
|5300 |[[926875, 13.615356], [903881, 9.838879], [903722, 9.159977], [965266, 7.813334], [989488, 7.7378144], [989458, 7.5569925], [904921, 7.258367], [904577, 7.0953326], [965269, 7.0501237], [969617, 7.015012]]|
|5803 |[[926875, 13.257656], [989458, 12.766717], [904921, 10.984689], [967275, 10.531608], [903881, 10.195558], [956366, 9.701281], [977534, 9.577065], [902780, 9.366842], [902156, 9.176537], [901065, 9.1558]]  |
+-----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 3 rows

In [242]:
liquorRecs_cv = als_cv_model.recommendForAllItems(10)
In [247]:
liquorRecs_cv.show(3, truncate=False)
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|item_num|recommendations                                                                                                                                                                                   |
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|21220   |[[5894, 3.4472413], [9922, 3.1110659], [4834, 2.9522429], [5729, 2.940638], [2690, 2.8818986], [9902, 2.8748298], [2688, 2.747865], [5855, 2.7437987], [2707, 2.739994], [5742, 2.6964037]]       |
|32460   |[[9010, 3.9188008], [9910, 3.8851376], [9014, 3.687474], [5894, 3.6865988], [9919, 3.3773234], [5053, 3.1073449], [9042, 3.065691], [9931, 2.9080954], [5729, 2.8365765], [5095, 2.8064992]]      |
|64590   |[[9922, 0.86530656], [5595, 0.7800746], [5501, 0.7376865], [5209, 0.7194774], [9001, 0.71321666], [2690, 0.70833623], [2706, 0.707105], [2708, 0.70677423], [5507, 0.70655644], [2703, 0.7050432]]|
+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 3 rows

In [250]:
# collect the recommendations to top 10 stores by total sales in dollars
rec_liquor_top10_store_cv ={}
for store in top10_store_list:
    rec_liquor = userRecs_cv.where(userRecs_cv.Store == store).select("recommendations").collect()
    rec_liquor_top10_store_cv[store] = [i.item_num for i in rec_liquor[0]["recommendations"]]
In [255]:
rec_liquor_top10_store_cv
Out[255]:
{2190: [940570,
  935969,
  903786,
  926875,
  903638,
  902349,
  933281,
  902156,
  34422,
  15627],
 2512: [902156,
  933281,
  903638,
  15627,
  19477,
  34007,
  28867,
  43127,
  34422,
  19067],
 2633: [933281,
  15627,
  41694,
  937882,
  67527,
  82847,
  25607,
  69947,
  28867,
  10627],
 3354: [904827,
  933281,
  902209,
  904884,
  989458,
  940718,
  915962,
  67706,
  987407,
  906266],
 3385: [933281,
  902780,
  43127,
  15627,
  25607,
  28867,
  34007,
  19067,
  42717,
  67527],
 3420: [933281,
  904827,
  902209,
  940718,
  915962,
  989458,
  956366,
  67706,
  906266,
  903707],
 3773: [933281,
  903638,
  15627,
  34457,
  19477,
  69947,
  902156,
  28867,
  28233,
  67527],
 3814: [926875,
  940570,
  977534,
  901230,
  903265,
  902156,
  977476,
  904921,
  935969,
  68703],
 3952: [933281,
  41694,
  937882,
  15627,
  67527,
  82847,
  25607,
  10627,
  43127,
  69947],
 4829: [933281,
  15627,
  903638,
  41694,
  19477,
  69947,
  34457,
  67527,
  28867,
  82847]}
In [125]:
liquor_store_clean.filter(liquor_store_clean.store_num_b==2190).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+-------------------------+
|store_num_b|Name                     |
+-----------+-------------------------+
|2190       |Central City Liquor, Inc.|
+-----------+-------------------------+

In [251]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store_cv[2190]))\
.show(truncate = False)
+-----------+---------------------------------+----------------------+
|Item Number|Item Description                 |Category Name Clean   |
+-----------+---------------------------------+----------------------+
|34422      |Grey Goose                       |Imported Vodka        |
|902349     |Hiram Walker Blackberry Brandy   |Blackberry Brandy     |
|935969     |Stolichnaya Hot                  |Imported Vodka - Misc |
|933281     |Seagram's Twisted Pineapple Gin  |Flavored Gin          |
|926875     |Templeton Rye Special Reserve 6YR|Blended Whiskey       |
|903638     |Merry's Irish Cream              |Cream Liqueur         |
|15627      |Jameson                          |Irish Whiskey         |
|935969     |Stolichnaya Hot                  |null                  |
|940570     |Skinnygirl Bare Naked Vodka      |100 Proof Vodka       |
|926875     |Templeton Rye Special Reserve 8YR|Blended Whiskey       |
|940570     |Skinnygirl Bare Naked            |American Vodka        |
|903786     |Distillers Gin No. 6             |Special Order Item    |
|34422      |Grey Goose Vodka                 |Imported Vodka        |
|902349     |Hiram Walker Blackberry Brandy   |Special Order Item    |
|903786     |Distillers Gin No. 6             |American Dry Gin      |
|903638     |Merry's Irish Cream              |Special Order Item    |
|940570     |Skinnygirl Bare Naked Vodka      |American Vodka        |
|902156     |Ice Hole Salty Caramel Schnapps  |Miscellaneous Schnapps|
+-----------+---------------------------------+----------------------+

In [134]:
liquor_store_clean.filter(liquor_store_clean.store_num_b== 2512).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+-----------------------------------+
|store_num_b|Name                               |
+-----------+-----------------------------------+
|2512       |Hy-Vee Wine and Spirits / Iowa City|
+-----------+-----------------------------------+

In [252]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store_cv[2512]))\
.show(truncate = False)
+-----------+-------------------------------+--------------------------------+
|Item Number|Item Description               |Category Name Clean             |
+-----------+-------------------------------+--------------------------------+
|34422      |Grey Goose                     |Imported Vodka                  |
|933281     |Seagram's Twisted Pineapple Gin|Flavored Gin                    |
|28867      |Tanqueray Gin                  |Imported Dry Gin                |
|43127      |Bacardi Superior Rum           |Puerto Rico & Virgin Islands Rum|
|34007      |Absolut Swedish Vodka 80prf    |Imported Vodka                  |
|903638     |Merry's Irish Cream            |Cream Liqueur                   |
|15627      |Jameson                        |Irish Whiskey                   |
|43127      |Bacardi Superior Rum           |White Rum                       |
|34422      |Grey Goose Vodka               |Imported Vodka                  |
|19067      |Jim Beam                       |Straight Bourbon Whiskey        |
|34007      |Absolut Swedish Vodka 80 Prf   |Imported Vodka                  |
|43127      |Bacardi Superior               |White Rum                       |
|19477      |Makers Mark                    |Straight Bourbon Whiskey        |
|903638     |Merry's Irish Cream            |Special Order Item              |
|902156     |Ice Hole Salty Caramel Schnapps|Miscellaneous Schnapps          |
|19477      |Maker's Mark                   |Straight Bourbon Whiskey        |
+-----------+-------------------------------+--------------------------------+

Recommend liquor stores to top 10 best-selling liquor products

In [253]:
rec_store_top10_item_cv ={}
for item in top10_item_list:
    rec_store = liquorRecs_cv.where(liquorRecs_cv.item_num == item).select("recommendations").collect()
    rec_store_top10_item_cv[item] = [i.Store for i in rec_store[0]["recommendations"]]
In [254]:
rec_store_top10_item_cv
Out[254]:
{u'11296': [3696, 4014, 3663, 3719, 3789, 4615, 2507, 3722, 4864, 3660],
 u'11297': [3494, 3524, 3447, 3420, 5178, 3385, 3354, 5144, 3525, 2636],
 u'11788': [3663, 3549, 3719, 3789, 3696, 4346, 3660, 3592, 3769, 4001],
 u'1799': [5053, 5082, 5629, 5178, 5103, 5721, 5144, 5438, 5192, 5518],
 u'26826': [3696, 4014, 3663, 2507, 4615, 3719, 3789, 3722, 4794, 3769],
 u'26827': [3494, 3524, 5178, 5690, 3385, 3447, 5053, 5675, 3420, 5144],
 u'36308': [3663, 3696, 3719, 3789, 3769, 3776, 3592, 3722, 4014, 4001],
 u'43337': [3494, 3549, 3524, 2636, 3525, 3447, 3818, 4512, 4762, 3456],
 u'43338': [5178, 5899, 3696, 2659, 5727, 5278, 3660, 5290, 5282, 5053],
 u'88296': [5178, 4705, 5690, 5016, 4679, 5398, 5310, 4877, 5899, 3524]}
In [129]:
df.filter(df.item_num=='11296').select("item_num", 'Item Description','Category Name Clean').dropDuplicates().show(truncate = False)
+--------+----------------+-------------------+
|item_num|Item Description|Category Name Clean|
+--------+----------------+-------------------+
|11296   |Crown Royal     |Canadian Whiskey   |
+--------+----------------+-------------------+

In [256]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item_cv['11296'])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+----------------------------------+-----------------------+------------+
|store_num_b|Name                              |Address                |City        |
+-----------+----------------------------------+-----------------------+------------+
|2507       |Hy-Vee Food Store #1 / Burlington |939 Angular            |Burlington  |
|3660       |Wal-Mart 2935 / Knoxville         |814 W Bell Ave         |Knoxville   |
|3722       |Wal-Mart 1361 / Sioux City        |3400 Singing Hills Blvd|Sioux City  |
|4615       |North Side Liquor & Grocery       |1303 North Federal Ave |Mason City  |
|4864       |CLINTON LIQUOR & CONVENIENCE STORE|1641 S Bluff Blvd      |Clinton     |
|4014       |Wal-Mart 1732 / Denison           |510 C Ave              |Denison     |
|3696       |Wal-Mart 1723 / Des Moines        |5101 SE 14th St        |Des Moines  |
|3719       |Wal-Mart 0581 / Marshalltown      |2802 S Center St       |Marshalltown|
|3663       |Wal-Mart 1496 / Waterloo          |1334 Flammang Dr       |Waterloo    |
|3789       |Wal-Mart 1393 / Oskaloosa         |2203 A Avenue West     |Oskaloosa   |
+-----------+----------------------------------+-----------------------+------------+

In [131]:
df.filter(df.item_num=='11297').select("item_num", 'Item Description','Category Name Clean').dropDuplicates().show(truncate = False)
+--------+---------------------------+-------------------+
|item_num|Item Description           |Category Name Clean|
+--------+---------------------------+-------------------+
|11297   |Crown Royal Canadian Whisky|Canadian Whiskey   |
|11297   |Crown Royal                |Canadian Whiskey   |
+--------+---------------------------+-------------------+

In [257]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item_cv['11297'])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+---------------------------------+-----------------------+---------------+
|store_num_b|Name                             |Address                |City           |
+-----------+---------------------------------+-----------------------+---------------+
|3524       |Sam's Club 6568 / Ames           |305 Airport Rd         |Ames           |
|3420       |Sam's Club 6344 / Windsor Heights|1101  73rd St          |Windsor Heights|
|3447       |Sam's Club 6432 / Sioux City     |4201 S. York St.       |Sioux City     |
|5144       |Sam's Club 6979 / Ankeny         |4625 SE Delaware Ave   |Ankeny         |
|3494       |Sam's Club 6514 / Waterloo       |210 East Tower Park Dr |Waterloo       |
|3354       |Sam's Club 8238 / Davenport      |3845 Elmore Ave.       |Davenport      |
|3525       |Wines and Spirits                |106 W 2nd St           |Washington     |
|5178       |New York Dollar Stores           |320, N Calhoun St      |West Liberty   |
|2636       |Hy-Vee Wine and Spirits / Hubbell|2310 Hubbell Ave       |Des Moines     |
|3385       |Sam's Club 8162 / Cedar Rapids   |2605 Blairs Ferry Rd NE|Cedar Rapids   |
+-----------+---------------------------------+-----------------------+---------------+

In [ ]:
 
In [255]:
rec_liquor_top10_store_cv
Out[255]:
{2190: [940570,
  935969,
  903786,
  926875,
  903638,
  902349,
  933281,
  902156,
  34422,
  15627],
 2512: [902156,
  933281,
  903638,
  15627,
  19477,
  34007,
  28867,
  43127,
  34422,
  19067],
 2633: [933281,
  15627,
  41694,
  937882,
  67527,
  82847,
  25607,
  69947,
  28867,
  10627],
 3354: [904827,
  933281,
  902209,
  904884,
  989458,
  940718,
  915962,
  67706,
  987407,
  906266],
 3385: [933281,
  902780,
  43127,
  15627,
  25607,
  28867,
  34007,
  19067,
  42717,
  67527],
 3420: [933281,
  904827,
  902209,
  940718,
  915962,
  989458,
  956366,
  67706,
  906266,
  903707],
 3773: [933281,
  903638,
  15627,
  34457,
  19477,
  69947,
  902156,
  28867,
  28233,
  67527],
 3814: [926875,
  940570,
  977534,
  901230,
  903265,
  902156,
  977476,
  904921,
  935969,
  68703],
 3952: [933281,
  41694,
  937882,
  15627,
  67527,
  82847,
  25607,
  10627,
  43127,
  69947],
 4829: [933281,
  15627,
  903638,
  41694,
  19477,
  69947,
  34457,
  67527,
  28867,
  82847]}
In [125]:
liquor_store_clean.filter(liquor_store_clean.store_num_b==2190).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+-------------------------+
|store_num_b|Name                     |
+-----------+-------------------------+
|2190       |Central City Liquor, Inc.|
+-----------+-------------------------+

In [251]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store_cv[2190]))\
.show(truncate = False)
+-----------+---------------------------------+----------------------+
|Item Number|Item Description                 |Category Name Clean   |
+-----------+---------------------------------+----------------------+
|34422      |Grey Goose                       |Imported Vodka        |
|902349     |Hiram Walker Blackberry Brandy   |Blackberry Brandy     |
|935969     |Stolichnaya Hot                  |Imported Vodka - Misc |
|933281     |Seagram's Twisted Pineapple Gin  |Flavored Gin          |
|926875     |Templeton Rye Special Reserve 6YR|Blended Whiskey       |
|903638     |Merry's Irish Cream              |Cream Liqueur         |
|15627      |Jameson                          |Irish Whiskey         |
|935969     |Stolichnaya Hot                  |null                  |
|940570     |Skinnygirl Bare Naked Vodka      |100 Proof Vodka       |
|926875     |Templeton Rye Special Reserve 8YR|Blended Whiskey       |
|940570     |Skinnygirl Bare Naked            |American Vodka        |
|903786     |Distillers Gin No. 6             |Special Order Item    |
|34422      |Grey Goose Vodka                 |Imported Vodka        |
|902349     |Hiram Walker Blackberry Brandy   |Special Order Item    |
|903786     |Distillers Gin No. 6             |American Dry Gin      |
|903638     |Merry's Irish Cream              |Special Order Item    |
|940570     |Skinnygirl Bare Naked Vodka      |American Vodka        |
|902156     |Ice Hole Salty Caramel Schnapps  |Miscellaneous Schnapps|
+-----------+---------------------------------+----------------------+

In [134]:
liquor_store_clean.filter(liquor_store_clean.store_num_b== 2512).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+-----------------------------------+
|store_num_b|Name                               |
+-----------+-----------------------------------+
|2512       |Hy-Vee Wine and Spirits / Iowa City|
+-----------+-----------------------------------+

In [252]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store_cv[2512]))\
.show(truncate = False)
+-----------+-------------------------------+--------------------------------+
|Item Number|Item Description               |Category Name Clean             |
+-----------+-------------------------------+--------------------------------+
|34422      |Grey Goose                     |Imported Vodka                  |
|933281     |Seagram's Twisted Pineapple Gin|Flavored Gin                    |
|28867      |Tanqueray Gin                  |Imported Dry Gin                |
|43127      |Bacardi Superior Rum           |Puerto Rico & Virgin Islands Rum|
|34007      |Absolut Swedish Vodka 80prf    |Imported Vodka                  |
|903638     |Merry's Irish Cream            |Cream Liqueur                   |
|15627      |Jameson                        |Irish Whiskey                   |
|43127      |Bacardi Superior Rum           |White Rum                       |
|34422      |Grey Goose Vodka               |Imported Vodka                  |
|19067      |Jim Beam                       |Straight Bourbon Whiskey        |
|34007      |Absolut Swedish Vodka 80 Prf   |Imported Vodka                  |
|43127      |Bacardi Superior               |White Rum                       |
|19477      |Makers Mark                    |Straight Bourbon Whiskey        |
|903638     |Merry's Irish Cream            |Special Order Item              |
|902156     |Ice Hole Salty Caramel Schnapps|Miscellaneous Schnapps          |
|19477      |Maker's Mark                   |Straight Bourbon Whiskey        |
+-----------+-------------------------------+--------------------------------+

Recommend liquor stores to top 10 best-selling liquor products

In [253]:
rec_store_top10_item_cv ={}
for item in top10_item_list:
    rec_store = liquorRecs_cv.where(liquorRecs_cv.item_num == item).select("recommendations").collect()
    rec_store_top10_item_cv[item] = [i.Store for i in rec_store[0]["recommendations"]]
In [254]:
rec_store_top10_item_cv
Out[254]:
{u'11296': [3696, 4014, 3663, 3719, 3789, 4615, 2507, 3722, 4864, 3660],
 u'11297': [3494, 3524, 3447, 3420, 5178, 3385, 3354, 5144, 3525, 2636],
 u'11788': [3663, 3549, 3719, 3789, 3696, 4346, 3660, 3592, 3769, 4001],
 u'1799': [5053, 5082, 5629, 5178, 5103, 5721, 5144, 5438, 5192, 5518],
 u'26826': [3696, 4014, 3663, 2507, 4615, 3719, 3789, 3722, 4794, 3769],
 u'26827': [3494, 3524, 5178, 5690, 3385, 3447, 5053, 5675, 3420, 5144],
 u'36308': [3663, 3696, 3719, 3789, 3769, 3776, 3592, 3722, 4014, 4001],
 u'43337': [3494, 3549, 3524, 2636, 3525, 3447, 3818, 4512, 4762, 3456],
 u'43338': [5178, 5899, 3696, 2659, 5727, 5278, 3660, 5290, 5282, 5053],
 u'88296': [5178, 4705, 5690, 5016, 4679, 5398, 5310, 4877, 5899, 3524]}
In [129]:
df.filter(df.item_num=='11296').select("item_num", 'Item Description','Category Name Clean').dropDuplicates().show(truncate = False)
+--------+----------------+-------------------+
|item_num|Item Description|Category Name Clean|
+--------+----------------+-------------------+
|11296   |Crown Royal     |Canadian Whiskey   |
+--------+----------------+-------------------+

In [256]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item_cv['11296'])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+----------------------------------+-----------------------+------------+
|store_num_b|Name                              |Address                |City        |
+-----------+----------------------------------+-----------------------+------------+
|2507       |Hy-Vee Food Store #1 / Burlington |939 Angular            |Burlington  |
|3660       |Wal-Mart 2935 / Knoxville         |814 W Bell Ave         |Knoxville   |
|3722       |Wal-Mart 1361 / Sioux City        |3400 Singing Hills Blvd|Sioux City  |
|4615       |North Side Liquor & Grocery       |1303 North Federal Ave |Mason City  |
|4864       |CLINTON LIQUOR & CONVENIENCE STORE|1641 S Bluff Blvd      |Clinton     |
|4014       |Wal-Mart 1732 / Denison           |510 C Ave              |Denison     |
|3696       |Wal-Mart 1723 / Des Moines        |5101 SE 14th St        |Des Moines  |
|3719       |Wal-Mart 0581 / Marshalltown      |2802 S Center St       |Marshalltown|
|3663       |Wal-Mart 1496 / Waterloo          |1334 Flammang Dr       |Waterloo    |
|3789       |Wal-Mart 1393 / Oskaloosa         |2203 A Avenue West     |Oskaloosa   |
+-----------+----------------------------------+-----------------------+------------+

In [131]:
df.filter(df.item_num=='11297').select("item_num", 'Item Description','Category Name Clean').dropDuplicates().show(truncate = False)
+--------+---------------------------+-------------------+
|item_num|Item Description           |Category Name Clean|
+--------+---------------------------+-------------------+
|11297   |Crown Royal Canadian Whisky|Canadian Whiskey   |
|11297   |Crown Royal                |Canadian Whiskey   |
+--------+---------------------------+-------------------+

In [257]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item_cv['11297'])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+---------------------------------+-----------------------+---------------+
|store_num_b|Name                             |Address                |City           |
+-----------+---------------------------------+-----------------------+---------------+
|3524       |Sam's Club 6568 / Ames           |305 Airport Rd         |Ames           |
|3420       |Sam's Club 6344 / Windsor Heights|1101  73rd St          |Windsor Heights|
|3447       |Sam's Club 6432 / Sioux City     |4201 S. York St.       |Sioux City     |
|5144       |Sam's Club 6979 / Ankeny         |4625 SE Delaware Ave   |Ankeny         |
|3494       |Sam's Club 6514 / Waterloo       |210 East Tower Park Dr |Waterloo       |
|3354       |Sam's Club 8238 / Davenport      |3845 Elmore Ave.       |Davenport      |
|3525       |Wines and Spirits                |106 W 2nd St           |Washington     |
|5178       |New York Dollar Stores           |320, N Calhoun St      |West Liberty   |
|2636       |Hy-Vee Wine and Spirits / Hubbell|2310 Hubbell Ave       |Des Moines     |
|3385       |Sam's Club 8162 / Cedar Rapids   |2605 Blairs Ferry Rd NE|Cedar Rapids   |
+-----------+---------------------------------+-----------------------+---------------+

In [ ]:
 

Model Fine-Tuning with cross validation-ALS Implicit Model

In [41]:
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics


# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als_imp_cv = ALS(implicitPrefs=True, userCol="Store", itemCol="item_num", 
              ratingCol="perc_of_store_total_sale",
          coldStartStrategy="drop", nonnegative = True)

evaluator_cv = RegressionEvaluator(metricName='rmse', labelCol='perc_of_store_total_sale')
In [42]:
# build parameter grid
paramGrid = ParamGridBuilder() \
    .addGrid(als_imp_cv.maxIter, [10, 15, 20]) \
    .addGrid(als_imp_cv.rank, [10, 15, 20]) \
    .addGrid(als_imp_cv.regParam, [0.01, 0.001]) \
    .build()
In [ ]:
crossval_imp = CrossValidator(estimator=als_imp_cv,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator_cv, 
                          numFolds=5)  
In [ ]:
# Run cross-validation, and choose the best set of parameters.
cvModel_imp = crossval_imp.fit(training)
In [ ]:
# Apply model on the test set to predict 
prediction_cv_imp = cvModel_imp.transform(test)
In [ ]:
prediction_cv_imp = prediction_cv_imp.join(liquor_store_clean,liquor_store_clean.store_num_b== prediction_cv_imp.Store)
prediction_cv_imp.show(10, truncate = False)
+-----+--------+------+------------------------+-----------+-----------+------------------------------------------+------------+---------------------+---------------+-----+--------+------------------------------------+-----------+
|Store|item_num|rating|perc_of_store_total_sale|prediction |store_num_b|Name                                      |Store Status|Address              |City           |State|Zip Code|Store Address                       |Report Date|
+-----+--------+------+------------------------+-----------+-----------+------------------------------------------+------------+---------------------+---------------+-----+--------+------------------------------------+-----------+
|4077 |10623   |0.0   |1.8700788475274926E-5   |0.06517759 |4077       |Xo Food And Liquor                        |A           |428 Franklin St      |Waterloo       |IA   |50703   |POINT (-92.33464100000002 42.501602)|04/01/2020 |
|3698 |10623   |0.0   |4.750013577122141E-6    |0.04848771 |3698       |Ingersoll Wine Merchants                  |A           |1300  50th St        |West Des Moines|IA   |50266   |POINT (-93.771159 41.590871)        |04/01/2020 |
|4295 |10623   |0.0   |3.65156058143765E-5     |0.090148866|4295       |Nat Food Mart #1 / Hubbell                |I           |3804 Hubbell         |Des Moines     |IA   |50317   |POINT (-93.541351 41.630849)        |04/01/2020 |
|2290 |10623   |0.0   |7.363685507009043E-6    |0.076922834|2290       |Al's Liquors Unlimited                    |I           |215 West Milwaukee   |Storm Lake     |IA   |50588   |POINT (-95.202405 42.647492)        |04/01/2020 |
|4545 |10623   |0.0   |1.892105279021529E-5    |0.023556076|4545       |Fill R Up                                 |I           |14151 Wabash Ave     |Council Bluffs |IA   |51503   |POINT (-95.81790900000001 41.220431)|04/01/2020 |
|3917 |10623   |0.0   |2.5384333237112735E-6   |0.05065586 |3917       |Smokin' Joe's #2 Tobacco and Liquor Outlet|A           |1606 W Locust St     |Davenport      |IA   |52804   |POINT (-90.599037 41.538254)        |04/01/2020 |
|4046 |10623   |0.0   |1.7309503054682413E-5   |0.028931752|4046       |J and K Market                            |I           |113 W Vanburen       |Centerville    |IA   |52544   |POINT (-92.874795 40.734946)        |04/01/2020 |
|5487 |10623   |1.0   |9.288992031104113E-5    |0.037823807|5487       |Independence Liquor & Food                |I           |1761 Independence Ave|Waterloo       |IA   |50707   |POINT (-92.302514 42.498404)        |04/01/2020 |
|5728 |15727   |1.0   |9.088837012728021E-5    |0.121096015|5728       |Beer Thirty Denison                       |A           |923 4th Avenue South |Denison        |IA   |51442   |POINT (-95.360162 42.012412)        |04/01/2020 |
|2551 |15727   |0.0   |3.7963324641862363E-6   |0.12539591 |2551       |Hy-Vee Food Store / Chariton              |A           |2001 West Court      |Chariton       |IA   |50049   |null                                |04/01/2020 |
+-----+--------+------+------------------------+-----------+-----------+------------------------------------------+------------+---------------------+---------------+-----+--------+------------------------------------+-----------+
only showing top 10 rows

Model evaluation

In [ ]:
evaluator = RegressionEvaluator(metricName='rmse', labelCol='perc_of_store_total_sale')
rmse_cv_imp = evaluator.evaluate(prediction_cv_imp)
print("Root-mean-square error = " + str(rmse_cv_imp))
Root-mean-square error = 0.481845640967
In [ ]:
sorted(zip(cvModel_imp.avgMetrics, paramGrid))[0][1]
Out[ ]:
{Param(parent=u'ALS_44dda395bf35493d2cfe', name='rank', doc='rank of the factorization'): 10,
 Param(parent=u'ALS_44dda395bf35493d2cfe', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
 Param(parent=u'ALS_44dda395bf35493d2cfe', name='maxIter', doc='max number of iterations (>= 0).'): 10}
In [ ]:
params = [{p.name: v for p, v in m.items()} for m in cvModel_imp.getEstimatorParamMaps()]
In [ ]:
# All grid search results
cv_rmse_df_imp = pd.DataFrame.from_dict({"params": params, cvModel_imp.getEvaluator().getMetricName():cvModel_imp.avgMetrics}).sort_values(by="rmse")
In [ ]:
cv_rmse_df_imp
Out[ ]:
params rmse
0 {u'regParam': 0.01, u'rank': 10, u'maxIter': 10} 0.383220
6 {u'regParam': 0.01, u'rank': 10, u'maxIter': 15} 0.386976
1 {u'regParam': 0.001, u'rank': 10, u'maxIter': 10} 0.387568
12 {u'regParam': 0.01, u'rank': 10, u'maxIter': 20} 0.388750
7 {u'regParam': 0.001, u'rank': 10, u'maxIter': 15} 0.392382
2 {u'regParam': 0.01, u'rank': 15, u'maxIter': 10} 0.393426
4 {u'regParam': 0.01, u'rank': 20, u'maxIter': 10} 0.394524
13 {u'regParam': 0.001, u'rank': 10, u'maxIter': 20} 0.395031
8 {u'regParam': 0.01, u'rank': 15, u'maxIter': 15} 0.396630
14 {u'regParam': 0.01, u'rank': 15, u'maxIter': 20} 0.398164
10 {u'regParam': 0.01, u'rank': 20, u'maxIter': 15} 0.398677
3 {u'regParam': 0.001, u'rank': 15, u'maxIter': 10} 0.399804
16 {u'regParam': 0.01, u'rank': 20, u'maxIter': 20} 0.400576
5 {u'regParam': 0.001, u'rank': 20, u'maxIter': 10} 0.400603
9 {u'regParam': 0.001, u'rank': 15, u'maxIter': 15} 0.402489
15 {u'regParam': 0.001, u'rank': 15, u'maxIter': 20} 0.404025
11 {u'regParam': 0.001, u'rank': 20, u'maxIter': 15} 0.404243
17 {u'regParam': 0.001, u'rank': 20, u'maxIter': 20} 0.406487
In [ ]:
#  parameters for best model
for k,v in cv_rmse_df_imp.params[cv_rmse_df_imp.rmse == cv_rmse_df_imp.rmse.min()].values[0].items():
    print k, v
regParam 0.01
rank 10
maxIter 10
In [ ]:
# RMSE for best model
cv_rmse_df_imp.rmse.min()
Out[ ]:
0.38322025914585806

Use best model to produce recommendations

In [54]:
# Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics
als_cv_imp = ALS(maxIter=10, regParam=0.01, rank = 10, implicitPrefs=True, 
                       userCol="Store", itemCol="item_num", ratingCol="perc_of_store_total_sale",
                   coldStartStrategy="drop", nonnegative = True)
In [55]:
#fit model
als_cv_model_imp = als_cv_imp.fit(training)
In [56]:
# Apply model on the test set to predict 
als_cv_predictions_imp = als_cv_model_imp.transform(test)
In [57]:
userRecs_cv_imp = als_cv_model_imp.recommendForAllUsers(10)
In [58]:
userRecs_cv_imp.show(3, truncate=False)
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Store|recommendations                                                                                                                                                                                            |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|4900 |[[89193, 1.0120503], [49189, 1.0109377], [49185, 1.0046718], [36903, 1.0030771], [48099, 0.9988279], [53213, 0.9939644], [26823, 0.9802368], [11773, 0.977573], [43333, 0.9661156], [35913, 0.96293867]]   |
|5300 |[[26821, 0.7096183], [37994, 0.69815487], [26826, 0.697179], [11290, 0.6882901], [43331, 0.6870129], [64866, 0.68502575], [11294, 0.68019557], [11774, 0.6794355], [64858, 0.67601943], [64864, 0.6743206]]|
|5803 |[[10807, 0.9805303], [64864, 0.97485816], [64858, 0.9744692], [38176, 0.9662294], [87408, 0.9614938], [43331, 0.9189896], [26821, 0.9176314], [36301, 0.917402], [11290, 0.9114194], [43334, 0.9105728]]   |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 3 rows

In [59]:
liquorRecs_cv_imp = als_cv_model_imp.recommendForAllItems(10)
In [60]:
liquorRecs_cv_imp.show(3, truncate=False)
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|item_num|recommendations                                                                                                                                                                                        |
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|21220   |[[2619, 0.22014774], [2614, 0.20923689], [2643, 0.20775041], [2603, 0.20030355], [2665, 0.19950663], [2648, 0.19860943], [2515, 0.19681376], [2512, 0.1955989], [2590, 0.19223362], [2538, 0.19113725]]|
|32460   |[[3773, 0.2162186], [4129, 0.18812409], [2538, 0.18042792], [3869, 0.17780903], [2190, 0.17657104], [2648, 0.17634706], [2619, 0.17563295], [2643, 0.17469415], [4988, 0.17314237], [2666, 0.17279202]]|
|64590   |[[2515, 0.49738222], [2619, 0.48381987], [2605, 0.47958094], [2614, 0.46836323], [2651, 0.46449745], [2590, 0.45988753], [2637, 0.4564313], [2599, 0.44975865], [2571, 0.4431762], [2647, 0.44315928]] |
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 3 rows

In [68]:
# collect the recommendations to top 10 stores by total sales in dollars
rec_liquor_top10_store_cv_imp ={}
for store in top10_store_list:
    rec_liquor = userRecs_cv_imp.where(userRecs_cv_imp.Store == store).select("recommendations").collect()
    rec_liquor_top10_store_cv_imp[store] = [i.item_num for i in rec_liquor[0]["recommendations"]]
In [69]:
rec_liquor_top10_store_cv_imp
Out[69]:
{2190: [16850, 18348, 49186, 88291, 5747, 21236, 89610, 18006, 89175, 88186],
 2512: [18348, 21236, 89610, 5133, 22175, 34244, 18006, 16850, 27100, 67595],
 2633: [18348, 16850, 64776, 27605, 27125, 16676, 15658, 21236, 75165, 5747],
 3354: [42717, 34007, 19067, 41694, 82607, 43127, 82847, 65257, 10627, 82787],
 3385: [42717, 34007, 19067, 82607, 41694, 43127, 82847, 65257, 82787, 11297],
 3420: [42717, 34007, 19067, 41694, 82607, 43127, 82847, 65257, 10627, 11297],
 3773: [18348, 16850, 5747, 27125, 15658, 27605, 75165, 27100, 28279, 5635],
 3814: [41694, 19067, 42717, 10627, 34007, 43127, 82607, 86887, 65257, 82847],
 3952: [19067, 41694, 42717, 43127, 10627, 34007, 28867, 67527, 65257, 82607],
 4829: [19067, 38177, 28867, 64776, 41694, 67527, 10627, 42717, 43127, 80096]}
In [70]:
liquor_store_clean.filter(liquor_store_clean.store_num_b==2190).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+-------------------------+
|store_num_b|Name                     |
+-----------+-------------------------+
|2190       |Central City Liquor, Inc.|
+-----------+-------------------------+

In [73]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store_cv_imp[2190]))\
.show(truncate = False)
+-----------+---------------------------+-----------------------------+
|Item Number|Item Description           |Category Name Clean          |
+-----------+---------------------------+-----------------------------+
|88186      |Olmeca Altos Plata         |100% Agave Tequila           |
|89610      |Olmeca Altos Reposado      |100% Agave Tequila           |
|89175      |Don Julio Anejo Tequila    |100% Agave Tequila           |
|88291      |Patron Tequila Silver Mini |100% Agave Tequila           |
|89175      |Don Julio Anejo Tequila    |Tequila                      |
|49186      |Remy Martin VSOP           |Imported Brandy              |
|16850      |Blanton's Bourbon          |Straight Bourbon Whiskey     |
|49186      |Remy Martin Vsop Cognac    |Imported Brandy              |
|5747       |Johnnie Walker White Walker|Temporary & Specialty Package|
|18348      |Four Roses Small Batch     |Straight Bourbon Whiskey     |
|16850      |Blantons Bourbon           |Straight Bourbon Whiskey     |
|89610      |Olmeca Altos Reposado      |null                         |
|88186      |Olmeca Altos Plata         |Tequila                      |
|49186      |Remy Martin Vsop Cognac    |Imported Grape Brandy        |
|21236      |Ridgemont Reserve 1792     |Straight Bourbon Whiskey     |
|88291      |Patron Silver Mini         |100% Agave Tequila           |
|89175      |Don Julio Anejo            |100% Agave Tequila           |
|5747       |Johnnie Walker White Walker|Scotch Whiskey               |
|18006      |Buffalo Trace Bourbon      |Straight Bourbon Whiskey     |
|88291      |Patron Tequila Silver Mini |Tequila                      |
+-----------+---------------------------+-----------------------------+
only showing top 20 rows

In [74]:
liquor_store_clean.filter(liquor_store_clean.store_num_b== 2512).select('store_num_b', "Name")\
.show(truncate = False)
+-----------+-----------------------------------+
|store_num_b|Name                               |
+-----------+-----------------------------------+
|2512       |Hy-Vee Wine and Spirits / Iowa City|
+-----------+-----------------------------------+

In [75]:
liquor_product_list.filter(liquor_product_list["Item Number"].isin(rec_liquor_top10_store_cv_imp[2512]))\
.show(truncate = False)
+-----------+-------------------------------------+------------------------+
|Item Number|Item Description                     |Category Name Clean     |
+-----------+-------------------------------------+------------------------+
|89610      |Olmeca Altos Reposado                |100% Agave Tequila      |
|67595      |Patron Xo Cafe                       |Coffee Liqueur          |
|67595      |Patron XO Cafe                       |Coffee Liqueur          |
|22175      |Russells Reserve 10YR                |Straight Bourbon Whiskey|
|22175      |Russell's Reserve 10YR               |Straight Bourbon Whiskey|
|5133       |Glenmorangie 10 Yr Single Malt Scotch|Irish Whiskey           |
|5133       |Glenmorangie 10 Yr Single Malt Scotch|Single Malt Scotch      |
|16850      |Blanton's Bourbon                    |Straight Bourbon Whiskey|
|18348      |Four Roses Small Batch               |Straight Bourbon Whiskey|
|16850      |Blantons Bourbon                     |Straight Bourbon Whiskey|
|89610      |Olmeca Altos Reposado                |null                    |
|27100      |Sazerac Rye                          |Straight Rye Whiskey    |
|21236      |Ridgemont Reserve 1792               |Straight Bourbon Whiskey|
|18006      |Buffalo Trace Bourbon                |Straight Bourbon Whiskey|
|34244      |Chopin Vodka                         |Imported Vodka          |
|89610      |Olmeca Altos Reposado                |Tequila                 |
|5133       |Glenmorangie 10YR                    |Single Malt Scotch      |
+-----------+-------------------------------------+------------------------+

Recommend liquor stores to top 10 best-selling liquor products

In [85]:
rec_store_top10_item_cv_imp ={}
for item in top10_item_list:
    rec_store = liquorRecs_cv_imp.where(liquorRecs_cv_imp.item_num == item).select("recommendations").collect()
    rec_store_top10_item_cv_imp[item] = [i.Store for i in rec_store[0]["recommendations"]]
In [86]:
rec_store_top10_item_cv_imp
Out[86]:
{u'11296': [4140, 4779, 4621, 4678, 5382, 4873, 4767, 5061, 4457, 5357],
 u'11297': [3612, 5236, 5140, 2465, 3390, 4312, 3525, 2191, 5068, 5102],
 u'11788': [4779, 5382, 4621, 5357, 4873, 4767, 5061, 4519, 4457, 4140],
 u'1799': [3731, 3644, 4516, 3719, 3899, 3696, 3645, 3628, 4521, 2556],
 u'26826': [4457, 4779, 4873, 5382, 4353, 4140, 5061, 4427, 4621, 4395],
 u'26827': [3612, 2465, 3390, 4312, 3525, 5140, 2191, 5326, 5068, 5236],
 u'36308': [4140, 4253, 4702, 2562, 4688, 2200, 4703, 4705, 4795, 4678],
 u'43337': [3612, 5236, 4312, 5326, 2465, 3525, 3390, 5068, 5102, 5140],
 u'43338': [5357, 3943, 2200, 5382, 5326, 4621, 4767, 4253, 4779, 3390],
 u'88296': [5326, 5427, 4167, 5251, 3525, 4312, 2465, 5236, 5162, 3612]}
In [87]:
df.filter(df.item_num=='11296').select("item_num", 'Item Description','Category Name Clean').dropDuplicates().show(truncate = False)
+--------+----------------+-------------------+
|item_num|Item Description|Category Name Clean|
+--------+----------------+-------------------+
|11296   |Crown Royal     |Canadian Whiskey   |
+--------+----------------+-------------------+

In [88]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item_cv_imp['11296'])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+---------------------------------------+-------------------+------------+
|store_num_b|Name                                   |Address            |City        |
+-----------+---------------------------------------+-------------------+------------+
|4621       |Casey's General Store #23 / Maxwell    |100, Main St       |Maxwell     |
|4678       |Fareway Stores #124 / Adel             |804 Nile Kinnick Dr|Adel        |
|4457       |Kum & Go #422 / Iowa City              |731 S Riverside Dr |Iowa City   |
|4779       |Casey's General Store #3782 / Mount Ayr|1305 E SOUTH STREET|Mount Ayr   |
|4873       |Casey's General Store #2559 / Granger  |1802, Sycamore St  |Granger     |
|5061       |Casey's General Store #3382 / Cedar    |560, 33rd Ave SW   |Cedar Rapids|
|4767       |Casey's General Store #3075 / Ankeny   |6981 NE 14th St    |Ankeny      |
|5382       |Casey's General Store # 2870/ Altoona  |1419 1st Ave N     |Altoona     |
|5357       |Gameday Liquor/ Orange City            |632 8th St. SE     |Orange City |
|4140       |Fareway Stores #900 / Euclid           |100 Euclid Ave     |Des Moines  |
+-----------+---------------------------------------+-------------------+------------+

In [91]:
df.filter(df.item_num=='1799').select("item_num", 'Item Description','Category Name Clean').dropDuplicates().show(truncate = False)
+--------+-------------------------------------+-------------------+
|item_num|Item Description                     |Category Name Clean|
+--------+-------------------------------------+-------------------+
|1799    |Captain Morgan Original Spiced Barrel|Spiced Rum         |
|1799    |Captain Morgan Spiced Barrel         |Spiced Rum         |
+--------+-------------------------------------+-------------------+

In [92]:
liquor_store_clean.filter(liquor_store_clean["store_num_b"].isin(rec_store_top10_item_cv_imp['1799'])).select("store_num_b", 
                        "Name", "Address","City").show(truncate = False)
+-----------+--------------------------------------------+-----------------------+---------------+
|store_num_b|Name                                        |Address                |City           |
+-----------+--------------------------------------------+-----------------------+---------------+
|2556       |Hy-Vee Wine and Spirits / Estherville       |1033 Central Avenue    |Estherville    |
|3731       |Wal-Mart 1241 / Davenport                   |5811 Elmore Ave        |Davenport      |
|3899       |Wal-Mart 3762 / WDM                         |6365 Stagecoach Dr     |West Des Moines|
|3645       |Wal-Mart 1764 / Windsor Heights             |1001  73rd St          |Windsor Heights|
|4516       |Casey's General Store #1569 / Oakland       |205 Brown St           |Oakland        |
|3628       |Wal-Mart 1528 / Cedar Rapids                |2645 Blairs Ferry Rd NE|Cedar Rapids   |
|4521       |Casey's General Store #1002 / Grand Junction|504 16th St N          |Grand Junction |
|3644       |Wal-Mart 2764 / Altoona                     |3501  8th St SW        |Altoona        |
|3696       |Wal-Mart 1723 / Des Moines                  |5101 SE 14th St        |Des Moines     |
|3719       |Wal-Mart 0581 / Marshalltown                |2802 S Center St       |Marshalltown   |
+-----------+--------------------------------------------+-----------------------+---------------+

In [ ]:
 
In [ ]: