Recommendation Engine Bought Also Bought Training Model Job

Table of Contents

Overview
Assumptions
Restrictions & Best Practices
How it Works
- Implicit Ratings
Hints

Overview

Broadleaf provides a standalone Apache Spark training component which is designed to work and be submitted to any Apache Spark Cluster.

The Bought Also Bought model recommends items that are likely to be bought together. The default algorithm looks at different conversion associations such as purchases. If the same customer buys items in the same purchase within a particular time interval (e.g. a day), they are considered as being "bought together". Additionally, this model also looks at historical purchases, attributes about the customer (segment, account), and other contextual information about those customer/purchase associations to recommend items that are likely to go together.

This component is provided as a .jar file and can be extended or customized to further enhance the training model for your needs.

<dependency>
    <groupId>com.broadleafcommerce.microservices</groupId>
    <artifactId>broadleaf-recommendation-engine-bought-also-bought-spark-job</artifactId>
    <version>${blc.recommendationengine.boughtalsojob.version}</version>
</dependency>

Assumptions

The "Bought Also Bought" training job works off of data stored in a relational data store and thus requires connectivity to a DB.
To run the ALS algorithm and generate recommendations, the job specifically looks at the following tables:
- blc_application
- blc_application_catalog
- blc_catalog
- blc_catalog_item
- blc_catalog_ref
- blc_customer_reference
- blc_purchase
- blc_purchase_item
Once the model has been calculated, results will be stored back into a relational data structure. Specifically the following tables:
- blc_user_recommendation
- blc_user_recommendation_item
Currently, this model will only consider purchases (i.e. blc_purchase records) that are associated with a customer (i.e. have a customer_id). This means that anonymous / guest purchases don’t influence the recommendation model
Currently, the default implementation "fully cycles" recommendations after a successful training run. i.e. after recommendations have been generated, the blc_user_recommendation and blc_user_recommendation_item tables are truncated and reset with the new recommendations.

Restrictions & Best Practices

To create relevant recommendations, one rule of thumb that’s quoted as an ideal minimum anecdotally is that you need less than 99% sparcity of interactions relative to the user and item matrix in order to make good recommendations.
As such, the model has a minimum and maximum number of "events" (e.g. purchases) that are required to train the model.
The minimum number of purchases required per Applications is 300
The maximum number of purchases queried by submit date is 5000000. Anything over this is ignored.

How it Works

The primary training job is facilitated by the following component: com.broadleafcommerce.recommendationengine.sparkjob.SparkPurchaseHistoryRecommendationJob

This job gathers Apache DataSets from the relational data store and utilizes the org.apache.spark.ml.recommendation.ALS Machine Learning/Collaborative Filtering ALS algorithm to generate a training model.

More details around this can be found on the Apache Spark website here: https://spark.apache.org/docs/3.5.3/ml-collaborative-filtering.html

Implicit Ratings

The default "Bought Also Bought" algorithm will look at all the different customer and purchase associations and develop a rating that would be use to train the model. Certain associations will "boost" a rating, such as if the items were in the same purchase, if the customer is in the same segment or account etc…

The algorithm and boosting applied here can be tweaked by extending the com.broadleafcommerce.recommendationengine.sparkjob.SparkPurchaseHistoryRecommendationJob class to fine tune your results.

Hints

In docker, you can tail the logs of the following containers to view progress and see work being done during the training phase: recommendationenginespark, apachespark, and apachesparkworker.
If you are running locally with the default single-worker Apache Spark cluster running in Docker, you can view the Apache Spark Master console at: http://localhost:8080/ . This console shows workers available to the Spark cluster, any running applications (training jobs), as well as any completed jobs.

Recommendation Engine Example Flow - Apache Spark Master Console