<dependency>
<groupId>com.broadleafcommerce.microservices</groupId>
<artifactId>broadleaf-recommendation-engine-bought-also-bought-spark-job</artifactId>
<version>${blc.recommendationengine.boughtalsojob.version}</version>
</dependency>
Broadleaf provides a standalone Apache Spark training component which is designed to work and be submitted to any Apache Spark Cluster.
The Bought Also Bought
model recommends items that are likely to be bought together. The default algorithm looks at different conversion associations such as purchases. If the same customer buys items in the same purchase within a particular time interval (e.g. a day), they are considered as being "bought together". Additionally, this model also looks at historical purchases, attributes about the customer (segment, account), and other contextual information about those customer/purchase associations to recommend items that are likely to go together.
This component is provided as a .jar
file and can be extended or customized to further enhance the training model for your needs.
<dependency>
<groupId>com.broadleafcommerce.microservices</groupId>
<artifactId>broadleaf-recommendation-engine-bought-also-bought-spark-job</artifactId>
<version>${blc.recommendationengine.boughtalsojob.version}</version>
</dependency>
The "Bought Also Bought" training job works off of data stored in a relational data store and thus requires connectivity to a DB.
To run the ALS algorithm and generate recommendations, the job specifically looks at the following tables:
blc_application
blc_application_catalog
blc_catalog
blc_catalog_item
blc_catalog_ref
blc_customer_reference
blc_purchase
blc_purchase_item
Once the model has been calculated, results will be stored back into a relational data structure. Specifically the following tables:
blc_user_recommendation
blc_user_recommendation_item
Currently, this model will only consider purchases (i.e. blc_purchase
records) that are associated with a customer (i.e. have a customer_id
). This means that anonymous / guest purchases don’t influence the recommendation model
Currently, the default implementation "fully cycles" recommendations after a successful training run. i.e. after recommendations have been generated, the blc_user_recommendation
and blc_user_recommendation_item
tables are truncated and reset with the new recommendations.
To create relevant recommendations, one rule of thumb that’s quoted as an ideal minimum anecdotally is that you need less than 99% sparcity of interactions relative to the user and item matrix in order to make good recommendations.
As such, the model has a minimum and maximum number of "events" (e.g. purchases) that are required to train the model.
The minimum number of purchases required per Applications is 300
The maximum number of purchases queried by submit date is 5000000
. Anything over this is ignored.
The primary training job is facilitated by the following component:
com.broadleafcommerce.recommendationengine.sparkjob.SparkPurchaseHistoryRecommendationJob
This job gathers Apache DataSets
from the relational data store and utilizes the org.apache.spark.ml.recommendation.ALS
Machine Learning/Collaborative Filtering ALS algorithm to generate a training model.
More details around this can be found on the Apache Spark website here: https://spark.apache.org/docs/3.5.3/ml-collaborative-filtering.html
The default "Bought Also Bought" algorithm will look at all the different customer and purchase associations and develop a rating that would be use to train the model. Certain associations will "boost" a rating, such as if the items were in the same purchase, if the customer is in the same segment or account etc…
The algorithm and boosting applied here can be tweaked by extending the com.broadleafcommerce.recommendationengine.sparkjob.SparkPurchaseHistoryRecommendationJob
class to fine tune your results.
In docker, you can tail
the logs of the following containers to view progress and see work being done during the training phase: recommendationenginespark
, apachespark
, and apachesparkworker
.
If you are running locally with the default single-worker Apache Spark cluster running in Docker, you can view the Apache Spark Master console at: http://localhost:8080/ . This console shows workers available to the Spark cluster, any running applications (training jobs), as well as any completed jobs.