Recommendation Engine Apache Spark Launcher Program

Table of Contents

Overview
Topology
How It Works
- Dependencies
Customization

Overview

Broadleaf provides an optional Apache Spark Launcher program that allows easier connectivity and integration with a running Apache Spark cluster. This spark launcher program has the following properties and responsibilities:

It is a simple Spring Boot application
It has a builtin dependency on Broadleaf’s default training model "JARs" which can be "submitted" to an Apache Spark cluster
has connectivity to Broadleaf’s Config Service to pull in necessary application properties and configuration needed by the training model jobs
exposes an endpoint to start the training model job on an Apache Spark Cluster
expected to be containerized and deployed alongside an Apache Spark Cluster (e.g. as a sidecar in a K8 cluster)
provides an example Dockerfile that contains the necessary Apache Spark binaries and libraries needed to "submit applications" to a running Spark cluster.

Topology

Conceptually, Broadleaf’s Recommendation Engine capabilities can be thought of in 3 distinct "layers":

The core Broadleaf Microservices Layer
The Data Layer
The Model Training Layer (powered by Apache Spark)

The following diagram depicts this in more detail:

Take notice of where the optional Spark Launcher component has the ability to reside in the Model Training Layer.

How It Works

The Spark Launcher program exposes an endpoint (/recommendation-engine/train-model) to "kick-off" a training model job.

An example cURL might look like:

curl -k -v --location 'https://localhost:8485/recommendation-engine/train-model' \
--request POST \
--header 'X-Context-Request: {"tenantId":"Telco"}' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer eyJh....' \
--data '{
    "type":"BOUGHT_ALSO_BOUGHT"
}'

Typically, in Broadleaf we would recommend invoking this endpoint via Broadleaf’s Scheduled Job service. Broadleaf’s starter data inserts a "Recommendation Engine Training Scheduled Job" entry which by default is set to run nightly. With that said, you can also kick off the job manually via the Admin console. To do that, log into the admin using the master credentials, and navigate to Processes > Scheduled Jobs. In the list, you should see one called "Run Recommendation Training Job". Click on this, and on the right-hand context menu (3 dots), you will see an option to: Run Now

Recommendation Engine Example Flow - Admin Model Training Scheduled Job

Invoking this job will begin the training model process by invoking the endpoint above.

The endpoint will run through any registered handlers and will begin the training model process by invoking a org.apache.spark.launcher.SparkLauncher instance.

Under the hood, a Spark Launcher invokes an Apache Spark Submit flow.

The parameters and variables (such as DB connectivity) that are necessary for the training job and are passed to the Spark Submit call can be reviewed in more detail in this class: com.broadleafcommerce.recommendationengine.sparklauncher.service.SparkLauncherService

More details around launching Spark applications can be found on the Apache Spark documentation:

https://spark.apache.org/docs/3.5.3/spark-standalone.html#launching-spark-applications

Note
the default DB credentials that are passed to Spark Submit are injected into the application on startup using Broadleaf’s Spring Cloud Config Service (along with with other configuration parameters).

Dependencies

As a convenience, this launcher program also defines a few dependencies that are passed into "Spark Submit" for the training model job. This includes things like:

A Postgres Driver JAR (postgresql-42.7.3.jar)
A Broadleaf Common Lib Jar (broadleaf-recommendation-engine-core-lib-x.y.z.jar)
A ULID Jar (ulidj-1.0.4.jar)
The "Bought Also Bought" training model JAR (broadleaf-recommendation-engine-bought-also-bought-spark-job-x.y.z.jar)

Important

If you are running a DB other than postgres, you will need to customize the launcher following directions in the next section to include the appropriate driver for you proper connectivity to your specific DB. This driver will be passed to Apache Spark and given to all the nodes in the cluster running the training model.

Customization

Broadleaf provides the source for this launcher component so that implementations have the freedom to augment and change the default functionality for their needs.

To do this, you can set editSource to true in your recommendationenginespark component of your manifest.yml:

supporting:
- name: recommendationenginespark
  platform: recommendationenginespark
  enabled: true
  domain:
    cloud: recommendationenginespark
    docker: recommendationenginespark
    local: localhost
  ports:
    port: '8485'
    sparkui: '4040'
    driver: '44720'
    blockManager: '44721'
  editSource: false <---------- change this to true