Broadleaf Microservices
  • v1.0.0-latest-prod

Monitoring & Observability

Overview

Broadleaf supports the OpenTelemetry specification and related open source components for exporting all application health, logging, and performance signals. OpenTelemetry is widely supported by most/all APM (Application Performance Monitor) products available today. As a result, Broadleaf is well positioned to interact with any APM of your choosing, and furthermore, provide it with performance signals tailored specifically to Broadleaf implementation needs. Lastly, the OpenTelemetry support in Broadleaf can be customized to introduce new logs locations, metrics, and trace information not already supported out-of-the-box.

Key Concepts

  • A JavaAgent is attached at runtime and is responsible for all the heavy lifting related to recording execution flow traces, extracting health metrics information, and exporting logs. These concepts are all correlated (e.g. you should be able to identify specific logs related to the execution trace that caused them to be generated) and communicated over a single channel using the OTLP protocol.

  • One or more OpenTelemetry collectors receive OTLP communications, optionally process the information, and export to a backend (e.g. APM) or another collector in a chain.

Requirements

  • To benefit from the JavaAgent, the Broadleaf OpenTelemetry starter must be available on the classpath. The simplest method to achieve this result is to set the appropriate property in the Broadleaf project manifest.yml file. Also, if not already available, add the otellgtm element to the list under others. Finally, if not already there, add the otellgtm name to the comma-delimited list at docker.components. See manifest.yml sample below.

  • Once the manifest property is set, issuing a mvn flex:generate command from the project manifest directory should result in an update of the project config to include OpenTelemetry support. Most notably, the broadleaf-open-telemetry-starter will be included in the appropriate maven pom files. The Broadleaf OpenTelemetry starter uses Spring autoconfiguration and should automatically set up any Broadleaf microservice flex component for runtime instrumentation.

  • During project maven builds using the -Pdocker profile, application images will now include the java agent support.

  • Issuing the mvn docker-compose:up project command from the manifest directory will now also include a reference APM for demonstration purposes (visible at http://localhost:3030 by default). This is delivered via the Grafana OTEL-LGTM docker image.

manifest.yml
schema:
  version: '1.0'
project:
  groupId: com.example.microservices
  javaMajorVersion: '17'
  packageName: com.example.microservices
  starterParentVersion: 1.0.0-SNAPSHOT
  version: 1.0.0-SNAPSHOT
  supportOpenTelemetry: true (1)
  enableOpenTelemetryInLocal: false (2)
  useAlpineJavaImages: false
...
others:
...
- name: otellgtm (3)
  platform: otellgtm
  descriptor: snippets/docker-otel.yml
  enabled: true
  domain:
    cloud: otellgtm
    docker: otellgtm
    local: localhost
  ports:
    port: '4317'
    addl: '4318'
    web: '3030'
...
docker:
  components: [...],otellgtm (3)
---
1 supportOpenTelemetry[true/false] - Denote to the initializr system that support for OpenTelemetry instrumentation should be included in the project build artifacts. Default value is false.
2 enableOpenTelemetryInLocal[true/false] - Whether OpenTelemetry instrumentation and container monitoring support should be enabled when running the stack locally. When true, the Prometheus instance in otellgtm will scrape various components running in docker-compose to show in the reference dashboard (visible at http://localhost:3030 by default). In addition, when launched, the application flex packages will be instrumented for OpenTelemetry and send telemetry to otellgtm. This setting is useful when you want to review the results of OpenTelemetry configuration, or examine the impact of JavaAgent Customization. Optional. Default value is false.
3 Configuration and a reference to the snippet for the otellgtm reference APM backend must be declared
4 The otellgtm container must be added to the list of components to include during docker-compose launch

Reference APM

Broadleaf includes a reference dashboard compatible with Grafana for viewing overall system health. Similar visualizations are achievable in other APM platforms. It is useful to note the different sections in the dashboard as points of reference.

monitoring cluster throughput graph
Figure 1. Cluster Throughput Graph

The cluster throughput graph describes the execution components (including application flex packages), throughput between each, and success/error rate. This is a useful high-level view of performance and health.

monitoring performance
Figure 2. Performance
  • The Duration Heatmap demonstrates density of requests around response times using a colormap. Furthermore, the dots are clickable sample exemplars for the response time that expose a link to the full execution trace.

  • The Hot Traces Sampler displays a limited sampling of traces that are greater than or equal to a threshold value (the trace duration min selector at the top of the dashboard).

monitoring health
Figure 3. Health
  • CPU, Memory, and Working Disk are standard metrics around resource consumption for the selected application.

  • GC Pause Duration, and GC Pressure demonstrate any impacts related to garbage collection. Notably, GC pressure displays time spent in GC as a percentage of total GC time.

  • Threads, and Thread States are standard metrics for thread quantity and status.

  • This information is generally important to understand if your application is utilizing resources in a healthy manner.

monitoring connection pool
Figure 4. Connection Pool
  • Connection Pool Size demonstrates the status of the DB connection pool. This is useful for identifying possible connection leakage, or over-utilized pools. This can contribute to excessive blocking, which may express as slow response times without high CPU usage.

  • Total Acquire Time, and Total Usage Time are metrics that point to how long processes are waiting to get a connection, and how long processes are holding on to connections. These are additional data points to help you identify your application’s behavior.

  • This visualization is an important part of the system health picture, but is commonly overlooked in many dashboards.

monitoring cache
Figure 5. Cache
  • The visualizations in this section are intended for the default Apache Ignite cache.

  • Cache Hit Ratio exposes the hit rate by cache region.

  • Cache Member Count is the count of cache members by cache region.

  • Off Heap demonstrates the amount of off heap memory being used to support the cache.

  • Get Rate and Put Rate expose the in/out rate by cache region.

  • Eviction Rate shows the rate at which cache members are being evicted, by cache region.

  • This information is important to understand if your cache is effective, and where adjustments may be required.

Other Sections

  • Gateways Health shows what level of pressure the microservice gateways are currently under. These metrics are useful in determining when additional scale at the gateway layer may be required.

  • Solr Cluster Health, Kafka Cluster Health, and Zookeeper Cluster Health shows what level of pressure is being applied to these important supporting components. This is useful in determining when additional scale may be required.

What’s Missing?

Most notably, a visualization for the health of the database is not shown in this dashboard. 99% of the time, Broadleaf implementations will use a cloud native database as the persistence layer for the applications. Acquiring metrics for such components can be handled with OpenTelemetry, but setting that up is outside the scope of this document. Alternatively, APMs often provide facilities for connecting directly to these databases for health information. In either case, once a means of metric input is established, new visualizations can easily be added to expose the information in the dashboard.

Why is the dashboard so busy?

While there is a lot of information on the dashboard, sections can be collapsed for more targeted viewing. Note, we are of the opinion that being able to see a slice at a point in time that includes all relevant components in the stack is a powerful benefit to the analysis. Without it, obvious contributors may be hidden from view.

What can I do with the reference APM?

While not intended for production in the manner deployed, the experience is feature-rich and can be used to explore capabilities. This includes:

  • Reviewing all shipped logs, including trace correlation

  • Reviewing, filtering, analyzing distributed traces with log correlation

  • Reviewing system health

  • View Broadleaf specific routing and audit metadata on traces

  • See trace granularity and dashboard metrics tailored for Broadleaf installations

JavaAgent Customization

Broadleaf’s OpenTelemetry JavaAgent leverages the official OpenTelemetry JavaAgent with additions and tweaks to support Broadleaf’s specific needs. It is generally satisfactory for client applications as it exists out-of-the-box. However, if further customization is desired, the same facilities used by Broadleaf can be leveraged by developers to customize even further.

Customize OpenTelemtry JavaAgent

Spring environment properties can be declared for any OpenTelemetry agent specific system properties (begin with otel.). They will be passed on when Broadleaf attaches the JavaAgent.

Add or remove classes/methods from instrumentation eligibility

Review the environment Broadleaf otel environment properties table below for properties you can define in your Spring configuration to influence instrumentation inclusion.

Customize the Broadleaf trace instrumentation

To introduce custom Spans, you should:

  1. Add one or more com.broadleafcommerce.otel.instrumentation.ltw.ObservationBuilderFactory implementations to your codebase. These will scanned on the classpath at runtime and the factories will be engaged to return ObservationBuilder instances.

  2. When coding your ObservationBuilder construction, review the javadocs for ObservationBuilder. You have a number of options for how to match classes and methods that should cause a Span to register with OpenTelemetry at time of execution. A simple factory implementation might look like this:

TestObservationBuilderFactory.class
public class TestObservationBuilderFactory implements ObservationBuilderFactory {

    @Override
    public ObservationBuilder create(OTelProperties properties) {
        return new ObservationBuilder(properties, "broadleaf-request") (1)
                .withMatchingInterfaces(
                        "com.broadleafcommerce.data.tracking.core.context.ContextInfoCustomizer"); (2)
    }

}
1 The span name will be broadleaf-request and should appear with this name in the APM visualization for the trace.
2 This example informs the instrumentation to intercept any method calls to any class implementing ContextInfoCustomizer and register a Span at that intercept point.

Customize metrics reported via OTLP

It is especially convenient to interact with Micrometer in Spring implementations. Moreover, with a micrometer to OpenTelemetry bridge installed (as is the case with Broadleaf), it is easy to include Micrometer metrics in OTLP output.

  • If you have some generic metrics you want to report that are not necessarily tied to a logical operation, you can register a bean implementing io.micrometer.core.instrument.binder.MeterBinder in your Spring configuration. Here’s an example that already exists in the Broadleaf codebase:

SystemLevelMetrics.class
public class SystemLevelMetrics implements MeterBinder {

    @Override
    public void bindTo(MeterRegistry registry) {
        OperatingSystemMXBean osBean =
                (OperatingSystemMXBean) java.lang.management.ManagementFactory
                        .getOperatingSystemMXBean();
        Gauge.builder("system.total.memory.bytes", osBean,
                OperatingSystemMXBean::getTotalMemorySize)
                .tags("type", "system-level")
                .description("Total OS Memory").register(registry);
        Gauge.builder("system.free.memory.bytes", osBean,
                OperatingSystemMXBean::getFreeMemorySize)
                .tags("type", "system-level")
                .description("Total Free OS Memory").register(registry);
        Gauge.builder("system.cpu.load.percentage", osBean,
                item -> Double.valueOf(item.getCpuLoad() * 100).intValue())
                .tags("type", "system-level")
                .description("Aggregate CPU Usage").register(registry);
    }

    @PreDestroy
    public void unregister() {
        final List<Meter> meters = MeterAccess.getMeterRegistry().getMeters();
        for (final Meter meter : meters) {
            if ("system-level".equals(meter.getId().getTag("type"))) {
                MeterAccess.getMeterRegistry().remove(meter);
            }
        }
    }

}
  • You can also influence metrics during code execution in a more direct fashion using the MeterRegistry API.

Increment a counter in MicroMeter
try {
    Optional.ofNullable(MeterAccess.getMeterRegistry())
            .ifPresent(registry -> registry.counter("audit.ingestion.processed")
                    .increment(getProcessedCount()));
} catch (Exception e) {
    log.debug("Failed to increment counter", e);
}

In this example, we access the meter registry and increment a counter every time the flow is executed.

Going To Production

  • While the default reference APM is not intended for production, the out-of-the-box OpenTelemetry tooling is ready to go with some configuration.

  • Developers can utilize the Broadleaf project starter features to generate a complete Helm chart for K8S deployment that includes OpenTelemetry support. This is generally achieved by executing mvn helm:generate from within the project’s manifest directory.

  • The opentelemetry chart in the generated helm directory contains several files that should be edited according to your needs.

    • In opentelmetry/values.yaml, edit the exporter.target value to represent the appropriate OTLP destination for your signals. This may be an internal destination if self-hosting the APM, but will often instead be a destination prescribed by a third-party APM cloud vendor. This should generally be the grpc endpoint.

    • You may wish to edit opentelemetry/templates/flex.yaml to refine local collection and processing of your signals before forwarding on. This may include adding additional processors like memory limiting and batching.

    • You may wish to edit opentelemetry/templates/supporting.yaml to refine which logs are shipped for applications not covered by the sidecar introduced in flex.yaml. This generally includes logs from supporting components like solr, zk, and kafka.

You should not remove the env processor from the flex.yaml collector definition. This processor is responsible for decorating signals with K8S metadata needed to make sense of visualizations in the APM.
The K8S installation script includes a step for installing the OpenTelemetry Operator. This operator introduces the notion of several CRDs (Custom Resource Definitions) that allow more complex automated setup, such as collector sidecar deployment, or log collection. Broadleaf leverages this operator to deploy the support codified in flex.yaml and supporting.yaml.
As part of the OpenTelemetry Operator execution, the webhooks require a TLS cert. There are various strategies for providing the cert described here. The best place to establish your desired configuration for this is in helm/opentelemetry/operator-values.yaml in your project. operator-values.yaml, by default, assumes you will not use certmanager and will leverage a self-signed cert for the webhooks.
The Broadleaf dashboard can be imported into a production Grafana environment, if that is your goal. You can find it located at helm/otellgtm/broadleaf-dashboard.json in your project.

What about metrics from my supporting components (zk, solr, kafka)?

So far we’ve discussed shipping several pieces of information:

  1. Application metrics, logs, and traces

  2. Supporting component logs (zk, solr, kafka)

  3. Cloud native database metrics

We still have not discussed how supporting component metrics are made observable in a possibly remote location (e.g. at an APM vendor backed). This is generally achieved with the prometheus remote_write feature where the internal prometheus used to scrape the internal supporting components will forward those metrics to a vendor hosted prometheus for visualization on the backend. There are likely several ways to achieve this based on the vendor of choice. In many cases, editing the values passed to the prometheus Helm chart will do the trick (see helm/kube-prometheus-stack/blc-values.yaml in the project).

Broadleaf Agent Starter Environment Configuration

Name Description Type Default

broadleaf.otel.optimize-span-size

Whether to reduce the span size by not including some unnecessary verbose resource attributes.

Boolean

true

broadleaf.otel.include-user-attributes

Whether to include user attributes in the span for audit purposes.

Boolean

true

broadleaf.otel.enabled

Whether instrumentation is enabled.

Boolean

true

broadleaf.otel.use-spring-cloud-stream-instrumentation

Whether Spring Cloud Stream flows should be instrumented. broadleaf.otel.use-broadleaf-instrumentation must also be true.

Boolean

true

broadleaf.otel.use-mapping-pipeline-instrumentation

Whether verbose span data should be included during entity mapping pipeline flows.

Boolean

false

broadleaf.otel.use-broadleaf-instrumentation

Whether Broadleaf specific facets should be instrumented. This includes user audit, micrometer export, spring cloud stream, and kafka.

Boolean

true

broadleaf.otel.include-micrometer-metrics

Whether to include micrometer metrics in OTLP. broadleaf.otel.use-broadleaf-instrumentation must also be true.

Boolean

true

broadleaf.otel.log-transformation

Whether to log the transformation of a class during instrumentation. Probably only useful as a debugging measure.

Boolean

false

broadleaf.otel.additional-packages

List of class package prefixes that denote groups of classes that should be included during the instrumentation. Included classes will be eligible for Span generation as a result of any configured ObservationBuilder instances.

List

[]

broadleaf.otel.exclusion-regexes

A list of regular expressions that will be used to exclude execution stack points from instrumentation. The regular expression will match against the fully qualified class name and method name. The regular expression can match as generally or specifically as desired. Matches will exclude the effect of any eligible ObservationBuilder instances.

List

[]

broadleaf.otel.enable-custom-observations

Whether to enable custom observations. Defaults to false. When enabled, the system will look any ObservationFactory classes on the classpath, instantiate them, and register their output with the Broadleaf instrumentation agent.

Boolean

false

broadleaf.otel.use-ok-http-instrumentation

Whether okHttp instrumentation should be used. Defaults to false. In general, Broadleaf does not use okHttp. However, there are traces related to k8s lease and metric acquisition that account for a significant amount of noise.

Boolean

false