Apache Solr

Table of Contents

Overview
Configuration
Other Search Engines

Overview

Broadleaf Commerce generally makes use of Apache Solr as its default Search Engine. Apache Solr is a general purpose Search Engine providing rich capabilities around indexing, searching, and querying data, including "fuzzy" logic that assists in free text queries. Specifically, we use Solr as the back-end Search Engine to power Broadleaf’s Search Services. Apache Solr allows for a flexible schema with loosely defined fields and documents. Documents - typically JSON or XML - can have 1 or more fields, are ingested by Solr, indexed, and then ready to be searched. As a result, Solr has some overlap with a document oriented database. However, Solr is not meant to replace a database. Rather it is meant to provide advanced, scalable search features that might augment the capabilities of a database. These advanced features might include things like:

Tokenization - Breaking phrases into words (or tokens), making all tokens lower case, and normalizing / removing punctuation
Stop Words - Removing words (tokens) that aren’t useful in searching (e.g. "a", "an", "and", "are, "as", "at", "by", "but", "if", "it", "of", "or", "the", etc.)
Stemming - Determine a stem or root word from tokens (e.g. ["search", "searching", "searched", "searcher"] → "search") which allows normalization across all four words
n-gram - The deconstruction of a token into n tokens starting with a certain point in the character sequence (e.g. "transmission" → ["tra", "tran" trans", "transm", "transmission"]) so that a search for "trans" would return results containing "transmission".
Phonetic Searching - Applying phonetic algorithms such as Double Metaphone to tokens so that a search for "Fenix" would return results containing "Phoenix".
Filtering
Etc.

This list is not exhaustive, and each item in the list has a number of flexible parameters that can be configured. Solr provides quite an extensive feature set and a rich collection of possible configurations. The Apache Solr Reference Guide can be found here. Our intent is to document our default use and configuration of Solr. Broadleaf currently provides the following default search and facet functionality with Solr:

Catalog Search (Website) - Search capabilities for text like product name, description, SKU, etc.
Catalog Browse (Website) - Browse products by clicking on Category links
Customer Search (Admin) - Feature to allow admins to search/filter for Customers based on name, address, phone number, email, etc.
Order Search (Admin) - Feature to allow admins to search/filter for Orders based on Customer info, date, product names, etc.

Broadleaf clients may wish to use Solr for additional search capabilities of custom data, beyond what Broadleaf provides by default. Customers may also choose to re-configure Solr to alter the way that default Broadleaf entities are queried, filtered, or boosted.

We encourage you to read the Solr documentation for additional details about configuring, deploying, and using Apache Solr.

Configuration

By default Broadleaf Commerce Microservices uses SolrCloud which is the clustered deployment options for Solr (it is not a different binary or code base, rather a deployment option). It allows for multiple indexes, shards, and replicas to be maintained and queried across multiple server nodes, with different levels of redundancy, allowing for massive numbers of documents (records) and massive scalability.

Apache ZooKeeper

Apache ZooKeeper is a general purpose cluster and configuration management service that can be used for many things including centralized configuration, leader election, state management, service discovery, distributed locks, etc. Apache Solr uses Apache ZooKeeper in SolrCloud mode for configuration and cluster state management. SolrJ (see below) also uses ZooKeeper for Solr cluster node discovery and load balancing when connecting to SolrCloud.

Although Solr comes bundled with Apache ZooKeeper, you are strongly encouraged to use an external ZooKeeper cluster in production. We recommend a minimum of 3 ZooKeeper nodes for minimum redundancy and failover. ZooKeeper recommends using an odd number of nodes (minimum of 3) to allow for failover. Three nodes allows for a single node to fail. Five nodes allow for two nodes to fail, etc…

For a ZooKeeper service to be active, there must be a majority of non-failing machines that can communicate with each other. To create a deployment that can tolerate the failure of F machines, you should count on deploying 2xF+1 machines. Thus, a deployment that consists of three machines can handle one failure, and a deployment of five machines can handle two failures. Note that a deployment of six machines can only handle two failures since three machines is not a majority.

For this reason, ZooKeeper deployments are usually made up of an odd number of machines.

— ZooKeeper Administrators Guide

ZooKeeper nodes can be run on the same server (or in the same container) as Solr nodes. This would mean that in a production setting, you could have a minimum 3 Solr/ZooKeeper nodes, which is enough for many applications. However, consider that ZooKeeper is a general purpose system for cluster, state, and configuration management, some organizations will prefer to have an enterprise ZooKeeper cluster or a cluster to support more than Solr. This is an architectural and infrastructure choice that can affect infrastructure cost since it requires additional nodes or machines, so keep that in mind. By default, and to reduce the infrastructure footprint, Broadleaf prescribes running Solr and ZooKeeper together on an odd number of nodes or containers.

When you start Solr in SolrCloud mode, you specify the ZooKeeper addresses in the command, which allows Solr (and SolrJ) to "discover" other nodes:

bin/solr start -cloud -z zkHost1:2181,zkHost2:2181,zkHost3:2181 -h solr1.internal.mycompany.com -p 8983

This the above example, consider the following arguments:

-cloud (also -c for short) - Indicates that Solr should be started in SolrCloud mode
-z - Indicates a comma-separated list of ZooKeeper addresses (host names and ports)
-p - Indicates the port with which to access this Solr node
-h - Indicates the hostname with which to access this Solr node

chroot

ZooKeeper has a virtual file system that is distributed across its cluster members. Optionally, you’ll want to create a root for all related activity in ZooKeeper. For example, you could create a root called /solr or /broadleaf/solr. You can do that with a Solr utility command:

bin/solr zk mkroot /solr -z zkHost1:2181,zkHost2:2181,zkHost3:2181

To connect and change the root, you would add the root to the ZooKeeper (-z) parameters (e.g. /solr):

bin/solr start -cloud -z zkHost1:2181,zkHost2:2181,zkHost3:2181/solr -p 8983

This tells Solr to use the ZooKeeper cluster specified, and use /solr as the root of ZooKeeper’s virtual file system. This helps, especially when using ZooKeeper for more than Solr or even Broadleaf, to keep data and access organized and isolated.

Collections

Apache Solr uses Apache Lucene under the covers. Lucene is actually the search "engine" under the covers of Solr. Solr provides the server and administrative capabilities on top of Lucene. Lucene creates and searches against indices. These indices are called cores. A core and an index are synonymous. However, Solr extends this concept with something called a collection. A collection is a logical, often distributed, index. It can be made up of one or more cores on one or more nodes of a Solr cluster. Here are some terms to consider:

Collection: A complete logical index in a SolrCloud cluster. It is made up of one or more shards. If the number of shards is more than one, it is a distributed index, but SolrCloud lets you refer to it by the collection name and not worry about the shards parameter that is normally required for distributed search.
Core: A physical index. This could represent the entire collection, or a collection can have many cores
Replica: One copy of a shard. Each replica exists within Solr as a core.
Shard: A logical piece (or slice) of a collection. A shard can have 1 or more replicas.
Node: A single instance of Solr running within a SolrCloud cluster. It can contain multiple collections, shards, and replicas.

So, collections are logical, often distributed, indexes. Collections contain 1 or more shards. Each shard is made up of 1 or more replicas. Each replica exists within a Solr (Lucene) core, which is a physical index.

In order to create a collection, you must specify the number of nodes, shards, and replicas as well as a config set. In many cases, the number of shards can be 1. This is especially true when the number of documents (records) in the collection is expected to be less than a few million. The number of replicas can equal the number of nodes. This is the simplest setup because it means that the entire collection fits into a single replica, and there is one replica per Solr node. This means that data is all stored together and that cross-cluster queries and collation will not be required. It also means that load balancing and redundency will be simpler because if a node goes down, the entire shard/replica goes down and the other nodes have self-contained replicas of the same shard and continue to serve data without interruption. The same automatic failover and load balancing will occur with more complicated shard and replica schemes. It’s just, well, more complicated and less self-contained.

Technically, a shard can contain up to Integer.MAX_VALUE (just over 2.1 billion) documents or records. However, practically, we recommend sizing and planning for a smaller number of documents per shard - a planned maximum in the 10s or 100s of millions, depending on document size, hardware, memory and heap, and OS page cache. Smaller shards will perform better, but more shards means more multi-shard queries and collation.

Config Sets

Each collection requires a config set. A config set is a collection of configurations that must be uploaded into ZooKeeper that Solr uses to create a collection. Config sets can contain a number of configuration files, depending on the complexity of the configuration. However, all collections require a solrconfig.xml and a managed-schema.xml. Additional configuration files include stop words, synonyms, and other types of optional files (see Apache Solr documention for more information about configurations).

The solrconfig.xml file contains information about how Solr should behave, technically. For example, autoCommit and autoSoftCommit settings are defined there (see below). Other things like cache settings, buffer sizes, search and indexer component definition, etc. are defined here. In most cases, the defaults will be appropriate.

The managed-schema includes definitions of fields that might be indexed, stored, and retrieved. This is the configuration most likely to directly affect how Broadleaf interacts with Solr. Unlike relational databases, Solr allows for a indexing, storing, and retrieving documents. Documents are semi-structured, JSON or XML documents. They are semi-structured in that they can contain as few as 1 field, and as many as hundreds of fields. Fields, in Solr, can be static or dynamic. Dynamic fields are fields whose names are not explicitly defined, but whose types are known. Consider, for example, this snippet from a Solr schema:

<dynamicField name="*_t" type="text_general" multiValued="false" indexed="true" stored="true"/>
...
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

In the above example, the dynamic field is defined as *_t. It is expected to be a single value. If it were multi-valued, then it could store an array of values in that field. It is indexed, meaning that it will be searchable and facetable and can be used as a predicate for filtering. It is also stored, which means that its original value can be retrieved. If it is not stored, then it can be searched, but not retrieved.

In this example, the type is text_general. This is a more detailed field type definition that includes a tokenizer and filter chain for both indexing and querying. With these two definitions, we can define any number of fields by simply indexing and then searching for fields with a particular name. For example, name_t, description_t, someOtherFieldName_t, etc. The point is that Solr allows you to define a field type and then dynamically create fields on documents with arbitrary field names. Consider the following, more complicated, field definiton:

<dynamicField name="*_tta" type="text_type_ahead" indexed="true" stored="true"/>
...
<fieldType name="text_type_ahead" class="solr.TextField">
    <analyzer type="index">
      <charFilter class="solr.HTMLStripCharFilterFactory"/>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="0" splitOnCaseChange="0" generateWordParts="0" splitOnNumerics="0" preserveOriginal="1" catenateAll="1" catenateWords="1"/>
      <filter class="solr.FlattenGraphFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="2"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="0" generateNumberParts="0" splitOnCaseChange="0" generateWordParts="0" splitOnNumerics="0" preserveOriginal="0" catenateAll="0" catenateWords="0"/>
      <filter class="solr.FlattenGraphFilterFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.PatternReplaceFilterFactory" pattern="^(.{20})(.*)?" replace="all" replacement="$1"/>
    </analyzer>
</fieldType>

In this case, we generically define a field for use in type-ahead functionality. You’ll notice that the index and query analyzers are a bit different than for a text_general field. Now, for example, you can define 2 fields, for 2 different purposes, from the same original data. For example, consider a product name like "Green Ghost". When we index the Green Ghost product, we’ll index its name. But we’ll do it in 2 (or more) Solr fields:

name_t
name_tta

In both cases, "Green Ghost" is the data that is passed in. But each field is indexed and search differently for different purposes, or to allow for different levels of relevance.

There are field types for integer, long, float, boolean, double, date/time, and others. The most complicated and highly configurable field types are the String and Text fields because of the ways that they can be tokenized, filtered, and because of their locale and language-specific processing capabilities. A plain String field is case sensitive and requires an exact match:

<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
...
<fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true"/>

A simple case-insenstive, string literal field definition would like this:

<dynamicField name="*_lowers" type="lowercase" multiValued="true" indexed="true" stored="true"/>
...
<fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

In the above example, *_lowers defines a case-insensitive, string literal, multi-valued dynamic field. So, name_lowers, in our example above, would contain ["green ghost"] - an array of 1 value that has been made lower case. All queries and filters against that field will be converted to lower case prior to accessing the index.

So Solr schemas are very important to the configuration of a collection. Broadleaf has provided a rich set of default field definitions that will meet the needs of most users.

Uploading config sets to ZooKeeper can be done using command line utilities or SolrJ. Generally, this is best done as a DevOps process rather than a development or application runtime process, so we recommend using the command line utilities, if necessary.

./server/scripts/cloud-scripts/zkcli.sh -zkhost zkHost1:2181 -cmd upconfig -confname catalog -confdir /path/to/my/catalog/confs

This command will upload all files and folders in the /path/to/my/catalog/confs directory to the /configs/catalog directory in Zookeeper’s virtual file system. Note that you only have to indicate 1 ZooKeeper server since it will propagate the files to the other servers.

From here, you can create a collection with command line utilties, SolrJ, or even a browser (REST URLs). Solr’s collections API requires you to specify the config set name (catalog), the collection name (catalog_collection_1, for example), the number of shards, and the number of replicas:

http://solrhost1:8983/solr/admin/collections?action=CREATE&name=catalog_collection_1&numShards=1&replicationFactor=3&collection.configName=catalog

The above URL is an example of a Solr Collections API command that will create a collection called catalog_collection_1 that has 1 shard and that has 3 replicas (assuming 3 Solr nodes), and that uses the config set called catalog. Note: see Solr Control Script reference documentation for command line utility usage to create collections.

See Solr’s Collection API documentation for more details and parameters, and additional APIs for creating and deleting aliases, and deleting or modifying collections, etc.

Aliases

Solr offers a concept called aliases. An alias is another name for a collection. It is a more logical name to address a collection. This is useful for collection swapping. Broadleaf uses 2 collections for each type of collection required. For example, Broadleaf uses a collection for Catalog data. So Broadleaf maintains 2 collections for catalog data. Here’s why: Broadleaf has a collection that represents the catalog data that customers are browsing and searching online. If it is necessary to completely refresh and reindex the catalog data, the Broadleaf does so with a "background" collection - i.e. an empty copy of the foreground collection that customers are using. Since it’s a background collection (customers are not using it), it can be truncated and refreshed over a period of time without disruption to customers. Once the background collection has been refreshed, we reassign aliases so that the background collection becomes the foreground and vice versa. Consider the following:

There is a collection called catalog_collection_1 with an alias of catalog_alias_1
There is a collection called catalog_collection_2 with an alias of catalog_alias_2

The Broadleaf Search Services always address the foreground collection as catalog_alias_1. When a reindex occurs, it happens against catalog_alias_2. When the reindex process is complete and a commit has been issued to finalize the process and make the data searchable and durable, the collections are re-aliased. This means that we issue an administrative command to Solr to assign the catalog_alias_1 alias to catalog_collection_2 and the catalog_alias_2 alias to catalog_collection_1. This happens very quickly, with no downtime for the customers.

Incidentally, incremental re-indexing (the reindexing of specific, discrete documents as a result of changes to catalog records, for example, via the admininstrative console) happens against the foreground alias. In these cases we issue a soft commit and allow Solr to handle the hard commit using autoCommit, described below.

Broadleaf makes use of 6 collections (and 6 aliases), by default:

2 for Catalog data
2 for Customer data
2 for Order data

Commits and Auto Commits

Unlike database commits which are supposed to be ACID-compliant and, specifically, isolated by the connection, Solr has more global "commit" concepts. Because connections to Solr are typically stateless and RESTful, commits are not isolated to the client or connection issuing updates (writes).

In fact, Solr has 2 types of commits: soft and hard. Soft commits are all about searchability and hard commits are (mostly) about durability. The reason that hard commits are "mostly" about durability is that they can make updates searchable too. Because commits are global, Solr also provides the capability of doing auto commits, both soft and hard.

Solr’s solrconfig.xml has these configurations:

<autoCommit>
    <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
    <openSearcher>false</openSearcher>
</autoCommit>

<autoSoftCommit>
    <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>

The autoCommit property has a maxTime of 15 seconds, meaning that any new writes will be committed and made durable automatically within 15 seconds. The openSearcher property is set to false. The reason is that openSearcher is expensive and makes newly committed documents searchable. In some ways, this is better left to the soft commit functionality, whose purpose is to make things searchable, but not necessarily durable.

The rule of thumb is that you should usually allow `autoCommit`s to happen automatically.

Broadleaf’s general advice is to leave these properties as they are. You can force a hard commit and open a new searcher at the end of a reindex process. It’s important to not have multiple threads issuing commits at the same time (or often), which is part of the reason that it’s suggested to only handle auto commits. Broadleaf does two types of indexing: Incremental and Full. Incremental indexing means making small changes in near real time (NRT), as needed. An example of this is changes to products that occur via Broadleaf’s administrative console. Full reindexing involves truncating the current index, and then writing updates quickly as possible to fully restore or refresh the data in that index from another source. During incremental indexing, Broadleaf can issue soft commits from the client (IndexService). Hard commits (making data durable) will happen within 15 seconds of updates. During full reindexing, Broadleaf issues a hard commit at the very end, including a command to open a searcher (making the data both durable and searchable at the end of the process). A good article on the subject of hard vs soft commits can be found here.

Apache SolrJ

Solr exposes RESTful APIs for Searching, Indexing, and Administration of Solr. Apache SolrJ is a Java client library that makes using Solr’s REST APIs easier. In particular, Broadleaf uses SolrCloud - a clustered "flavor" of Solr deployment that uses ZooKeeper for discovery of the cluster nodes. Broadleaf specifically uses CloudSolrClient, which is a subclass of the more generic SolrClient. While interaction with Solr is RESTful and therefore stateless, CloudSolrClient maintains a stateful connection to ZooKeeper which constantly provides the status and availability of each Solr node in the cluster, and other configurations such as which nodes contain certain types of data. As a result, SolrJ and CloudSolrClient handle load balancing and connectivity to appropriate Solr nodes, as needed, and transparently.

SolrJ can be used for searching, filtering, and faceting (reading); (re)indexing (writing); and administration.

Security

By default, Solr and ZooKeeper are unsecured, meaning that anyone with network access can modify configurations in ZooKeeper, read or modify data in Solr, and even create, modify, or delete collections in Solr. Both Solr and ZooKeeper can be secured via configuration. Please see Solr’s documentation for securing Solr and ZooKeeper.

Docker

Broadleaf offers a default Solr and Zookeeper installation, complete with all configuration sets, as a docker container. This can be used for development purposes, or as a reference to create a new SolrCloud deployment. Note that this Docker container does not have security enabled for Solr or ZooKeeper. For production and other higher level environments, enabling security is recommended.

Other Search Engines

By default Broadleaf uses Apache Solr as its search engine. We have not yet built integrations with other search appliances. However, Broadleaf Search and Indexing APIs have been designed with flexibility to allow for the use of additional Search Engines and Services.

Several other popular Search Engines and Appliances include:

LucidWorks Fusion
ElasticSearch
Google Cloud Search
Amazon CloudSearch
Oracle Endeca

Broadleaf does not directly support these integrations at this time. However, they can be integrated in a pluggable way, as needed, by implementors.