Faceting Overview
  • 25 Sep 2022
  • 9 Minutes to read
  • Dark
    Light

Faceting Overview

  • Dark
    Light

Article summary

Faceting is a way to count and display how the results of a query distribute into subsets. Faceting makes it easy for users to "drill down" into more and more specific results, discovering patterns as they go.

The search appliance's faceting features are powerful and convenient to use, and they can greatly enhance the user experience. Faceting has been implemented with best-of-breed efficiency, but it is nonetheless more expensive to facet than to run a simple query; therefore, you should understand and consider the benefits of faceting verses the performance and complexity trade-off.

Background Concepts

A facet is an organized perspective of data, and it associates with a particular field that has groupable values. For example, if indexing classified ads, each ad might have a field for price, list_date, product_category, description, and seller_name. Of those fields, price, list_date, and product_category, all lend themselves to grouping (show all the ads where price less than $5,000, or maybe show all the ads where list_date was last week), and are thus good candidates for faceting. On the other hand, description and seller_name are not good facets, because each ad will have a unique value in those fields—they don't group easily.

The criteria used to group facet values are called constraints. The subset of values that matches a particular constraint is called a bin. Each bin typically has a count; the counts associated with bins can be used to build a histogram.

A constraint is expressed by describing a range and its subsets. All numeric and date fields are potential range facets. If faceting on a field named test_score_pct, the overall range is likely 0 to 100. The report writer might want to count how many values fall into the "A" range (93 to 100), how many fall into the "B" range (86 to 92), and so forth. A second scenario may be faceting by year on a field named birth_date.

Some constraints can be enumerated. For example, if faceting on a field named color, and the possible values for the field consisted of the primary colors ("red", "yellow", "blue"), then one possible constraint for the color facet would be color="red", another would be color="yellow", and so forth. Enumerable constraints are managed as taxonomies in the search appliance, and facets with enumerable constraints are called taxonomy facets. More information about defining and using taxonomies is provided below.


Asking for Facets

Sometimes prior configuration is required to make faceting useful, but many faceting features work out-of-the-box. Let's assume a working configuration for a moment, so we can cover the basic query syntax; then we'll circle back and explain configuration issues when our foundation is complete.

Number and Date Facets

A faceted query is any normal query that contains a &facets= in the URL. Consider the following simple (unfaceted) query:

Query

https://<appliance_ip>/search?q="molecular biology"

In this example, the caller wants all documents that contain the phrase "molecular biology." The response will include XML containing a series of <hit> tags describing the documents that match.

Now, imagine that the caller makes the same query, but appends the faceting syntax:

Query

https://<appliance_ip>/search?q="molecular biology"&facets=lastmodified:range(begin:-1y, end:today, gap:-mo),contentlength:range(begin:0,end:4G,gap:x100)

This request still causes the same series of <hit> tags to be returned, but by adding &facets=... the caller is also requesting additional information about how those hits are distributed across a lastmodified facet and a contentlength facet. The resulting XML response will have an additional block at the bottom:

Response

<facets>
  <facet name="lastmodified" type="date">
    <bin range="[2011-09-01 TO 2011-10-01}">19</bin>
    <bin range="[2011-10-01 TO 2011-11-01}">6</bin>
    …
    <bin range="[2012-08-01 TO 2012-09-01}">25</bin>
  </facet>
  <facet name="contentlength" type="number">
    <bin range="[0 TO 100}">0</bin>
    <bin range="[100 to 10000}">2</bin>
    …
    <bin range="[100000000 TO 4294967296}">2</bin>
  </facet>
</facets>

This block describes, for each facet, how the results subdivide according to the constraints that define the requested bins: 19 of the hits have a lastmodified value during September 2011; 6 during October 2011, and so forth. An advanced discussion of facet range specifications is covered in Faceting Query Syntax.

Note: The half-open range notation (square bracket = inclusive; curly brace = exclusive) is utilized to prevent the same items from being included in multiple bins.

String Facets

Searching for text strings requires a taxonomy file. See the example below. The &facets= section of the request URI contains a comma-separated list of specifications—one for each requested facet. The syntax for a string facet is this: &facets=name:field(...). Name should be replaced with the name of the facet established in the taxonomy file, capitalization matters, but field is a keyword and must be entered as is. A query in classified ads for color and price looks like this:

Query

https://<appliance_ip>/search?&q=corvette&facets=color:field(),price:range(begin:0, end:75000,gap:{1000,5000,15000}, after:true)

This request asks that all ads for corvettes be grouped according to price. The response might look like this:

Response

<facets>
  <facet name="color" type="tax" undertn="9">
    <bin undertn="25" under="red">39</bin>
    <bin undertn="26" under="yellow">17</bin>
    <bin undertn="27" under="blue">12</bin>
  </facet>
  <facet name="price" type="number">
    <bin range="[0 TO 1000}">0</bin>
    <bin range="[1000 to 5000}">16</bin>
    <bin range="[5000 to 20000}">34</bin>
    <bin range="[20000 TO 50000}">11</bin>
    <bin range="[50000 TO *}">7</bin>
  </facet>
</facets>

Notice that the sum of the counts for the price facet is 68 and the sum of the counts for the color facet is also 68. This query returned 68 hits; if all records are assigned exactly one value from each facet, then this equality (hit count = sum of counts for facet X) will always hold.

If our color taxonomy is complex, with children like "red"arrow_forward"candy apple" or "bluearrow_forward"navy", and we wanted to see only red corvettes, we might specify the facets like this:

Query

&facets=color:field(undertn=25, under="red")

In this example, undertn identifies a specific node in our hierarchy (the node for "red"; we discovered this number by running the first query; see XML above), and under provides an optional, redundant label for that node for cross-checking purposes. All shades of red would be returned in bins, with the count for each:

Response

<facet name="color" type="tax" undertn="25">
  <bin undertn="96" under="candy apple">22</bin>
  <bin undertn="97" under="rust">7</bin>
  <bin undertn="98" under="maroon">10</bin>
</facet>

Notice that the facet information for the top-level values in the color hierarchy are no longer provided; the request asked to do counts only under taxonomy node (undertn) 25. Notice also that the sum of the counts in this new set of bins = 39, and the count for the "red" bin in the original query was also 39.


Viewing Facets Using Report Builder

Note: The following example is specific to Report Builder.

get_facets(q, facets=True, store=0)

To use get_facets(), assign the query a name, begin the query with get_facets, then follow it with a Boolean expression and the facet and store parameters. This function requires two parameters mentioned (q and facets) in order to operate. See the Query Syntax guide for details on each of these parameters. Using only get_facets() creates the bins for the facet. To view facets in the output, a print function is necessary.

facet_query = get_facets('()insomnia', facets = 'n.age:range(0, 101, 10)', store=0)

print(facet_query)

Note: The ending bin number should be one number greater than the final number to be included in the counts. In the example above, the final number counted was 100.

If using multiple stores when faceting, enter the stores as a comma-separated string. The facet parameter must include the name of the field followed by a colon and the word range.


Configuration

Search appliances ship with predefined configuration for a date field named "lastmodified" and a numeric field named "contentlength"; these automatically capture the last modified date and size of files and web pages, and support faceting without any additional work.

Before a field can be faceted, the search system must know the field's name and data type. This information is specified in conf/parsetable.xml. To facet on other date or numeric fields, edit the parsetable.xml using the predefined fields as examples.

Text fields that map onto values in a taxonomy require one special setup step in parsetable.xml. If they are flat (see the condition of cars facet below), they should be defined as type="keyliteral", or type="hierarchy" if they are nested (see the color facet below). In addition, the taxonomy onto which their values map must be defined by a special file named conf/tax.<store number>.txt. For example, if I wanted to define a color taxonomy and a condition taxonomy in space 0, I might create a conf/tax.0.txt file that looks like this:

// colors for cars
color
    red
      candy apple
      rust
      maroon
    yellow
      sunflower
      gold
    blue
      navy
      sky
	  
# condition of cars
condition
    doesn't run
    fair
    good
    excellent
    new

Taxonomy files should be encoded as UTF-8. They accept either tabs or spaces (any number less than or equal to 8) as the indent. Tabs and spaces cannot be mixed, and with spaces, the number must be consistent; whatever indent is used on the first indented non-comment line must be used at each indent level thereafter. Blank lines are ignored, trailing space characters are trimmed, and comments can begin with #, //, or ;. Indents in front of comments are irrelevant. It shouldn't matter if lines end with \r\n or just \n.

The default search included with the appliance does not expose any taxonomy fields, because customer taxonomies cannot be anticipated in advance. Also, the current implementation does not support discovering taxonomy values from the data. You must define your taxonomy in the external files; values you place there are used to create bins when you facet on the associated fields. Values are exact; variations in case or punctuation are different values.

Tax files are loaded once, when the ps-searchserver service starts. If you add new values to your taxonomy, you need to restart the service. The undertn numbers used to drill into the tax node hierarchy are stable across edits to a tax file. This allows you to edit with confidence. If you have a simple 3-node hierarchy for colors ("red", "yellow", "blue"), and you feed a document containing white, this document will not be counted in any of the facet bins, even if it appears in the hit results. To correct the problem, add a line for "white" in tax.0.txt and restart the ps-searchserver service. Re-feeding or re-indexing is not required.

Taxonomies can be large—tens of thousands of nodes and many levels of nested hierarchy—with no particular performance impact. However, no single node in a taxonomy can have more than 64 thousand children. In a case where you have millions of possible values (e.g., you want to facet a billion phone calls by the target phone number), subdivide the desired facet into smaller fields (e.g., area code, then first three digits of number) to achieve smaller sets; without this kind of change, the number of bins in the facet would be unusable to a human wanting to browse.


Drilling In and Breadcrumbs

When you "drill into" a facet, the scope of your query narrows. This narrowing is not accomplished with the &facets parameter—that only tells the search engine how to count. Instead, you must add an &fq= parameter to narrow the search. Contrast the following:

Query

https://<appliance_ip>/search?&q=corvette&facets=price:range(begin:0, end:10000, gap:5000)
https://<appliance_ip>/search?&q=corvette&facets=price:range(begin:0, end:10000, gap:5000)&fq=price:[* TO 10000]

The first query says to find all corvettes, and to facet (assign to bins and count) those with prices between $0 and $10,000. This query would return all corvettes, but only count inexpensive ones. Corvettes with prices greater than $10,000 would be in the results, but would not be counted in a bin.

The second query says to find all corvettes where prices are less than or equal to $10,000 and to facet them into bins of $0–$5,000 and $5,000–$10,000.


Performance Cost

In most cases, queries run dramatically faster when they can calculate only the top N hits. Because faceting has to count all hits according to a faceted characteristic, faceted queries must examine every hit individually; the optimization of finding only the top N hits is lost.

If the number of hits for a query is relatively small (e.g., less than 10,000), the extra overhead of faceting is unlikely to be noticeable. However, in systems that have billions of records, and that run queries with vast hit totals (find all documents containing the word "the"), faceting overhead may become undesirable. The number of bins in a given facet and the number of facets being examined also influence overhead.


Miscellaneous

The search appliance cannot query with facets across spaces.

For security reasons, taxonomy bins with 0 count are not returned in facet results.



Was this article helpful?