Execute and Raw Execute
  • 11 Apr 2024
  • 5 Minutes to read
  • Dark
    Light

Execute and Raw Execute

  • Dark
    Light

Article summary

Query Object

The query object is a Python class that wraps the search API call to execute a simple query from Python code. Queries are not run until an execute() function is called.


Constructor Arguments

Constructor arguments mirror the parameters available in a simple query. See the Search Query Syntax Guide for more information on formatting the query string and for details about each parameter.

  • query string: only required parameter
  • fields
  • limit: defaults to 33554000 (or 1 GB of data)
  • store
  • outputset: when set to True, the outside search API result id is used as temporary file name, allowing Report Builder to create clickable charts and graphs
  • facets: when set to True, the facets value from the outside search API will be used, allowing Report Builder to create charts and graphs from its predefined facet list
  • lang
  • logquery: flag to output search patterns into a temp file; used by engineering for debugging
  • snippetlength
  • format: XML, CSV, DF (dataframe)
  • sample
  • sort
  • start
  • fq: facet query, Boolean query string to further constrain the main query; it is used during "facet drill down" searches
  • wq
  • wqm
  • report
  • total_type-'exacttotal', 'approxtotal': computes (or estimates) the number of hits for the query in the corpus
  • out file
  • snippet count
  • count: alias for limit--deprecated
  • space: alias for store--deprecated
  • count

Execute Function

The execute function can compute multiple items for a single query including hits in XML, CSV or a dataframe (DF), facets in XML, and the hits total for the corpus. The items are stored in the ResultsHolder.

resultHolder = Query(...).execute(aggregator = None, dictCsvParams = None)

Neither the aggregator or the dictCsvParams parameters are required. If the default of None is used, there is no need to enter them within the parentheses.

aggregator
This interface receives dataframe "chunks" to customize aggregating the data rather than building a large dataframe. It is only relevant when format = 'df' on the Query object. When None, a large dataframe is built. See the aggregator section for more details.

dictCsvParams
This is a dictionary of parameters that can override the default parameters for the Pandas read_csv() function for constructing dataframes. Perfect Search automatically sets infer_datetime_format = True since we normalize date-time fields in our feed system. Thus customers typically do not need to provide any overrides.

A query can be executed by calling the .execute() function inline with the constructor call, or by assigning the Query to a variable and calling the execute statement later.

my_resultHolder = Query("test1").execute()

OR

my_query = Query("test2")
my_resultHolder = my_query.execute()
WARNING:

For string formats (XML, CSV), the execute() function is limited to returning about 1 GB of data for the call. This is a search server configuration setting. The data in the response body returned by the search server can never exceed 4 GB. However, for format='df' to build a Pandas dataframe, the execute function makes multiple calls to the search server to fulfill the requested limit as it creates a potentially huge dataframe for analysis. The script writer must be careful to avoid exceeding physical memory and depriving other users of system resources. Rather than banning the capability to build large dataframes, Perfect Search recognized there could be appropriate use cases where the entire server is dedicated to a single analysis task. When running such a large script, it is advisable to run it through the night while the demand on the server is minimal.


raw_execute function

This function returns the raw string from the search server, without creating a ResultsHolder object.

stringObject = Query("test").raw_execute()

The string size is constrained by the search server to approximately 1 GB.


ResultsHolder

The ResultsHolder can hold the requested hits, facets, and hit total of the corpus, as requested by the Query parameters. It is the object return from the Query.execute() function (in place of deprecated ResultSet object). Because the hit list can be large, the object only stores the hit list in the requested format: XML, CSV or DF (dataframe). Facets are are only retrieved when XML or DF format is specified (not CSV). The facets are stored internally as the original XML string received from search server. Functions are provided to convert them to Pandas objects (either Series or DataFrame). The ResultsHolder has the following functions to retrieve the query results.

Because get_hits, get_facets, and show_hits are somewhat limited in usability, additional functions are available for advanced users.

As an example, the following query is executed and it returns the ResultsHolder into variable 'rh':

rh = Query("()diabetes", facets="n.age:range(0,90,10)", store=0, format="df").execute()
  • hits(self): returns hits in the format requested in the query (XML, CSV, DF
myHitsDataframe = rh.hits()
  • facets_xml(self): returns a source-facets XML string or an empty string if no facets exist
myFacetsXmlStr = rh.facets_xml()
  • facets_series_list(self): returns a list of Pandas Series objects (one for each facet) or None of no facets exist
myListOfSeriesObjects = rh.facets_series_list()
  • facets_dataframe(self): returns a concatenated Dataframe containing all facets or None if no facets exist
myFacetsDataframe = rh.facets_dataframe()
  • raw_string(self): returns the raw string as received from the search server; can be None if the raw CSV was converted to a hits DataFrame and no other XML is stored
drawstring = rh.raw_string()
  • hits_total(self): returns the total number of hits if the query determined it, otherwise it is None
totalNumberOfHitsInCorpus = rh.hits_total()

Aggregator

The aggregator is an interface with one defined function:

class Aggregator(object):
    def absorb(self, dataframe):
        raise NotImplementedError
Note:

This has been wrapped in a function. Use get_distinct_values() for a simpler version of the above code.

The script writer codes a customized subclass that overrides the function absorb(). Query().execute(myAggregator) passes dataframe "chunks" to the absorb() function, and the custom code extracts and uses its information appropriately (typically aggregating summary information).

The following example is an aggregator that gathers unique values for a list of fields along with the count of records that have each value. Here fields is a comma separated string listing the field names:

class UniqueMultiFieldAggregator(object):
    def __init__(self, fields):
        self.unique_dict = dict.fromkeys(fields.split(','))

    def absorb(self,df):
        for field,unique in self.unique_dict.iteritems():
            uc = df.groupby(field).size()
            try:
                self.unique_dict[field] = unique.add(uc, fill_value = 0)
            except AttributeError:
                self.unique_dict[field] = uc

The unique_dict is a Python dictionary object that maps each field name to a Pandas Series object. The Series object contains the list of unique values for the field (as the Series index) along with the corresponding record counts (as the Series data).


Was this article helpful?