pdb.more_like_this
.
For instance, the following query finds documents that are “like” a document with an id
of 3
:
Expected Response
description
, rating
, and category
, were returned.
This is because, by default, all fields present in the index are considered for matching.
The only exception is JSON fields, which are not yet supported and are ignored
by the more like this query.
Expected Response
Because JSON fields are not yet supported for MLT, an error will be returned
if a JSON field is passed into the array.
How It Works
Let’s look at how the MLT query works under the hood:- Stored values for the input document’s fields are retrieved. If they are text fields, they are tokenized and filtered in the same way as the field was during index creation.
- A set of representative terms is created from the input document. For example, in the statement above, these terms would be
sleek
,running
, andshoes
for thedescription
field;5
for therating
field;footwear
for thecategory
field. - Documents with at least one term match across any of the fields are considered a match.
Using a Custom Input Document
In addition to providing a key field value, a custom document can also be provided as JSON. The JSON keys are field names and must correspond to field names in the index.Configuration Options
Term Frequency
min_term_frequency
excludes terms that appear fewer than a certain number of times in the input document,
while max_term_frequency
excludes terms that appear more than that many times. By default, no terms are excluded
based on term frequency.
For instance, the following query returns no results because no term appears twice in the input document.
Document Frequency
min_doc_frequency
excludes terms that appear in fewer than a certain number of documents across the entire index,
while max_doc_frequency
excludes terms that appear in more than that many documents. By default, no terms are excluded
based on document frequency.
Max Query Terms
By default, only the top 25 terms across all fields are considered for matching. Terms are scored using a combination of inverse document frequency and term frequency (TF-IDF) — this means that terms that appear frequently in the input document and are rare across the index score the highest. This can be configured withmax_query_terms
:
Term Length
min_word_length
and max_word_length
can be used to exclude terms that are too short or too long, respectively. By default, no terms
are excluded based on length.
Custom Stopwords
To exclude terms from being considered, provide a text array tostopwords
: