Approximate search a special character

son-fast-shard · May 24, 2024, 7:23am

Hi, I got an issue while filtering in a field that has approximate search enabled.
Given: Field with value: VSMA 24# 78280679
Input for filter: VSMA 24# 78280679 => return nothing.
Input for filter: VSMA 24 78280679 => return correctly.

How can I fix this issue?

ahmad-warm-vale · May 24, 2024, 7:53am

Hi @son-fast-shard,

It depends on the search engine you are using, in case you are using solr, then please check your app configuration and make sure you have this:

mgmtp.a12.dataservices.search.analysis.fullText.ngrams.enabled=true

son-fast-shard · May 24, 2024, 8:45am

These are the search configuration that we’re using:

mgmtp.a12.dataservices.search.analysis.fullText.ngrams.enabled=true
mgmtp.a12.dataservices.search.index.initialization.mode=REGULAR
mgmtp.a12.dataservices.search.service=lucene
mgmtp.a12.dataservices.search.lucene.homeDir=./cosmo-claims/database/lucene

We don’t upgrade to the 2023.06 yet so we still used Lucene.

ahmad-warm-vale · May 24, 2024, 8:48am

Can you try adding the following configs:

mgmtp.a12.dataservices.search.solr.urls=http://localhost:8983/solr
mgmtp.a12.dataservices.search.service=solrclient
mgmtp.a12.dataservices.search.index.initialization.mode=REBUILD_INDEX
mgmtp.a12.dataservices.search.analysis.fullText.ngrams.enabled=true

petr-high-peak · May 24, 2024, 9:13am

Hi @son-fast-shard ,

please check your schema.xml in the Solr. What tokenizers are used for indexing and querying for approximate search? I’m afraid that once it’s whitespace one and then another tokenizer, which avoids the ability to use the special characters for search using ngrams.

son-fast-shard · May 24, 2024, 9:49am

Currently, we use the default configuration of A12. So we don’t have any customized schema.xml.

petr-high-peak · May 24, 2024, 10:25am

Yes, then the definition is:

	<fieldType name="fulltextNGram" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
		<analyzer type="index">
			<tokenizer class="solr.WhitespaceTokenizerFactory"/>
			<filter class="solr.LowerCaseFilterFactory"/>
			<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="40"/>
			<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
		</analyzer>
		<analyzer type="query">
			<tokenizer class="solr.KeywordTokenizerFactory"/>
			<filter class="solr.LowerCaseFilterFactory"/>
		</analyzer>
	</fieldType>

and it’s not possible to use tokenizing characters for search

son-fast-shard · May 27, 2024, 2:39am

Is there any way to enable it? And how does it affect to the current search?

tomas-thin-gale · June 12, 2024, 12:50pm

Hi @son-fast-shard ,

You can specify how the approximate match behaves by changing type for dynamic field *_APPROXIMATE_MATCH

	<dynamicField name="*_APPROXIMATE_MATCH" type="fulltextNGram" indexed="true" stored="false" multiValued="true"/>

to any type you would like. @petr-high-peak provided the default configuration.

By looking into your use case, I would suggest trying different tokenizer for index analyzer. I cannot recommend one just by looking at the one input string. The tokenizer (index analyzer) create searchable tokens from the field values. I.e.: whitespace tokenizer creates as many tokens as there are white spaces in the field value, standard tokenizer adds token splits based on the special characters and keyword tokenizer takes whole field value as single token.

I would suggest trying out a couple of configuration from the schema.xml on your data (in some test) environment until you get the results that you find acceptable. Please keep in mind that changing of schema.xml will require restart of DS server and enabling re-indexing of documents.

The answer is valid for DS versions 33.0 - 37.0