Azure Search Cyrillic
Azure Search does not work pretty well at this moment with stemmers also there is no ngram etc. But there is still a way to make this work
Here is my experiment, suppose we have "Vacancies" index with some job offers in Russian, we want to search over them
All requests will be made with Powershell for simplicity
To reproduce them you gonna need your API key
Each vacancy in index has just one field "Position" against which search will be performed
Note: For powershell to produce right json do not forget to add something like -Depth 10
to ConvertTo-Json
calls
Note: For things to work propertly do not forget to convert your data into UTF-8 with [System.Text.Encoding]::UTf8.GetBytes(...)
Setting up
$Headers = @{
'Content-Type' = 'application/json; charset=utf-8'
'api-key' = '********************************' # Provide Your API key
}
Invoke-RestMethod -Method Delete -Uri 'https://testrus.search.windows.net/indexes/vacancies?api-version=2015-02-28-Preview' -Headers $Headers
Search with default analyzers
Create index
$IndexDefinition = @{
'name' = 'vacancies'
'fields' = @(
@{
'name' = 'VacancyId'
'type' = 'Edm.String'
'searchable' = $False
'filterable' = $False
'sortable' = $False
'facetable' = $False
'key' = $True
'retrievable' = $True
},
@{
'name' = 'Position'
'type' = 'Edm.String'
'searchable' = $True
'filterable' = $True
'sortable' = $True
'facetable' = $True
'key' = $False
'retrievable' = $True
}
)
}
Invoke-RestMethod -Method Post -Uri 'https://testrus.search.windows.net/indexes?api-version=2015-02-28-Preview' -Headers $Headers -Body ($IndexDefinition | ConvertTo-Json -Depth 10)
Insert some data
$Documents = @{
'value' = @(
@{
'VacancyId' = '1'
'Position' = 'Менеджер по продажам в Киеве' # Translation: Sales manager in Kiev
},
@{
'VacancyId' = '2'
'Position' = '1-С Программист Киев' # Translation: 1-C programmer Kiev
},
@{
'VacancyId' = '3'
'Position' = '1-С Программист во Львове' # Translation: 1-C programmer Lviv
},
@{
'VacancyId' = '4'
'Position' = 'Acme ищет менеджера по продажам' # Translation: Acme search sales manager
}
)
}
Invoke-RestMethod -Method Post -Uri 'https://testrus.search.windows.net/indexes/vacancies/docs/index?api-version=2015-02-28-Preview' -Headers $Headers -Body ([System.Text.Encoding]::UTf8.GetBytes(($Documents | ConvertTo-Json -Depth 10)))
Notice that first two vacancies has Киев
word at the end (It is capital city of Ukraine) and notice that first vacancy has additional letter е
at the end
So, what I want is to perform search over киеве
and get two first vacancies (search request and index should be stemmed)
Invoke-RestMethod -Method Get -Uri 'https://testrus.search.windows.net/indexes/vacancies/docs?api-version=2015-02-28-Preview&search=киеве' -Headers $Headers | select -ExpandProperty value
@search.score VacancyId Position
------------- --------- --------
0,74075186 1 Менеджер по продажам в Киеве
But there is only one :(
Azure search and Apache Lucene?
Apache has made greate product called Apache Lucene
which is used by many projects (eg: Apache Solr, ElasticSearch etc) it has freaking amount of features and analyzers
Azure Search provides us ability to use preconfigured Lucene analyzers
$IndexDefinition = @{
'name' = 'vacancies'
'fields' = @(
@{
'name' = 'VacancyId'
'type' = 'Edm.String'
'searchable' = $False
'filterable' = $False
'sortable' = $False
'facetable' = $False
'key' = $True
'retrievable' = $True
},
@{
'name' = 'Position'
'type' = 'Edm.String'
'searchable' = $True
'filterable' = $True
'sortable' = $True
'facetable' = $True
'key' = $False
'retrievable' = $True
'analyzer' = 'ru.lucene' # <--- Here is tricky part
}
)
}
Invoke-RestMethod -Method Post -Uri 'https://testrus.search.windows.net/indexes?api-version=2015-02-28-Preview' -Headers $Headers -Body ($IndexDefinition | ConvertTo-Json -Depth 10)
Do not forget to delete index first, otherwise you will get error Cannot create index 'vacancies' because it already exists.
Now we can insert data absolutely like before and try run our search again
Invoke-RestMethod -Method Get -Uri 'https://testrus.search.windows.net/indexes/vacancies/docs?api-version=2015-02-28-Preview&search=киеве' -Headers $Headers | select -ExpandProperty value | ft -AutoSize
@search.score VacancyId Position
------------- --------- --------
0,8465736 1 Менеджер по продажам в Киеве
So here is funny stuff, from one side Azure giving us serious analyzer tool, but only with preconfigured options which is not working :)
Microsoft Natural Language Processing
Thank to gods Azure provide to us another probably even cooler way to analyze our data with their NLP (which is by the way used in Office and Bing)
$IndexDefinition = @{
'name' = 'vacancies'
'fields' = @(
@{
'name' = 'VacancyId'
'type' = 'Edm.String'
'searchable' = $False
'filterable' = $False
'sortable' = $False
'facetable' = $False
'key' = $True
'retrievable' = $True
},
@{
'name' = 'Position'
'type' = 'Edm.String'
'searchable' = $True
'filterable' = $True
'sortable' = $True
'facetable' = $True
'key' = $False
'retrievable' = $True
'analyzer' = 'ru.microsoft' # <--- Microsoft NLP can stemm russian words
}
)
}
Invoke-RestMethod -Method Post -Uri 'https://testrus.search.windows.net/indexes?api-version=2015-02-28-Preview' -Headers $Headers -Body ($IndexDefinition | ConvertTo-Json -Depth 10)
As usual do not forget delete old index first, and now magic happens:
Invoke-RestMethod -Method Get -Uri 'https://testrus.search.windows.net/indexes/vacancies/docs?api-version=2015-02-28-Preview&search=киеве' -Headers $Headers | select -ExpandProperty value | ft -AutoSize
@search.score VacancyId Position
------------- --------- --------
0,30007723 1 Менеджер по продажам в Киеве
0,30007723 2 1-С Программист Киев
At last, we got our two documents and all seems to work right
Tip: Do not forget that you can have in our index PositionRaw
, PositionLucene
, 'PositionMicrosoft` fields with different analyzers to perform queries against all analyzers. Unfortunatelly it is not so elegant like in ElasticSearch, but come on Azure Search released few days ago :)
Azure Search Suggesters and Analyzers
There is always something wrong when all seems to be good :)
$IndexDefinition = @{
'name' = 'vacancies'
'fields' = @(
@{
'name' = 'VacancyId'
'type' = 'Edm.String'
'searchable' = $False
'filterable' = $False
'sortable' = $False
'facetable' = $False
'key' = $True
'retrievable' = $True
},
@{
'name' = 'Position'
'type' = 'Edm.String'
'searchable' = $True
'filterable' = $True
'sortable' = $True
'facetable' = $True
'key' = $False
'retrievable' = $True
'analyzer' = 'ru.microsoft'
}
)
'suggesters' = @(
@{
'name' = 'sg'
'searchMode' = 'analyzingInfixMatching'
'sourceFields' = @('Position')
}
)
}
Invoke-RestMethod -Method Post -Uri 'https://testrus.search.windows.net/indexes?api-version=2015-02-28-Preview' -Headers $Headers -Body ($IndexDefinition | ConvertTo-Json -Depth 10)
If you will try create index with suggesters over fields that use any analyzer you will get Field 'Position' in suggester 'sg' uses a custom analyzer, suggesters are not currently supported with custom analyzers.
Hope to see this feature somewhere in the future