Blog posts by Viktor Sahlström2016-08-22T18:55:48.0000000Z/blogs/viktor-sahlstrom/Optimizely WorldEpiserver Find stemming/blogs/viktor-sahlstrom/dates/2016/82/episerver-find-stemming/2016-08-22T18:55:48.0000000Z<p>One of the things that makes search challenging (and really interesting) is language handling. Often, spoken languages differ from what a programmer is familiar with, since the rules are more comparable to a never-ending set of exceptions than actual rules. To properly analyze a string of text, the system must understand all of these exceptions.</p>
<p>A great feature of Episerver Find is <em>stemming</em>. Stemming reduces an inflected word to its root form (a.k.a. stem), for example, "fishing", "fished", and "fisher" have a root word of "fish." If the root word is determined, it can be used to return the full set of related items, thus improving retrievability and relevancy of search results.</p>
<p>Episerver Find uses snowball stemmers shipped as the default stemmer with Elastic search. This stemmer handles the general rules quite well but does not handle all special cases. In many languages, this works very well (English, for example). But depending on the complexity of the language and the maturity of the stemmer, this is not always enough. Swedish is a case where the default stemmer is not always perfect in execution. Often, the default stemmer creates a conflict that, in turn, causes unexpected search hits.</p>
<p>As an example, consider the Swedish words “bananen” (the banana) and “banans” (the race tracks). Using normal Swedish stemming rules, they would both be stemmed down to “banan.” In this case, any search also stemmed down to “banan” would give both results even though half of them are not relevant.<br /> <br />To fix this, a list of exceptions has been added to the Find stemming. We have started with Swedish and will look at additional languages going forward. The new algorithm recognizes that “bananen” and “banans” are different words even though their stem is the same. Hence, it creates unique tokens from them so the search engine can distinguish them at query time. This is a great improvement to search relevancy in many cases. One thing that remains to be solved is the case of “banan” (banana) and “banan” (the race track). In this form, the words are spelled exactly the same and cannot be distinguished without looking at the context. For these cases, search results are returned for both words.</p>
<p>To keep the list updated we would love users and partners to let us know if they find searches that results in weird results. </p>Episerver Find persistent queue/blogs/viktor-sahlstrom/dates/2016/82/episerver-find-persistent-queue/2016-08-22T18:38:12.0000000Z<p>As part of the continuous effort to operate optimally in the cloud, we have done multiple things to adapt the Episerver platform.</p>
<p>In Episerver Find 12.1.0, we changed a fundamental part of indexing to better work in a cloud environment, where several client instances may be present and there is less control of the actual computing power. Previously, a document to be indexed was put into a memory queue and processed by the client based in order of submission. In a cloud environment (such as Azure), this approach may cause issues. For example, an instance is stopped due to scaling or maintenance. In this case, the information in memory is lost, and those documents are not indexed. Also, when using a memory queue, the load of indexing could not be split across multiple instances unless the indexing was initiated on different machines.</p>
<p>To solve these issues, the memory queue approach has been re-implemented using a persistent queue that stores requests and events in a SQL database. This way, data is maintained more reliably, and queue information is accessible from all machines in a cluster. There will be no data dropped if a machine is stopped. As an additional benefit, we made sure that, if an item is submitted repeatedly but no changes are present in subsequent data sets, it is not indexed. Finally, we added the ability to trace failed events. See also: <a href="/link/90631c2db74a49088f33278c95dfd242.aspx" target="_blank">Tracing events in Episerver Find</a></p>Improving attachment search relevancy using the attachment helper/blogs/viktor-sahlstrom/dates/2016/1/improving-search-relevancy-for-attachments/2016-01-13T09:19:00.0000000Z<p>The way that Episerver Find treats attachments is not optimal in all scenarios. If you are a large company with lots of documents in many languages, you might notice that the search hits are not always optimal when trying to find a document. This is due to how Find handles indexing of attachments.</p>
<p>When indexing an attachment, the file is sent to Find as a base64 encoded string. The string is parsed in Apache Tika, and the resulting text is indexed using the standard language analyzer. This approach creates several issues.</p>
<ul>
<li>A lot of data passed from the client to Find is not really needed, like images in a pdf. This causes an unnecessary flow of data over the network.</li>
<li>The parsed text is only indexed using the standard analyzer. This significantly reduces the quality of hits if the attachment content is not written in the standard language.</li>
<li>While browsing indexed content using the explorer view in the Find edit mode. For an attachment, the actual text is not put in the document but rather the base64 representation of the document. This might make it hard to browse the index and to verify that the correct optimization is done.</li>
</ul>
<p> </p>
<p>To solve these issues in one go, the Attachment Helper interface was created. This interface lets the developer decide how to handle attachments. Out of the box, there is an implementation created by Episerver using the Windows built-in IFilter features <a title="AttachmentFilter on Episerver nuget" href="http://nuget.episerver.com/en/OtherPages/Package/?packageId=EPiServer.Find.Cms.AttachmentFilter">here</a>. This version supports a wide range of file types and is easy to get going. </p>
<p>Install the nuget package and the IFilters that suit your needs and, suddenly, your attachment search experience is vastly improved. You might notice that network traffic is reduced when running the index job, your searches provide more relevant hits, and the Find administrator can view the document content from inside the Find admin UI.</p>
<p>For more details on the search attachment filter, check out the <a title="attachment documentation" href="/link/a5b706c3cd7842e1802137db6a600430.aspx">docs</a>.</p>