Try our conversational search powered by Generative AI!

Stephan Lonntorp
Oct 28, 2016
  4924
(3 votes)

URL Transliteration for EPiServer CMS 10

A while back we built a website that had a chinese language version, and we had a few issues with URL Segments not looking very nice. I took me a while to figure out that what I was looking for is called Transliteration. I implemented a really hacky way of modifying the URL Segment that EPiServer produces, so that I could inject my transliterated page name, instead of the page name in chinese.

Now in CMS 10, the old UrlSegment class has been removed, and instead we have the IUrlSegmentGenerator, IUrlSegmentCreator and IUrlSegmentLocator, you can read more about that in the release note CMS-3824.

The default implementations for these interfaces are all internal, so it's still a bit hacky to extend, but, I've implemented a Transliterating UrlSegmentGenerator, and swaped out the implementation, so you don't have to.

OK, so what is transliteration, and why is this important?

Let's say we have a page named "伤寒论 勘误" (I don't know what that means, it's just some chinese text that I copied). The default UrlSegmentGenerator would produce the url "-", since everything but alphanumeric chars are stripped out, so the only thing that remains is the whitespace character in the name.

Using transliteration, the chinese characters are converted to their alphanumeric versions, so the same input string "伤寒论 勘误" is converted to "Shang Han Lun Kan Wu", and the Transliterating UrlSegmentGenerator then produces the url "shang-han-lun-kan-wu".

Granted, I don't know chinese, so I can't verify that this is 100% correct. But I do know that "shang-han-lun-kan-wu" is a better representation than "-", since three pages in chinese, in the same location, would have the urls "-", "-1" and"-2" using the default generator.

This approach should work for all languages, not just chinese, but you'll have to test it for yourself, if you find any bugs, please let us know by sending a pull request.

The code is available at https://github.com/creunaab/EPi.UrlTransliterator, and a package with the same name should be available in the EPiServer NuGet feed shortly.

Oct 28, 2016

Comments

Oct 28, 2016 12:06 PM

Nice work!

Oct 28, 2016 03:08 PM

Yes as you have pointed out we have made it possible to change the default handling for url segments.

We will officially support this from version 10.1.0 (no need to replace implementations in container) where we have added encoding support as well (even if most browsers handle unencoded urls with IRI characters the recommendation is to encode such characters). It will be announced when we release 10.1.0.

Stephan Lonntorp
Stephan Lonntorp Oct 28, 2016 03:25 PM

@Johan, care to elaborate? Have you implemented transliteration, or encoding? or both?

Oct 28, 2016 11:02 PM

There is a class UrlSegmentOptions registered as singleton in IOC container where you can specify which regexp an url segment should be validated against (this exist in cms 10 as well), meaning you can for example specify a regexp that allows unicode characters. So you can replace default instance with your own instance.

What we have added in 10.1 is encoding, that is that IRI urls gets encoded. In cms 10 those urls will not be encoded (most browsers will handle them correctly anyway). In cms 10.1 we have also opened up simple address to allow IRI characters.

Oct 28, 2016 11:07 PM

So to clarify you do not need to replace IUrlSegmentGenerator in IOC container, you can instead set the regexp on UrlSegmentOptions.

Vincent
Vincent Nov 1, 2016 12:54 AM

Nice work mate.

I can read Chinese, and I can confirm each Chinese character is translated to appropriate Pinyin. 

Stephan Lonntorp
Stephan Lonntorp Nov 2, 2016 01:57 PM

@code monkey: Thanks!

@Johan: Being able to replace the regexp isn't really useful for transliteration though, I use another library for transliteration, and AFAIK that has nothing to do with regular expressions. It's nice that you've made it configurable, but being able to change a regulare expression really just caters to a use case for using regular expressions to generate url segments.

Please login to comment.
Latest blogs
Optimizely and the never-ending story of the missing globe!

I've worked with Optimizely CMS for 14 years, and there are two things I'm obsessed with: Link validation and the globe that keeps disappearing on...

Tomas Hensrud Gulla | Apr 18, 2024 | Syndicated blog

Visitor Groups Usage Report For Optimizely CMS 12

This add-on offers detailed information on how visitor groups are used and how effective they are within Optimizely CMS. Editors can monitor and...

Adnan Zameer | Apr 18, 2024 | Syndicated blog

Azure AI Language – Abstractive Summarisation in Optimizely CMS

In this article, I show how the abstraction summarisation feature provided by the Azure AI Language platform, can be used within Optimizely CMS to...

Anil Patel | Apr 18, 2024 | Syndicated blog

Fix your Search & Navigation (Find) indexing job, please

Once upon a time, a colleague asked me to look into a customer database with weird spikes in database log usage. (You might start to wonder why I a...

Quan Mai | Apr 17, 2024 | Syndicated blog