My SearchDataSource control is not returning pages that have been added recently. I'm assuming that it's stopped indexing, for some reason.
I'm just a little confused about the search architecture -- EPiServer has a "CMS Indexing Service," but my review of the API and the DLLs tells me that EPiServer is using Microsoft Indexing Service (if so, I can't figure out what catalog they're using...). Additionally, there's all sorts of Lucene integration floating around. The whole thing is a little black-box-ish.
So, in the end, what actually maintains the search index in EPiServer? And how do I get it to start indexing again? Is there some way to "reset" it?
Try restarting the EPiserver indexing service. You should easily find it under services. If you check the vpp folders there should be a indexing folder.
There's a setting in web.config for the delay before a newly created page get's indexed.
But I think restarting the service should do it. I've had to reset it several times in different environments.
Thanks, but that hasn't really solved it. Search works great on pages we created long ago, but the newer the page, the less likely it is to be in the index. It's like the indexer stopped indexing at some point in time.
I'd really like some more insight on how this thing works behind the scenes. I used Reflector to dig through the SearchDataSource control, and I see that there's a method in there that calls IndexServerSearch, which actually makes uses Microsoft Indexing Server to run a search on...something. I checked and I still only have the System and Web catalogs, and I never set anything else up, so I have no idea what index it's searching.
And, on top of all this, I have no idea how Lucene fits in into all this.
A quick overview:
* The versioned VPP (files) is using the EPiServer Indexing Service which uses Lucene for the index.
* The native VPP (files) is using Microsoft Indexing Service and is the "classic" implementation and not enabled by default. Not dependent on the EPiServer Indexing Service.
* Searching for pages is using a custom search implementation stored as keywords in the database (see tblKeyword, tblPageKeyword). It listens for events and is not dependent on any indexing service. Implemented in EPiServer.LazyIndexer.
We are looking into consolidating this for a future version.
Content is edited
So Microsoft Indexing Services is essentially deprecated by default?
To sum up --
(1) Lucene indexes binary files, and (2) a custom SQL implementation indexes pages.
You are correct. We still support native file systems using MS Indexing, but by default we don't use it since you don't get permanent links and versioning if you go that route.
Thanks, Per. I have a ticket open with support on this, and it's gotten really odd. In particular, since the page indexing is event-based, not service-based, it couldn't have just "stopped indexing." Also, support has had me run SQL queries on the keywords in the database, and the results are not consistent.
I'll report back here with the solution. I appreciate the background info -- that helped clear up some questions for me.
On workaround is to delete contents of tblKeyword and tblPageKeyword, and then run the following code:
ArrayList array = new ArrayList();
IList pages = new EPiServer.DataAccess.PageListDB().ListAll();
foreach (EPiServer.Core.PageReference page in pages)
IndexPageJob job = new IndexPageJob((int)array.ToArray(typeof(int)));
This will start the re-indexing process (this is a time-consuming process, you only need to start it once).
Thanks for this code. I've wrapped a Scheduled Job around it, and I'm running it now.
One question, though -- Reflector tells me that IndexPageJob just passes the IDs to LazyIndexer which queues them up. What process actually clears the queue? Is this done in the Web process, or is it the EPiServer CMS Indexing Service?
LazyIndexer has a timer that checks the queue every minute (the "lazy" part to get better perf when a lof pages are being published). It is done in the web process.
The IndexPageJob is internally used when the application is being shut down to make sure we don't loose unprocessed pages in the queue, thats why it looks a bit strange and just queues up pages.
You could also call LazIndexer.IndexPage(pageID) to force an instant re-index of a page (no queues or timers involved).
You make an interesting point there -- what happens if there's 1,000 pages in the LazyIndexer queue, and the process suddenly goes away? I don't see any persistence layer anywhere, so it strikes me that these pages just wouldn't be indexed.
If the app is shutdown gracefully, you might be able to do something, but if it's reset suddenly, I think you end up with holes in the index. (And, I don't think there's a clean way around this, either.)
We store the queue when the appdomain unloads, you are correct that we cannot handle when something forcefully kills the process.
Okay, I've figured out what the problem is. I don't know why it's happening, but based on my reading, it's a bug in the indexing system.
I've dug down through this issue, and I have found that a page (Page A, let's say) that is fetching its data from another page (Page B), the indexing system does not index fetched properties.
For Page B (the source page), it is clearly indexed for all search terms in the page name and the searchable properties (MainBody, in this instance -- a XHTML field).
But Page A (the one fetching from Page B) is only indexed for its page name. None of the fetched properties are indexed. The page works fine otherwise -- when I render a property from Page A, it transparently fetches the MainBody from Page B behind the scenes. But this doesn't seem to extend to indexing, for some reason.
I have a ticket open with support on this. I'll update this thread when we come to some resolution on it.
I think I've proven this via an experiment. Consider --
Page A is fetching its content from Page B.
Reflecting through the API, I think I found the source of the bug.
To index a page, LazyIndexer calls PageTextIndexDB().LoadPageTextData(pageID). The problem is, that method doesn't use the API to get the text of the page to index. It makes a direct database call. Specifically, it calls editGetPageTextData, and passes the page ID and the language branch.
Since it goes straight to the database rather than through the API, it doesn't do any fetching of the properties. I looked through the stored proc, and while I don't claim to completely understand it, I'm pretty sure it's not doing any property fetching at all.
So, my workaround is this --
I'll create a LongString property called "Searchable Text." I'll mark this property as searchable, but not display it in Edit Mode. On page publishing, I'll use the API (which does fetch) to write the contents of any searchable properties to this hidden field. This should give the indexing system a text string with the correct values in it to index.
If anyone sees anything wrong with this plan, let me know.
Are you updating the hidden searchtext property on page a when publishing page a or b?