Using Find to index multiple sites

Vote:
 

We are starting a new project that will be built with EPiServer 7.5

Our issue is that there will be several sections of the site that will link of to existing websites.
These will be on sub domains, or sometimes sub sections of existing sites.

i.e.

www.mysite.com
section.mysite.com
www.mysite.com/section2
section.mysite.com/section3

We will have no direct control of the existing sites, apart from making sure that Meta tags are completed, and providing updated headers and footers.

These "other" sites would fit under one of 8 main navigation items in the new site.


We want to implement a "global" search that covers not only the existing EPiServer pages, but also all the existing pages.
When we get search results back, we want the user to be able to filter down further based on the section that the result came from. If could be an actual child page of the section within EPiServer, but also be a page within one of the sub sites.. 

I an envisioning creating a spider/service that would crawl the existing sites, creating a simple object to represent this page, and manually adding this to the index. I could at this stage define the "section" that the site/page appears..

I hope this makes sense... but please let me know if any further clarifications are needed..

 

#80368
Jan 21, 2014 17:23
Vote:
 

I think Per Magne did a demo of something similar at an EPiServer Norway partner event a while ago. I'll try to get hold of the code.

#80463
Jan 23, 2014 10:55
Vote:
 

Ooh, that sounds promising.. Would be great if you could find me something, thanks!

#80464
Jan 23, 2014 10:56
Vote:
 

Hey Mari,

Were you able to find Pers example code?

Thanks

Danny

#80499
Jan 23, 2014 16:29
Vote:
 

When crawling, save a scraped URL to an object that inherits or looks like UnifiedSearchHit. That way you can easily add everything to the UnifiedSearchRegistry (example http://joelabrahamsson.com/docs/episerver-find-alloy-search-page/findinitialization.html).

I would also think about sitting tight for a bit since next gen Find will, if I recall correctly, have crawling capabilities available.

 

#80501
Jan 23, 2014 17:13
Vote:
 

Cool, I was thinking something along those lines.. Good tip about inheriting UnifiedSearchHit. Thanks.
I am also thinking about using alchemyapi.com to extract keywords, and key content for adding to index. I could then have facet results..

Oh, great, that sounds perfect it does indeed have crawling capabilities..

EPiServer are presenting to our client next week, so I'll get a question over to them about this...
Any idea when next gen Find is coming?


Thanks for the heads up!

 

 

#80502
Jan 23, 2014 17:20
Vote:
 

Hi,

I did indeed create a crawler for Find at a partner event. It is very similar to your original idea, Danny.

For this particular demo I used ncrawler for the crawling part and HtmlAgilityPack for the scraping part. http://ncrawler.codeplex.com/

If you decide to use ncrawler, then here is some basic code to get you started:

First, setup the crawler and execute the crawl. A scheduled job would be perfect for this:

            using(Crawler c = new Crawler(new Uri("http://yoururlgoeshere.com/"), new HtmlDocumentProcessor(), new Step()))
            {
                // you could set a maxium crawl count or time
                c.MaximumCrawlCount = 100;
                c.MaximumCrawlTime = new TimeSpan(0, 0, 1, 0);

                // you could exclude files and certain paths
                c.ExcludeFilter = new[]
                    {
                        new RegexFilter(
                            new Regex(@"(\.jpg|\.css|\.js|\.gif|\.jpeg|\.png|\.ico|=atom|=rss)",
                                      RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase))
                    };
                c.Crawl();
            }

    

the "Step" class is executed for every url that is crawled, and is where we do the actual indexing:

    public class Step : IPipelineStep
    {
        public void Process(Crawler crawler, PropertyBag propertyBag)
        {
            var meta = (string[])propertyBag["Meta"].Value
            var crawledItem = new CrawledItem()
                {
                    Url = propertyBag.Step.Uri.ToString().ToLower(),
                    Title = propertyBag.Title,
                    Text = propertyBag.Text,
                    Published = MetaDataExtracter.GetPublishedFromMetaData(meta),
                    // ... and so on
                };
            SearchClient.Instance.Index(crawledItem);
        }
    }


You should probably put the objects in a list and then index the objects in bulks, instead of one by one. I've written a blog post about that earlier:

http://world.episerver.com/Blogs/Per-Magne-Skuseth/Dates/2013/5/EPiServer-Find-Bulks-please/

 

The GetPublishedFromMetaData extension method:

        public static DateTime GetPublishedFromMetaData(string[] metatagValues)
        {
            return Convert.ToDateTime(metatagValues.FirstOrDefault(x => x.Contains("last-modified")).Substring(15));
        }

    

And you could then add CrawledItem, or whatever you name it, to the unified search, as suggested by Johan.

I'm not sure when the next version of Find will be released. Maybe you'll get some answers in your meeting next week :-)

 

 

#80503
Edited, Jan 23, 2014 17:42
Vote:
 

That's fantastic, thanks Per.. Just want I needed, and good to know my conceptual idea is not a bad one! 
Should be quite an interesting project this.. 10,000+ pages/pdf files from multiple sites into one, highly searchable index.. 

#80504
Jan 23, 2014 17:49
Vote:
 

Happy to help! That does indeed sound like an interesting project.
However, I would have to agree with Johan. You might want sit tight for a while and wait for the next version of Find.

#80506
Jan 23, 2014 17:58
Vote:
 

fingers crossed it'll be out in time.. I am hoping it will be!

#80507
Jan 23, 2014 17:59
Vote:
 

Thanks, Per Magne. This was the code I was referring to! :)

#80509
Jan 23, 2014 18:32
This topic was created over six months ago and has been resolved. If you have a similar question, please create a new topic and refer to this one.