Searching for Word Documents and PDFs

Jonathan Roberts
Member since: 2012
 

Is it possible to search for Word and PDF documents using FIND? I want to have a generic Search box on the Master page and when a USer types in a word the search looks at ALL the Page Types (templates) using a generic content Composer block and it also needs to search all Word and PDF documents on the page too. I then need to identify the different types (.doc, pdf, web page) so it is clear to the user where the word is located.

Is this possible?

Thanks,

Jonathan

#66308 Feb 26, 2013 17:40
  • Johan Petersson
    Member since: 2007
     

    Hi,

    Sure you can, for complete documentation please see http://find.episerver.com/Documentation/dotnet-api-attachments.

    If you want to search over all kind of data types you should consider Unified Search http://joelabrahamsson.com/entry/new-in-episerver-find-unified-search.

    #66311 Feb 26, 2013 18:42
  • Jonathan Roberts
    Member since: 2012
     

    Do you have any code yourself as the code on this site is very minimal and doesnt work on my environment - Im using Episerver 6 R2 with the latest FIND.

      The code looks like:

    var result = SearchClient.Instance
                     .UnifiedSearchFor(Query, EPiServer.Find.Language.English
                     .Take(PageSize)
                     .Skip((ActivePageNumber - 1) * PageSize)
                     .GetResult();

     

    But result is empty - even though I use a generic query such as 'Test' or 'the' or something I know exists in the pages

    Thanks

    #66324 Feb 27, 2013 11:11
  •  

    Unified Search exists in the general .NET client API for the CMS 6 integration. However, pages and files aren't automatically "included" which they are in the CMS 7 integration. Luckily Johan has written an awesome post about how to include them yourself.

    #66325 Feb 27, 2013 11:14
  • Johan Petersson
    Member since: 2007
     

    Hi again,


    Actually I wrote a blog post a couple of days ago about using Unified Search with EPiServer 6.

    There are some caveats because Unified Search isn't fully backported to EPi 6 and was developed for EPi 7.

    #66326 Feb 27, 2013 11:14
  • Jonathan Roberts
    Member since: 2012
     

    I had a look at the code and you have excuse me but is there a sample where the code is passing in the Search Query word? Basically I can see the code but I cant see where it is searching for a particular word.

    #66327 Feb 27, 2013 11:28
  • Johan Petersson
    Member since: 2007
     

    My blog post just shows how to set up Find in EPiServer 6. Joels blog post goes more into details how to actually search for something.

    #66328 Edited, Feb 27, 2013 11:33
  • Jonathan Roberts
    Member since: 2012
     

    Hi, thanks for the updates - Is there any code that has both the code set together as one so I can see it all working together?

    #66329 Feb 27, 2013 11:43
  • Jonathan Roberts
    Member since: 2012
     

    Hi,
    Do you have a full working set of code for Episerver 6 R2 for example:
    The Search Page using a repeater to display the results and the required classes please?
    I have tried to work out where all your code goes but Im having no luck - I was lead to believe when we purchased the FIND licence that we could easily search Pages and PDF and Word documents but this doesnt seem to be the case.
    Many thanks
    Jonathan

    #66332 Feb 27, 2013 12:20
  •  

    Hi,

    Put the below line of code in an initialization module or in application start in global.asax:

    FileIndexer.Instance.Conventions.ShouldIndexVPPConvention 
      = new VisibleInFilemanagerVPPIndexingConvention();

    Now when you upload a file into a VPP other than page files, or you run the re-indexing job, files will be indexed. See more info in the documentation where you can also see how page files indexing can be enabled if you want that.

    After enabling files indexing you can search for files like this:

    SearchClient.Instance.Search<UnifiedFile>()
      .For("banana")
      .GetFilesResult();

    To search for both files and pages you can do

    public class SearchHit
    {
      //Add more stuff you need here
      public string Title { get; set; }
      public string Url { get; set; }
    }
    
    SearchClient.Instance.Search<PageData>()
      .For("banana")
      .Select(x => new SearchHit
        { 
          Title = x.PageName, 
          Url = x.LinkURL 
        })
      .IncludeType<SearchHit, UnifiedFile>(x => new SearchHit 
        {
          Title = x.Name,
          Url = x.VirtualPath
        })
      .GetResult();

    However, the above can be done much more easily with the UnifiedSearch concept which works great out-of-the-box with EPiServer 7. It's possible to use it with EPiServer 6 as well, but it will require you to adopt Johan's blog post.    

     

    You may also be helped by this and this.

    #66334 Edited, Feb 27, 2013 12:57
  • Jonathan Roberts
    Member since: 2012
     

    Hi, thanks for all your help. I think im going to quit my job and start selling shells to tourists in Bognor Regis - this is so complicated - What is FileIndexer - I cant find that anywhere. I thought I was a good programmer - now I feel like Im back in year one. Im really sorry for not getting this.

    Sorry if I didnt mention - Im using Episerver 6 R2

    #66335 Edited, Feb 27, 2013 13:08
  •  

    FileIndexer indexes files :)

    It hooks into events in the EPiServer API and indexes files when they are uploaded.

    However, by default uploaded files aren't indexed but that can easily be enabled with the code I posted above. It's also possible to control exactly what files are indexed using custom conventions, but the code example above works great in most cases.

    #66336 Feb 27, 2013 13:12
  •  

    The gist of what both me and Johan i saying can be summed up like this:

    1. Download Johan's code for unified search and include it in your project.

    2. Download the source code for my ready-to-use search page for EPiServer 7 here.

    3. In the code for the search page there's a class named FindInitialization. Include that in your project.

    4. Look at, or copy and paste the rest of the source code for the search page. If you copy and paste everything you'll have to do some modifications due to the code being written for EPi 7.

    You should now have a ready to use search page.

    #66338 Feb 27, 2013 13:17
  • Jonathan Roberts
    Member since: 2012
     

    HI, Im using Episerver 6 R2 - do I just need to take out the bits from your Episerver 7 code?

    #66339 Feb 27, 2013 13:21
  •  

    In #2 above I link to code for a ready-to-use search page for a an EPiServer 7 site. Given that you adopt the code for adding pages and files to unified search that Johan describes and has made downloadable the code I linked to should *almost* work with EPiServer 6 R2. However, there will of course be a few things that will have to be adjusted. For instance, EPiServer 6 doesn't have strongly typed pages and the downloadable code contains a page type class. You can either modify that to use Page Type Builder or you can create the corresponding page type in admin mode.

    #66340 Feb 27, 2013 13:26
  • Jonathan Roberts
    Member since: 2012
     

    Can you tell me why I get asked for a Language when I add this line:

    var query = SearchClient.Instance.UnifiedSearchFor(Query);

    #66354 Feb 27, 2013 14:52
  • Jonathan Roberts
    Member since: 2012
     

    Hi,

    I have added all the code in to my application and it doesnt work - even on the first two lines of the serach:

    var query = SearchClient.Instance.UnifiedSearchFor(Query,EPiServer.Find.Language.English);

                    Response.Write("QUERY COUNT:" + query.Count());

    It comes back with ZERO

    I then continue with my code:

     

    query = query.TermsFacetFor(x => x.SearchSection)

    .Skip((ActivePageNumber - 1) * PageSize)
    .Take(PageSize)
    .ApplyBestBets();

    var hitSpec = new HitSpecification

    {     HighlightTitle = true,HighlightExcerpt = false, ExcerptLength = 120     };

    Results = query.GetResult(hitSpec,true);

    Results has ZERO

    I have indexed the site about 5 times and it takes ages to do one and it still brings back nothing

     

    #66355 Feb 27, 2013 15:00
  • Jonathan Roberts
    Member since: 2012
     

    Looking at the code it seems that the Index would index the VPP - we dont want to index the VPP as there could be documents that shouldnt be found yet, we only want to search for documents that have been published as links on pages. Is this something that the code does?

    #66377 Feb 27, 2013 16:47
  • Johan Petersson
    Member since: 2007
     

    Jonathan: Are you deleting your questions immediately after you've posted them? I get notified by e-mail, but they're not showing up here.


    To search for files that is only linked in pages, you have to do a filter which checks if the file is actually linked by querying the softlink api.

    There is no built-in functionality for this in Find. But that's actually one of my feature requests.

    #66378 Feb 27, 2013 17:01
  • Jonathan Roberts
    Member since: 2012
     

    Hi, I thought it was deleting them too but the Posts have been paged - there are now two pages.

    Blimey - I think im going to have to knock this on the head as this is way too complicated - I have spent days on this and Im not getting any where.

    I can search ONE page type and thats as much as I have working. Im amazed that there isnt a nice set of code available - my client has paid so much for Episerver and FIND and for it not to work properly is amazing. Im really apreciative of all your help but its all very cryptic. Is there any source available in just one place? So far I have read about 5 blogs and tried implementing all variations of code and have got no where.

    Cheers,

    Jon

    #66379 Feb 27, 2013 17:06
  • Johan Petersson
    Member since: 2007
     

    Haha, yeah the paging is not that obvious.

    Have you read the documentation, http://find.episerver.com/Documentation? And especially the EPiServer 6 integration part. That's a good starting point.

    This is an example of indexing conventions that should work http://pastebin.com/3xC5B1SM. You can find all extension methods in this class in my blog post. You also have to implement the ISearchContent interface in your pagetypes or a base class, as mentioned in my post. Why? I think Joels blog post explains it well.

    Then after a re-index you should be able to search with your above posted code.

    #66385 Edited, Feb 27, 2013 17:59
  • Johan Petersson
    Member since: 2007
     

    Updated the link to the code.

    #66386 Feb 27, 2013 18:01
  • Jonathan Roberts
    Member since: 2012
     

    Hi, thanks for all your help over the last couple of days. Our client wanted to only search for published documents on the site and not the entire VPP.

    I have also realised that the Unified serach just doesnt work on my Episerver 6 R2, we are using PageTypes and have multiple templates such as a 2 col template and a 3 col template and Home page template and we need to search all these different PageTypes or Templates.

    I can search the 3 col template but I can find how to search multiple templates, is there any sample code that explains this - if at all possible?

    Thanks again

    #66435 Edited, Feb 28, 2013 14:40
  •  

    Hi,

    By template, do you mean page type? Unified search should search for all pages no matter the page type. However if you by template mean .aspx-files I'm a bit confused :)

    #66436 Feb 28, 2013 14:46
  • Johan Petersson
    Member since: 2007
     

    When you say templates, do you mean different Page Type Builder page types?

    Unified Search's main goal is to be able to search over multiple types.

    If you add

    SearchClient.Instance.Conventions.UnifiedSearchRegistry.Add<PageData>()

    you will search for all page types.

    If you just want to search for, let's say two types and you have Page Type Builder you can specify those types instead:

    SearchClient.Instance.Conventions.UnifiedSearchRegistry.Add(typeof(StartPage));
    SearchClient.Instance.Conventions.UnifiedSearchRegistry.Add(typeof(RegularPage));

        

    #66437 Feb 28, 2013 14:46
  • Jonathan Roberts
    Member since: 2012
     

    This is the code I am currently using:

    protected SearchResults<SearchResultHit> Results { get; set; } 
    
    IClient client3 = Client.CreateFromConfig();
                    var query3 = client3.Search<StandardPage3ColType>()
                        //var query = client.Search<NewsArticleType>()
                    .For(Query);
    
    
                    //.TermsFacetFor(x => x.Type);
    
    
                    ////Filter on Language
                    //query = query.Filter(x => x.LanguageID.Match("en-GB"));
    
                    var results = query3.Select(x => new SearchResultHit
                   {
                       Title = !string.IsNullOrEmpty(x.PageName.AsHighlighted()) ? x.PageName.AsHighlighted() : x.PageName,
                       Url = x.LinkURL ?? x.LinkURL,
                       Description = !string.IsNullOrEmpty(x.MegaMenuDescription.AsHighlighted()) ? x.MegaMenuDescription.AsHighlighted() : x.MegaMenuDescription,
                       CategoryOut = ""
                   })
                   .Take(PageSize)
                   .Skip((ActivePageNumber - 1) * PageSize)
                   .GetResult();
    
                    Results = results;
    
    
    
                    ItemCount = Results.TotalMatching;
    
                    plSearchResults.DataSource = Results;
                    plSearchResults.DataBind();

        This works well, not sure the full code for using UnifiedSearch

    #66441 Feb 28, 2013 14:57
  • Jonathan Roberts
    Member since: 2012
     

    Is this something that can be achieved?

    #66444 Feb 28, 2013 15:39
  •  

    Hi,

    First of all use SearchClient.Instance instead of Client.CreateFromConfig().

    Next, if you have added the unified search code from Johan's blog post/download you should replace "client3.Search<StandardPage3ColType>().For(Query)" with "SearchClient.Instance.UnifiedSearchFor(Query, Language.English)". Finally, remove the call to the Select method.

    If you do that it should work fine. If it doesn't it would be great with details similar to the code you pasted above but in its modified form and I'm sure either me or Johan can help you.

    #66445 Feb 28, 2013 15:46
  •  
    #66448 Feb 28, 2013 16:13
  •  

    You get asked for language in order to support stemming for the query. If you want to have it resolved from ContentLanguage.PreferredCulture you could use the EPiServer.Find.Cms-namespace that has an extension for UnifiedSearchFor that peeks at your preferred culture automatically.

    #66450 Feb 28, 2013 16:16
  • Jonathan Roberts
    Member since: 2012
     

    Hi,

    Great. Many thanks for all your help. I have it bringing back results - yeehaa!

    But it only brings back Excerpt and not Title or URL - what would be the reason for this.

     

    Cheers

    #66454 Feb 28, 2013 17:22
  •  

    Glad to hear it!

    Title is fetched from a property/field named SearchTitle and Url from SearchHitUrl. In Johan's post, in the last code example, he includes those. Do you have similar code?

    #66455 Feb 28, 2013 17:29
  • Jonathan Roberts
    Member since: 2012
     

    I have a class called PageTypeBase which looks like his code - in my Search Page do I need to reference this in some way - My page has this

     public partial class SiteSearchResults : TemplatePage<SiteSearchResultsType>

        And SiteSearchResultsType looks like this:

    namespace cSkills2S.Templates.PageTypes
    {
         [PageType(
         Filename = "/Templates/PageTemplates/SiteSearchResults.aspx",
         Name = "Site Search Results",
         AvailableInEditMode = true,
         Description = "The Site Search Results")]
        public class SiteSearchResultsType : PageTypeBase
        {
            [PageTypeProperty(
              Type = typeof(ExtensionContentAreaProperty),
              DisplayInEditMode = true,
              HelpText = "Area for other sections",
              Tab = typeof(ComposerTab))]
            public virtual string ContentBottom { get; set; }
    
            [PageTypeProperty(
              Type = typeof(ExtensionPageProperty),
              DisplayInEditMode = true,
              Searchable = false,
              UniqueValuePerLanguage = true,
              HelpText = "Specialized For Extension Added By Extension (Do not remove)",
              Tab = typeof(ComposerTab))]
            public virtual string ExtensionPageProperty { get; set; }
    
            [PageTypeProperty(
                Type = typeof(PropertyLongString),
                DisplayInEditMode = true,
                HelpText = "Text to show in Mega Menu when there is no child navigation to show")
            ]
            public virtual string MegaMenuDescription { get; set; }
    
            [PageTypeProperty(
               Type = typeof(PropertyDropDownList),
               DisplayInEditMode = true,
              Required = true,
               HelpText = "Select the coloured column that you want your top level page to appear in")
           ]
            public virtual string MegaMenuColumn { get; set; }
        }
    }

        Does this look ok?

    #66457 Feb 28, 2013 17:39
  • Johan Petersson
    Member since: 2007
     

    In your PageTypeBase class you have to implement the ISearchContent interface, or create public properties with corresponding names as those in ISearchContent.

    #66458 Feb 28, 2013 17:42
  • Jonathan Roberts
    Member since: 2012
     

    It looks like this:

    namespace cSkills2S.Templates.PageTypes
    {
        public class PageTypeBase : TypedPageData
        {
            public string SearchTitle
            {
                get { return this.Property.ExistsLocally("PageTitle") && this["PageTitle"] != null ? this["PageTitle"].ToString() : this.PageName; }
            }
    
            public string SearchHitUrl
            {
                get { return ""; }
            }
    
            public string SearchSection
            {
                get
                {
                    PageData sectionPage = this;
    
                    while (!sectionPage.ParentLink.CompareToIgnoreWorkID(PageReference.EmptyReference) &&
                           !sectionPage.ParentLink.CompareToIgnoreWorkID(PageReference.StartPage) &&
                           !sectionPage.ParentLink.CompareToIgnoreWorkID(PageReference.RootPage))
                    {
                        sectionPage = DataFactory.Instance.GetPage(sectionPage.ParentLink);
                    }
    
                    if (!sectionPage.PageLink.CompareToIgnoreWorkID(PageReference.StartPage))
                    {
                        return sectionPage.PageName;
                    }
    
                    return string.Empty;
                }
            }
    
            public string SearchHitTypeName
            {
                get { return "Web page"; }
            }
    
            public string SearchTypeName
            {
                get
                {
                    if (this.PageTypeID == 1)
                    {
                        return "News";
                    }
    
                    if (this.PageTypeID == 2)
                    {
                        return "Contact persons";
                    }
    
                    return "Other";
                }
            }
    
            public DateTime? SearchPublishDate
            {
                get { return this.Changed; }
            }
    
            public DateTime? SearchUpdateDate
            {
                get { return this.StartPublish; }
            }
    
        }
    }

        

    #66459 Feb 28, 2013 17:45
  • Johan Petersson
    Member since: 2007
     

    You're just returning an empty string in SearchHitUrl. But you should get the title back. Have you re-indexed the site after you added the properties to this base class?

    #66460 Feb 28, 2013 17:49
  • Jonathan Roberts
    Member since: 2012
     

    Thanks.

    I have been indexing it for a while but it keeps saying Thread was being aborted.

    But I'll fix the url - thanks again you guys have been very patient - Im not the worlds best programmer and you have helped alot. Many thanks :)

    #66461 Feb 28, 2013 17:51
  • Jonathan Roberts
    Member since: 2012
     

    Hi, one last question. I have all the code in and the Indexing went well - took a while but finally finished. I have searchable content in my templates and I have set up code similar to that in your blogs - The Exerpt is working but not the other fields such as title or URL - in your blog you give your url - this.ExternalURL, I dont have this.

    Where would Exerpt be coming from if the other bits dont work?

    Thanks

    #66464 Feb 28, 2013 18:23
  •  

    Hi,

    If your "PageTitle" is the "empty" string it will be returned in your SearchTitle-property. Instead of using "this["PageTitle"] != null" use "!string.IsNullOrEmpty(this["PageTitle"])".

    As for the SearchHitUrl you could return "this.LinkUrl".

    Regards,
    Henrik

    #66467 Feb 28, 2013 23:13
  • Jonathan Roberts
    Member since: 2012
     

    HAs anyone had success using UnifiedSearch? Its says on the Episerver Find website that using Find is easy. I have been trying to get this to work  now for 3 days solid - I have no idea. I have read the blogs I have been helped by the guys above but I have no success. Im using Episerver 6 R2 NOT Episerver 7, IM using Composer to populate the pages and using Multiple Template types, I Index the site continuously but nothing is displayed - I get the correct number of pages returned but each row shown in the Repeater is Blank - the Titles and URL are blank.

    No idea

    #66480 Mar 01, 2013 11:08
  • Jonathan Roberts
    Member since: 2012
     

    I have been on this for over 1 week straight - nothing more than trying to get Episerver FIND to work - BUT i have failed. I have no idea. But it just doesnt work. Is there a built in Search that can be used instead?

    #66526 Mar 04, 2013 12:15
  • Johan Petersson
    Member since: 2007
     

    I saw in a different thread that you were using Composer. Then I would recommend a crawling search engine or to override the SearchText property with the rendererd page also (instead of only a concatinated string with all searchable properties on that page). Find has no clue how your pages are built or which composer block is inserted on which page.

    I thought you managed to get the search working.

    #66533 Mar 04, 2013 13:21
  • Johan Petersson
    Member since: 2007
     

    BTW, you can always call Expert Services.

    #66534 Mar 04, 2013 13:24
  • Jonathan Roberts
    Member since: 2012
     

    I thought I did but its now pulling duplicates - it has to be me - I must have created a complicated site with too many controls.

    I thought there may be a way of trimming the duplicates but all FIND is doing is pulling every Content block control that matches the Query string, then using a split on the control iD, for example: CF__345_345_322_22 I get the first ID which is the PageID go off to a function to get the Page Name, URL etc then populate my repeater that way, sometimes FIND brings back what looks like duplicate pages but its bringing back Controls that are on the same page for example the IDs on one page would be CF__345_345_322_22 and CF__345_345_322_23.

    Sorry if this doesnt make much sence - neither does my code :)

    #66535 Mar 04, 2013 13:28
  • Johan Petersson
    Member since: 2007
     

    In your case I would exclude all Composer content blocks and just index pages. BUT also include a rendered version of the page in the SearchText property. When you're rendering the page for Find you could exclude navigation, header, footer and so on. I've posted some code where I show how to include extra content http://world.episerver.com/Modules/Forum/Pages/Thread.aspx?id=65814&epslanguage=en. In your case you have to do a web request and download the page, remove all html tags, navigation etc.

    #66536 Mar 04, 2013 13:36
  •  

    Hi Jonathan,

    Please contact EPiServer Developer Support for assistance with this if needed. I have informed them about your situation and they will assist you as soon as possible.

    Best Regards

    Marcus

    #66539 Mar 04, 2013 14:55
  • bjarne.somme
    Member since: 2010
     

    I'm implementing EPiServer Find for v 6 r2 and have added the IndexingConventions class, but I'm getting a reference error during compile-time for FilterForVisitor and FilterOnUnifiedFileReadCccess:

     

    SearchClient.Instance.Conventions.UnifiedSearchRegistry.Add<PageData>().PublicSearchFilter(c => c.BuildFilter<PageData>().FilterForVisitor<PageData>());

    and

    SearchClient.Instance.Conventions.UnifiedSearchRegistry.Add<UnifiedFile>().PublicSearchFilter(c => c.BuildFilter<UnifiedFile>().FilterOnUnifiedFileReadAccess());

     

    #67989 Mar 14, 2013 14:09
  •  

    FilterForVisitor<PageData> and FilterOnUnifiedFileReadAccess<UnifiedFile> is not a part of CMS 6R2. I guess you have been following: http://www.dodavinkeln.se/post/2013/02/25/using-episerver-finds-unified-search-in-episerver-6.aspx and if so please download the search extensions Johan has uploaded http://world.episerver.com/Code/Johan-Pettersson/EPiServer-Find-Filters/

    Regards,

    Henrik

    #67998 Mar 14, 2013 14:47
  • bjarne.somme
    Member since: 2010
     

    Thanx. I noticed that the extension method 'SearchSubsection' for UnifiedFile wasn't implemented but still used when specifying the type fields to be included/excluded (line 31 in the article).

    Also, I had to change my code from client.UnifiedSearch(...) to SearchClient.Instance.UnifiedSearchFor(...)

    Why won't the instance IClient client = Client.CreateFromConfig() work for unified search when it works for getting pages with client.Search<T>() ?

    #68003 Mar 14, 2013 15:15
  •  

    "Why won't the instance IClient client = Client.CreateFromConfig() work for unified search when it works for getting pages with client.Search<T>() ?"


    It will not work because SearchClient is a singleton with configuration for UnifiedSearch applied at startup. When running the Client.CreateFromConfig() you create a plain .NET client that does not know anything about EPiServer. SearchClient is a part of the the Framework and knows information about for instance IContent and so on. The SearchClient is there to make your life a little bit easier when working with EPiServer Find.

    #68072 Mar 15, 2013 15:08