I've set up a crawler connector in EPiServer Find which seems to create duplicates in the index. When I investigated I've found out that the URLs differ only by JSESSIONID in URL. For example:
Not that the query string is the same and the links lead to the same page (can easily be tested in browser). I cannot find any way to configure the connector to ignore this jsessionid thing. Anyone did this before? Can Find filter them?
The crawler is very limited in terms of configuration. Our choice has been to identify the crawler on the target site and hide markup that's not relevant to crawl. You should be able to solve your issue on that side as well I assume.
We've asked Epi to look at what was in SiteSeeker for crawling configuration. It was pretty complete in my opinion and had some options for querystring-keys that would've fixed your issue.