Which file types can the Crawler Connector extract text content from?


Some quick investigation seems to suggest that crawled Word documents get their content extracted but not PDFs.

Can't find (hehe) anything on this in the docs it seems.

Sep 19, 2019 16:59

Johan, I'm researching this with the Find team. Do you have examples of pdf documents whose text is not being extracted properly?

Sep 19, 2019 20:17

As I've followed up on my own it seems that they have a lot of scanned paper documents saved as PDFs and not real documents "exported to PDF".

The customer compares with Google search where Google seem to run image text recognition on those when indexing.

I've now also found crawled PDFs where SearchText has extracted text.

Sep 20, 2019 9:39


Are you saying that you don't have examples of pdf documents whose text is not being extracted properly?

Sep 20, 2019 16:11

Yes. After looking further the examples I have are essentially just a scanned A4 (an image saved as PDF) which I can understand you don't support extracting text from.

The examples I have of PDFs that they've exported from an actual digital document seem to get text extracted as expected.

I guess "character recognition in images to put in SearchText of the WebContent if the crawled file looks like a scanned document" would be my feature request. I think Azure Cognitive Services would do a good job with this.

For this thread a list of which file types get text extracted would still be welcome.

Edited, Sep 20, 2019 16:59

Word, Excel, Power Point, PDF

Sep 20, 2019 18:35
