Sinking in a sea of documents

Lately because of one of my projects I am having to work my way through a bunch of long, semi structured documents containing things such as requests for proposals, government program reports, threat models and all kinds of things like that. They are in techno-legalese as I would call them: highly structured, with section numbering and 3, 4 and 5 levels of nesting. All in English, they look like the writers were getting paid by the word if you know what I mean.

I need a more efficient way to locate those paragraphs of nuggets that matter to me. So what I’d like is kind of a local document index/repository, that would allow me to have some standing queries and easily locate sections in documents that talk about my queries. Here’s an example:

  • I’d like to load in 10 large PDF files, each of say 100 pages. Each PDF contains English text, formatted very nicely into paragraphs and sections.
  • I’d like to specify that I am interested in “blogging platforms”, “weaknesses in Ruby”, “localization and internationalization”
  • Ideally then look at a list that showed the section of text, the name of the document, and other information that seemed to be related to and/or include the words and phrases I specified.

I am sure something like this exists. I would call it something like document indexing, document comprehension or structured searching.

Any suggested leads or ideas?

0 thoughts on “Sinking in a sea of documents

  1. Hey Pito,It’s JJ Kennedy. We worked together waaaay back at eRoom. Anyway, Adobe Acrobat Pro has something that might do the trick.At the search box, use the dropdown to choose “Open Full Reader Search”. It allows you to search an entire drive, folder or whatever for your keywords in all .pdf files.I tend to have all my research already grouped in folders by topic, so it makes finding all instances of a subject very easy.Not sure if this is exactly what you are looking for, but it works for me :)JJ


  2. JJ, thanks! I tried that and it was decent but I am looking for something that is a little smarter. That searches for phrases (and even concepts.) So I could ask to show me all documents that are about “email spam” for example, and it would show me the part of the document (even better if it was the section/subsection of the document) where that was mentioned, in context, for example with a full paragraph. And even better if I could have a standing set of searches where I could easily detect where each group of matches were to be found. I can imagine how to build this but I am sure it must exist somewhere. Haven’t found it yet though…


