×

Method and apparatus for representation of unstructured data

  • US 7,467,155 B2
  • Filed: 07/12/2005
  • Issued: 12/16/2008
  • Est. Priority Date: 07/12/2005
  • Status: Expired due to Fees
First Claim
Patent Images

1. A system for representing and searching a document including unstructured data, the system comprising:

  • a data store storing a plurality of documents;

    a processor executing program instructions, the program instructions including generating a binary representation of the unstructured data in the plurality of documents and searching the binary representation in response to a search request, the processor generating an output based on the search; and

    a memory storing the binary representation of the unstructured data in a plurality of data structures, the data structures including;

    a first binary bit vector identifying a plurality of unstructured data included in the plurality of documents;

    a plurality of second binary bit vectors, wherein for each of the plurality of unstructured data identified in the first binary bit vector, a corresponding second binary bit vector sets one or more bits for one or more position identifiers assigned to one or more instances of the associated unstructured data appearing in one or more of the plurality of documents, wherein the instance of an unstructured data appearing at the end of a first one of the plurality of documents is assigned a position identifier of n, and the instance of an unstructured data appearing at the beginning of a second one of the plurality of documents is assigned a position identifier of n+1, wherein n is an integer greater than 0; and

    a positional ID vector indicating a start position identifier of each word appearing at the beginning of each of the plurality of documents, wherein the program instructions for searching the binary representation include;

    determining if a particular search term provided with the search request is identified in the first binary bit vector;

    if the particular search term is identified in the first binary bit vector, retrieving the corresponding second binary bit vector;

    identifying from the positional ID vector the start position identifier of the word at the beginning of a particular one of the plurality of documents to be searched;

    deducing from the positional ID vector an end position identifier of a word at the end of the particular one of the plurality of documents to be searched; and

    identifying one or more bits set for one or more of the position identifiers in the retrieved secondary binary bit vector between the start position identifier and the end position identifier for identifying all instances of the search term occurring in the particular document.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×