High accuracy document information-element vector encoding server

US 7,725,466 B2
Filed: 10/23/2007
Issued: 05/25/2010
Est. Priority Date: 10/24/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

applying finite state automaton (FSA) to parse a document to identify one or more information elements (IEs) in the document;

deriving a unique symbolic sequence particular to the document based on the one or more IEs contained in the document, such unique symbolic sequence being analogous to the DeoxyriboNucleic Acid (DNA) sequence in animals and/or plants;

wherein deriving the unique symbolic sequence particular to the document comprises;

if an IE of the one or more IEs includes a section of free text, determining a term frequency inverted document frequency (tfidf) of each of a plurality of words in the section of free text; and

using the tfidf to generate a portion of the DNA sequence; and

applying reduced concept space (RCS) to the one or more IEs, wherein the RCS includes polysemic analysis and synomemic analysis.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Some embodiments of a high-accuracy document information element-vector (IE-vector) encoding server have been presented. In one embodiment, the high-accuracy document IE-vector encoding server applies finite state automaton (FSA) to parse a document to identify one or more information elements (IEs) in the document. Then a DNA sequence of the document is derived based on the one or more IEs. The concept of DNA sequence of a document is powerful and can be used in building automated tools such as computer based processes to automatically reason and search for similarity, dissimilarity, equivalence and other relationships between structured, semi-structured and unstructured data and information. The DNA sequence of a document provides powerful paradigm to build sophisticated information and data search and retrieval techniques and tools.

24 Citations

12 Claims

1. A computer-implemented method comprising:
- applying finite state automaton (FSA) to parse a document to identify one or more information elements (IEs) in the document;
  
  deriving a unique symbolic sequence particular to the document based on the one or more IEs contained in the document, such unique symbolic sequence being analogous to the DeoxyriboNucleic Acid (DNA) sequence in animals and/or plants;
  
  wherein deriving the unique symbolic sequence particular to the document comprises;
  
  if an IE of the one or more IEs includes a section of free text, determining a term frequency inverted document frequency (tfidf) of each of a plurality of words in the section of free text; and
  
  using the tfidf to generate a portion of the DNA sequence; and
  
  applying reduced concept space (RCS) to the one or more IEs, wherein the RCS includes polysemic analysis and synomemic analysis.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, further comprising:
    - recursively analyzing each of the one or more IEs to identify one or more embedded IEs within a corresponding IE.
  - 3. The method of claim 1, wherein each of the one or more IEs includes a basic building block that encapsulates a predetermined type of information.
  - 4. The method of claim 3, wherein a type of each of the one or more IEs includes at least one of:
    - free text, a table, a spreadsheet, a figure, an image, a field, a header, a footer, a footnote, an index, a glossary, and a table of content.
  - 5. The method of claim 3, wherein the type of each of the one or more IEs is associated with a predefined data schema.
  - 6. The method of claim 1, further comprising:
    - creating an IE-map to graphically represent the document based on the one or more IEs, wherein a structure of the IE-map corresponds to a structure of the document.

7. An apparatus comprising:
- a finite state machine to parse a document to identify one or more information elements (IEs) in the document;
  
  a DeoxyriboNucleic Acid (DNA) generator coupled to the finite state machine to derive a DNA sequence of the document based on the one or more IEs;
  
  wherein if an IE of the one or more IEs includes a section of free text, then the DNA generator determines a term frequency inverted document frequency (tfidf) of each of a plurality of words in the section of free text; and
  
  the DNA generator further uses the tfidf to generate a portion of the DNA sequence; and
  
  a reduced concept space (RCS) processor coupled to the finite state machine, the RCS processor further comprising a polysemic analysis module and a synomemic analysis module.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The apparatus of claim 7, wherein the finite state machine is operable to recursively analyze each of the one or more IEs to identify one or more embedded IEs within a corresponding IE.
  - 9. The apparatus of claim 7, further comprising:
    - a graph processing module to create an IE-map to graphically represent the document based on the one or more IEs, wherein a structure of the IE-map corresponds to a structure of the document.
  - 10. The apparatus of claim 7, wherein a type of each of the one or more IEs includes at least one of free text, a table, a spreadsheet, a figure, an image, a field, a header, a footer, a footnote, an index, a glossary, and a table of content.
  - 11. A system comprising the apparatus of claim 7, further comprising:
    - a network; and
      
      a network security server communicatively coupled to the network, the network security server operable to use the DNA generator and the finite state machine to process each of a plurality of documents to be sent out from the network, and the network security server comprising a data extrusion prevention module to dynamically apply a data extrusion prevention policy to a respective document based on a corresponding DNA sequence generated.
  - 12. The system of claim 11, further comprising:
    - a database to store DNA sequences of a plurality of predetermined documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Tarique Mustafa
Original Assignee
Tarique Mustafa
Inventors
Mustafa, Tarique
Primary Examiner(s)
Breene; John E
Assistant Examiner(s)
Ly; Anh

Application Number

US11/977,318
Publication Number

US 20080097990A1
Time in Patent Office

945 Days
Field of Search

707/715, 707/729, 707/739
US Class Current

707/729
CPC Class Codes

G06F 16/31 Indexing; Data structures t...

High accuracy document information-element vector encoding server

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

24 Citations

12 Claims

Specification

Use Cases

Quick Links

Others

High accuracy document information-element vector encoding server

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

24 Citations

12 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others