High accuracy document information-element vector encoding server
First Claim
1. A computer-implemented method comprising:
- applying finite state automaton (FSA) to parse a document to identify one or more information elements (IEs) in the document;
deriving a unique symbolic sequence particular to the document based on the one or more IEs contained in the document, such unique symbolic sequence being analogous to the DeoxyriboNucleic Acid (DNA) sequence in animals and/or plants;
wherein deriving the unique symbolic sequence particular to the document comprises;
if an IE of the one or more IEs includes a section of free text, determining a term frequency inverted document frequency (tfidf) of each of a plurality of words in the section of free text; and
using the tfidf to generate a portion of the DNA sequence; and
applying reduced concept space (RCS) to the one or more IEs, wherein the RCS includes polysemic analysis and synomemic analysis.
0 Assignments
0 Petitions
Accused Products
Abstract
Some embodiments of a high-accuracy document information element-vector (IE-vector) encoding server have been presented. In one embodiment, the high-accuracy document IE-vector encoding server applies finite state automaton (FSA) to parse a document to identify one or more information elements (IEs) in the document. Then a DNA sequence of the document is derived based on the one or more IEs. The concept of DNA sequence of a document is powerful and can be used in building automated tools such as computer based processes to automatically reason and search for similarity, dissimilarity, equivalence and other relationships between structured, semi-structured and unstructured data and information. The DNA sequence of a document provides powerful paradigm to build sophisticated information and data search and retrieval techniques and tools.
24 Citations
12 Claims
-
1. A computer-implemented method comprising:
-
applying finite state automaton (FSA) to parse a document to identify one or more information elements (IEs) in the document; deriving a unique symbolic sequence particular to the document based on the one or more IEs contained in the document, such unique symbolic sequence being analogous to the DeoxyriboNucleic Acid (DNA) sequence in animals and/or plants; wherein deriving the unique symbolic sequence particular to the document comprises; if an IE of the one or more IEs includes a section of free text, determining a term frequency inverted document frequency (tfidf) of each of a plurality of words in the section of free text; and using the tfidf to generate a portion of the DNA sequence; and applying reduced concept space (RCS) to the one or more IEs, wherein the RCS includes polysemic analysis and synomemic analysis. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. An apparatus comprising:
-
a finite state machine to parse a document to identify one or more information elements (IEs) in the document; a DeoxyriboNucleic Acid (DNA) generator coupled to the finite state machine to derive a DNA sequence of the document based on the one or more IEs; wherein if an IE of the one or more IEs includes a section of free text, then the DNA generator determines a term frequency inverted document frequency (tfidf) of each of a plurality of words in the section of free text; and
the DNA generator further uses the tfidf to generate a portion of the DNA sequence; anda reduced concept space (RCS) processor coupled to the finite state machine, the RCS processor further comprising a polysemic analysis module and a synomemic analysis module. - View Dependent Claims (8, 9, 10, 11, 12)
-
Specification