Method and system for document classification or search using discrete words
First Claim
1. A method of operating a computerized document classification system, comprising the steps of:
- generating at least one classification important word for distinguishing between two or more document classifications, wherein each classification important word has a classification importance value;
selecting a reference document having information content comprising a plurality of words related to at least one of the two or more document classifications;
automatically detecting at least one classification important word from the plurality of words within the reference document, wherein the at least one classification important word from the reference document has been processed using at least two dictionary functions selected from a group of dictionary functions consisting of;
Derived Words;
Acronym;
Word Capitalization; and
Hyphenation;
generating a word score value for the at least one detected classification important word from the reference document using a WordRatio, which comprises a normalization factor for the detected classification important word that is based upon the relative occurrences of a predetermined plurality of base words, and at least one value selected from a group of values consisting of;
a value defined for the at least one detected classification important word related to a document section that occurs in the reference document;
a classification importance value for the at least one detected classification important word;
a value defined for the at least one detected classification important word in a document type that applies to the document;
a value defined for the at least one detected classification important word across multiple document classifications; and
a value based on the statistical occurrence of the at least one detected classification important word in at least two different documents comparing the number of occurrences of the at least one classification important word in a first document having a first classification and a second document having a second classification; and
generating a classification score for the reference document that is related to the word score value for the at least one detected classification important word.
2 Assignments
0 Petitions
Accused Products
Abstract
A method of operating a computerized document search system where information is matched against a database containing documents in response to user queries includes receiving a query identifying a source document that has information content related to the documents within the database. Important words within the source document are detected automatically, where at least one of the important words has been processed using at least two dictionary functions consisting of Derived Words, Acronym, Word Capitalization, and Hyphenation. An importance value is generated for important words in a processed document using a WordRatio and at least one of a selected set of values. A score is generated for a processed document based partly on the importance value of at least one important word in that document. A document list is created for identifying documents that are related to a source document.
17 Citations
17 Claims
-
1. A method of operating a computerized document classification system, comprising the steps of:
-
generating at least one classification important word for distinguishing between two or more document classifications, wherein each classification important word has a classification importance value; selecting a reference document having information content comprising a plurality of words related to at least one of the two or more document classifications; automatically detecting at least one classification important word from the plurality of words within the reference document, wherein the at least one classification important word from the reference document has been processed using at least two dictionary functions selected from a group of dictionary functions consisting of; Derived Words; Acronym; Word Capitalization; and Hyphenation; generating a word score value for the at least one detected classification important word from the reference document using a WordRatio, which comprises a normalization factor for the detected classification important word that is based upon the relative occurrences of a predetermined plurality of base words, and at least one value selected from a group of values consisting of; a value defined for the at least one detected classification important word related to a document section that occurs in the reference document; a classification importance value for the at least one detected classification important word; a value defined for the at least one detected classification important word in a document type that applies to the document; a value defined for the at least one detected classification important word across multiple document classifications; and a value based on the statistical occurrence of the at least one detected classification important word in at least two different documents comparing the number of occurrences of the at least one classification important word in a first document having a first classification and a second document having a second classification; and generating a classification score for the reference document that is related to the word score value for the at least one detected classification important word. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method of operating a computerized document classification system, comprising the steps of:
-
generating at least one classification important word for distinguishing between two or more document classifications, wherein each classification important word has a classification importance value; selecting a reference document having information content comprising a plurality of words related to at least one of the two or more document classifications; automatically detecting the at least one classification important word from the plurality of words within the reference document, the at least one classification important word identified as a SearchWord wherein the at least one classification important word from the reference document has been processed using at least two dictionary functions selected from a group of dictionary functions consisting of; Derived Words; Acronym; Word Capitalization; and Hyphenation; generating a word score value for the at least one detected classification important word using a WordRatio, which comprises a normalization factor for the detected classification important word that is based upon the relative occurrences of a predetermined plurality of base words and at least one value selected from a group of values consisting of; a value defined for the at least one detected classification important word related to a document section that occurs in the reference document; a classification importance value for the at least one detected classification important word; a value defined for the at least one detected classification important word in a document type that applies to the reference document; a value defined for the at least one detected classification important word across multiple document classifications; and a value based on the statistical occurrence of the at least one detected classification important word in at least two different documents;
comparing the number of occurrences of the at least one classification important word in a first document having a first classification and a second document having a second classification; andgenerating a classification score for the reference document that is related to the word score value for the at least one detected classification important word. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A method of operating a computerized document classification system comprising the steps:
-
generating at least one classification important word for distinguishing between two or more document classifications, wherein each classification important word has a classification importance value; selecting a reference document having information content comprising a plurality of words related to at least one of the two or more document classifications; automatically detecting at least one classification important word within the reference document, the at least one detected classification important word identified as a SearchWord where the SearchWord is selected in part by using a WordRatio, which comprises a normalization factor for the SearchWord that is based upon the relative occurrences of a predetermined plurality of base words related to a document classification and a number of document classifications that contain the SearchWord; processing a document in a database in which the classification importance value of the at least one detected classification important word within the document being processed is partially derived from the WordRatio and at least one value selected from a group of values consisting of; a value defined for the at least one detected classification important word related to a document section that occurs in the document being processed; a classification importance value for the at least one detected classification important word; a value defined for the at least one detected classification important word in a document type that applies to the document being processed; a value defined for the at least one detected classification important word across multiple document classifications; and a value based on the statistical occurrence of the at least one detected classification important word across at least two different documents comparing the number of occurrences of the at least one classification important word in a first document having a first classification and a second document having a second classification; and generating a score for the document being processed based in part on the importance value of the SearchWord in the document being processed. - View Dependent Claims (12, 13, 14, 15)
-
-
16. A method of operating a computerized document classification, comprising the steps of:
-
establishing at least one classification important word for a plurality of document classifications derived from words likely to be within the document classifications; (b) automatically detecting the at least one classification important word within a reference document wherein the at least one classification important word from the reference document has been processed using at least two dictionary functions selected from a group of dictionary functions consisting of; Derived Words; Acronym; Word Capitalization; and Hyphenation; processing documents in a database in which an importance value of the at least one classification important word for each processed document is generated using a WordRatio, which comprises a normalization factor for the classification important word that is based upon the relative occurrences of a predetermined plurality of base words and at least one value selected from a group of values consisting of; a value defined for the at least one classification important word related to a document section that occurs in the document being processed; a classification importance value for the at least one classification important word; a value defined for the at least one classification important word in a document type that applies to the document being processed; a value defined for the at least one classification important word across multiple document classifications; and a value based on the statistical occurrence of the at least one classification important word across at least two different documents;
comparing the number of occurrences of the at least one classification important word in a first document having a first classification and a second document having a second classification;generating a classification score for the reference document that is related to at least one word score value for the at least one classification important word; and providing a list of documents from the database that are related to at least one word of the at least one classification important word and to an indication of a quality of a match between the at least one classification important word and each related document in the database. - View Dependent Claims (17)
-
Specification