System and method for document categorization
First Claim
Patent Images
1. A document categorization method implemented by a computer processor for creating associations between one or more documents with a predefined topic, wherein each said predefined topic comprising:
- a topic name, a topic threshold and a topic query, wherein said topic threshold comprising a percentage value ranging from zero to one hundred, and wherein said topic query comprising one or more terms, each of said terms comprising;
a word or a phrase, logical and grouping operators that define relationships between said terms, said method comprising;
identifying matching documents using said topic query;
screening said matching documents to produce document-topic associations for each of said documents that also match said topic threshold and said topic query, said screening comprising;
computing a score for each of said matching documents, wherein said score value equals the similarity of each of said document with said topic;
sorting said matching documents in order of said computed score; and
selecting a subset of said matching documents wherein said subset being defined by said topic threshold,wherein said method implements a bimodal classifier that produces document-topic associations that are consistently accurate in terms of precision and recall over document collections that change over time in size and composition.
0 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides methods and systems for automatic categorization of documents. More specifically, the present invention provides for the automatic assignment of a set of pre-defined topics to a set of documents.
140 Citations
4 Claims
-
1. A document categorization method implemented by a computer processor for creating associations between one or more documents with a predefined topic, wherein each said predefined topic comprising:
-
a topic name, a topic threshold and a topic query, wherein said topic threshold comprising a percentage value ranging from zero to one hundred, and wherein said topic query comprising one or more terms, each of said terms comprising; a word or a phrase, logical and grouping operators that define relationships between said terms, said method comprising; identifying matching documents using said topic query; screening said matching documents to produce document-topic associations for each of said documents that also match said topic threshold and said topic query, said screening comprising; computing a score for each of said matching documents, wherein said score value equals the similarity of each of said document with said topic; sorting said matching documents in order of said computed score; and selecting a subset of said matching documents wherein said subset being defined by said topic threshold, wherein said method implements a bimodal classifier that produces document-topic associations that are consistently accurate in terms of precision and recall over document collections that change over time in size and composition.
-
-
2. A computer implemented system for automatically creating associations between one or more documents with a predefined topic, comprising:
-
a memory; and a processor; wherein the predefined topic comprising a topic name, a topic threshold and a topic query, wherein said topic threshold comprising a percentage value ranging from zero to one hundred, and wherein said topic query comprising one or more terms, each of said terms comprising; a word or a phrase, logical and grouping operators that define relationships between said terms, said system comprising; identifying matching documents using said topic query; screening said matching documents to produce document-topic associations for each of said documents that also match said topic threshold and said topic query, said screening comprising; computing a score for each of said matching documents, wherein said score value equals the similarity of each of said document with said topic; sorting said matching documents in order of said computed score; and selecting a subset of said matching documents wherein said subset being defined by said topic threshold, wherein said system implements a bimodal classifier that produces document-topic associations that are consistently accurate in terms of precision and recall over document collections that change over time in size and composition.
-
-
3. A computer implemented system for creating associations between a set of topics and a set of documents, said system including:
-
a classifier for each topic, each said classifier comprising; a query expression, each said query expression comprising; one or more terms, each said term comprising a word, or a phrase, and, logical and grouping operators, said operators defining relationships between said terms; a similarity threshold; a classification engine, said classification engine comprising; a searcher for determining if a specified document matches a specified query expression, a scorer for computing the similarity score between each said matching document and said query expression, a selector for selecting a proportion of said matching documents as specified by said similarity threshold; and storage means for storing the output of said classification engine, wherein said system implements a bimodal classifier that produces document-topic associations that are consistently accurate in terms of precision and recall over document collections that change over time in size and composition.
-
-
4. A computer-readable medium having computer instructions stored thereon, which, when executed by a processor, causes the processor to perform operations comprising:
-
creating associations between one or more documents with a predefined topic, wherein said topic comprising; a topic name, a topic threshold and a topic query, wherein said topic threshold comprising a percentage value ranging from zero to one hundred, and wherein said topic query comprising one or more terms, each of said terms comprising; a word or a phrase, logical and grouping operators that define relationships between said terms, said operations comprising; identifying matching documents using said topic query; screening said matching documents to produce document-topic associations for each of said documents that also match said topic threshold and said topic query, said screening comprising; computing a score for each of said matching documents, wherein said score value equals the similarity of each of said document with said topic; sorting said matching documents in order of said computed score; and selecting a subset of said matching documents wherein said subset being defined by said topic threshold, wherein said method implements a bimodal classifier that produces document-topic associations that are consistently accurate in terms of precision and recall over document collections that change over time in size and composition.
-
Specification