Methodologies and analytics tools for identifying white space opportunities in a given industry
First Claim
1. A method for use with at least one keyword retrieved from a first set of documents, wherein the keyword corresponds to a predefined subject matter, the method comprising:
- constructing snippets from textual material in said first set of documents stored on a computer to produce constructed snippets, each of said constructed snippets including at least one non-key word appearing within a specified text distance of said at least one keyword;
defining, by a computer processor, a plurality of categories wherein each of said constructed snippets is assigned to one of said plurality of categories, only if said assigned snippet is not already assigned to another of said plurality of said categories, each of said plurality of categories being designated for receiving similar constructed snippets;
creating a respective mathematical model for each of said plurality of categories;
analyzing a second set of documents to determine an assignment for each document in said second set of documents to a selected one of said plurality of categories, said assignment based on matching each of said documents in said second set of documents to said mathematical model for the selected one of said plurality of categories;
assigning a numeric vector to each document of the first and second sets of documents, wherein the numeric vector represents occurrences of one of the constructed snippets within the respective document;
creating a partition taxonomy that includes less than all of the plurality of categories, wherein the partition taxonomy creation is based on a clustered configuration of the first and second sets of documents;
editing, using a computer processor, less than all of the plurality of categories in the partition taxonomy using domain expertise to produce edited categories in an edited partition taxonomy, such that each document of the first and second sets of documents is assigned to a corresponding one of the less than all of the plurality of categories;
creating a classification taxonomy based on the edited partition taxonomy, based on a number of documents in each of the edited categories, based on percentage similarity of words between documents in one of the edited categories, and based on distances between category centroids of the edited categories, wherein a category centroid for an edited category is an average of values of the numeric vectors for the documents in the category;
identifying at least one white space in said classification taxonomy, said at least one white space including one or more of the edited categories that contain fewer than a specified number of documents.
6 Assignments
0 Petitions
Accused Products
Abstract
A method for analyzing predefined subject matter in a patent database being for use with a set of target patents, each target patent related to the predefined subject matter, the method comprising: creating a feature space based on frequently occurring terms found in the set of target patents; creating a partition taxonomy based on a clustered configuration of the feature space; editing the partition taxonomy using domain expertise to produce an edited partition taxonomy; creating a classification taxonomy based on structured features present in the edited partition taxonomy; creating a contingency table by comparing the edited partition taxonomy and the classification taxonomy to provide entries in the contingency table; and identifying all significant relationships in the contingency table to help determine the presence of any white space.
41 Citations
21 Claims
-
1. A method for use with at least one keyword retrieved from a first set of documents, wherein the keyword corresponds to a predefined subject matter, the method comprising:
-
constructing snippets from textual material in said first set of documents stored on a computer to produce constructed snippets, each of said constructed snippets including at least one non-key word appearing within a specified text distance of said at least one keyword; defining, by a computer processor, a plurality of categories wherein each of said constructed snippets is assigned to one of said plurality of categories, only if said assigned snippet is not already assigned to another of said plurality of said categories, each of said plurality of categories being designated for receiving similar constructed snippets; creating a respective mathematical model for each of said plurality of categories; analyzing a second set of documents to determine an assignment for each document in said second set of documents to a selected one of said plurality of categories, said assignment based on matching each of said documents in said second set of documents to said mathematical model for the selected one of said plurality of categories; assigning a numeric vector to each document of the first and second sets of documents, wherein the numeric vector represents occurrences of one of the constructed snippets within the respective document; creating a partition taxonomy that includes less than all of the plurality of categories, wherein the partition taxonomy creation is based on a clustered configuration of the first and second sets of documents; editing, using a computer processor, less than all of the plurality of categories in the partition taxonomy using domain expertise to produce edited categories in an edited partition taxonomy, such that each document of the first and second sets of documents is assigned to a corresponding one of the less than all of the plurality of categories; creating a classification taxonomy based on the edited partition taxonomy, based on a number of documents in each of the edited categories, based on percentage similarity of words between documents in one of the edited categories, and based on distances between category centroids of the edited categories, wherein a category centroid for an edited category is an average of values of the numeric vectors for the documents in the category; identifying at least one white space in said classification taxonomy, said at least one white space including one or more of the edited categories that contain fewer than a specified number of documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform a method comprising the steps of:
-
assembling a first set of target documents and a second set of target documents using one or more keywords, each of said first set of target documents and the second set of target documents including a predefined subject matter; analyzing each of said first set of target documents and second set of target documents to derive a count of occurrences of said keywords in each of said first set of target documents and second set of target documents; partitioning said first set of target documents and the second set of target documents into a plurality of categories based on non-key words or phrases appearing within a specified distance of one of said keywords; determining a centroid for each category of the plurality of categories as an average of values of numeric vectors representing features of the non-key words and the keywords in the category of the plurality of categories, wherein each numeric vector represents occurrences of one of the non-key words or one of the keywords within one of the first set of target documents or the second set of target documents; creating a first taxonomy for the first set of target documents, wherein the creation of the first taxonomy is based on less than all of the plurality of categories; creating a second taxonomy for the second set of target documents, wherein the second taxonomy is based on a number of documents in each of the plurality of categories, percentage similarity of the non-key words and the keywords between different ones of the second set of target documents in one of the plurality of categories, and percentage difference that the centroids for the plurality of categories differ from one another; creating a third taxonomy that includes both the first set of target documents and the second set of target documents, wherein the creation of the third taxonomy is based on both the first taxonomy and the second taxonomy, such that each one of the first set of target documents and each one of the second set of target documents are assigned to a corresponding category among the plurality of categories, wherein a total number of the plurality of categories is based on a combined size of the first set of target documents and the second set of target documents; and merging a selected two of the plurality of categories that have a closeness of centroids between the selected two of the plurality of categories such that the closeness is less than a threshold. - View Dependent Claims (20)
-
-
21. A computer program product stored on a non-transitory computer storage medium for use with at least one keyword retrieval from a first set of documents corresponding to a predefined subject matter, wherein when executed on a computer the program product causes the computer to:
-
construct snippets from textual material in said first set of documents, each of said constructed snippets including at least one non-key word appearing within a specified text distance of said at least one keyword; define a plurality of categories wherein each of said constructed snippets is assigned to one of said plurality of categories, only if said assigned snippet is not already assigned to another of said plurality of said categories, each of said plurality of categories designated for receiving at least one of said constructed snippets; create a mathematical model for each of said plurality of categories; analyze a second set of documents to determine an assignment for each document in said second set of documents to a first one of said plurality of categories, said assignment based on matching each of said documents in said second set of documents to said created mathematical models for the first one of said plurality of categories; analyze a third set of documents to determine an assignment for each document in said third set of documents to a second respective one of said plurality of categories, said assignment based on matching each of said documents in said third set of documents to said created mathematical model for the second respective one of said plurality of categories, wherein a total number of the categories is generated based on a size of the third set of documents; determine a centroid for each category of the plurality of categories as an average of values of numeric vectors representing features of both key and non-key words in the category of the plurality of categories, wherein each of the numeric vectors represents occurrences of one of the snippets within one of the first set of documents; perform interactive clustering of the plurality of categories using domain expertise; merge two of the plurality of categories that have a closeness of centroids between the plurality of categories such that the closeness is less than a threshold; and identify at least one white space in said second set of documents, said at least one white space including all of said plurality of categories, including the merged categories, with fewer than a specified number of documents.
-
Specification