Automated document analysis for varying natural languages
First Claim
1. A computer-implemented method comprising:
- receiving a plurality of documents containing text written in a type of natural language, each document associated with a unique document identification number;
representing text included in the plurality of documents using unique computer representations for each word in the text, the computer representations comprising ASCII, Unicode, or an equivalent technology;
preprocessing the plurality of documents by;
generating one or more document portions from each of the plurality of documents, each one of the document portions associated with one of the unique document identification numbers;
parsing the text included in the plurality of documents into separate words based at least in part on each word'"'"'s associated computer representation;
identifying stop words, duplicate words, and punctuation in the text based at least in part on the respective computer representation associated with the individual stop words, duplicate words, and punctuation; and
removing the stop words, duplicate words, and punctuation from the text;
generating a word count for each of the document portions by counting the number of computer representations of separate words in each one of the document portions;
identifying a referential word count;
calculating a word count ratio for each of the document portions by dividing the referential word count by the word count for each individual one of the document portions;
determining, based at least in part on the computer representations, a word frequency for each word included in the document portions, the word frequency being a total number of instances that a word is found in the document portions prior to removal of duplicate words;
generating a commonness score for each of the document portions by taking the square root of the sum of the squares of the inverse of the word frequency for each one of the separate words in the individual ones of the document portions;
identifying a document portion of the document portions having a highest commonness score;
calculating a commonness score ratio for each of the document portions by dividing the highest commonness score by the commonness score for the individual ones of the document portions;
calculating an overall score for each of the document portions based on a normalization of the square root of the sum of the square of the word count ratio and the square of the commonness score ratio for the individual ones of the document portions; and
generating a user interface including at least one overall score for one of the document portions in proximity to the unique document identification number associated with the one of the document portions and an indicia indicating one or more anomalies for the one of the document portions.
3 Assignments
0 Petitions
Accused Products
Abstract
Manual human processing of documents often generates results that are subjective and include human-error. The cost and relatively slow speed of manual, human analysis makes it effectively impossible or impracticable to perform document analysis at the scale, speed, and cost desired in many industries. Accordingly, it may be advantageous to employ objective, accurate rule-based techniques to evaluate and process documents. This application discloses data processing equipment and methods specially adapted for a specific application: analysis of the breadth of documents. The processing may include context-dependent pre-processing of documents and sub-portions of the documents. The sub-portions may be analyzed based on word count and commonality of words in the respective sub-portions. The equipment and methods disclosed herein improve upon other automated techniques to provide document processing by achieving a result that quantitatively improves upon manual, human processing.
-
Citations
20 Claims
-
1. A computer-implemented method comprising:
-
receiving a plurality of documents containing text written in a type of natural language, each document associated with a unique document identification number; representing text included in the plurality of documents using unique computer representations for each word in the text, the computer representations comprising ASCII, Unicode, or an equivalent technology; preprocessing the plurality of documents by; generating one or more document portions from each of the plurality of documents, each one of the document portions associated with one of the unique document identification numbers; parsing the text included in the plurality of documents into separate words based at least in part on each word'"'"'s associated computer representation; identifying stop words, duplicate words, and punctuation in the text based at least in part on the respective computer representation associated with the individual stop words, duplicate words, and punctuation; and removing the stop words, duplicate words, and punctuation from the text; generating a word count for each of the document portions by counting the number of computer representations of separate words in each one of the document portions; identifying a referential word count; calculating a word count ratio for each of the document portions by dividing the referential word count by the word count for each individual one of the document portions; determining, based at least in part on the computer representations, a word frequency for each word included in the document portions, the word frequency being a total number of instances that a word is found in the document portions prior to removal of duplicate words; generating a commonness score for each of the document portions by taking the square root of the sum of the squares of the inverse of the word frequency for each one of the separate words in the individual ones of the document portions; identifying a document portion of the document portions having a highest commonness score; calculating a commonness score ratio for each of the document portions by dividing the highest commonness score by the commonness score for the individual ones of the document portions; calculating an overall score for each of the document portions based on a normalization of the square root of the sum of the square of the word count ratio and the square of the commonness score ratio for the individual ones of the document portions; and generating a user interface including at least one overall score for one of the document portions in proximity to the unique document identification number associated with the one of the document portions and an indicia indicating one or more anomalies for the one of the document portions. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method for automatically assigning a claim breadth score to a patent claim, the method comprising:
-
obtaining a data file including a corpus of patent claims that include the patent claim; obtaining a first set of rules that defines an anomalous patent claim, the first set of rules comprising a first rule for identifying at least one of a dependent patent claim, a deleted patent claim, a means-plus-function patent claim, or a patent claim containing normative language; and
at least one of;generating an ignore list for patent claims included in the corpus of patent claims by applying the first set of rules;
orgenerating an indicium marking the patent claim by applying the first set of rules; determining a jurisdiction in which the patent claim was filed; determining substantive law associated with the jurisdiction; obtaining a second set of rules that define a word count score for the patent claim as a function of word count in the patent claim; obtaining a third set of rules that define a commonness score for the patent claim as a function of the frequency with which words in the patent claim are present in the corpus of patent claims; generating the word count score and the commonness score for the patent claim by evaluating the patent claim against the second set of rules and the third set of rules; generating a claim breadth score for the patent claim based at least in part on the word count score, the commonness score, and the substantive law associated with the jurisdiction; and producing, based at least partly on the claim breadth score, a ranking of the patent claim with respect to a plurality of other patent claims from the corpus of patent claims. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. One or more computing devices for automatically analyzing a corpus of patent documents, the one or more computing devices comprising:
-
one or more processing units; one or more memories coupled to the one or more processing units and storing computer-readable instructions that, when executed by the one or more processing units, perform operations comprising; for a first point in prosecution for a first portion of the corpus of patent documents; processing claim sections of the first portion of the corpus of patent documents by delimiting individual claims, stemming words in the individual claims to root forms, removing duplicate root forms from the individual claims, and removing stop words from the individual claims; detecting and removing dependent claims and deleted claims from the individual claims of the first portion of the corpus of patent documents; and calculating a first claim breadth score for each individual claim of the first portion of the corpus of patent documents that are not removed, the individual first claim breadth scores being based on a word count score of a first individual claim and a commonness score of the first individual claim; for a second point in prosecution for the first portion of the corpus of patent documents; processing the claim sections of the first portion of the corpus of patent documents by delimiting individual claims, stemming words in the individual claims to root forms, removing duplicate root forms from the individual claims, and removing stop words from the individual claims; detecting and removing dependent claims and deleted claims from the individual claims of the first portion of the corpus of patent documents; and calculating a second claim breadth score for each of individual claim of the first portion of the corpus of patent documents that are not removed, the second individual claim breadth scores being based on a word count score of a second individual claim and a commonness score of the second individual claim. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
-
Specification