Automated document analysis for varying natural languages

US 10,366,461 B2
Filed: 03/06/2017
Issued: 07/30/2019
Est. Priority Date: 03/06/2017
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving a plurality of documents containing text written in a type of natural language, each document associated with a unique document identification number;

representing text included in the plurality of documents using unique computer representations for each word in the text, the computer representations comprising ASCII, Unicode, or an equivalent technology;

preprocessing the plurality of documents by;

generating one or more document portions from each of the plurality of documents, each one of the document portions associated with one of the unique document identification numbers;

parsing the text included in the plurality of documents into separate words based at least in part on each word'"'"'s associated computer representation;

identifying stop words, duplicate words, and punctuation in the text based at least in part on the respective computer representation associated with the individual stop words, duplicate words, and punctuation; and

removing the stop words, duplicate words, and punctuation from the text;

generating a word count for each of the document portions by counting the number of computer representations of separate words in each one of the document portions;

identifying a referential word count;

calculating a word count ratio for each of the document portions by dividing the referential word count by the word count for each individual one of the document portions;

determining, based at least in part on the computer representations, a word frequency for each word included in the document portions, the word frequency being a total number of instances that a word is found in the document portions prior to removal of duplicate words;

generating a commonness score for each of the document portions by taking the square root of the sum of the squares of the inverse of the word frequency for each one of the separate words in the individual ones of the document portions;

identifying a document portion of the document portions having a highest commonness score;

calculating a commonness score ratio for each of the document portions by dividing the highest commonness score by the commonness score for the individual ones of the document portions;

calculating an overall score for each of the document portions based on a normalization of the square root of the sum of the square of the word count ratio and the square of the commonness score ratio for the individual ones of the document portions; and

generating a user interface including at least one overall score for one of the document portions in proximity to the unique document identification number associated with the one of the document portions and an indicia indicating one or more anomalies for the one of the document portions.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Manual human processing of documents often generates results that are subjective and include human-error. The cost and relatively slow speed of manual, human analysis makes it effectively impossible or impracticable to perform document analysis at the scale, speed, and cost desired in many industries. Accordingly, it may be advantageous to employ objective, accurate rule-based techniques to evaluate and process documents. This application discloses data processing equipment and methods specially adapted for a specific application: analysis of the breadth of documents. The processing may include context-dependent pre-processing of documents and sub-portions of the documents. The sub-portions may be analyzed based on word count and commonality of words in the respective sub-portions. The equipment and methods disclosed herein improve upon other automated techniques to provide document processing by achieving a result that quantitatively improves upon manual, human processing.

Citations

20 Claims

1. A computer-implemented method comprising:
- receiving a plurality of documents containing text written in a type of natural language, each document associated with a unique document identification number;
  
  representing text included in the plurality of documents using unique computer representations for each word in the text, the computer representations comprising ASCII, Unicode, or an equivalent technology;
  
  preprocessing the plurality of documents by;
  
  generating one or more document portions from each of the plurality of documents, each one of the document portions associated with one of the unique document identification numbers;
  
  parsing the text included in the plurality of documents into separate words based at least in part on each word'"'"'s associated computer representation;
  
  identifying stop words, duplicate words, and punctuation in the text based at least in part on the respective computer representation associated with the individual stop words, duplicate words, and punctuation; and
  
  removing the stop words, duplicate words, and punctuation from the text;
  
  generating a word count for each of the document portions by counting the number of computer representations of separate words in each one of the document portions;
  
  identifying a referential word count;
  
  calculating a word count ratio for each of the document portions by dividing the referential word count by the word count for each individual one of the document portions;
  
  determining, based at least in part on the computer representations, a word frequency for each word included in the document portions, the word frequency being a total number of instances that a word is found in the document portions prior to removal of duplicate words;
  
  generating a commonness score for each of the document portions by taking the square root of the sum of the squares of the inverse of the word frequency for each one of the separate words in the individual ones of the document portions;
  
  identifying a document portion of the document portions having a highest commonness score;
  
  calculating a commonness score ratio for each of the document portions by dividing the highest commonness score by the commonness score for the individual ones of the document portions;
  
  calculating an overall score for each of the document portions based on a normalization of the square root of the sum of the square of the word count ratio and the square of the commonness score ratio for the individual ones of the document portions; and
  
  generating a user interface including at least one overall score for one of the document portions in proximity to the unique document identification number associated with the one of the document portions and an indicia indicating one or more anomalies for the one of the document portions.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The computer-implemented method of claim 1, wherein the preprocessing further comprises:
    - identifying, based at least in part on the type of natural language, a listing of the stop words and the duplicate words specific to patent laws of a jurisdiction associated with the type of natural language; and
      
      determining a computer representation associated with each of the stop words and the duplicate words for the type of natural language.
  - 3. The computer-implemented method of claim 2, wherein the listing comprises a first listing of first stop words and first duplicate words, and wherein identifying the first listing comprises:
    - querying a database including;
      
      first computer representations of the first stop words and the first duplicate words, the first stop words and first duplicate words corresponding to words written in the type of natural language, wherein the type of natural language includes a first type of natural language that is a natural language other than English; and
      
      second computer representations of second stop words and second duplicate words included in a second listing, the second stop words and second duplicate words corresponding to a second type of natural language that includes words written in English; and
      
      selecting the first listing of the first stop words and the first duplicate words based at least in part on the type of natural language.
  - 4. The computer-implemented method of claim 1, wherein the preprocessing of the plurality of documents is performed independent of a translation of the text from the type of natural language to another type of natural language.
  - 5. The computer-implemented method of claim 4, wherein the type of natural language comprises a natural language other than English, and the other type of natural language comprises English.
  - 6. The computer-implemented method of claim 1, wherein the plurality of documents containing text comprise patents, the unique document identification numbers comprise patent numbers, and the document portions comprise patent claims.

7. A method for automatically assigning a claim breadth score to a patent claim, the method comprising:
- obtaining a data file including a corpus of patent claims that include the patent claim;
  
  obtaining a first set of rules that defines an anomalous patent claim, the first set of rules comprising a first rule for identifying at least one of a dependent patent claim, a deleted patent claim, a means-plus-function patent claim, or a patent claim containing normative language; and
  
  at least one of;
  
  generating an ignore list for patent claims included in the corpus of patent claims by applying the first set of rules;
  
  orgenerating an indicium marking the patent claim by applying the first set of rules;
  
  determining a jurisdiction in which the patent claim was filed;
  
  determining substantive law associated with the jurisdiction;
  
  obtaining a second set of rules that define a word count score for the patent claim as a function of word count in the patent claim;
  
  obtaining a third set of rules that define a commonness score for the patent claim as a function of the frequency with which words in the patent claim are present in the corpus of patent claims;
  
  generating the word count score and the commonness score for the patent claim by evaluating the patent claim against the second set of rules and the third set of rules;
  
  generating a claim breadth score for the patent claim based at least in part on the word count score, the commonness score, and the substantive law associated with the jurisdiction; and
  
  producing, based at least partly on the claim breadth score, a ranking of the patent claim with respect to a plurality of other patent claims from the corpus of patent claims.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The method of claim 7, wherein the jurisdiction in which the patent claim was filed comprises China, and the substantive law comprises including words of a preamble of the patent claim in the word count in the patent claim.
  - 9. The method of claim 7, wherein the jurisdiction in which the patent claim was filed comprises the United States of America, and the substantive law comprises excluding words of a preamble of the patent claim in the word count in the patent claim.
  - 10. The method of claim 7, wherein when the ignore list is generated by applying the first set of rules, the word count score and the commonness score are not generated for patent claims included in the ignore list.
  - 11. The method of claim 7, wherein the second set of rules comprises a second rule defining the word count score as based on a number of words in the patent claim following pre-processing of the patent claim, wherein pre-processing the patent claim comprises stemming, removal of duplicate words, and removal of stop words.
  - 12. The method of claim 7, wherein the third set of rules comprises a third rule defining the commonness score as based on a per-claim commonness score, the per-claim commonness score calculated by a square root of a sum of, for each word in the patent claim following pre-processing, the square of the inverse of a global word count for each word.

13. One or more computing devices for automatically analyzing a corpus of patent documents, the one or more computing devices comprising:
- one or more processing units;
  
  one or more memories coupled to the one or more processing units and storing computer-readable instructions that, when executed by the one or more processing units, perform operations comprising;
  
  for a first point in prosecution for a first portion of the corpus of patent documents;
  
  processing claim sections of the first portion of the corpus of patent documents by delimiting individual claims, stemming words in the individual claims to root forms, removing duplicate root forms from the individual claims, and removing stop words from the individual claims;
  
  detecting and removing dependent claims and deleted claims from the individual claims of the first portion of the corpus of patent documents; and
  
  calculating a first claim breadth score for each individual claim of the first portion of the corpus of patent documents that are not removed, the individual first claim breadth scores being based on a word count score of a first individual claim and a commonness score of the first individual claim;
  
  for a second point in prosecution for the first portion of the corpus of patent documents;
  
  processing the claim sections of the first portion of the corpus of patent documents by delimiting individual claims, stemming words in the individual claims to root forms, removing duplicate root forms from the individual claims, and removing stop words from the individual claims;
  
  detecting and removing dependent claims and deleted claims from the individual claims of the first portion of the corpus of patent documents; and
  
  calculating a second claim breadth score for each of individual claim of the first portion of the corpus of patent documents that are not removed, the second individual claim breadth scores being based on a word count score of a second individual claim and a commonness score of the second individual claim.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The one or more computing devices of claim 13, wherein:
    - the first point in prosecution comprises a first time prior to amendments being made to the individual claims of the first portion of the corpus of patent documents; and
      
      the second point in prosecution comprises a second time associated with a notice of allowance of the individual claims of the first portion of the corpus of patent documents.
  - 15. The one or more computing devices of claim 14, wherein the one or more memories store additional computer-readable instructions that, when executed by the one or more processing units, perform additional operations comprising determining an average change in claim breadth score between the first claim breadth scores and the second claim breadth scores.
  - 16. The one or more computing devices of claim 15, wherein:
    - the first portion of the corpus of patent documents corresponds to invention patents;
      
      a second portion of the corpus of patent documents corresponds to utility model patents; and
      
      the one or more memories store further computer-readable instructions that, when executed by the one or more processing units, perform further operations comprising determining a third claim breadth score for individual claims in the second portion of the corpus of patent documents based at least in part on the average change in claim breadth scores between the first claim breadth scores and the second claim breadth scores.
  - 17. The one or more computing devices of claim 16, wherein each patent of the corpus of patent documents are associated with a common classification, the common classification comprising at least one of:
    - a jurisdiction;
      
      a technology classification;
      
      an assignee;
      
      an applicant;
      
      oran inventor.
  - 18. The one or more computing devices of claim 13, wherein the one or memories store additional computer-readable instructions that, when executed by the one or more processing units, perform additional operations comprising determining a word count score for each of the individual claims based on a word count for each of the individual claims and a maximum word count for a claim from the corpus of patent documents having a highest word count.
  - 19. The one or more computing devices of claim 13, wherein the one or memories store additional computer-readable instructions that, when executed by the one or more processing units, perform additional operations comprising generating a user interface including a respective ranking, a respective claim breadth score, and a respective anomaly for at least a portion of the individual claims included in the corpus of patent documents.
  - 20. The one or more computing devices of claim 13, wherein the one or memories store additional computer-readable instructions that, when executed by the one or more processing units, perform additional operations comprising determining a commonness score for each of the individual claims based on frequencies of individual words in each of the individual claims occurring throughout all of the claims in the corpus of patent documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Moat Metrics, Inc.
Original Assignee
Aon Risk Services, Inc. of Maryland (Aon Plc (Spain))
Inventors
Edmund, William Michael, Crouse, Daniel, Bradley, III, John E.
Primary Examiner(s)
Godbold, Douglas
Assistant Examiner(s)
Villena, Mark

Application Number

US15/451,138
Publication Number

US 20180253810A1
Time in Patent Office

876 Days
Field of Search
US Class Current
CPC Class Codes

G06F 40/137   Hierarchical processing, e....

G06F 40/247   Thesauruses; Synonyms

G06F 40/253   Grammatical analysis; Style...

G06F 40/263   Language identification

G06F 40/284   Lexical analysis, e.g. toke...

G06Q 50/184   Intellectual property manag...

Automated document analysis for varying natural languages

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Automated document analysis for varying natural languages

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links