Please download the dossier by clicking on the dossier button x
×

Automated document analysis for varying natural languages

  • US 10,366,461 B2
  • Filed: 03/06/2017
  • Issued: 07/30/2019
  • Est. Priority Date: 03/06/2017
  • Status: Active Grant
First Claim
Patent Images

1. A computer-implemented method comprising:

  • receiving a plurality of documents containing text written in a type of natural language, each document associated with a unique document identification number;

    representing text included in the plurality of documents using unique computer representations for each word in the text, the computer representations comprising ASCII, Unicode, or an equivalent technology;

    preprocessing the plurality of documents by;

    generating one or more document portions from each of the plurality of documents, each one of the document portions associated with one of the unique document identification numbers;

    parsing the text included in the plurality of documents into separate words based at least in part on each word'"'"'s associated computer representation;

    identifying stop words, duplicate words, and punctuation in the text based at least in part on the respective computer representation associated with the individual stop words, duplicate words, and punctuation; and

    removing the stop words, duplicate words, and punctuation from the text;

    generating a word count for each of the document portions by counting the number of computer representations of separate words in each one of the document portions;

    identifying a referential word count;

    calculating a word count ratio for each of the document portions by dividing the referential word count by the word count for each individual one of the document portions;

    determining, based at least in part on the computer representations, a word frequency for each word included in the document portions, the word frequency being a total number of instances that a word is found in the document portions prior to removal of duplicate words;

    generating a commonness score for each of the document portions by taking the square root of the sum of the squares of the inverse of the word frequency for each one of the separate words in the individual ones of the document portions;

    identifying a document portion of the document portions having a highest commonness score;

    calculating a commonness score ratio for each of the document portions by dividing the highest commonness score by the commonness score for the individual ones of the document portions;

    calculating an overall score for each of the document portions based on a normalization of the square root of the sum of the square of the word count ratio and the square of the commonness score ratio for the individual ones of the document portions; and

    generating a user interface including at least one overall score for one of the document portions in proximity to the unique document identification number associated with the one of the document portions and an indicia indicating one or more anomalies for the one of the document portions.

View all claims
  • 3 Assignments
Timeline View
Assignment View
    ×
    ×