Document relationship analysis system

US 9,928,295 B2
Filed: 02/02/2015
Issued: 03/27/2018
Est. Priority Date: 01/31/2014
Status: Active Grant

First Claim

Patent Images

1. A system for analyzing relationships between documents, the system comprising:

a user interface;

an ingest memory configured to store source documents retrieved from an external document source;

a text index memory configured to store a text index;

a cluster index memory configured to store document vectors associated with each source document;

a text extraction pipeline automatically extracting text from source documents added to the ingest memory;

a document vector calculator automatically computing document vectors for source documents by applying term weights to the extracted text associated with the source document, the document vector calculator generating a plurality of profile document vectors associated with profile documents selected for use in a query against a target dataset;

an indexer automatically building an index of the extracted text and storing the text index in the text index memory;

a dataset manager component generating a result dataset containing documents of interest from a target dataset containing selected source documents based on a query by evaluating similarities between each profile document vector and the document vector calculated for each source document in the target dataset; and

a relationship analyzer component automatically selecting a visualization model for clustering the documents of interest based the number of documents of interest in the result dataset and rendering the result set using selected visualization model in the user interface.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A document relationship analysis system. Aspects of the system include ingesting, discovering, recommending, analyzing, and exporting documents of interest. The system dynamically searches large or streaming datasets using a tiered, multi-step approach that includes discovery techniques and recommender components to filter and refine these larger datasets to smaller datasets of documents of interest. The system dynamically selects and renders an appropriate visualization for result datasets based on predetermined measures that allow for facilitate analysis of the documents of interest.

27 Citations

View as Search Results

20 Claims

1. A system for analyzing relationships between documents, the system comprising:
- a user interface;
  
  an ingest memory configured to store source documents retrieved from an external document source;
  
  a text index memory configured to store a text index;
  
  a cluster index memory configured to store document vectors associated with each source document;
  
  a text extraction pipeline automatically extracting text from source documents added to the ingest memory;
  
  a document vector calculator automatically computing document vectors for source documents by applying term weights to the extracted text associated with the source document, the document vector calculator generating a plurality of profile document vectors associated with profile documents selected for use in a query against a target dataset;
  
  an indexer automatically building an index of the extracted text and storing the text index in the text index memory;
  
  a dataset manager component generating a result dataset containing documents of interest from a target dataset containing selected source documents based on a query by evaluating similarities between each profile document vector and the document vector calculated for each source document in the target dataset; and
  
  a relationship analyzer component automatically selecting a visualization model for clustering the documents of interest based the number of documents of interest in the result dataset and rendering the result set using selected visualization model in the user interface.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The system of claim 1 wherein the document vector calculator applies term weights to the extracted text using Term Frequency—
    - Inverse Corpus Frequency.
  - 3. The system of claim 1 further comprising a language detection component automatically detecting a language of the extracted text.
  - 4. The system of claim 1 wherein the text extraction pipeline further comprises:
    - a plurality of text extraction components to extract text from source documents;
      
      a document queue containing a list of the source documents added to the ingest memory; and
      
      a scheduler routing each source document from the document queue to one of the text extraction components based on an evaluation of features of the source documents against scheduling parameters.
  - 5. The system of claim 4 wherein the text extraction components are divided into a small file text extraction path and a large file extraction path.
  - 6. The system of claim 1 further comprising:
    - a user property memory configured to store profile documents; and
      
      wherein the document vector calculator automatically computes document vectors for profile documents added to the user property memory.
  - 7. The system of claim 6 wherein the dataset manager component further comprises a recommender component and discovery component.
  - 8. The system of claim 7 further comprising:
    - a combined profile document vector generated from document vectors calculated for a plurality of profile documents selected for use in a query against a target dataset; and
      
      a result dataset generated by evaluating similarities between the combined profile document vector and the document vector calculated for each source document in the target dataset.
  - 9. The system of claim 7 further comprising:
    - a plurality of profile document vectors associated with profile documents selected for use in a query against a target dataset; and
      
      a result dataset generated by evaluating similarities between each profile document vector and the document vector calculated for each source document in the target dataset.
  - 10. The system of claim 1 further comprising an export component for outputting result datasets in formats readable by external applications.

11. A method of analyzing relationships between documents, the method comprising the acts of:
- extracting text from source documents received from an external document source;
  
  storing the extracted text;
  
  creating an index of the extracted text;
  
  computing a document vector for each source document using the extracted text automatically when the extracted text is stored;
  
  storing the document vectors for each source document;
  
  extracting text from profile documents received from an external document source;
  
  storing the extracted text from the profile documents;
  
  computing a document vector for each profile document using the extracted text automatically when the extracted text is stored;
  
  computing a combined profile document vector from the profile document vectors of selected profile documents associated with a query;
  
  receiving a selection of a plurality of source documents as a target dataset and parameters of a query via a user interface;
  
  generating a result dataset containing documents of interest from the target dataset based on the query by evaluating similarities between the combined profile document vector and the document vector calculated for each source document in the target dataset; and
  
  automatically selecting a visualization model for clustering the documents of interest based the number of documents of interest in the result dataset and rendering the result set using selected visualization model in a user interface.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
- - 12. The method of claim 11 further comprising the acts of:
    - indexing the extracted text from each source document automatically when the extracted text is stored; and
      
      storing the indexed text.
  - 13. The method of claim 11 wherein the act of extracting text from source documents received from an external document source further comprises the act of automatically routing each source document to a text extraction component selected from a plurality of text extraction components based on an evaluation of features of the source document against at least one scheduling parameter.
  - 14. The method of claim 13 wherein the at least one scheduling parameter is selected from a file size, a file type, a blacklist associated with the text extraction component, and a whitelist associated with the text extraction component.
  - 15. The method of claim 13 wherein the act of act of automatically routing each source document to a text extraction component further comprises the acts of:
    - routing source documents having a file size less than a first size threshold to a small text file extraction component configured for extracting text from files with file sizes less than the first size threshold;
      
      routing source documents having a file size greater than a second size threshold to a large text file extraction component configured for extracting text from files with file sizes greater than the second size threshold;
      
      routing source documents having a file size between the first size threshold and the second size threshold to either the small text file extraction component or the large text file extraction component based a weighted average of sizes of source documents having a file size between the first size threshold and the second size threshold.
  - 16. The method of claim 11 further comprising the acts of:
    - extracting text from profile documents received from an external document source;
      
      storing the extracted text from the profile documents;
      
      computing a document vector for each profile document using the extracted text automatically when the extracted text is stored; and
      
      generating a result dataset by evaluating similarities between each profile document vector and the document vector calculated for each source document in the target dataset.
  - 17. The method of claim 11 further comprising the acts of:
    - extracting text from profile documents received from an external document source;
      
      storing the extracted text from the profile documents;
      
      computing a document vector for each profile document using the extracted text automatically when the extracted text is stored;
      
      computing a combined profile document vector from the profile document vectors of selected profile documents associated with a query;
      
      generating a result dataset by evaluating similarities between the combined profile document vector and the document vector calculated for each source document in the target dataset.
  - 18. The method of claim 11 further comprising the act of detecting a language associated with the extracted text of a source document based on the probability of characters in the extracted text appearing in a particular language.
  - 19. The method of claim 11 wherein the act of act of detecting a language further comprises the acts of:
    - calculating the probability of the characters in the extracted text appearing in a particular language using multiple code pages;
      
      evaluating the probabilities associated with each code page against the source documents to select the language of the source document.

20. A computer readable medium containing computer executable instructions which, when executed by a computer, perform a method for analyzing relationships between documents, the method comprising the acts of:
- extracting text from source documents received from an external document source;
  
  storing the extracted text;
  
  creating an index of the extracted text;
  
  indexing the extracted text from each source document automatically when the extracted text is stored;
  
  storing the indexed text;
  
  computing a document vector for each source document using the extracted text automatically when the extracted text is stored;
  
  storing the document vectors for each source document;
  
  extracting text from profile documents received from an external document source;
  
  storing the extracted text from the profile documents;
  
  computing a document vector for each profile document using the extracted text automatically when the extracted text is stored;
  
  computing a combined profile document vector from the profile document vectors of selected profile documents associated with a query;
  
  receiving a selection of a plurality of source documents as a target dataset and parameters of a query via a user interface;
  
  generating a result dataset containing documents of interest from the target dataset based on the query by evaluating similarities between the combined profile document vector and the document vector calculated for each source document in the target dataset; and
  
  automatically selecting a visualization model for clustering the documents of interest based the number of documents of interest in the result dataset and rendering the result set using selected visualization model in a user interface.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
VortexT Analytics, Inc.
Original Assignee
VortexT Analytics, Inc.
Inventors
Lambert, Matthew Cody, Angerani, Peter Joseph, Ostermayr, Gregory David
Primary Examiner(s)
Ly, Anh

Application Number

US14/611,624
Publication Number

US 20150220539A1
Time in Patent Office

1,149 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/338 Presentation of query results

G06F 16/355 Class or cluster creation o...

Document relationship analysis system

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

27 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Document relationship analysis system

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

27 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links