Systems and methods for identifying key phrase clusters within documents

US 10,180,929 B1
Filed: 10/13/2016
Issued: 01/15/2019
Est. Priority Date: 06/30/2014
Status: Active Grant

First Claim

Patent Images

1. An electronic device comprising:

a computer display;

computer-readable storage media; and

one or more processors configured to execute instructions to cause the electronic device to;

obtain, based on a first user input, documents and a statistical model;

segment contents of the documents into segments;

determine frequencies at which the segments occur within the contents of the documents and store the frequencies in the computer-readable storage media;

with the statistical model, determine modeled frequencies for the segments;

compare the frequencies with the modeled frequencies;

based on the comparison, determine statistical significance values for the segments;

identify representative segments from the segments having statistical significance values exceeding a predetermined threshold value;

cluster the documents into clusters, each cluster having identical or substantially identical representative segments;

determine a label for each cluster;

display within a graphical user interface a representation of the documents;

receive a second user input and identify a set of clusters, from the clusters, associated with the second user input; and

based on the received second user input, modify the graphical user interface to further includea representation of the second user input, andfor each of the clusters of the set of clusters;

an indication of the label associated with the cluster, andan indication of the documents associated with the cluster,wherein the clusters of the set of clusters are grouped and displayed in separate portions of the graphical user interface.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are disclosed for key phrase clustering of documents. In accordance with one implementation, a method is provided for key phrase clustering of documents. The method includes obtaining a first plurality of documents based at least on a user input, obtaining a statistical model based at least on the user input, and obtaining, from content of the first plurality of documents, a plurality of segments. The method also includes identifying a plurality of clusters of segments from the plurality of segments, determining statistical significance of the plurality of clusters based at least on the statistical model and the content, and providing for display a representative cluster from the plurality of tokens, the representative cluster being determined based at least on the statistical significance. The method further includes determining a label for the representative cluster based at least on the plurality of clusters and the statistical significance.

Citations

20 Claims

1. An electronic device comprising:
- a computer display;
  
  computer-readable storage media; and
  
  one or more processors configured to execute instructions to cause the electronic device to;
  
  obtain, based on a first user input, documents and a statistical model;
  
  segment contents of the documents into segments;
  
  determine frequencies at which the segments occur within the contents of the documents and store the frequencies in the computer-readable storage media;
  
  with the statistical model, determine modeled frequencies for the segments;
  
  compare the frequencies with the modeled frequencies;
  
  based on the comparison, determine statistical significance values for the segments;
  
  identify representative segments from the segments having statistical significance values exceeding a predetermined threshold value;
  
  cluster the documents into clusters, each cluster having identical or substantially identical representative segments;
  
  determine a label for each cluster;
  
  display within a graphical user interface a representation of the documents;
  
  receive a second user input and identify a set of clusters, from the clusters, associated with the second user input; and
  
  based on the received second user input, modify the graphical user interface to further includea representation of the second user input, andfor each of the clusters of the set of clusters;
  
  an indication of the label associated with the cluster, andan indication of the documents associated with the cluster,wherein the clusters of the set of clusters are grouped and displayed in separate portions of the graphical user interface.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The electronic device of claim 1, wherein the identifying the representative segments further comprises, for a document, determining a positive number of the segments having the highest statistical significance values among the segments of the document.
  - 3. The electronic device of claim 1, wherein the identical or substantially identical representative segments are based on synonyms.
  - 4. The electronic device of claim 1, wherein the identical or substantially identical representative segments are based on an edit distance.
  - 5. The electronic device of claim 4, wherein the edit distance is based on a Levenshtein distance.
  - 6. The electronic device of claim 1, wherein determining the label for each cluster further comprises determining the representative segment appearing most frequently in the cluster.
  - 7. The electronic device of claim 1, wherein determining the label for each cluster further comprises determining a textual phrase different from the representative segments for the cluster.
  - 8. The electronic device of claim 7, wherein the textual phrase is based in part on one or more of the representative segments for the cluster.
  - 9. The electronic device of claim 1, wherein the second user input comprises a date or date range.
  - 10. The electronic device of claim 1, wherein the indications of the documents comprises contents of the documents, links to the documents, or a combination thereof.

11. A method performed by one or more processors, the method comprising:
- obtaining, based on a first user input, documents and a statistical model;
  
  segmenting contents of the documents into segments;
  
  determining frequencies at which the segments occur within the contents of the documents;
  
  with the statistical model, determining modeled frequencies for the segments;
  
  comparing the determined frequencies with the modeled frequencies;
  
  based on the comparison, determining statistical significance values for the segments;
  
  identifying representative segments from the segments based on a comparison of the statistical significance values with a predetermined threshold value;
  
  clustering the documents into clusters, each cluster having related representative segments;
  
  determining a label for each cluster;
  
  displaying within a graphical user interface a representation of the documents;
  
  receiving a second user input and identifying a set of clusters, from the clusters, associated with the second user input; and
  
  based on the received second user input, modifying the graphical user interface to further includea representation of the second user input, andfor each of the clusters of the set of clusters;
  
  an indication of the label associated with the cluster, andan indication of the documents associated with the cluster,wherein the clusters of the set of clusters are grouped and displayed in separate portions of the graphical user interface.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The method of claim 11, wherein the related representative segments are identical or substantially identical representative segments.
  - 13. The method of claim 11, wherein the related representative segments are related by a synonym relationship.
  - 14. The method of claim 11, wherein the related representative segments are related by an edit distance.
  - 15. The method of claim 14, wherein the edit distance is based on a Levenshtein distance.
  - 16. The method of claim 11, wherein determining the label for each cluster further comprises determining the representative segment appearing most frequently in the cluster.
  - 17. The method of claim 11, wherein determining the label for each cluster further comprises determining a textual phrase different from the representative segments for the cluster.
  - 18. The method of claim 17, wherein the textual phrase is based in part on one or more of the representative segments for the cluster.
  - 19. The method of claim 11, wherein the second user input comprises a date or date range.
  - 20. The method of claim 11, wherein the indications of the documents comprises contents of the documents, links to the documents, or a combination thereof.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Palantir Technologies Incorporated
Original Assignee
Palantir Technologies Incorporated
Inventors
Kesin, Max, Wadhar, Hem
Primary Examiner(s)
Chbouki, Tarek

Application Number

US15/293,140
Time in Patent Office

824 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/345   Summarisation for human users

G06F 16/353   into predefined classes

G06F 3/0481   based on specific propertie...

G06F 40/106   Display of layout of docume...

G06F 40/117   Tagging; Marking up details...

G06F 40/205   Parsing

Systems and methods for identifying key phrase clusters within documents

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for identifying key phrase clusters within documents

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links