Clustering of text units using dimensionality reduction of multi-dimensional arrays

US 9,141,882 B1
Filed: 10/19/2012
Issued: 09/22/2015
Est. Priority Date: 10/19/2012
Status: Active Grant

First Claim

Patent Images

1. A method comprising operations executed on a processor, the operations comprising:

tokenizing a plurality of text units from a plurality of documents;

creating a first multi-dimensional array, wherein the dimensions of the first multi-dimensional array are based upon the plurality of text units;

normalizing the first multi-dimensional array;

reducing the dimensionality of the first multi-dimensional array;

creating a second multi-dimensional array for each of the text units, wherein each text unit is initially assigned a random x-coordinate and a random y-coordinate;

determining a first distribution based on similarity of each document with each other document in the plurality of text units using the first multi-dimensional array;

determining a second distribution based on similarity of each document with each other document in the plurality of text units using the second multi-dimensional array based upon the x-coordinates and y-coordinates;

minimizing divergence between the first distribution and the second distribution by iterating a cost function;

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatuses, including computer programs encoded on computer-readable media, for tokenizing n-grams from a plurality of text units. A multi-dimensional array is created having a plurality of dimensions based upon the plurality of text units and the n-grams from the plurality of text units. The multi-dimensional array is normalized and the dimensionality of the multi-dimensional array is reduced. The reduced dimensionality multi-dimensional array is clustered to generate a plurality of clusters that each cluster includes one or more of the plurality of text units.

Citations

20 Claims

1. A method comprising operations executed on a processor, the operations comprising:
- tokenizing a plurality of text units from a plurality of documents;
  
  creating a first multi-dimensional array, wherein the dimensions of the first multi-dimensional array are based upon the plurality of text units;
  
  normalizing the first multi-dimensional array;
  
  reducing the dimensionality of the first multi-dimensional array;
  
  creating a second multi-dimensional array for each of the text units, wherein each text unit is initially assigned a random x-coordinate and a random y-coordinate;
  
  determining a first distribution based on similarity of each document with each other document in the plurality of text units using the first multi-dimensional array;
  
  determining a second distribution based on similarity of each document with each other document in the plurality of text units using the second multi-dimensional array based upon the x-coordinates and y-coordinates;
  
  minimizing divergence between the first distribution and the second distribution by iterating a cost function;
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, further comprising determining a label for each of the clusters based on the one or more of the plurality of text units of the cluster.
  - 3. The method of claim 2, further comprising producing a visual graph of the plurality of clusters that includes an indication of each of the plurality of text units, wherein each indication is colored based upon the cluster of the text unit.
  - 4. The method of claim 3, further comprising:
    - determining a conditional probability that a text unit of a first cluster is related to a second cluster; and
      
      determining a distance between the text units of the first cluster and the text units of the second cluster in the visual graph based upon the determined conditional probability.
  - 5. The method of claim 1, wherein the text units are tokenized into unigrams.
  - 6. The method of claim 1, wherein the text units are tokenized into bigrams.
  - 7. The method of claim 1, further comprising:
    - determining one or more words from the first multi-dimensional array that are in a predetermined number of text units; and
      
      removing entries from the first multi-dimensional array corresponding to the one or more words.
  - 8. The method of claim 1, further comprising:
    - removing stop words from the text units;
      
      determining one or more documents text units that do not contain any stop words; and
      
      removing one or more text units from the one or more documents that do not contain any stop words from the first multi-dimensional array.
  - 9. The method of claim 1, wherein the x-coordinate is updated iteratively based on
  - 10. The method of claim 1, wherein the y-coordinate is updated iteratively based on
  - 11. The method of claim 1, wherein p_ijis calculated as
  - 12. The method of claim 11, wherein q_ijis calculated as
  - 13. The method of claim 1, wherein the second multi-dimensional array consists of two dimensions:
    - the x-coordinate and the y-coordinate.
  - 14. The method of claim 1, wherein the second multi-dimensional array consists of three dimensions:
    - the x-coordinate, the y-coordinate, and a z-coordinate.

15. A non-transitory computer-readable medium, having instructions stored thereon that when executed by a computing device cause the computing device to perform operations comprising:
- tokenizing a plurality of text units from a plurality of documents;
  
  creating a first multi-dimensional array, wherein the dimensions of the first multi-dimensional array are based upon the plurality of text units;
  
  normalizing the first multi-dimensional array;
  
  reducing the dimensionality of the first multi-dimensional array;
  
  creating a second multi-dimensional array for each of the text units, wherein each text unit is initially assigned a random x-coordinate and a random y-coordinate;
  
  determining a first distribution based on similarity of each document with each other document in the plurality of text units using the first multi-dimensional array;
  
  determining a second distribution based on similarity of each document with each other document in the plurality of text units using the second multi-dimensional array based upon the x-coordinates and y-coordinates;
  
  minimizing divergence between the first distribution and the second distribution by iterating a cost function;
- View Dependent Claims (16, 17)
- - 16. The non-transitory computer-readable medium of claim 15, wherein the operations further comprise determining a label for each of the clusters based on the one or more of the plurality of text units of the cluster.
  - 17. The non-transitory computer-readable medium of claim 16, wherein the operations further comprise:
    - producing a visual graph of the plurality of clusters that includes an indication of each of the plurality of text units, wherein each indication is colored based upon the cluster of the text unit;
      
      determining a conditional probability that a text unit of a first cluster is related to a second cluster; and
      
      determining a distance between the text units of the first cluster and the text units of the second cluster in the visual graph based upon the determined conditional probability.

18. A system comprising:
- one or more electronic processors configured to;
  
  tokenize a plurality of text units from a plurality of documents;
  
  create a first multi-dimensional array, wherein the dimensions of the first multi-dimensional array are based upon the plurality of text units;
  
  normalize the first multi-dimensional array;
  
  reduce the dimensionality of the first multi-dimensional array;
  
  create a second multi-dimensional array for each of the text units, wherein each text unit is initially assigned a random x-coordinate and a random y-coordinate;
  
  determine a first distribution based on similarity of each document with each other document in the plurality of text units using the first multi-dimensional array;
  
  determine a second distribution based on similarity of each document with each other document in the plurality of text units using the second multi-dimensional array based upon the x-coordinates and y-coordinates;
  
  minimize divergence between the first distribution and the second distribution by iterating a cost function;
- View Dependent Claims (19, 20)
- - 19. The system of claim 18, wherein the one or more electronic processors are further configured to determine a label for each of the clusters based on the one or more of the plurality of text units of the cluster.
  - 20. The system of claim 19, wherein the one or more electronic processors are further configured to:
    - produce a visual graph of the plurality of clusters that includes an indication of each of the plurality of text units, wherein each indication is colored based upon the cluster of the text unit;
      
      determine a conditional probability that a text unit of a first cluster is related to a second cluster; and
      
      determine a distance between the text units of the first cluster and the text units of the second cluster in the visual graph based upon the determined conditional probability.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Networked Insights LLC
Original Assignee
Networked Insights LLC
Inventors
Cao, Baoqiang, Fitz-Gibbon, T. Ryan, Forehand, Lucas, McHale, Ryan, Burke, Bradley
Primary Examiner(s)
Jalil, Neveen Abel
Assistant Examiner(s)
Tamaru, Michael K

Application Number

US13/656,315
Time in Patent Office

1,068 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/287   Visualization; Browsing

G06F 16/35   Clustering; Classification

G06F 16/355   Class or cluster creation o...

G06F 16/358   Browsing; Visualisation the...

G06F 18/2137   based on criteria of topolo...

G06F 40/284   Lexical analysis, e.g. toke...

Clustering of text units using dimensionality reduction of multi-dimensional arrays

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Clustering of text units using dimensionality reduction of multi-dimensional arrays

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links