Clustering of text units using dimensionality reduction of multi-dimensional arrays
First Claim
Patent Images
1. A method comprising operations executed on a processor, the operations comprising:
- tokenizing a plurality of text units from a plurality of documents;
creating a first multi-dimensional array, wherein the dimensions of the first multi-dimensional array are based upon the plurality of text units;
normalizing the first multi-dimensional array;
reducing the dimensionality of the first multi-dimensional array;
creating a second multi-dimensional array for each of the text units, wherein each text unit is initially assigned a random x-coordinate and a random y-coordinate;
determining a first distribution based on similarity of each document with each other document in the plurality of text units using the first multi-dimensional array;
determining a second distribution based on similarity of each document with each other document in the plurality of text units using the second multi-dimensional array based upon the x-coordinates and y-coordinates;
minimizing divergence between the first distribution and the second distribution by iterating a cost function;
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatuses, including computer programs encoded on computer-readable media, for tokenizing n-grams from a plurality of text units. A multi-dimensional array is created having a plurality of dimensions based upon the plurality of text units and the n-grams from the plurality of text units. The multi-dimensional array is normalized and the dimensionality of the multi-dimensional array is reduced. The reduced dimensionality multi-dimensional array is clustered to generate a plurality of clusters that each cluster includes one or more of the plurality of text units.
-
Citations
20 Claims
-
1. A method comprising operations executed on a processor, the operations comprising:
-
tokenizing a plurality of text units from a plurality of documents; creating a first multi-dimensional array, wherein the dimensions of the first multi-dimensional array are based upon the plurality of text units; normalizing the first multi-dimensional array; reducing the dimensionality of the first multi-dimensional array; creating a second multi-dimensional array for each of the text units, wherein each text unit is initially assigned a random x-coordinate and a random y-coordinate; determining a first distribution based on similarity of each document with each other document in the plurality of text units using the first multi-dimensional array; determining a second distribution based on similarity of each document with each other document in the plurality of text units using the second multi-dimensional array based upon the x-coordinates and y-coordinates; minimizing divergence between the first distribution and the second distribution by iterating a cost function; - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable medium, having instructions stored thereon that when executed by a computing device cause the computing device to perform operations comprising:
-
tokenizing a plurality of text units from a plurality of documents; creating a first multi-dimensional array, wherein the dimensions of the first multi-dimensional array are based upon the plurality of text units; normalizing the first multi-dimensional array; reducing the dimensionality of the first multi-dimensional array; creating a second multi-dimensional array for each of the text units, wherein each text unit is initially assigned a random x-coordinate and a random y-coordinate; determining a first distribution based on similarity of each document with each other document in the plurality of text units using the first multi-dimensional array; determining a second distribution based on similarity of each document with each other document in the plurality of text units using the second multi-dimensional array based upon the x-coordinates and y-coordinates; minimizing divergence between the first distribution and the second distribution by iterating a cost function; - View Dependent Claims (16, 17)
-
-
18. A system comprising:
one or more electronic processors configured to; tokenize a plurality of text units from a plurality of documents; create a first multi-dimensional array, wherein the dimensions of the first multi-dimensional array are based upon the plurality of text units; normalize the first multi-dimensional array; reduce the dimensionality of the first multi-dimensional array; create a second multi-dimensional array for each of the text units, wherein each text unit is initially assigned a random x-coordinate and a random y-coordinate; determine a first distribution based on similarity of each document with each other document in the plurality of text units using the first multi-dimensional array; determine a second distribution based on similarity of each document with each other document in the plurality of text units using the second multi-dimensional array based upon the x-coordinates and y-coordinates; minimize divergence between the first distribution and the second distribution by iterating a cost function; - View Dependent Claims (19, 20)
Specification