SYSTEMS AND METHODS FOR IDENTIFYING SETS OF SIMILAR PRODUCTS

US 20120265736A1
Filed: 04/16/2012
Published: 10/18/2012
Est. Priority Date: 04/14/2011
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

receiving an identification of a product for clustering into one or more sets of similar products;

transmitting a query to one or more information sources having information related to the product;

receiving from the one or more internet websites information relevant to the product;

merging and storing at least a portion of the received information into one or more databases;

transforming at least a portion of the received information into a text file;

cleansing at least a portion of the text file to create a cleansed text file;

creating a dictionary from at least a portion of the cleansed text file, wherein the dictionary comprises words found in the cleansed text file; and

performing topic modeling on the cleansed text file to determine one or more clusters of one or more substitutes of the product.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments of the present invention relate to systems and methods for determining sets of products which are similar to each other in terms of consumers'"'"' wants and needs. Queries are performed on a particular product. Documents relating to the query are received and stored. A dictionary is created from the received documents, whereby the documents, which are text files, are scrubbed of certain data to create a scrubbed text file. Topic modeling is then performed on the cleansed text file. Various methods can be used to perform topic modeling, including, but not limited to, latent semantic analysis, nonnegative matrix factorization, and singular value decomposition.

21 Citations

View as Search Results

22 Claims

1. A computer-implemented method comprising:
- receiving an identification of a product for clustering into one or more sets of similar products;
  
  transmitting a query to one or more information sources having information related to the product;
  
  receiving from the one or more internet websites information relevant to the product;
  
  merging and storing at least a portion of the received information into one or more databases;
  
  transforming at least a portion of the received information into a text file;
  
  cleansing at least a portion of the text file to create a cleansed text file;
  
  creating a dictionary from at least a portion of the cleansed text file, wherein the dictionary comprises words found in the cleansed text file; and
  
  performing topic modeling on the cleansed text file to determine one or more clusters of one or more substitutes of the product.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
- - 2. The method of claim 1, wherein the information sources are selected from the group consisting of an Internet website or a database.
  - 3. The method of claim 1, wherein transforming at least a portion of the received information into a text file comprises transforming all of the received information into a text file.
  - 4. The method of claim 1, wherein cleansing at least a portion of the text file to create a cleansed text file comprises cleansing the entire text file to create a cleansed text file.
  - 5. The method of claim 1, wherein creating a dictionary from at least a portion of the cleansed text file, wherein the dictionary comprises words found in the cleansed text file comprises creating a dictionary from the entire cleansed text file, wherein the dictionary comprises all words found in the cleansed text file.
  - 6. The method of claim 1, wherein cleansing at least a portion of the text file to create a cleansed text file comprises scrubbing the text file of data selected from the group consisting of HTML tags, metadata, punctuation, and numbers.
  - 7. The method of claim 6 further comprising converting the non-scrubbed data comprising alphanumeric characters to lower case characters.
  - 8. The method of claim 1, wherein performing topic modeling on the cleansed text file to determine one or more substitutes of the product is selected from the group consisting of latent semantic analysis, nonnegative matrix factorization, and singular value decomposition.
  - 9. The method of claim 1, wherein performing topic modeling on the cleansed text file to determine one or more substitutes of the product comprises:
    - computing a count of documents in which a word in the cleansed text file appears;
      
      deleting one or more words that appear at a predetermined frequency from the dictionary.
  - 10. The method of claim 9 further comprising:
    - stemming at least a portion of the non-deleted words;
      
      adding at least a portion of the stemmed words to the dictionary; and
      
      computing a count of documents in which one or more stemmed word appears.
  - 11. The method of claim 9 further comprising computing a weight calculated by
  - 12. The method of claim 10 further comprising:
    - converting the dictionary to a sparse m×
      
      p matrix, where m is a number of documents and p is the number of words, wherein each entry in the matrix is the count of each corresponding word multiplied by the weight; and
      
      performing Singular Value Decomposition on the matrix, wherein each document is represented by its corresponding row of the left singular vectors (eigenvectors), wherein the rows of the right singular vector (eigenvectors) represent the one or more clusters of one or more substitutes of the product;
      
      wherein a cluster of the one or more clusters to which each product belongs is found by projecting the sparse p-dimensional vector corresponding to its document onto each of the right eigenvectors representing the one or more clusters, wherein the projection which produces the maximum value identifies the product'"'"'s cluster.
  - 13. The method of claim 1, wherein performing topic modeling on the cleansed text file is performed multiple times to increase the reliability of the one or more clusters.

14. A non-transitory computer-readable medium having software instructions stored thereon, which, when executed by a client device, causes the client device to perform the operations comprising:
- receiving an identification of a product for clustering into one or more sets of similar products;
  
  transmitting a query to one or more internet websites having information related to the product;
  
  receiving from the one or more internet websites information relevant to the product;
  
  merging and storing the received information into one or more databases;
  
  transforming all received information into a text file;
  
  cleansing the text file to create a cleansed text file;
  
  creating a dictionary from the cleansed text file, wherein the dictionary comprises all words found in the cleansed text file; and
  
  performing topic modeling on the cleansed text file to determine a cluster of one or more substitutes of the product.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21)
- - 15. The non-transitory computer-readable medium of claim 14, wherein cleansing the text file to create a cleansed text file comprises scrubbing the text file of data selected from the group consisting of HTML tags, metadata, punctuation, and numbers.
  - 16. The non-transitory computer-readable medium of claim 15, further comprising converting the non-scrubbed data comprising alphanumeric characters to lower case characters.
  - 17. The non-transitory computer-readable medium of claim 14, wherein the method for performing topic modeling on the cleansed file is selected from the group consisting of latent semantic analysis, nonnegative matrix factorization, and singular value decomposition.
  - 18. The non-transitory computer-readable medium of claim 14, wherein performing topic modeling on the cleansed text file comprises:
    - computing a count of documents in which a word in the cleansed text file appears;
      
      deleting one or more words that appear at a predetermined frequency from the dictionary.
  - 19. The non-transitory computer-readable medium of claim 18 further comprising:
    - perform stemming on the non-deleted words;
      
      adding the stemmed words to the dictionary; and
      
      computing a count of documents in which each stemmed word appears.
  - 20. The non-transitory computer-readable medium of claim 18 further comprising computing a weight calculated by
  - 21. The non-transitory computer-readable medium of claim 19 further comprising:
    - converting the dictionary to a sparse m×
      
      p matrix, where m is a number of documents and p is the number of words, wherein each entry in the matrix is the count of each corresponding word multiplied by the weight; and
      
      performing Singular Value Decomposition on the matrix, wherein each document is represented by its corresponding row of the left singular vectors (eigenvectors), wherein the rows of the right singular vector (eigenvectors) represent the one or more clusters of one or more substitutes of the product;
      
      wherein a cluster of the one or more clusters to which each product belongs is found by projecting the sparse p-dimensional vector corresponding to its document onto each of the right eigenvectors representing the one or more clusters, wherein the projection which produces the maximum value identifies that product'"'"'s cluster.

22. A system for topic modeling comprising:
- a client computer comprising a processor configured to execute computer-executable instructions, the instructions comprising instructions for;
  
  receiving an identification of a product for clustering into one or more sets of similar products;
  
  transmitting a query to one or more internet web sites having information related to the product;
  
  receiving from the one or more internet websites information relevant to the product;
  
  merging and storing at least a portion of the received information into one or more databases;
  
  transforming at least a portion of the received information into a text file;
  
  cleansing at least a portion of the text file to create a cleansed text file;
  
  creating a dictionary from at least a portion of the cleansed text file, wherein the dictionary comprises words found in the cleansed text file; and
  
  performing topic modeling on the cleansed text file to determine one or more substitutes of the product; and
  
  a database for storing;
  
  the text file;
  
  the cleansed text file; and
  
  the dictionary.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Infor Incorporated
Original Assignee
Predictix, LLC (Infor Incorporated)
Inventors
Vasiloglou, Nikolaos, Williams, Loren, Pasalic, Emir

Granted Patent

US 8,682,883 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/692
CPC Class Codes

G06F 16/36   Creation of semantic tools,...

G06F 16/9535   Search customisation based ...

G06F 40/242   Dictionaries

SYSTEMS AND METHODS FOR IDENTIFYING SETS OF SIMILAR PRODUCTS

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

21 Citations

22 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEMS AND METHODS FOR IDENTIFYING SETS OF SIMILAR PRODUCTS

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

21 Citations

22 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links