Semiotic indexing of digital resources

US 8,903,825 B2
Filed: 05/23/2012
Issued: 12/02/2014
Est. Priority Date: 05/24/2011
Status: Active Grant

First Claim

Patent Images

1. A method of classifying a plurality of documents, comprising:

providing a first set of classification terms and a second set of classification terms, the second set of classification terms being different from the first set of classification terms;

generating a first frequency array of a number of occurrences of each term from the first set of classification terms in each document;

generating a second frequency array of a number of occurrences of each term from the second set of classification terms in each document;

generating a first similarity matrix from the first frequency array;

generating a second similarity matrix from the second frequency array;

determining an entrywise combination of the first similarity matrix and the second similarity matrix; and

clustering the plurality of documents based on the result of the entrywise combination.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of classifying a plurality of documents. The method includes steps of providing a first set of classification terms and a second set of classification terms, the second set of classification terms being different from the first set of classification terms; generating a first frequency array of a number of occurrences of each term from the first set of classification terms in each document; generating a second frequency array of a number of occurrences of each term from the second set of classification terms in each document; generating a first similarity matrix from the first frequency array; generating a second similarity matrix from the second frequency array; determining an entrywise combination of the first similarity matrix and the second similarity matrix; and clustering the plurality of documents based on the result of the entrywise combination.

Citations

48 Claims

1. A method of classifying a plurality of documents, comprising:
- providing a first set of classification terms and a second set of classification terms, the second set of classification terms being different from the first set of classification terms;
  
  generating a first frequency array of a number of occurrences of each term from the first set of classification terms in each document;
  
  generating a second frequency array of a number of occurrences of each term from the second set of classification terms in each document;
  
  generating a first similarity matrix from the first frequency array;
  
  generating a second similarity matrix from the second frequency array;
  
  determining an entrywise combination of the first similarity matrix and the second similarity matrix; and
  
  clustering the plurality of documents based on the result of the entrywise combination.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 2. The method of claim 1, wherein the first set of classification terms comprises one of an externally managed set of terms and a patent classification code.
  - 3. The method of claim 1, wherein the first set of classification terms comprises an externally managed set of terms.
  - 4. The method of claim 3, wherein the externally managed set of terms has been disambiguated.
  - 5. The method of claim 1, wherein the first set of classification terms comprises an externally managed set of classification terms for Bacteria and Archaea.
  - 6. The method of claim 1, further comprising reordering metadata associated with the plurality of documents according to the clustering.
  - 7. The method of claim 1, wherein clustering the documents comprises a hierarchical clustering method selected from the group consisting of:
    - single linkage clustering, complete linkage clustering, group-average clustering, and centroid clustering.
  - 8. The method of claim 1, wherein clustering the documents comprises a non-hierarchical method selected from the group consisting of:
    - monothetic divisive clustering, minimization of trace clustering, multivariate mixture model clustering, Jardine and Sibsons'"'"'s K-dend clustering, distribution-based model clustering, density based model clustering, partitioning based clustering, and Bayesian based clustering.
  - 9. The method of claim 1, further comprising projecting the data as at least one of a heatmap and a hexagonal bin plot.
  - 10. The method of claim 1, wherein the plurality of documents comprises patent documents.
  - 11. The method of claim 1, wherein the plurality of documents comprises one of scientific, technical, medical, or legal literature.
  - 12. The method of claim 1, wherein generating a first similarity matrix comprises generating a first similarity matrix using the Jaccard coefficient.
  - 13. The method of claim 1, wherein the entrywise combination comprises at least one of multiplication, addition, subtraction, and division of the first similarity matrix and the second similarity matrix.
  - 14. The method of claim 1, further comprisingproviding a third set of classification terms, different from the first and second sets of classification terms;
    - generating a third frequency array of a number of occurrences of each term from the third set of classification terms in each document; and
      
      generating a third similarity matrix from the third frequency array;
      
      wherein determining an entrywise combination of the first similarity matrix and the second similarity matrix further comprises determining an entrywise combination of the first similarity matrix, the second similarity matrix, and the third similarity matrix.
  - 15. The method of claim 14, wherein the first, second, and third frequency arrays comprise an intersection of documents from the plurality of documents which have at least one term from each of the first, second, and third sets of classification terms.
  - 16. The method of claim 1, wherein the first frequency array includes only documents having at least one term from the first set of classification terms.
  - 17. The method of claim 16, wherein the second frequency array includes only documents having at least one term from the first set of classification terms.
  - 18. The method of claim 1, wherein the second frequency array includes only documents having at least one term from the second set of classification terms.
  - 19. The method of claim 1, wherein the first frequency array includes only documents having at least two different terms from the first set of classification terms.
  - 20. The method of claim 19, wherein the second frequency array includes only documents having at least one term from the first set of classification terms.
  - 21. The method of claim 1, wherein the plurality of documents comprises a digital resource.
  - 22. The method of claim 1, wherein the first set of classification terms comprises an externally managed set of classification terms for organisms, chemicals, enzymes, genes, proteins, minerals, materials, trademarks, or trade names.
  - 23. The method of claim 1, wherein the first set of classification terms comprises an externally managed set of classification terms comprising a computable terminology.
  - 24. The method of claim 1, wherein at least one step is carried out using a microprocessor.

25. A computer-based system for classifying a plurality of documents, the system comprising:
- a processor; and
  
  a storage medium operably coupled to the processor, wherein the storage medium includes, program instructions executable by the processor forproviding a first set of classification terms and a second set of classification terms, the second set of classification terms being different from the first set of classification terms;
  
  generating a first frequency array of a number of occurrences of each term from the first set of classification terms in each document;
  
  generating a second frequency array of a number of occurrences of each term from the second set of classification terms in each document;
  
  generating a first similarity matrix from the first frequency array;
  
  generating a second similarity matrix from the second frequency array;
  
  determining an entrywise combination of the first similarity matrix and the second similarity matrix; and
  
  clustering the plurality of documents based on the result of the entrywise combination.
- View Dependent Claims (26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48)
- - 26. The computer-based system of claim 25, wherein the first set of classification terms comprises one of an externally managed set of terms and a patent classification code.
  - 27. The computer-based system of claim 25, wherein the first set of classification terms comprises an externally managed set of terms.
  - 28. The computer-based system of claim 27, wherein the externally managed set of terms has been disambiguated.
  - 29. The computer-based system of claim 25, wherein the first set of classification terms comprises an externally managed set of classification terms for Bacteria and Archaea.
  - 30. The computer-based system of claim 25, further comprising reordering metadata associated with the plurality of documents according to the clustering.
  - 31. The computer-based system of claim 25, wherein clustering the documents comprises a hierarchical clustering method selected from the group consisting of:
    - single linkage clustering, complete linkage clustering, group-average clustering, and centroid clustering.
  - 32. The computer-based system of claim 25, wherein clustering the documents comprises a nonhierarchical method selected from the group consisting of:
    - monothetic divisive clustering, minimization of trace clustering, multivariate mixture model clustering, Jardine and Sibsons'"'"'s K-dend clustering, distribution-based model clustering, density based model clustering, partitioning based clustering, and Bayesian based clustering.
  - 33. The computer-based system of claim 25, further comprising projecting the data as at least one of a heatmap and a hexagonal bin plot.
  - 34. The computer-based system of claim 25, wherein the plurality of documents comprises patent documents.
  - 35. The computer-based system of claim 25, wherein the plurality of documents comprises one of scientific, technical, medical, or legal literature.
  - 36. The computer-based system of claim 25, wherein generating a first similarity matrix comprises generating a first similarity matrix using the Jaccard coefficient.
  - 37. The computer-based system of claim 25, wherein the entrywise combination comprises at least one of multiplication, addition, subtraction, and division of the first similarity matrix and the second similarity matrix.
  - 38. The computer-based system of claim 25, wherein the program instructions executable by the processor further comprise instructions forproviding a third set of classification terms, different from the first and second sets of classification terms;
    - generating a third frequency array of a number of occurrences of each term from the third set of classification terms in each document; and
      
      generating a third similarity matrix from the third frequency array;
      
      wherein determining an entrywise combination of the first similarity matrix and the second similarity matrix further comprises determining an entrywise combination of the first similarity matrix, the second similarity matrix, and the third similarity matrix.
  - 39. The computer-based system of claim 38, wherein the first, second, and third frequency arrays comprise an intersection of documents from the plurality of documents which have at least one term from each of the first, second, and third sets of classification terms.
  - 40. The computer-based system of claim 25, wherein the first frequency array includes only documents having at least one term from the first set of classification terms.
  - 41. The computer-based system of claim 40, wherein the second frequency array includes only documents having at least one term from the first set of classification terms.
  - 42. The computer-based system of claim 25, wherein the second frequency array includes only documents having at least one term from the second set of classification terms.
  - 43. The computer-based system of claim 25, wherein the first frequency array includes only documents having at least two different terms from the first set of classification terms.
  - 44. The computer-based system of claim 43, wherein the second frequency array includes only documents having at least one term from the first set of classification terms.
  - 45. The computer-based system of claim 25, wherein the plurality of documents comprises a digital resource.
  - 46. The computer-based system of claim 25, wherein the first set of classification terms comprises an externally managed set of classification terms for organisms, chemicals, enzymes, genes, proteins, minerals, materials, trademarks, or trade names.
  - 47. The computer-based system of claim 25, wherein the first set of classification terms comprises an externally managed set of classification terms comprising a computable terminology.
  - 48. The computer-based system of claim 25, wherein at least one step is carried out using a microprocessor.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Regents of the University of California (University of California)
Original Assignee
Namesforlife Llc
Inventors
Parker, Charles T., Garrity, George M.
Primary Examiner(s)
Nguyen, Phong

Application Number

US13/478,973
Publication Number

US 20130013603A1
Time in Patent Office

923 Days
Field of Search

707/737, 707/738, 707/801, 707/740, 707/E17.089
US Class Current

707/737
CPC Class Codes

G06F 16/353 into predefined classes

G06F 16/355 Class or cluster creation o...

Semiotic indexing of digital resources

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

48 Claims

Specification

Solutions

Use Cases

Quick Links

Semiotic indexing of digital resources

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

48 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links