Selection of initial document collection for visual interactive search

US 10,606,883 B2
Filed: 10/17/2016
Issued: 03/31/2020
Est. Priority Date: 05/15/2014
Status: Active Grant

First Claim

Patent Images

1. A method of identifying an initial collection of k documents I₁, I₂, . . . , I_kfrom n₁candidate documents X₁, X₂, . . . , X_n1in an embedding space, the initial collection of k documents I₁, I₂, . . . , I_kto be used for user identification of a desired document, the method comprising:

providing, accessibly to a computer system, a database identifying (i) the n₁candidate documents X₁, X₂, . . . , X_n1in the embedding space and (ii) a distance between each pair of documents of the n₁candidate documents X₁, X₂, . . . , X_n1in the embedding space, the distance between each pair of candidate documents corresponding to a predetermined measure of dissimilarity between the pair of candidate documents, wherein n₁>

k>

1;

identifying the k initial documents I₁, I₂, . . . , I_kto be identified to a user by, for each i'"'"'th one of k iterations, beginning with a first iteration (i=1), performing;

calculating a cost score for documents of the n_icandidate documents X₁, X₂, . . . , X_ni, the cost score being calculated according to an algorithm that operates in dependence on a representativeness calculation and a diversity calculation,adding, to the initial collection of k documents I₁, I₂, . . . , I_k, a minimum cost document, from the scored documents, having a lowest cost score, andremoving, from the n_icandidate documents X₁, X₂, . . . , X_ni, the minimum cost document and all r documents that are within a predetermined distance from the minimum cost document, where r≥

0, and n_i+1being n_i—

(r +1); and

identifying toward the user the initial collection of k documents I₁, I₂, . . . , I_kfor selection of a document.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Roughly described, a system for user identification of a desired document. A database identifies a catalog of documents in an embedding space, in which the distance between documents corresponds to a measure of their dissimilarity. The system presents an initial collection of the documents toward the user from an initial candidate space which is part of the embedding space, then in response to iterative user input, refines the candidate space and subsequent collections of documents presented toward the user. The initial collection is determined using a weighted cost-based iterative addition to the initial collection of documents from the initial candidate space, trading off between two sub-objectives of representativeness and diversity.

54 Citations

27 Claims

1. A method of identifying an initial collection of k documents I₁, I₂, . . . , I_kfrom n₁candidate documents X₁, X₂, . . . , X_n1in an embedding space, the initial collection of k documents I₁, I₂, . . . , I_kto be used for user identification of a desired document, the method comprising:
- providing, accessibly to a computer system, a database identifying (i) the n₁candidate documents X₁, X₂, . . . , X_n1in the embedding space and (ii) a distance between each pair of documents of the n₁candidate documents X₁, X₂, . . . , X_n1in the embedding space, the distance between each pair of candidate documents corresponding to a predetermined measure of dissimilarity between the pair of candidate documents, wherein n₁>
  
  k>
  
  1;
  
  identifying the k initial documents I₁, I₂, . . . , I_kto be identified to a user by, for each i'"'"'th one of k iterations, beginning with a first iteration (i=1), performing;
  
  calculating a cost score for documents of the n_icandidate documents X₁, X₂, . . . , X_ni, the cost score being calculated according to an algorithm that operates in dependence on a representativeness calculation and a diversity calculation,adding, to the initial collection of k documents I₁, I₂, . . . , I_k, a minimum cost document, from the scored documents, having a lowest cost score, andremoving, from the n_icandidate documents X₁, X₂, . . . , X_ni, the minimum cost document and all r documents that are within a predetermined distance from the minimum cost document, where r≥
  
  0, and n_i+1being n_i—
  
  (r +1); and
  
  identifying toward the user the initial collection of k documents I₁, I₂, . . . , I_kfor selection of a document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, wherein the representativeness calculation decreases the cost score for documents which are more representative of documents in the n_icandidate documents, and increases the cost score for documents which are less representative of documents in the n_icandidate documents, and wherein the diversity calculation decreases the cost score for documents which are more diverse from documents previously added to the initial collection and increases the cost score for documents which are less diverse from documents previously added to the initial collection.
  - 3. The method of claim 2, wherein the representativeness calculation is calculated in dependence on a distance in the embedding space from the document being scored to a τ
    - '"'"'th closest document in the n_icandidate documents, where τ
      
      is a predetermined number greater than or equal to 1.
  - 4. The method of claim 2, wherein the diversity calculation is calculated in dependence on an average distance between the document being scored from the documents previously added to the initial collection.
  - 5. The method of claim 2, wherein the cost score applies a predetermined weighting factor to one of the diversity calculation and the representativeness calculation, relative to the other of the diversity calculation and the representativeness calculation.
  - 6. The method of claim 5, wherein, as the predetermined weighting factor increases, diversity of the initial collection of k documents I₁,I₂, . . . , I_kincreases, and as the predetermined weighting factor decreases, representativeness of the initial collection of k documents I₁, I₂, . . . , I_kincreases.
  - 7. The method of claim 2, wherein the representativeness calculation is calculated for each scored document in dependence upon the number of documents within the n_icandidate documents X₁, X₂, . . . , X_nithat are within a fixed distance in the embedding space from the document being scored.
  - 8. The method of claim 2, wherein the diversity calculation is calculated in dependence on a sum of eigenvalues of a covariance matrix of embeddings of the document being scored in the embedding space.
  - 9. The method of claim 2, wherein the diversity calculation is calculated as α
    - ∥
      
      X₀-_meaned∥
      
      _L2where α
      
      is a predetermined weighting factor, and ∥
      
      X₀-_meaned∥
      
      _L2is a square root of a sum of squares of 0-meaned values of elements in a feature vector x representing the document being scored.
  - 10. The method of claim 1, wherein the cost score for each scored document is calculated using a K-medoids algorithm as follows:
  - 11. The method of claim 1, wherein the cost score for each scored document is calculated using a K-medoids algorithm including a centered norm variance term as follows:
  - 12. The method of claim 1, wherein the cost score for each scored document is calculated using a K-medoids with a mean distance term as follows:

13. A computer-readable storage medium impressed in a non-transitory manner with computer program instructions for identifying an initial collection of k documents I₁I₂, . . . , I_kfrom n_icandidate documents X₁, X₂, . . . , X_n1in an embedding space, the initial collection of k documents ₁, I₂, . . . , I_kto be used for user identification of a desired document, the computer program instructions, when executed, causing a computer to perform a method comprising:
- providing, accessibly to a computer system, a database identifying (i) the n₁candidate documents X₁, X₂, . . . , X_n1in the embedding space and (ii) a distance between each pair of documents of the n₁candidate documents X₁, X₂, . . . , X_n1in the embedding space, the distance between each pair of candidate documents corresponding to a predetermined measure of dissimilarity between the pair of candidate documents, wherein n₁22 k>
  
  1 ;
  
  identifying the k initial documents I₁, I₂, . . . , I_kto be identified to a user by, for each i'"'"'th one of k iterations, beginning with a first iteration (i−
  
  1), performing;
  
  calculating a cost score for documents of the n_icandidate documents X₁, X₂, . . . , X_nithe cost score being calculated according to an algorithm that operates in dependence on a representativeness calculation and a diversity calculation,adding, to the initial collection of k documents I₁, I₂, . . . , I_k, a minimum cost document, from the scored documents, having a lowest cost score, andremoving, from the n_icandidate documents X₁, X₂, . . . , X_ni, the minimum cost document and all r documents that are within a predetermined distance from the minimum cost document, where r≥
  
  0, and n_i+1being n_i−
  
  (r +1); and
  
  identifying toward the user the initial collection of k documents I₁, I₂, . . . , I_kfor selection of a document.

14. A system for identifying an initial collection of k documents I₁, I₂, . . . , I_kfrom n_icandidate documents X₁, X₂, . . . , X_n1in an embedding space, the initial collection of k documents I₁, I₂, . . . , I_kto be used for user identification of a desired document, the system including:
- a processor;
  
  a memory storing the embedding space; and
  
  a computer-readable medium coupled to the processor, computer-readable medium having stored thereon, in a non-transitory manner, a plurality of software code portions defining logic for;
  
  a first module for providing, accessibly to a computer system, a database identifying (i) the n₁candidate documents X₁, X₂, . . . , X_n1in the embedding space and (ii) a distance between each pair of documents of the n₁candidate documents X₁, X₂, . . . , X_n1in the embedding space, the distance between each pair of candidate documents corresponding to a predetermined measure of dissimilarity between the pair of candidate documents, wherein n₁>
  
  k>
  
  1;
  
  a second module for identifying the k initial documents I₁, I₂, . . . , I_kto be identified to a user by, for each i'"'"'th one of k iterations, beginning with a first iteration (i=1), performing;
  
  calculating a cost score for documents of the ni candidate documents X₁, X₂, . . . , X_nithe cost score being calculated according to an algorithm that operates in dependence on a representativeness calculation and a diversity calculation,adding, to the initial collection of k documents I₁, I₂, . . . , I_k, a minimum cost document, from the scored documents, having a lowest cost score, andremoving, from the n_icandidate documents X₁, X₂, . . . , X_ni, the minimum cost document and all r documents that are within a predetermined distance from the minimum cost document, where r≥
  
  0, and n_i+1being n_i−
  
  (r +1); and
  
  a third module for identifying toward the user the initial collection of k documents I₁,I₂, . . . , I_kfor selection of a document.
- View Dependent Claims (15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
- - 15. The system of claim 14, wherein the representativeness calculation decreases the cost score for documents which are more representative of documents in the n_icandidate documents, and increases the cost score for documents which are less representative of documents in the n_icandidate documents, and wherein the diversity calculation decreases the cost score for documents which are more diverse from documents previously added to the initial collection and increases the cost score for documents which are less diverse from documents previously added to the initial collection.
  - 16. The system of claim 15, wherein the representativeness calculation is calculated in dependence on a distance in the embedding space from the document being scored to a τ
    - '"'"'th closest document in the n_icandidate documents, where τ
      
      is a predetermined number greater than or equal to 1.
  - 17. The system of claim 15, wherein the diversity calculation is calculated in dependence on an average distance between the document being scored from the documents previously added to the initial collection.
  - 18. The system of claim 15, wherein the cost score applies a predetermined weighting factor to one of the diversity calculation and the representativeness calculation, relative to the other of the diversity calculation and the representativeness calculation.
  - 19. The system of claim 18, wherein, as the predetermined weighting factor increases, diversity of the initial collection of k documents I₁, I₂, . . . , I_kincreases, and as the predetermined weighting factor decreases, representativeness of the initial collection of k documents I₁, I₂, . . . , I_kincreases.
  - 20. The system of claim 15, wherein the representativeness calculation is calculated for each scored document in dependence upon the number of documents within the n_icandidate documents X₁, X₂, . . . , X_nithat are within a fixed distance in the embedding space from the document being scored.
  - 21. The system of claim 15, wherein the diversity calculation is calculated in dependence on a sum of eigenvalues of a covariance matrix of embeddings of the document being scored in the embedding space.
  - 22. The system of claim 15, wherein the diversity calculation is calculated as α
    - ∥
      
      X₀-_meaned∥
      
      _L2, where α
      
      is a predetermined weighting factor, and ∥
      
      X_0-meaned∥
      
      _L2is a square root of a sum of squares of 0-meaned values of elements in a feature vector x representing the document being scored.
  - 23. The system of claim 14, wherein the cost score for each scored document is calculated using a K-medoids algorithm as follows:
  - 24. The system of claim 14, wherein the cost score for each scored document is calculated using a K-medoids algorithm including a centered norm variance term as follows:
  - 25. The system of claim 14, wherein the cost score for each scored document is calculated using a K-medoids with a mean distance term as follows:

26. A method for user identification of a desired document, comprising:
- providing, accessibly to a computer system, a database identifying (i) n₀candidate documents X₁, X₂, . . . , X_n0in an embedding space and (ii) a distance between each pair of documents of the n₀candidate documents X₁, X₂, . . . , X_n0in the embedding space, the distance between each pair of candidate documents corresponding to a predetermined measure of dissimilarity between the pair of candidate documents;
  
  identifying an initial (j=0) collection of k documents I₁, I₂, . . . , I_k, n₀>
  
  k>
  
  1, from the n₀candidate documents X₁, X₂, . . . , X_n0by, for each i'"'"'th one of k iterations, beginning with a first iteration (i=1), performing;
  
  calculating a cost score for documents of the n_icandidate documents X₁, X₂, . . . , X_nithe cost score being calculated according to an algorithm that operates in dependence on a representativeness calculation and a diversity calculation,adding, to the initial collection of k documents I₁, I₂, . . . , I_k, a minimum cost document, from the scored documents, having a lowest cost score, andremoving, from the n_icandidate documents X₁, X₂, . . . , X_ni, the minimum cost document and all r documents that are within a predetermined distance from the minimum cost document, where r≥
  
  0, and n_i+1being n_i−
  
  (r +1);
  
  a computer system identifying the initial (j=0) collection of k documents toward the user;
  
  for each j'"'"'th iteration in a plurality of iterations, beginning with a first iteration (j=1);
  
  in response to user selection of a j'"'"'th selected subset of the documents from the (j−
  
  1)'"'"'th collection of documents, and in dependence upon the j'"'"'th selected subset, a computer system identifying a j'"'"'th candidate space of n_jcandidate documents from the (j−
  
  1)'"'"'th candidate space of n_j−
  
  1candidate documents, the j'"'"'th candidate space being smaller than the (j−
  
  1)'"'"'th candidate space,identifying a j'"'"'th collection of documents which is a subset of the j'"'"'th candidate space of n_jcandidate documents, andidentifying toward the user the j'"'"'th collection of documents; and
  
  taking action in response to user indicating commitment to a particular collection of documents identified toward the user.

27. A system for user identification of a desired document, the system including:
- a processor;
  
  a memory storing accessibly to a computer system, a database identifying (i) no candidate documents X₁, X₂, . . . , X_n0in an embedding space and (ii) a distance between each pair of documents of the n₀candidate documents X₁, X₂, . . . , X_n0in the embedding space, the distance between each pair of candidate documents corresponding to a predetermined measure of dissimilarity between the pair of candidate documents; and
  
  a computer-readable medium coupled to the processor, computer-readable medium having stored thereon, in a non-transitory manner, a plurality of software code portions defining logic for;
  
  a first module which identifies an initial (j=0) collection of k documents I₁, I₂, . . . , I_k, n₀>
  
  k>
  
  1, from the n₀candidate documents X₁, X₂, . . . , X_n0by, for each i'"'"'th one of k iterations, beginning with a first iteration (i=1), the first module performing;
  
  calculating a cost score for documents of the n_icandidate documents X₁, X₂, . . . , X_nithe cost score being calculated according to an algorithm that operates in dependence on a representativeness calculation and a diversity calculation,adding, to the initial collection of k documents I₁, I₂, . . . , I_k, a minimum cost document, from the scored documents, having a lowest cost score, andremoving, from the n_icandidate documents X₁, X₂, . . . , X_ni, the minimum cost document and all r documents that are within a predetermined distance from the minimum cost document, where r≥
  
  0, and n_i+1being n_i−
  
  (r +1);
  
  a second module which identifies the initial (j=0) collection of k documents toward the user;
  
  a third module which, for each j'"'"'th iteration in a plurality of iterations, beginning with a first iteration (j=1), performs;
  
  in response to user selection of a j'"'"'th selected subset of the documents from the (j-31
  
  1)'"'"'th collection of documents, and in dependence upon the j'"'"'th selected subset, a computer system identifying a j'"'"'th candidate space of n_jcandidate documents from the (j−
  
  1)'"'"'th candidate space of n_j−
  
  1candidate documents, the j'"'"'th candidate space being smaller than the (j−
  
  1)'"'"'th candidate space,identifying a j'"'"'th collection of documents which is a subset of the j'"'"'th candidate space of n_jcandidate documents, andidentifying toward the user the j'"'"'th collection of documents; and
  
  a fourth module which takes action in response to user indicating commitment to a particular collection of documents identified toward the user.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Sentient Technologies (Barbados) Limited
Original Assignee
Evolv Technology Solutions, Inc.
Inventors
Legrand, Diego, Long, Philip M., Duffy, Nigel
Primary Examiner(s)
Ruiz, Angelica

Application Number

US15/295,930
Publication Number

US 20170031904A1
Time in Patent Office

1,261 Days
Field of Search

707600-831, 707899, 707999001-999206
US Class Current
CPC Class Codes

G06F 16/34   Browsing; Visualisation the...

G06F 16/38   Retrieval characterised by ...

G06F 16/387   using geographical or spati...

G06F 16/58   Retrieval characterised by ...

G06F 16/583   using metadata automaticall...

G06F 16/904   Browsing; Visualisation the...

G06F 16/9535   Search customisation based ...

G06F 16/9538   Presentation of query results

Selection of initial document collection for visual interactive search

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

54 Citations

27 Claims

Specification

Use Cases

Quick Links

Others

Selection of initial document collection for visual interactive search

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

54 Citations

27 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others