Detection of junk in search result ranking

US 8,738,635 B2
Filed: 06/01/2010
Issued: 05/27/2014
Est. Priority Date: 06/01/2010
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method for ranking candidate documents in response to a search query, comprising steps of:

creating, by at least a first processor, an index of a plurality of documents in a corpus;

calculating a junk score for at least a first document in the corpus, wherein calculating the junk score comprises;

using a first candidate histogram for the first document in the corpus, wherein the first candidate histogram is specific to the first document; and

using a junk profile, wherein the junk profile comprises;

a first reference histogram for a first known junk document, wherein the first reference histogram is specific to the first known junk document and is based on a first junk variable; and

comparing the first candidate histogram to the first reference histogram;

receiving a search query;

identifying, based on the search query and the index, candidate documents from the plurality of documents in the corpus, wherein the candidate documents include at least the first document;

ranking the candidate documents.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Embodiments are directed to ranking search results using a junk profile. For a given corpus of documents, one or more junk profiles may be created and maintained. The junk profile provides reference metrics to represent known junk documents. For example, a junk profile may comprise a dictionary of document data that is automatically inserted into documents created using a particular system or template. A junk profile may also comprise one or more representations (e.g., histograms) of a distribution of a particular junk variable for known junk documents. The junk profile provides a usable representation of known junk documents, and the present systems and methods employ the junk profile to predict the likelihood that documents in the corpus are junk. In embodiments, junk scores are calculated and used to rank such documents higher or lower in response to a search query.

360 Citations

20 Claims

1. A computer-implemented method for ranking candidate documents in response to a search query, comprising steps of:
- creating, by at least a first processor, an index of a plurality of documents in a corpus;
  
  calculating a junk score for at least a first document in the corpus, wherein calculating the junk score comprises;
  
  using a first candidate histogram for the first document in the corpus, wherein the first candidate histogram is specific to the first document; and
  
  using a junk profile, wherein the junk profile comprises;
  
  a first reference histogram for a first known junk document, wherein the first reference histogram is specific to the first known junk document and is based on a first junk variable; and
  
  comparing the first candidate histogram to the first reference histogram;
  
  receiving a search query;
  
  identifying, based on the search query and the index, candidate documents from the plurality of documents in the corpus, wherein the candidate documents include at least the first document;
  
  ranking the candidate documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The computer-implemented method of claim 1, wherein ranking the candidate documents comprises ranking the candidate documents based at least in part on the junk score for the first document, and wherein the ranking of the first document is decreased where the first document is more similar to the first known junk document.
  - 3. The computer-implemented method of claim 1,wherein calculating the junk score further comprises determining a first similarity metric.
  - 4. The computer-implemented method of claim 3, wherein the junk profile comprises a second reference histogram for the first junk variable of a second known junk document, and wherein calculating the junk score comprises comparing the candidate histogram to the second reference histogram to determine a second similarity metric.
  - 5. The computer-implemented method of claim 4, wherein calculating the junk score comprises at least one of:
    - calculating a maximum of the first and second similarity metrics and calculating an average of the first and second similarity metrics.
  - 6. The computer-implemented method of claim 1, further comprising the step of displaying the ranked candidate documents and displaying a junk status for at least the first document.
  - 7. The computer-implemented method of claim 1, wherein the first junk variable comprises chunk size.
  - 8. The computer-implemented method of claim 1, wherein:
    - the junk profile comprises a dictionary of automatically generated data, and wherein creating the index comprises ignoring document data that matches the automatically generated data.
  - 9. The computer-implemented method of claim 1, wherein:
    - the junk profile comprises a dictionary of automatically generated data;
      
      calculating the junk score further comprises comparing document data from the plurality of documents in the corpus to the dictionary of automatically generated data; and
      
      creating the index comprises delineating in the index document data that matches the automatically generated data.
  - 10. The computer-implemented method of claim 9, wherein identifying the candidate documents includes comparing the search query to document data in the index, and wherein ranking the candidate documents includes determining whether document data matching the search query has been delineated as matching the automatically generated data.
  - 11. The computer-implemented method of claim 9, wherein calculating the junk score for the first document comprises determining a similarity metric between document data in the first document and the automatically generated data.
  - 12. The computer-implemented of claim 9, further comprising:
    - creating the junk profile, comprising creating the dictionary of automatically generated data by;
      
      creating a blank template containing automatically generated data; and
      
      extracting the automatically generated data from the blank template.
  - 13. The computer-implemented method of claim 1, wherein the step of calculating includes calculating a junk score for a second document in the corpus and wherein the step of identifying comprises excluding the second document from the candidate documents when the junk score for the second document exceeds a predetermined threshold.
  - 14. The computer-implemented method of claim 1, wherein the step of calculating occurs after the step of identifying and wherein the step of calculating comprises calculating a junk score for a plurality of the candidate documents.
  - 15. The computer-implemented method of claim 1, wherein the corpus is an intranet, the plurality of documents is created using a particular template, and the junk profile is specific to the particular template.
  - 16. The computer-implemented method of claim 1, wherein the search query comprises a query for documents in the corpus that have a junk score that exceeds a predetermined threshold.
  - 17. The computer-implemented method of claim 1, wherein the junk score for the first document exceeds a predetermined threshold, further comprising:
    - sending to an administrator a message identifying the first document as junk.

18. A system for ranking candidate documents in response to a search query, comprising:
- at least one processor;
  
  a memory, operatively connected to the at least one processor and containing instructions that, when executed by the at least one processor, perform a method comprising;
  
  creating an index of a plurality of documents in a corpus;
  
  calculating a junk score for at least a first document in the corpus, wherein calculating the junk score comprises;
  
  using a first candidate histogram for the first document in the corpus, wherein the first candidate histogram is specific to the first document; and
  
  using a junk profile, wherein the junk profile comprises;
  
  a first reference histogram for a first known junk document, wherein the first reference histogram is specific to the first known junk document and is based on a first junk variable; and
  
  comparing the first candidate histogram to the first reference histogram;
  
  receiving a search query;
  
  identifying, based on the search query and the index, candidate documents from the plurality of documents in the corpus, wherein the candidate documents include at least the first document;
  
  ranking the candidate documents based at least in part on the junk score for the first document;
  
  wherein creating the index comprises separately delineating document data from the plurality of documents if the document data matches the junk profile.
- View Dependent Claims (19)
- - 19. The system of claim 18, wherein the method further comprises:
    - creating, for at least the first document, a candidate histogram for at least a first junk variable;
      
      wherein calculating the junk score comprises comparing the candidate histogram to the first reference histogram to determine a first similarity metric;
      
      wherein the junk profile comprises a dictionary of automatically generated data;
      
      wherein calculating the junk score further comprises comparing document data from the plurality of documents in the corpus to the dictionary of automatically generated data; and
      
      wherein creating the index comprises delineating in the index document data that matches the automatically generated data.

20. A computer storage medium including computer-executable instructions that, when executed by at least one processor, perform a method comprising:
- creating an index of a plurality of documents in a corpus;
  
  creating, for at least a first document of the plurality of documents, a candidate histogram specific to the first document for at least a first junk variable;
  
  calculating a junk score for at least the first document using a junk profile, wherein;
  
  the junk profile comprises;
  
  a first reference histogram for a first known junk document, wherein the first reference histogram is specific to at least the first known junk document and is based on the first junk variable, anda dictionary of automatically generated data; and
  
  calculating a junk score comprises at least (a) comparing the candidate histogram to the first reference histogram to determine a first similarity metric and (b) determining a second similarity metric between document data in the first document and the dictionary of automatically generated data;
  
  receiving a search query;
  
  identifying, based on the search query and the index, candidate documents from the plurality of documents in the corpus, wherein the candidate documents include at least the first document;
  
  ranking the candidate documents based at least in part on the junk score for the first document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Tankovich, Vladimir, Meyerzon, Dmitriy, Poznanski, Victor
Primary Examiner(s)
CHANNAVAJJALA, SRIRAMA T

Application Number

US12/791,756
Publication Number

US 20110295850A1
Time in Patent Office

1,456 Days
Field of Search

707/609, 707705-711, 707726-728, 707741-746, 707/723, 707/748, 707/749, 707/758, 709/203, 709/206, 709215-220, 715/215, 715233-237, 715/259, 715/968
US Class Current

707/748
CPC Class Codes

G06F 16/00   Information retrieval; Data...

G06F 16/3331   Query processing

G06F 16/951   Indexing; Web crawling tech...

G06F 17/00   Digital computing or data p...

Detection of junk in search result ranking

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

360 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

Detection of junk in search result ranking

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

360 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others