SEARCH RESULTS RANKING USING EDITING DISTANCE AND DOCUMENT INFORMATION

US 20090259651A1
Filed: 04/11/2008
Published: 10/15/2009
Est. Priority Date: 04/11/2008
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented relevance system, comprising:

a processing component for extracting document information from documents received as search results based on a query string; and

a proximity component for computing edit distance between the data string and the query string, the edit distance employed in determining relevance of a document as part of result ranking.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Architecture for extracting document information from documents received as search results based on a query string, and computing an edit distance between the data string and the query string. The edit distance is employed in determining relevance of the document as part of result ranking by detecting near-matches of a whole query or part of the query. The edit distance evaluates how close the query string is to a given data stream that includes document information such as TAUC (title, anchor text, URL, clicks) information, etc. The architecture includes the index-time splitting of compound terms in the URL to allow the more effective discovery of query terms. Additionally, index-time filtering of anchor text is utilized to find the top N anchors of one or more of the document results. The TAUC information can be input to a neural network (e.g., 2-layer) to improve relevance metrics for ranking the search results.

Citations

20 Claims

1. A computer-implemented relevance system, comprising:
- a processing component for extracting document information from documents received as search results based on a query string; and
  
  a proximity component for computing edit distance between the data string and the query string, the edit distance employed in determining relevance of a document as part of result ranking.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system of claim 1, wherein the document information employed to generate the data string includes at least one of a title information, URL information, click information, or anchor text.
  - 3. The system of claim 1, wherein the processing component splits compound terms of the document information at index time to compute the edit distance relative to a URL.
  - 4. The system of claim 1, wherein the processing component filters anchor text of the document information at index time to compute a top-ranked set of anchor text.
  - 5. The system of claim 1, wherein the document information includes at least one of title characters, anchor characters, click characters, or URL characters, which document information is input to a neural network along with raw input features of a BM25F function, click distance, file type, language and URL depth to compute the relevance of the document.
  - 6. The system of claim 1, wherein the computing of the edit distance is based on insertion and deletion of terms to increase proximity between the data string and the query string.
  - 7. The system of claim 1, wherein the computing of the edit distance is based on costs associated with insertion and deletion of terms to increase proximity between the data string and the query string.

8. A computer-implemented method of determining relevance, comprising:
- receiving a query string as part of a search process;
  
  extracting document information from a document returned during the search process;
  
  generating a data string from the document information;
  
  computing edit distance between the data string and the query string; and
  
  calculating a relevance score based on the edit distance.
- View Dependent Claims (9, 10, 11, 12, 13, 14, 15, 16)
- - 9. The method of claim 8, further comprising employing term insertion as part of computing the edit distance and assessing an insertion cost for insertion of a term in the query string to generate the data string, the cost represented as a weighting parameter.
  - 10. The method of claim 8, further comprising employing term deletion as part of computing the edit distance and assessing a deletion cost for deletion of a term in the query string to generate the data string, the cost represented as a weighting parameter.
  - 11. The method of claim 8, further comprising computing a position cost as part of computing the edit distance, the position cost associated with term insertion and/or term deletion of a term position in the data string.
  - 12. The method of claim 8, further comprising performing a matching process between characters of the data string and characters of the query string to compute an overall cost of computing the edit distance.
  - 13. The method of claim 8, further comprising splitting compound terms of a URL of the data string at index time.
  - 14. The method of claim 8, further comprising filtering anchor text of the data string to find a top-ranked set of anchor text based on frequency of occurrence in the document.
  - 15. The method of claim 14, further comprising computing an edit distance score for anchor text in the set.
  - 16. The method of claim 8, further comprising inputting a score, derived from computing the edit distance, into a two-layer neural network after application of a transform function, the score generated based on calculating the edit distance associated with at least one of title information, anchor information, click information, or URL information, and other raw input features.

17. A computer-implemented method of computing relevance of a document, comprising:
- processing a query string as part of a search process to return a result set of documents;
  
  generating a data string based on document information extracted from a document of the result set, the document information includes one or more of title information, anchor text information, click information, and URL information from the document;
  
  computing edit distance between the data string and the query string based on term insertion, term deletion, and term position; and
  
  calculating a relevance score based on the edit distance, the relevance score used to rank the document in the result set.
- View Dependent Claims (18, 19, 20)
- - 18. The method of claim 17, further comprising computing a cost associated with each of the term insertion, term deletion and term position, and factoring the cost into computation of the relevance score.
  - 19. The method of claim 17, further comprising splitting compound terms of the URL information at index time and filtering the anchor text information at index time to find a top-ranked set of anchor text based on frequency of occurrence of the anchor text in the document.
  - 20. The method of claim 17, further comprising reading occurrences of terms of the query string to construct a string of query terms in order of appearance in a source URL string and filling space between the terms with word marks.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Xu, Jun, Meyerzon, Dmitriy, Li, Hang, Tankovich, Vladimir

Granted Patent

US 8,812,493 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

G06F 40/194 Calculation of difference b...

SEARCH RESULTS RANKING USING EDITING DISTANCE AND DOCUMENT INFORMATION

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SEARCH RESULTS RANKING USING EDITING DISTANCE AND DOCUMENT INFORMATION

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links