Pseudo-anchor text extraction for vertical search
First Claim
1. A computer-implemented method comprising:
- via a processor executing computer-readable instructions;
extracting from a digital corpus an object;
extracting from the digital corpus a pseudo-anchor text associated with the object, wherein extracting the pseudo-anchor text comprises;
extracting from the digital corpus a parallel object identified with a second object identifier;
identifying an occurrence of the object in the digital corpus;
selecting a first candidate anchor block based on the occurrence of the object;
identifying an occurrence of the parallel object in the digital corpus;
selecting a second candidate anchor block based on the occurrence of the parallel object;
comparing similarity between the first identifier and the second identifier;
adding the first candidate anchor block and the second candidate anchor block to a common candidate anchor block set if the similarity between the first identifier and the second identifier satisfies a specified threshold; and
extracting the pseudo-anchor text from the common candidate anchor block set; and
making the pseudo-anchor text available for searching.
2 Assignments
0 Petitions
Accused Products
Abstract
A search method uses pseudo-anchor text associated with search objects to improve search performance. The pseudo-anchor text may be extracted in combination with an identifier of the search objects (such as a pseudo-URL) from a digital corpus such as a collection of documents. Pseudo-anchor texts for each object are preferably extracted from candidate anchor blocks using a machine learning based approach. The pseudo-anchor texts are made available for searching and used to help ranking the objects in a search result to improve search performance. Method may be used in vertical search of objects such as published articles, products and images that lack explicit URL and anchor text information.
-
Citations
19 Claims
-
1. A computer-implemented method comprising:
-
via a processor executing computer-readable instructions; extracting from a digital corpus an object; extracting from the digital corpus a pseudo-anchor text associated with the object, wherein extracting the pseudo-anchor text comprises; extracting from the digital corpus a parallel object identified with a second object identifier; identifying an occurrence of the object in the digital corpus; selecting a first candidate anchor block based on the occurrence of the object; identifying an occurrence of the parallel object in the digital corpus; selecting a second candidate anchor block based on the occurrence of the parallel object; comparing similarity between the first identifier and the second identifier; adding the first candidate anchor block and the second candidate anchor block to a common candidate anchor block set if the similarity between the first identifier and the second identifier satisfies a specified threshold; and extracting the pseudo-anchor text from the common candidate anchor block set; and making the pseudo-anchor text available for searching. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. One or more computer-readable media having computer-readable instructions thereon which, when executed by one or more processors, cause the one or more processors to perform the following:
-
extracting from a digital corpus an object, wherein extracting the object comprises; extracting from the digital corpus a plurality of pseudo-URLs; constructing a plurality of feature strings from the plurality of pseudo-URLs, at least one feature string for each pseudo-URL; applying a hash function to the plurality of feature strings to calculate a hash value of each feature string; placing the plurality of pseudo-URLs into slots of a hash-table according to the hash value of the at least one feature string of each pseudo-URL, such that each slot of the hash table contains pseudo-URLs each having at least one feature string with the same hash value; calculating similarity of the pseudo-URLs in each slot of the hash-table using a similarity function; selecting pseudo-URLs having similarity that satisfies a specified threshold to be members of a subgroup of pseudo-URLs; and associating the subgroup of the pseudo-URLs with the object; extracting from the digital corpus a pseudo-anchor text associated with the object; and making the pseudo-anchor text available for searching. - View Dependent Claims (17, 18)
-
-
19. A system for generating a search result upon receiving a query, the system comprising:
-
one or more processors; one or more I/O devices; and
one or more computer-readable media having a collection of digital objects and computer-readable instructions thereon, wherein at least some of the digital objects being each associated with a pseudo-anchor text; andwherein the computer readable instructions, when executed by the one or more processors, cause the one or more processors to perform the following acts; extracting from a digital corpus an object, wherein extracting the object comprises; extracting from the digital corpus a plurality of pseudo-URLs; constructing a plurality of feature strings from the plurality of pseudo-URLs, at least one feature string for each pseudo-URL; applying a hash function to the plurality of feature strings to calculate a hash value of each feature string; placing the plurality of pseudo-URLs into slots of a hash-table according to the hash value of the at least one feature string of each pseudo-URL, such that each slot of the hash table contains pseudo-URLs each having at least one feature string with the same hash value; calculating similarity of the pseudo-URLs in each slot of the hash-table using a similarity function; selecting pseudo-URLs having similarity that satisfies a specified threshold to be members of a subgroup of pseudo-URLs; and associating the subgroup of the pseudo-URLs with the object; receiving a search query, and ranking at least one of the digital objects at least partially based on the pseudo-anchor text.
-
Specification