Automatic metadata identification
First Claim
1. A method performed by one or more processors associated with one or more network devices, the method comprising:
- capturing text of a document;
comparing the text of the document to content of each of a plurality of metadata records, each of the plurality of metadata records storing information associated with a particular one of a plurality of documents that differs from the document;
selecting, based on comparing the text of the document to the content, one or more of the plurality of metadata records, where, for each of the selected metadata records, a portion of the associated content corresponds to at least a portion of the text of the document;
scoring each of the selected metadata records, including calculating a score representing a correspondence between the text of the document and the content of the respective one of the selected metadata records, where scoring each of the selected metadata records further includes;
calculating a first probability associated with a likelihood of one or more common phrases, that appear in both the text of the document and the content of the one of the selected metadata records, also appearing in the contents of the plurality of metadata records,calculating a second probability associated with a likelihood of the one or more common phrases appearing in text of the plurality of documents, andscoring the one of the selected metadata records based on the first probability and second probability;
ranking the selected metadata records based on scoring the selected metadata records; and
storing an association between the document and a particular number of highest ranking ones of the selected metadata records.
2 Assignments
0 Petitions
Accused Products
Abstract
A system identifies metadata associated with a document by capturing text of a document and comparing the text of the document with a collection of metadata records. Sets of matches between the text of the document and at least one record in the collection of metadata records may be identified, where each set of matches corresponds to a metadata record in the collection of metadata records. Metadata records corresponding to each set of matches may be scored. At least one of the metadata records may be identified based on the scores of the metadata records. The at least one identified metadata record may be associated with the document.
-
Citations
30 Claims
-
1. A method performed by one or more processors associated with one or more network devices, the method comprising:
-
capturing text of a document; comparing the text of the document to content of each of a plurality of metadata records, each of the plurality of metadata records storing information associated with a particular one of a plurality of documents that differs from the document; selecting, based on comparing the text of the document to the content, one or more of the plurality of metadata records, where, for each of the selected metadata records, a portion of the associated content corresponds to at least a portion of the text of the document; scoring each of the selected metadata records, including calculating a score representing a correspondence between the text of the document and the content of the respective one of the selected metadata records, where scoring each of the selected metadata records further includes; calculating a first probability associated with a likelihood of one or more common phrases, that appear in both the text of the document and the content of the one of the selected metadata records, also appearing in the contents of the plurality of metadata records, calculating a second probability associated with a likelihood of the one or more common phrases appearing in text of the plurality of documents, and scoring the one of the selected metadata records based on the first probability and second probability; ranking the selected metadata records based on scoring the selected metadata records; and storing an association between the document and a particular number of highest ranking ones of the selected metadata records. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A system, comprising:
one or more processors to; capture an image of a document; recognize, based on the image, text of the document; compare the text of the document to content of each of a plurality of metadata records associated with a plurality of captured documents that differ from the document; identify sets of matching phrases that occur in both the text of the document and the content of one or more of the metadata records; calculate a probability of finding each phrase in the set of matching phrases in the plurality of captured documents; calculate a probability of finding each phrase in the set of matching phrases in the plurality of metadata records; score each set of matching phrases based on the calculated probability of finding each of the phrases, in the set of matching phrases, in the plurality of captured documents, and the calculated probability of finding each of the phrases in the set of matching phrases in the plurality of metadata records; and link at least one selected metadata record, from the plurality of metadata records, to the document based on the scoring of the sets of matching phrases. - View Dependent Claims (21)
-
22. A system, comprising:
-
a first memory to store metadata records; a second memory to store text of at least one page of a document; and a processor to; identify sets of matching phrases included in both the text of the at least one page of the document and the stored metadata records, where each of the sets of matching phrases is associated with one of the metadata records, score each of the sets of matching phrases based on probabilities of each of the matching phrases, included in the respective set of matching phrases, appearing, respectively, in a randomly selected one of the stored metadata records and in a randomly selected one of a plurality of documents associated with the stored metadata records, where the plurality of documents differ from the document, select at least one of the stored metadata records, associated with the sets of matching phrases, based on the scoring of each of the sets of matching phrases, and store information to associate the document with the at least one selected metadata record in the first memory or the second memory. - View Dependent Claims (23, 24)
-
-
25. A non-transitory computer-readable memory device that stores instructions executable by at least one processor, the computer-readable memory device comprising:
-
one or more instructions for receiving text of a document; one or more instructions for identifying a particular page of the document based on the text of the document; one or more instructions for identifying one or more of a plurality of metadata records, based on a comparison between text of the particular page and information in the plurality of metadata records; one or more instructions for scoring each of the identified metadata records based on probabilities of one or more common phrases, that appear in both the text of the document and content of the respective identified metadata record, also appearing, respectively, in a randomly selected one of the plurality of metadata records and in a randomly selected one of a plurality of documents associated with the plurality of metadata records, where the plurality of documents differ from the document; one or more instructions for selecting at least one highest scoring identified metadata record of the identified metadata records; and one or more instructions for associating, based on the scoring, the selected at least one of the identified metadata records with the document. - View Dependent Claims (26, 27, 28, 29, 30)
-
Specification