Method for recognizing compound terms in a document
First Claim
1. A method for identifying a compound term in a document that is represented as a stream of document terms, the method comprising the steps of:
- scanning the stream of document terms for an initial term associated with the compound term;
when the initial term is identified, accessing a compound term template that includes content, retention, and token specifications for the compound term;
comparing the stream, beginning with the initial term, to the compound term template; and
adding to the document stream a tagged token indicated by the token and retention specifications of the template, when the stream matches the content specification of the template.
1 Assignment
0 Petitions
Accused Products
Abstract
A method is provided for identifying compound terms in a document that is represented by a stream of tokens. The stream of document tokens is scanned for an initial term associated with a compound term and a compound term template is accessed when the initial term is identified. The template includes content, retention, and token specifications for the compound term. The stream of tokens is compared with the template, and when the stream matches the content specification of the template, a token representing the compound term is tagged according to the retention specification and added to the stream of tokens. The tagged token is stopped according to the retention specification represented by its tag.
26 Citations
12 Claims
-
1. A method for identifying a compound term in a document that is represented as a stream of document terms, the method comprising the steps of:
-
scanning the stream of document terms for an initial term associated with the compound term; when the initial term is identified, accessing a compound term template that includes content, retention, and token specifications for the compound term; comparing the stream, beginning with the initial term, to the compound term template; and adding to the document stream a tagged token indicated by the token and retention specifications of the template, when the stream matches the content specification of the template. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A method for identifying a compound term in a document, using a data structure that represents the compound term, the method comprising the steps of:
-
converting the document into a stream of document terms; scanning the stream of document terms for an initial term associated with the compound term; when the initial term is identified, comparing the stream, beginning with the initial term, to content indications specified by the data structure; and when the content indications are matched by the stream, tagging a compound term token in accordance with status indications specified in the data structure and adding the tagged compound term token to the stream. - View Dependent Claims (8, 9, 10)
-
-
11. A system for identifying compound terms in a document that is represented as a stream of document terms, the system comprising:
-
a comparison engine coupled to receive the stream of document terms, for comparing the received document terms to a data structure representing the compound term; and a data structure representing the compound term and coupled to the comparison engine through a location derived from the initial term, the data structure including a content specification for indicating a component term of the compound term, a retention tag associated with the content specification for indicating a status of the component term in an index representation of the document; and
a token specification, associated with the content specification for identifying a token to be added to the document stream the content specification is met.
-
-
12. A method for identifying a compound term in a document, using a data structure that represents the compound term, the method comprising the steps of:
-
tokenizing the document into a stream of document terms; detecting an initial term of the compound term in the stream of document terms; identifying content and retention specifications for the compound term from the data structure; adding a token representing the compound term to the stream when the specified content indication is matched by the stream; and tagging the token according to the retention specification indicated in the data structure.
-
Specification