Finding selected character strings in text and providing information relating to the selected character strings
First Claim
1. A method of automatically finding selected character strings in text and providing information relating to the selected character strings;
- the method comprising;
(A) automatically searching a text to find character strings that match any of a list of selected strings;
the act of automatically searching comprising a series of iterations, each iteration having a starting point in the text, each iteration comprising;
(A1) determining whether the iteration'"'"'s starting point is followed by a character string that matches any of the list of selected strings and that ends at a probable string ending;
said determining further comprising;
accessing a finite state data structure with a character string following the iteration'"'"'s starting point;
the finite state data structure including acceptance data that can be accessed at the end of one of the selected strings, and determining that the character string matches one of the selected strings if, at the end of the character string, the acceptance data can be accessed;
(A2) finding a starting point for a next iteration in the series, the next iteration'"'"'s starting point being a probable string beginning; and
(A3) if the iteration'"'"'s starting point is followed by a character string matching one of the selected strings and ending at a probable string ending, performing an operation relating to the matching character string;
in which the finite state data structure is a finite state transducer with a string matching level and an information output level;
the act of accessing the finite state data structure with the character string being performed at the string matching level;
the finite state transducer providing information output data at the information output level in response to the character string; and
in which the information output data is access data for accessing information relating to the matching character string.
4 Assignments
0 Petitions
Accused Products
Abstract
Selected character strings are automatically found by performing an automatic search of a text to find character strings that match any of a list of selected strings. The automatic search includes a series of iterations, each with a starting point in the text. Each iteration determines whether its starting point is followed by a character string that matches any of the list of selected strings and that ends at a probable string ending. Each iteration also finds a starting point for the next iteration that is a probable string beginning. The selected strings can be words and multiple word expressions, in which case probable string endings and beginnings are word boundaries. A finite state lexicon, such as a finite state transducer or a finite state automation, can be used to determine whether character strings match the list of selected strings. A tokenizing automation can be used to find starting points.
-
Citations
18 Claims
-
1. A method of automatically finding selected character strings in text and providing information relating to the selected character strings;
- the method comprising;
(A) automatically searching a text to find character strings that match any of a list of selected strings;
the act of automatically searching comprising a series of iterations, each iteration having a starting point in the text, each iteration comprising;
(A1) determining whether the iteration'"'"'s starting point is followed by a character string that matches any of the list of selected strings and that ends at a probable string ending;
said determining further comprising;
accessing a finite state data structure with a character string following the iteration'"'"'s starting point;
the finite state data structure including acceptance data that can be accessed at the end of one of the selected strings, anddetermining that the character string matches one of the selected strings if, at the end of the character string, the acceptance data can be accessed;
(A2) finding a starting point for a next iteration in the series, the next iteration'"'"'s starting point being a probable string beginning; and
(A3) if the iteration'"'"'s starting point is followed by a character string matching one of the selected strings and ending at a probable string ending, performing an operation relating to the matching character string;
in which the finite state data structure is a finite state transducer with a string matching level and an information output level;
the act of accessing the finite state data structure with the character string being performed at the string matching level;
the finite state transducer providing information output data at the information output level in response to the character string; and
in which the information output data is access data for accessing information relating to the matching character string. - View Dependent Claims (6, 7, 8, 9, 10, 12, 13, 14)
performing tokenization on the text to find the next iteration'"'"'s starting point.
- the method comprising;
-
7. The method of claim 6 in which tokenization is performed using a finite state tokenizer.
-
8. The method of claim 1 in which (A2) comprises:
performing tokenization on the text to find the next iteration'"'"'s starting point.
-
9. The method of claim 8 in which tokenization is performed using a finite state tokenizer.
-
10. The method of claim 1 in which each of the selected strings is a word or a multi-word expression.
-
12. The method of claim 1 in which (A3) comprises:
determining whether the matching character string is the longest character string that begins at the iteration'"'"'s starting point, that ends at a probable string ending, and that matches one of the selected strings.
-
13. The method of claim 1 in which (A3) comprises:
determining whether to retrieve information associated with the matching character string.
-
14. The method of claim 13 in which the associated information is an annotation and in which (A3) further comprises:
-
upon determining to retrieve the associated information, inserting the annotation into the text at a position associated with the matching character string;
the series of iterations producing an annotated version of the text;
the method further comprising;
(B) providing the annotated version of the text as output.
-
-
2. A method of automatically finding selected character strings in text and providing information relating to the selected character strings;
- the method comprising;
(A) automatically searching a text to find character strings that match any of a list of selected strings;
the act of automatically searching comprising a series of iterations, each iteration having a starting point in the text, each iteration comprising;
(A1) determining whether the iteration'"'"'s starting point is followed by a character string that matches any of the list of selected strings and that ends at a probable string ending;
said determining further comprising;
accessing a finite state data structure with a character string following the iteration'"'"'s starting point;
the finite state data structure including acceptance data that can be accessed at the end of one of the selected strings, anddetermining that the character string matches one of the selected strings if, at the end of the character string, the acceptance data can be accessed;
(A2) finding a starting point for a next iteration in the series, the next iteration'"'"'s starting point being a probable string beginning; and
(A3) if the iteration'"'"'s starting point is followed by a character string matching one of the selected strings and ending at a probable string ending, performing an operation relating to the matching character string;
in which the finite state data structure is a finite state automaton that can be accessed with any of a set of acceptable character strings to obtain a counterpart number to which the acceptable character string maps;
each acceptable character string in the set being one of the list of selected strings; and
furtherin which the counterpart number is for accessing information relating to the matching character string. - View Dependent Claims (3, 4, 5, 15, 16, 17)
determining whether the matching character string is the longest character string that begins at the iteration'"'"'s starting point, that ends at a probable string ending, and that matches one of the selected strings.
- the method comprising;
-
4. The method of claim 2 in which (A3) comprises:
determining whether to retrieve information associated with the matching character string.
-
5. The method of claim 4 in which the associated information is an annotation and in which (A3) further comprises:
-
upon determining to retrieve the associated information, inserting the annotation into the text at a position associated with the matching character string;
the series of iterations producing an annotated version of the text;
the method further comprising;
(B) providing the annotated version of the text as output.
-
-
15. The method of claim 2 in which (A2) comprises:
performing tokenization on the text to find the next iteration'"'"'s starting point.
-
16. The method of claim 15 in which tokenization is performed using a finite state tokenizer.
-
17. The method of claim 2 in which each of the selected strings is a word or a multi-word expression.
-
11. A system for automatically finding selected character strings in text and providing information relating to the selected character strings, the system comprising:
-
text data defining a text;
a list of selected character strings; and
a processor connected for accessing the text data and the list of selected character strings;
the processor automatically searching the text to find character strings that match any of the list of selected strings;
in automatically searching, the processor performing a series of iterations, each having a starting point in the text;
in each iteration, the processor operating to;
determine whether the iteration'"'"'s starting point is followed by a character string that matches any of the list of selected strings and that ends at a probable string ending; and
in which the processor further operates to;
access a finite state data structure with a character string following the iteration'"'"'s starting point;
the finite state data structure including acceptance data that can be accessed at the end of one of the selected strings, anddetermine that the character string matches one of the selected strings if, at the end of the character string, the acceptance data can be accessed;
find a starting point for a next iteration in the series, the next iteration'"'"'s starting point being a probable string beginning; and
if the iteration'"'"'s starting point is followed by a character string matching one of the selected strings and ending at a probable string ending, performing an operation relating to the matching character string;
in which the finite state data structure is a finite state transducer with a string matching level and an information output level, the act of accessing the finite state data structure with the character string being performed at the string matching level;
the finite state transducer providing information output data at the information output level in response to the character string; and
in which the Information output data is access data for accessing information relating to the matching character string.
-
-
18. A system for automatically finding selected character strings in text and providing information relating to the selected character strings, the system comprising:
-
text data defining a text;
a list of selected character strings; and
a processor connected for accessing the text data and the list of selected character strings;
the processor automatically searching the text to find character strings that match any of the list of selected strings;
in automatically searching, the processor performing a series of iterations, each having a starting point in the text;
in each iteration, the processor operating to;
determine whether the iteration'"'"'s starting point is followed by a character string that matches any of the list of selected strings and that ends at a probable string ending; and
in which the processor further operates to;
access a finite state data structure with a character string following the iteration'"'"'s starting point;
the finite state data structure including acceptance data that can be accessed at the end of one of the selected strings, anddetermine that the character string matches one of the selected strings if, at the end of the character string, the acceptance data can be accessed;
find a starting point for a next iteration in the series, the next iteration'"'"'s starting point being a probable string beginning; and
if the iteration'"'"'s starting point is followed by a character string matching one of the selected strings and ending at a probable string ending, performing an operation relating to the matching character string;
in which the finite state data structure is a finite state automaton that can be accessed with any of a set of acceptable character strings to obtain a counterpart number to which the acceptable character string maps;
each acceptable character string in the set being one of the list of selected strings; and
furtherin which the counterpart number is for accessing information relating to the matching character string.
-
Specification