Method for segmentation of text
First Claim
1. A method for segmentation of a stream of text elements comprising analyzed tokens into one or more initial clauses, using a computer, comprising the steps of:
- scanning, from a given position, a predetermined number of consecutive text elements of said stream of text elements;
comparing said predetermined number of consecutive text elements with each pattern of a set of patterns for beginnings of initial clauses;
identifying a beginning of an initial clause in said predetermined number of consecutive text elements if said predetermined number of consecutive text elements match one pattern of said set of patterns for beginnings of initial clauses; and
repeating the steps of scanning, comparing and identifying, wherein said given position is moved at least one position forward between each repetition.
5 Assignments
0 Petitions
Accused Products
Abstract
A computerized method, and a corresponding apparatus, for segmentation of a stream of text elements comprising analyzed tokens into one or more initial clauses is disclosed. In the method, a predetermined number of consecutive text elements of said stream of text elements are scanned, starting from a given position. The predetermined number of consecutive text elements are compared with each pattern of a set of patterns for beginnings of initial clauses, and a beginning of an initial clause is identified in the predetermined number of consecutive text elements, if the predetermined number of consecutive text elements match one pattern of the set of patterns for beginnings of initial clauses. The given position is then moved at least one position forward and the scanning, comparison and identification is repeated.
-
Citations
34 Claims
-
1. A method for segmentation of a stream of text elements comprising analyzed tokens into one or more initial clauses, using a computer, comprising the steps of:
-
scanning, from a given position, a predetermined number of consecutive text elements of said stream of text elements;
comparing said predetermined number of consecutive text elements with each pattern of a set of patterns for beginnings of initial clauses;
identifying a beginning of an initial clause in said predetermined number of consecutive text elements if said predetermined number of consecutive text elements match one pattern of said set of patterns for beginnings of initial clauses; and
repeating the steps of scanning, comparing and identifying, wherein said given position is moved at least one position forward between each repetition. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 33, 34)
inserting a marker for begin initial clause into said predetermined number of consecutive text elements in response to an identified beginning of an initial clause in the step of identifying.
-
-
3. The method according to claim 2, wherein each pattern of said set of patterns for beginnings of initial clauses is associated with an action, and wherein, in the step of inserting, said marker is inserted into said predetermined number of consecutive text elements in accordance with the action associated with said one pattern of said set of patterns for beginnings of initial clauses.
-
4. The method according to claim 3, wherein, in the step of inserting, said marker is inserted into said predetermined number of consecutive text elements in a position determined by the action associated with said one pattern of said set of patterns for beginnings of initial clauses.
-
5. The method according to claim 4, further comprising the step of:
storing said stream of text elements, including the markers inserted in the step of inserting, in an electronic storage medium.
-
6. The method according to claim 2, further comprising the steps of;
indicating, for each marker for begin initial clause, the pattern of said patterns for beginnings of clauses to which the marker corresponds.
-
7. The method according to claim 2, wherein, in the step of inserting, a marker for end clause is inserted before every marker for begin clause, except before the first begin clause marker, and at the end of said analyzed text.
-
8. The method according to claim 1, further comprising the steps of:
-
scanning the text elements of an initial clause;
comparing said text elements of said initial clause with each pattern of a set of patterns for multiple finite verbs;
identifying a beginning of a clause in said text elements of said initial clause if said text elements of said initial clause match one pattern of said set of patterns for multiple finite verbs; and
repeating the steps of scanning, comparing and identifying for each initial clause.
-
-
9. The method according to claim 8, further comprising the step of:
inserting a marker for begin initial clause into said text elements of said initial clause in response to an identified beginning of an initial clause in the step of identifying.
-
10. The method according to claim 9, wherein each pattern of said pattern for multiple finite verbs is associated with an action, and wherein, in the step of inserting, said marker is inserted into said text elements of said initial clause in accordance with the action associated with said one pattern of said patterns for multiple finite verbs.
-
11. The method according to claim 10, wherein, in the step of inserting, said marker for begin initial clause is inserted into said text elements of said initial clause in a position determined by the action associated with said one pattern of said set of patterns for multiple finite verbs.
-
12. The method according to claim 11, further comprising the step of:
storing said stream of text elements, including the markers inserted in the step of inserting, in an electronic storage medium.
-
13. The method according to claim 9, wherein, in the stop of inserting, a marker for end clause is inserted before every marker for begin clause, except before the first begin clause marker, and at the end of said analyzed text.
-
14. The method according to claim 1, wherein said stream of text elements comprising analyzed tokens is segmented into said initial clauses such that every token belongs to exactly one initial clause.
-
15. The method according to claim 1, wherein said analyzed tokens have only been assigned a unique analysis in the form of a morphosyntactic description and a lemma.
-
16. The method according to claim 15, wherein said morphosyntactic description comprises a part-of-speech and an inflectional form.
-
17. The method according to claim 1, wherein each pattern of said set of patterns comprises at most said predetermined number of text elements.
-
18. The method according to claim 1, wherein said predetermined number is adapted to a specific language or application.
-
19. The method according to claim 1, wherein a text element comprises either a token or a text structure marker.
-
20. The method according to claim 19, wherein
the presence of a text structure marker marks the beginning or the end of some text unit, and a type of text structure marker marks a type of text unit, such as head, paragraph, sentence, clause, phrase or word. -
21. The method of claim 20, wherein a text unit comprises one or more consecutive tokens.
-
22. The method according to claim 1, wherein a text element that is a token and occurs in a pattern may refer to:
-
the token itself, the lemma of the token, or the morphosyntactic description of the token.
-
-
33. A computer readable medium having computer-executable instructions for a general-purpose computer to perform the steps recited in claim 1.
-
34. A computer program comprising computer-executable instructions for performing the steps recited in claim 1.
-
23. An apparatus for segmentation of a stream of text elements comprising analyzed tokens into one or more initial clauses, comprising:
-
memory means arranged to store a set of patterns for beginnings of initial clauses;
scanning means arranged to scan a predetermined number of consecutive text elements of said stream of text elements;
comparing means arranged to compare said predetermined number of consecutive text elements of said stream of text elements with each pattern of said set of patterns for beginnings of initial clauses; and
matching means arranged to identify a match between said predetermined number of consecutive text elements and one pattern of said set of patterns for beginnings of initial clauses. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32)
inserting means arranged to insert a marker for begin initial clause into said predetermined number of consecutive text elements in response to a match made by said matching means.
-
-
25. The apparatus according to claim 24, wherein said memory means are further arranged to store an action for each pattern of said set of patterns, and wherein said inserting means are arranged to insert said marker into said predetermined number of consecutive text elements in accordance with the action associated with said one pattern.
-
26. The apparatus according to claim 25, wherein said inserting means are arranged to insert said marker into said predetermined number of consecutive text elements in a position determined by the action associated with said one pattern.
-
27. The apparatus according to claim 26, wherein said memory means are further arranged to store said stream of text elements, including the markers inserted by said inserting means.
-
28. The apparatus according to claim 24, wherein said inserting means are further arranged to insert a marker for the pattern of said patterns for beginnings of clauses to which the marker for begin initial clause corresponds.
-
29. The apparatus according to claim 23, wherein:
-
said memory means are further arranged to store a set of patterns for multiple finite verbs;
said scanning means are further arranged to scan the text elements of an initial clause;
said comparing means are further arranged to compare said text elements of said initial clause with each pattern of said set of patterns for multiple finite verbs; and
said matching means are further arranged to identify a match between said text elements of said initial clause and one pattern of said set of for multiple finite verbs.
-
-
30. The apparatus according to claim 29, further comprising:
inserting means arranged to insert a marker for begin initial clause into said predetermined number of consecutive text elements in response to a match made by said matching means.
-
31. The apparatus according to claim 30, wherein said memory means are further arranged to store an action for each pattern of said sot of patterns for multiple finite verbs, and wherein said inserting means are arranged to insert said marker into said predetermined number of consecutive text elements in accordance with the action associated with said one pattern of said patterns for multiple finite verbs.
-
32. The apparatus according to claim 31, wherein said inserting means are arranged to insert said marker into said predetermined number of consecutive text elements in a position determined by the action associated with said one pattern of said set of patterns for multiple finite verbs.
Specification