Text summarization using part-of-speech
First Claim
1. A method for automatically summarizing text, comprising:
- (a) obtaining input text data defining a text that includes two or more tokens;
(b1) using the input text data to tokenize the text, the tokenized text including one or more tokenized sentences;
(b2) obtaining part-of-speech (POS) data indicating parts of speech for tokens in the text of each of the tokenized sentences from (b1);
(c) using the POS data for each tokenized sentence to obtain group data for the sentence indicating one or more groups of consecutive tokens of the text and indicating, within each group, any tokens that meet a POS-based removal criterion; and
(d) using the group data for each sentence to obtain summarized text data defining a summarized version of the text for the sentence in which tokens in each group that are indicated as meeting the removal criterion are removed so that the number of tokens in the summarized version of the text for the sentence is less than the number of tokens in the text.
4 Assignments
0 Petitions
Accused Products
Abstract
Text is summarized using part-of-speech (POS) data indicating parts of speech for tokens in the text. The POS data can be obtained using input text data defining the text, such as by POS tagging. The POS data can be used to obtain group data indicating groups of tokens of the text, such as verb groups and noun groups. The group data can also indicate, within each group, any tokens that meet a POS based removal criterion. The group data can be used to obtain summarized text data by removing tokens that meet the removal criterion. The original text may be obtained via scanner or video camera from a user'"'"'s document, and may be recognized to obtain input text data. The summarized text may output as text or as audio pronunciation using a speech synthesizer.
-
Citations
20 Claims
-
1. A method for automatically summarizing text, comprising:
-
(a) obtaining input text data defining a text that includes two or more tokens;
(b1) using the input text data to tokenize the text, the tokenized text including one or more tokenized sentences;
(b2) obtaining part-of-speech (POS) data indicating parts of speech for tokens in the text of each of the tokenized sentences from (b1);
(c) using the POS data for each tokenized sentence to obtain group data for the sentence indicating one or more groups of consecutive tokens of the text and indicating, within each group, any tokens that meet a POS-based removal criterion; and
(d) using the group data for each sentence to obtain summarized text data defining a summarized version of the text for the sentence in which tokens in each group that are indicated as meeting the removal criterion are removed so that the number of tokens in the summarized version of the text for the sentence is less than the number of tokens in the text. - View Dependent Claims (2, 3, 4, 5)
(a1) using an image capture device directed upon an image bearing portable medium containing text matter to generate image data representative of the text matter; - and
(a2) converting the image data to machine readable text data, the text data being a representation of the text matter, the text data being said input text data.
-
-
4. The method of claim 1, further comprising:
(e) converting the summarized text data to audio data, the audio data being a representation of the pronunciation of the words in the summarized text data, and emitting sounds corresponding to said audio data.
-
5. The method of claim 1, wherein (b), (c), and (d) are performed in one pass through the input text data.
-
6. A method for automatically summarizing text, comprising:
-
(a) obtaining input text data defining a text that includes two or more tokens;
(b) using the input text data to obtain part-of-speech (POS) data indicating parts of speech for tokens in the text;
(c) using the POS data to obtain group data indicating one or more groups of consecutive tokens of the text and indicating, within each group, any tokens that meet a POS-based removal criterion; and
(d) using the group data to obtain summarized text data defining a summarized version of the text in which tokens in each group that are indicated as meeting the removal criterion are removed so that the number of tokens in the summarized version of the text is less than the number of tokens in the text;
wherein (c) comprises;
(c1) obtaining first group type data indicating one or more groups of consecutive tokens that have a first word group type, and, within each group having the first word group type, any tokens that meet a first POS-based removal criterion applicable to groups of the first word group type;
(c2) obtaining second group type data indicating one or more groups of consecutive tokens that have a second word group type, and, within each group having the second word group type, any tokens that meet a second POS-based removal criterion applicable to groups of the second word group type. - View Dependent Claims (7, 8)
-
-
9. A system for automatically summarizing text, the system comprising:
-
input text data defining a text that includes two or more tokens; and
a processor connected for accessing the input text data;
the processor automatically summarizing the text;
in automatically summarizing, the processor operating to;
use the input text data to tokenize the text, the tokenized text including one or more tokenized sentences;
obtain part-of-speech (POS) data indicating parts of speech for tokens in the text of each of the tokenized sentences;
use the POS data for each tokenized sentence to obtain group data for the sentence indicating one or more groups of consecutive tokens of the text and indicating, within each group, any tokens that meet a POS-based removal criterion; and
use the group data for each sentence to obtain summarized text data defining a summarized version of the text for the sentence in which tokens in each group that are indicated as meeting the removal criterion are removed so that the number of tokens in the summarized version of the text for the sentence is less than the number of tokens in the text. - View Dependent Claims (10)
-
-
11. An article of manufacture for use in a system for automatically summarizing text;
- the system including;
input text data defining a text that includes two or more tokens;
a storage medium access device; and
a processor connected for receiving data accessed on a storage medium by the storage medium access device and for accessing the input text data;
the article of manufacture comprising;
a storage medium; and
instruction data stored by the storage medium;
the instruction data indicating instructions the processor can execute;
the processor, in executing the instructions, automatically summarizing the text;
in automatically summarizing, the processor operating to;
use the input text data to tokenize the text, the tokenized text including one or more tokenized sentences;
obtain part-of-speech (POS) data indicating parts of speech for tokens in the text of each of the tokenized sentences;
use the POS data for each tokenized sentence to obtain group data for the sentence indicating one or more groups of consecutive tokens of the text and indicating, within each group, any tokens that meet a POS-based removal criterion; and
use the group data for each sentence to obtain summarized text data defining a summarized version of the text for the sentence in which tokens in each group that are indicated as meeting the removal criterion are removed so that the number of tokens in the summarized version of the text for the sentence is less than the number of tokens in the text. - View Dependent Claims (12)
- the system including;
-
13. A method for automatically summarizing text, comprising:
-
(A) receiving a signal from a user input device selecting one of a set of part-of-speech (POS) based removal criteria and obtaining input text data defining a text that includes two or more tokens;
(B1) using the input text data to tokenize the text, the tokenized text including one or more tokenized sentences;
(B2) obtaining POS data indicating parts of speech for tokens in the text of each of the tokenized sentences from (B1); and
(C) using the POS data for each tokenized sentence to obtain summarized text data defining a summarized version of the text for the sentence in which tokens are removed in accordance with the selected POS based criterion so that the number of tokens in the summarized version of the text for the sentence is less than the number of tokens in the text. - View Dependent Claims (14, 15, 16)
(A1) displaying on a display device an image showing the set of POS based removal criteria; and
(A2) receiving the signal from the user input device, the signal selecting the selected POS based removal criterion.
-
-
15. The method of claim 13, further comprising:
(D) converting the summarized text data to audio data, the audio data being a representation of the pronunciation of the words in the summarized text data, and emitting sounds corresponding to said audio data.
-
16. The method of claim 13, wherein (A), (B), and (C) are performed in one pass through the input text data.
-
17. A system for automatically summarizing text, the system comprising:
-
input text data defining a text that includes two or more tokens; and
a processor connected for accessing the input text data;
the processor automatically summarizing the text;
in automatically summarizing, the processor operating to;
receive a signal from a user input device selecting one of a set of part-of-speech (POS) based removal criteria;
use the input text data to tokenize the text, the tokenized text including one or more tokenized sentences;
obtain POS data indicating parts of speech for tokens in the text of each of the tokenized sentences; and
use the POS data for each tokenized sentence to obtain summarized text data defining a summarized version of the text for the sentence in which tokens are removed in accordance with the selected POS based criterion so that the number of tokens in the summarized version of the text for the sentence is less than the number of tokens in the text. - View Dependent Claims (18)
-
-
19. An article of manufacture for use in a system for automatically summarizing text;
- the system including;
input text data defining a text that includes two or more tokens;
a storage medium access device; and
a processor connected for receiving data accessed on a storage medium by the storage medium access device and for accessing the input text data;
the article of manufacture comprising;
a storage medium; and
instruction data stored by the storage medium;
the instruction data indicating instructions the processor can execute;
the processor, in executing the instructions, automatically summarizing the text;
in automatically summarizing, the processor operating to;
receive a signal from a user input device selecting one of a set of part-of-speech (POS) based removal criteria;
use the input text data to use the input text data to tokenize the text, the tokenized text including one or more tokenized sentences;
obtain POS data indicating parts of speech for tokens in the text of each of the tokenized sentences; and
use the POS data for each tokenized sentence to obtain summarized text data defining a summarized version of the text for the sentence in which tokens are removed in accordance with the selected POS based criterion so that the number of tokens in the summarized version of the text for the sentence is less than the number of tokens in the text. - View Dependent Claims (20)
- the system including;
Specification