Method and system for topical segmentation, segment significance and segment function
First Claim
1. A computer-based method for identifying topical segments of a document input, comprising:
- extracting one or more selected terms from a document;
linking occurrences of said extracted terms based upon the proximity of similar terms;
assigning weighted scores to paragraphs of said document input corresponding to said linked occurrences, wherein said scores depend upon the type of said selected terms and the position of said linked occurrences with respect to said paragraphs, and wherein said scores define boundaries of said topical segments; and
zero-sum normalizing said assigned weighted scores to determine said topical boundaries.
2 Assignments
0 Petitions
Accused Products
Abstract
A “domain-general” method for topical segmentation of a document input includes the steps of: extracting one or more selected terms from a document; linking occurrences of the extracted terms based upon the proximity of similar terms; and assigning weighted scores to paragraphs of the document input corresponding to the linked occurrences. In accordance with the present invention, the values of the assigned scores depend upon the type of the selected terms, e.g., common noun, proper noun, pronominal, and the position of the linked occurrences with respect to the paragraphs, e.g., front, during, rear, etc. Upon zero-sum normalization, the assigned scores represent the boundaries of the topical segments of the document input.
140 Citations
35 Claims
-
1. A computer-based method for identifying topical segments of a document input, comprising:
-
extracting one or more selected terms from a document;
linking occurrences of said extracted terms based upon the proximity of similar terms;
assigning weighted scores to paragraphs of said document input corresponding to said linked occurrences, wherein said scores depend upon the type of said selected terms and the position of said linked occurrences with respect to said paragraphs, and wherein said scores define boundaries of said topical segments; and
zero-sum normalizing said assigned weighted scores to determine said topical boundaries. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)
determining a segment importance; and
determining a segment coverage.
-
-
12. The method according to claim 10, wherein determining said segment significance comprises:
-
computing a segment importance score;
computing a segment coverage score; and
summing said segment importance score and segment coverage score.
-
-
13. The method according to claim 12, wherein said step of computing said segment importance score for a selected one of said topical segments comprises:
-
computing TF*SF values corresponding to each of said terms within said selected topical segment, wherein TF is defined as a term frequency and SF is defined as a segment frequency; and
summing said TF*SF values to obtain a TF*SF sum, wherein said sum represents said segment importance score.
-
-
14. The method according to claim 10, wherein said segment coverage is defined at least in part on the number of said linked occurrences within the same topical segment.
-
15. The method according to claim 10, wherein said step of computing said segment coverage score for a selected one of said topical segments comprises:
-
initializing segment counters to zero for each of the topical segments;
incrementing, for linked occurrences contained within said selected segment, a corresponding one of said segment counters by a predetermined amount;
incrementing one or more segment counters corresponding to non-selected segments by a predetermined amount only if said non-selected segments contain one or more of said linked occurrences contained within said selected segment;
summing all of said segment counters to obtain a segment counter sum, wherein said sum represents said segment coverage score.
-
-
16. The method according to claim 1, further comprising the step of determining a segment function to measure the relevance of said topical segments with respect said document input as a whole.
-
17. The method according to claim 16, wherein said step of determining a segment function comprises identifying one or more summary segments.
-
18. The method according to claim 16, wherein said step of determining a segment function comprises identifying one or more anecdotal segments.
-
19. The method according to claim 16, wherein said step of determining a segment function comprises identifying one or more support segments.
-
20. The method according to claim 1, wherein the linking step includes using at least a first linking distance for a first term type and a second linking distance for a second term type.
-
21. The method according to claim 3, wherein the linking step includes using a first linking distance for said proper noun phrases, a second linking distance for common noun phrases and a third linking distance for pronominal noun phrases.
-
22. A computer based method for automatically extracting significant topical information from a document, comprising:
-
extracting topical information from a document in accordance with specified categories of information;
linking occurrences of said extracted topical information based on the proximity of similar topical information;
assigning weighted scores to paragraphs of said document input corresponding to said linked occurrences, wherein said scores depend upon the type of said selected terms and the position of said linked occurrences with respect to said paragraphs, and wherein said scores represent boundaries of said topical segments;
zero-sum normalizing said assigned weighted scores to determine said topical boundaries;
determining topical segments within said document corresponding to said linked occurrences of said topical information; and
determining the significance of said topical segments. - View Dependent Claims (23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
determining a segment importance; and
determining a segment coverage.
-
-
26. The method according to claim 22, wherein said step of determining said segment significance comprises:
-
computing a segment importance score;
computing a segment coverage score; and
summing said segment importance score and segment coverage score.
-
-
27. The method according to claim 26, wherein said step of computing said segment importance score for a selected one of said topical segments comprises:
-
computing TF*SF values corresponding for each of said terms within said selected topical segment, wherein TF is defined as a term frequency and SF is defined as a segment frequency; and
summing said TF*SF values to obtain a TF*SF sum, wherein said sum represents said segment importance score.
-
-
28. The method according to claim 26, wherein said segment coverage is defined at least in part on the number of said linked occurrences within the same topical segment.
-
29. The method according to claim 26, wherein said step of computing said segment coverage score for a selected one of said topical segments comprises:
-
initializing segment counters to zero for each of the topical segments;
incrementing, for linked occurrences within said selected segment, a corresponding one of said segment counters by a predetermined amount;
incrementing one or more segment counters corresponding to non-selected segments by a predetermined amount only if said non-selected segments contain one or more of said linked occurrences contained within said selected segment;
summing all of said segment counters to obtain a segment counter sum, wherein said sum represents said segment coverage score.
-
-
30. The method according to claim 22, further comprising the step of determining a segment function to measure the relevance of said topical segments with respect said document input as a whole.
-
31. The method according to claim 30, wherein said step of determining a segment function comprises identifying one or more summary segments.
-
32. The method according to claim 30, wherein said step of determining a segment function comprises identifying one or more anecdotal segments.
-
33. The method according to claim 30, wherein said step of determining a segment function comprises identifying one or more support segments.
-
34. A computer program for identifying topical segments of a document input, comprising:
-
means for extracting selected terms from a document;
means for linking occurrences of said extracted terms based upon the proximity of similar terms;
means for assigning weighted scores to paragraphs of said document input corresponding to said linked occurrences, wherein said scores depend upon the type of said selected terms and the position of said linked occurrences with respect to said paragraphs, and wherein said scores represent boundaries for said topical segments; and
means for zero sum normalizing said assigned weighted scores to determine said topical boundaries.
-
-
35. A computer program for automatically extracting significant topical information from a document, comprising:
-
means for extracting topical information from a document in accordance with specified categories of information;
means for linking occurrences of said extracted topical information based on the proximity of similar topical information;
means for assigning weighted scores to paragraphs of said document input corresponding to said linked occurrences, wherein said scores depend upon the type of said selected terms and the position of said linked occurrences with respect to said paragraphs, and wherein said scores represent boundaries of said topical segments;
means for determining topical segments within said document corresponding to said linked occurrences of said topical information; and
means for determining the significance of said topical segments including determining a segment importance and determining a segment coverage.
-
Specification