Article and method of automatically filtering information retrieval results using test genre
First Claim
1. A processor implemented method of searching a heterogeneous corpus of untagged machine-readable texts, each text of the corpus having a text genre and a topic, the corpus including at least a first text genre and a second text genre, the corpus including a multiplicity of topics, the processor implemented method comprising the steps of:
- a) searching the corpus for a first multiplicity of untagged texts that have a first topic;
b) identifying a first set of texts of the first multiplicity of untagged texts that are instances of the first text genre;
c) identifying a second set of texts of the first multiplicity of untagged texts that are instances of the second text genre;
d) identifying the first multiplicity of untagged texts to a computer user in an order based upon at least a first type and a second type of text genre.
7 Assignments
0 Petitions
Accused Products
Abstract
A method of filtering according to text genre the results of a topic search of a heterogeneous corpus of untagged, machine-readable texts. Because each text of the corpus has a topic and a text genre, the corpus includes multiple text genres and covers multiple topics. According to the method, a processor first searches the corpus for a first multiplicity of texts that have a first topic. Next, the processor identifies a first set of texts of the first multiplicity that are instances of a first text genre and identifies a second set of texts of the first multiplicity that are instances of a second text genre. Finally, the processor identifies to a computer user the first multiplicity of texts in an order based upon the first text genre and second text genre.
39 Citations
22 Claims
-
1. A processor implemented method of searching a heterogeneous corpus of untagged machine-readable texts, each text of the corpus having a text genre and a topic, the corpus including at least a first text genre and a second text genre, the corpus including a multiplicity of topics, the processor implemented method comprising the steps of:
-
a) searching the corpus for a first multiplicity of untagged texts that have a first topic;
b) identifying a first set of texts of the first multiplicity of untagged texts that are instances of the first text genre;
c) identifying a second set of texts of the first multiplicity of untagged texts that are instances of the second text genre;
d) identifying the first multiplicity of untagged texts to a computer user in an order based upon at least a first type and a second type of text genre. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
b1) for each text of the first multiplicity, generating a cue vector from the text, the cue vector representing occurrences in the text of a set of nonstructural, surface cues; and
b2) for each text of the first multiplicity, determining whether the text is an instance of the first text genre using the cue vector and a weighting vector associated with the first text genre.
-
-
3. The method of claim 1 wherein step b) comprises the steps of:
-
1) for each text of the first multiplicity, generating a cue vector from the text, the cue vector representing occurrences in the text of a set of nonstructural surface cues;
2) for each text of the first multiplicity, determining a relevancy to the text of each facet of a set of facets using the cue vector and a weighting vector associated with the facet; and
3) for each text of the first multiplicity, determining whether the text is an instance of the first text genre based upon the facets relevant to the text.
-
-
4. The method of claim 2 wherein the set of nonstructural, surface cues includes a punctuational cue.
-
5. The method of claim 4 wherein the set of cues further includes at least a one of a lexical cue, a string recognizable constructional cue, a formulae cue and a deviation cue.
-
6. The method of claim 3 wherein the set of nonstructural surface cues includes a punctuational cue.
-
7. The method of claim 6 wherein the set of nonstructural surface cues further includes at least a one of a lexical cue, a string recognizable constructional cue, a formulae cue and a deviation cue.
-
8. The method of claim 6 wherein the set of facets includes at least a one of a date facet, a narrative facet, a suasive facet, a fiction facet, a legal fact, a science and technical facet, and an author facet.
-
9. The method of claim 2 wherein the first text genre is a one of a press report genre, an Email genre, an editorial opinion genre, and a market analysis genre.
-
10. The method of claim 3 wherein the first text genre is a one of a press report genre, an Email genre, an editorial opinion genre, and a market analysis genre.
-
11. An article of manufacture comprising:
-
a) a memory; and
b) instructions stored in the memory for a method of searching a heterogeneous corpus of untagged machine-readable texts, each text of the corpus having a text genre and a topic, the corpus including at least a first text genre and a second text genre, the corpus including a multiplicity of topics, the method being implemented by a processor coupled to the memory, the method comprising the steps of;
1) searching the corpus for a first multiplicity of texts that have a first topic;
2) identifying a first set of texts of the first multiplicity of untagged texts that are instances of the first text genre;
3) identifying the first set of texts to a computer user.
-
-
12. A processor implemented method of searching a heterogeneous corpus of untagged machine-readable texts, each text of the corpus having a text genre and a topic, the corpus including a first multiplicity of text genres and a second multiplicity of topics, the processor implemented method comprising the steps of:
-
a) receiving from a computer user a search request for texts having a first topic and a first text genre, the search request also identifying a second text genre to be excluded;
b) identifying a third multiplicity of untagged texts of the corpus having the first topic;
c) determining a text genre of each text of the third multiplicity of untagged texts; and
d) identifying to the computer user those texts of the third multiplicity that are instances of the first text genre and not identifying any text of the third multiplicity that are instances of the second text genre.
-
-
13. An article of manufacture comprising:
-
a) a memory; and
b) instructions stored in the memory for a method of searching a heterogeneous corpus of untagged machine-readable texts, each text of the corpus having a text genre and a topic, the corpus including a first multiplicity of text genres and a second multiplicity of topics, the method being implemented by a processor coupled to the memory, the method comprising the steps of;
1) receiving from a computer user a search request for texts having a first topic and a first text genre, the search request also identifying a second text genre to be excluded;
2) identifying a third multiplicity of untagged texts of the corpus having the first topic;
3) determining a text genre of each text of the third multiplicity of untagged texts; and
4) identifying to the computer user those texts of the third multiplicity of untagged texts that are instances of the first text genre and not identifying any text of the third multiplicity of untagged texts that are instances of the second text genre. - View Dependent Claims (14, 15, 16, 17, 18)
A) for each text of the third multiplicity of untagged texts generating a cue vector from the text, the cue vector representing occurrences in the text of a first set of nonstructural, surface cues; and
B) for each text o the third multiplicity of untagged texts identifying a text genre from a second set of text genres using the cue vector and a weighting vector associated with each text genre.
-
-
15. The article of claim 13 wherein step b3) comprises the substeps of:
-
A) for each text of the third multiplicity of untagged texts generating a cue vector from the text, the cue vector representing occurrences in the text of a first set of nonstructural, surface cues;
B) for each text of the third multiplicity of untagged texts determining a relevancy to the text of each face of a second set of facets using the cue vector and a weighting vector associated with each facet, and C) for each text of the third multiplicity of untagged texts identifying relevant text genres from a third set of text genres based upon the facets relevant to the text.
-
-
16. The article of claim 14 wherein the first set of cues includes at least a one of either a punctuational cue, a lexical cue, a string recognizable constructional cue, a formulae cue and a deviation cue.
-
17. The article of claim 15 wherein the second set of facets includes at least a one of either a date facet, a narrative facet, a suasive facet, a fiction facet, a legal fact, a science and technical facet, and an author facet.
-
18. The article of claim 13 wherein the third set of text genres includes at least a one of either a press report genre, an Email genre, an editorial opinion genre, and a market analysis genre.
-
19. An article of manufacture comprising:
-
a) a memory; and
b) instructions stored in the memory for a method of searching a heterogeneous corpus of untagged machine-readable texts, each text of the corpus having a text genre and a topic, the corpus including a first multiplicity of text genres and a second multiplicity of topics, the method being implemented by a processor coupled to the memory, the method comprising the steps of;
1) receiving from a computer user a search request for texts having a first topic and a first text genre to be excluded;
2) identifying a third multiplicity of untagged texts of the corpus having the first topic;
3) determining a text genre of each text of the third multiplicity of untagged texts; and
4) identifying to the computer user those texts of the third multiplicity of untagged texts that have a text genre other than the first text genre.
-
-
20. An article of manufacture comprising:
-
a) a memory; and
b) instructions stored in the memory for a method of searching a heterogeneous corpus of untagged machine-readable texts, each text of the corpus having a topic and a facet value for each facet of a first multiplicity of facets, the corpus including a second multiplicity of topics, the method being implemented by a processor coupled to the memory, the method comprising the steps of;
1) receiving from a computer user a search request for texts having a first topic and a first value of a first facet of the first multiplicity of facets;
2) identifying a third multiplicity of untagged texts of the corpus having the first topic;
3) for each text of the third multiplicity of untagged texts determining for a value of the first facet; and
4) identifying to the computer user those texts of the third multiplicity of untagged texts that have the first value of the first facet. - View Dependent Claims (21, 22)
-
Specification