Text sample entry group formulation
First Claim
1. A method comprising:
- an act of accessing a set of text samples, each having a corresponding text sample identifier;
for each of at least some of the set of text samples, an act of preparing the text sample, the act of preparing the text sample comprising;
an act of parsing a plurality of text components from the text sample; and
for each of at least some of the parsed plurality of text components, an act of identifying the text component, the act of identifying the text component comprising;
an act of determining if the text component is already correlated to a text component identifier, the text component identifier representing the content while being distinguished from the content;
if the text component is already correlated to a text component identifier, assigning the text component identifier to the text component and such that when two text components are the same then the two text components will be assigned a same text component identifier;
if the text component is not already correlated to a text component identifier, assigning a new text component identifier to the text component; and
an act of creating a text component entry comprising a) the text sampleidentifier for the text sample from which the text component was parsed, and b) the assigned text component identifier;
an act of creating a text sample entry group comprising a plurality of text component entries corresponding to text components parsed from the text sample, and such that the plurality of text component entries are sorted by sequence of the corresponding text component within the text sample; and
an act of storing a plurality of text sample entry groups created by performance of the act of preparing the text sample for each of the at least some of the set of text samples, wherein the pluarity of text samples entries are stored in a text component entry table that includes a duplicate set of text component entries having a same text sample identifier and component identifier pairing.
2 Assignments
0 Petitions
Accused Products
Abstract
Storing text samples in a manner that the text samples may be quickly searched. The text samples are assigned a text sample identifier and are each parsed to thereby extract text components from the text samples. Text components that have the same content are assigned the same text component identifier. For each parsed text component, a text component entry is created that includes the assigned text component identifier as well as the text sample identifier for the text sample from which the text component was parsed. A text sample entry group is created for each text sample that contains the text component entries in sequence for the text components found within the text sample. The text sample entry groups are stored so as to be scannable during a future search.
8 Citations
20 Claims
-
1. A method comprising:
-
an act of accessing a set of text samples, each having a corresponding text sample identifier; for each of at least some of the set of text samples, an act of preparing the text sample, the act of preparing the text sample comprising; an act of parsing a plurality of text components from the text sample; and for each of at least some of the parsed plurality of text components, an act of identifying the text component, the act of identifying the text component comprising; an act of determining if the text component is already correlated to a text component identifier, the text component identifier representing the content while being distinguished from the content; if the text component is already correlated to a text component identifier, assigning the text component identifier to the text component and such that when two text components are the same then the two text components will be assigned a same text component identifier; if the text component is not already correlated to a text component identifier, assigning a new text component identifier to the text component; and an act of creating a text component entry comprising a) the text sample identifier for the text sample from which the text component was parsed, and b) the assigned text component identifier; an act of creating a text sample entry group comprising a plurality of text component entries corresponding to text components parsed from the text sample, and such that the plurality of text component entries are sorted by sequence of the corresponding text component within the text sample; and an act of storing a plurality of text sample entry groups created by performance of the act of preparing the text sample for each of the at least some of the set of text samples, wherein the pluarity of text samples entries are stored in a text component entry table that includes a duplicate set of text component entries having a same text sample identifier and component identifier pairing. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
-
-
16. A computer program product comprising one or more computer-readable storage media having thereon computer-executable instructions that are structured such that, when executed by one or more processors of a computing system, cause the computing system to perform a method for storing representations of a set of text samples, each having a corresponding text sample identifier, the method comprising:
-
an act of creating a content identification table that includes text components and corresponding text component identifier pairings, that act of creating including for each of at least some of the set of text samples; an act of parsing a plurality of text components from the text sample; and for each of at least some of the parsed plurality of text components, an act of identifying the text component, the act of identifying the text component comprising; an act of determining if the text component is already correlated to a text component identifier, the text component identifier representing the content while being distinguished from the content; if the text component is already correlated to a text component identifier, assigning the text component identifier to the text component; if the text component is not already correlated to a text component identifier, assigning a new text component identifier to the text component as a text component and corresponding text component identifier pairing; and an act of creating a text component entry comprising a) the text sample identifier for the text sample from which the text component was parsed, and b) the assigned text component identifier; while creating the content identification table, refraining from creating a new entry in the content identification table for a text component that is encountered that is the same as another text component in the content identification table, and such that the content identification table omits duplicate entries of the text component and corresponding text component identifier pairing; and an act of creating a text component entry table by at least; creating a text sample entry group comprising a plurality of text component entries corresponding to text components parsed from the text sample, and such that the plurality of text component entries are sorted by sequence of the corresponding text component within the text sample; and an act of storing a plurality of text sample entry groups created by performance of the act of preparing the text sample for each of the at least some of the set of text samples. - View Dependent Claims (17)
-
-
18. A computing system comprising:
-
at least one processor; and one or more storage device having stored computer-executable instructions that are executable by the at least one processor to cause the computing system to implement the following; an act of accessing a set of text samples, each having a corresponding text sample identifier; for each of at least some of the set of text samples, an act of preparing the text sample, the act of preparing the text sample comprising; an act of parsing a plurality of text components from the text sample; and for each of at least some of the parsed plurality of text components, an act of identifying the text component, the act of identifying the text component comprising; an act of determining if the text component is already correlated to a text component identifier, the text component identifier representing the content while being distinguished from the content; if the text component is already correlated to a text component identifier, assigning the text component identifier to the text component; if the text component is not already correlated to a text component identifier, assigning a new text component identifier to the text component; and an act of creating a text component entry comprising a) the text sample identifier for the text sample from which the text component was parsed, and b) the assigned text component identifier; an act of creating a text sample entry group comprising a plurality of text component entries corresponding to text components parsed from the text sample, and such that the plurality of text component entries are sorted by sequence of the corresponding text component within the text sample; an act of storing a plurality of text sample entry groups created by performance of the act of preparing the text sample for each of the at least some of the set of text samples, the plurality of text sample entry groups being stored in a predetermined ordering by text sample identifier, an act of performing a search on the plurality of sample entry groups, wherein the act of performing a search comprisies; scanning through the text sample identifiers of the plurality of text sample entry groups in search of a first particular text component identifier that identifies a first particular text component and a second particular text component identifier that identifies a second particular text component, and upon finding the first and second particular text component identifiers during the act of scanning, an act of using text sample identifiers corresponding to the first and second particular text component identifiers to identify one or more text samples that include the first particular text component and to identify one or more text samples that include the second particular text component, and an act of identifying a result of the search, the result including at least some of the text samples that include the first particular text component and the second particular text component, or identifying a result of the search as including at least some of the text samples that include the first particular text component but which omit the second particular text component. - View Dependent Claims (19, 20)
-
Specification