Methods and/or systems for selecting data sets
First Claim
1. Apparatus for determining a measure of similarity between at least a first and a second data set, said apparatus comprising:
- i) input means for receiving at least said first and second data sets;
ii) processing means for identifying a set of keywords in at least the first of the data sets, the processing means having access to at least one rule set and identifying the set of keywords by use of said at least one rule set, the processing means further determining said measure of similarity; and
iii) output means to output said measure of similarity;
wherein said rule set includes a rule concerning relative location of data items in a respective data set, and wherein said processing means determines the measure of similarity by comparing at least one set of key words, identified by said processing means in the first data set, with a set of keywords comprising or derived from said second data set;
said relative location of data items in a respective data set comprises adjacent location of at least two potential key words with respect to each other in the data set, the processing means identifying such adjacent potential key words as together providing a single key word in an identified set of key words; and
said at least one rule set comprises at least one of the following criteria;
1) a noun followed by a noun or a predetermined set of indicia;
2) a verb followed by a noun or a predetermined set of indicia;
3) an adjective followed by a noun or a predetermined set of indicia; and
4) a predetermined set of indicia followed by a noun or a verb or a further predetermined set of indicia;
the processing means identifying adjacent potential key words as together providing a single key word in an identified set of key words only when they meet said at least one criterion.
1 Assignment
0 Petitions
Accused Products
Abstract
Methods and apparatus for identifying associated key words in a data set. Associated key words are identified by a parser which firstly operates to extract key words from a data set. These key words are then analyzed by the parser to identify which key words, if any, have an association as determined by a predefined set of rules. These rules are grammatical and include, for example, two key words both being nouns that occur one after the other without intervening low value words. A similar rule applies to nouns followed by verbs but does not extend to verbs followed by nouns. These rules allow terms and phrases such as “information technology” and “wide area network” to be identified as associated key words rather than as individual and unrelated key words.
-
Citations
6 Claims
-
1. Apparatus for determining a measure of similarity between at least a first and a second data set, said apparatus comprising:
-
i) input means for receiving at least said first and second data sets;
ii) processing means for identifying a set of keywords in at least the first of the data sets, the processing means having access to at least one rule set and identifying the set of keywords by use of said at least one rule set, the processing means further determining said measure of similarity; and
iii) output means to output said measure of similarity;
wherein said rule set includes a rule concerning relative location of data items in a respective data set, and wherein said processing means determines the measure of similarity by comparing at least one set of key words, identified by said processing means in the first data set, with a set of keywords comprising or derived from said second data set;
said relative location of data items in a respective data set comprises adjacent location of at least two potential key words with respect to each other in the data set, the processing means identifying such adjacent potential key words as together providing a single key word in an identified set of key words; and
said at least one rule set comprises at least one of the following criteria;
1) a noun followed by a noun or a predetermined set of indicia;
2) a verb followed by a noun or a predetermined set of indicia;
3) an adjective followed by a noun or a predetermined set of indicia; and
4) a predetermined set of indicia followed by a noun or a verb or a further predetermined set of indicia;
the processing means identifying adjacent potential key words as together providing a single key word in an identified set of key words only when they meet said at least one criterion. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method of determining a level of similarity between first and second data sets, wherein said method comprises the steps of:
-
i) applying identifying tags to selected data items in at least the first of the data sets in accordance with at least a first rule;
ii) identifying a set of potential key words by reference to either the presence or the absence of said identifying tags;
iii) selecting sets of two or more potential keywords which are adjacent by applying at least a second rule;
iv) classifying each selected set of potential keywords as a single keyword;
v) generating a set of keywords which comprises each classified set of potential keywords as a single keyword, together with the remaining keywords from the identified set of potential keywords;
vi) comparing the generated set of keywords with a set of keywords either comprising or derived form the second data set; and
said first rule relates at least in part to the grammatical category of the data items;
said at least a second rule comprises one or more rules from the following set;
1) a noun followed by a noun or a predetermined set of indicia;
2) a verb followed by a noun or a predetermined set of indicia;
3) an adjective followed by a noun or a predetermined set of indicia; and
4) a predetermined set of indicia followed by a noun or a verb or a further predetermined set of indicia.
-
Specification