Information type identification method and apparatus, e.g. for music file name content identification
First Claim
1. Method of automatically identifying an instance of at least one specific type of information present in a data sequence, wherein said instance is unknown and expressed in one of several different forms in said data sequence, characterised in that it comprises the steps of:
- initially defining at least one characteristic feature of said specific type of information that is independent of said different possible forms of instances thereof, expressing said at least one characteristic feature in terms of at least one recognition rule executable by processor means (2), applying said at least one recognition rule through said processor means to analyse said data sequence, determining in said data sequence a data portion thereof satisfying said at least one recognition rule, and identifying said data portion as corresponding to said instance of the specific type of information for that data sequence.
3 Assignments
0 Petitions
Accused Products
Abstract
The method serves to automatically identify in a set of data sequences at least one specific type of information contained in each data sequence of the set, wherein the type of information has an unknown presentation in the data sequences. It comprises the steps of:
initially defining at least one characteristic feature of the specific type of information, and of expressing the characteristic feature(s) in terms of at least one recognition rule executable by processor means (2),
applying the recognition rule(s) through the processor means to analyze the set of data sequences,
determining in each data sequence a data portion thereof satisfying the recognition rule(s), and
identifying the data portion as corresponding to the specific type of information.
The invention can be used notably for automatically processing the contents of music file names, where the data sequence corresponds to the characters for a music file, and the specific information types are an artist name and/or music title contained in some arbitrary form and order in the file name.
-
Citations
32 Claims
-
1. Method of automatically identifying an instance of at least one specific type of information present in a data sequence, wherein said instance is unknown and expressed in one of several different forms in said data sequence, characterised in that it comprises the steps of:
-
initially defining at least one characteristic feature of said specific type of information that is independent of said different possible forms of instances thereof, expressing said at least one characteristic feature in terms of at least one recognition rule executable by processor means (2), applying said at least one recognition rule through said processor means to analyse said data sequence, determining in said data sequence a data portion thereof satisfying said at least one recognition rule, and identifying said data portion as corresponding to said instance of the specific type of information for that data sequence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 28, 29, 30)
said data sequence corresponds to a file name of a set of file names of music files, said data sequence being the characters forming a corresponding music file name, and said data portion being a character field containing information of a given type, and said specific type of information to be identified comprises at least one of;
a first type of information corresponding to an artist name contained in said music file name, and a second type of information corresponding to a music title name contained in said music file name.
-
-
4. Method according to claim 3, further comprising a step, prior to said determining step, of determining a separator character present between character fields respectively assigned to said first and second types of information.
-
5. Method according to claim 3, further comprising a step of detecting the presence of a character cluster composed of a first part which is constant and a second part which is variable over said set of music file names, said second part being e.g. an integer or equivalent count character, and of eliminating that character cluster from said character sequence.
-
6. Method according to claim 3, wherein a said recognition rule instructs to identify said first type of information as contained in the character field forming the most words among character fields assigned to respective types of information.
-
7. Method according to claim 3, wherein a said recognition rule instructs to identify said first type of information as contained in the character field which has the most occurrence in identical form in said set of music file names.
-
8. Method according to claim 3, wherein a said recognition rule instructs to identify said first type of information as contained in the character field matching a character field in a set of stored character fields corresponding to artist names.
-
9. Method according to claim 3, wherein a said recognition rule instructs to identify said first type of information as contained in the first character field appearing in the music file name.
-
10. Method according to claim 3, wherein said determining and identifying steps involve the sub-steps of:
-
identifying in said characters forming said music file name a first character field and a second character field, one said field containing the first type of information as artist name and the other containing the second type of information as music title name, determining, by reference to an artist database containing character fields each corresponding to a respective artist name, a first value (OCC1) corresponding to the number of occurrences, over said set of music file names, of a first character field contained in said artist database, and a second value (OCC2) corresponding to the number of occurrences, over said set of music file names, of a second character field contained in said artist database, wherein if said first value (OCC1) is greater than said second value (OCC2), identifying said first character field as corresponding to an artist name, if said second value (OCC2) is greater than said second value (OCC1), identifying said second character field as corresponding to an artist name, if said first and second values (OCC1, OCC2) are equal, continuing by;
determining a new first value (OCC1) corresponding to the number of different contents of said first character field over the set of music file names and a new second value (OCC2) corresponding to the number of different contents of said second character field over the set music file names, wherein if said first value (OCC1) is greater than said second value (OCC2), identifying said second character field as corresponding to an artist name, if said second value (OCC2) is greater than said second value (OCC1), identifying said first character field as corresponding to an artist name, if said first and second values (OCC1, OCC2) are equal, continuing by;
determining a new first value (OCC1) corresponding to the total number of words in said first character field summed over the entire set of music file names and a new second value (OCC2) corresponding to the total number of words in said second character field summed over the entire set of music file names, wherein if said first value (OCC1) is greater than said second value (OCC2), identifying said first character field as corresponding to an artist name, if said second value (OCC2) is greater than said second value (OCC1), identifying said second character field as corresponding to an artist name, and if said first and second values (OCC1, OCC2) are equal, identifying said first character field as corresponding to an artist name.
-
-
11. Method according to claim 3, further comprising the step of applying rewriting rules to at least one of an artist name and a music title name identified from a said music file name, said rewriting rules being executable by said processor means (2) for transforming an artist name/music title name into a form corresponding to that used for storing artist names/music title names in a database.
-
12. Method according to claim 11, further comprising a step of compiling a directory of rewritten music file names, corresponding to said identified music file names, in which at least one of an artist name and a music title name is organised to be machine readable.
-
13. Method according to claim 3, further comprising the step of constructing for each music file name a machine readable information module comprising at least of an identified artist name and an identified music title name, to which is associated metadata, said metadata being provided from a database on the basis of said identified artist name and/or music title name.
-
14. Method according to claim 13, wherein said metadata is indicative of a genre or genre/subgenre associated with the corresponding music title.
-
15. Use of the method according to claim 3 in a music playlist generator (16), wherein said playlist generator accesses stored music files by reference to identified artist names and/or identified music title names.
-
28. Method according to claim 1, further comprising:
-
applying said at least one recognition rule to analyse a set of data sequences including said data sequence;
determining in each data sequence of said set of data sequences a data portion for each data sequence satisfying said at least one recognition rule; and
identifying said data portion for each data sequence as corresponding to said specific type of information.
-
-
29. Method according to claim 1, wherein said data sequence includes an instance for each of two or more specific types of information, and each instance is expressed in one of several possible forms, the method further comprising:
-
initially defining at least one characteristic feature of each specific type of information that is independent of said different possible forms of instances thereof, expressing said at least one characteristic feature in terms of said at least one recognition rule, applying said at least one recognition rule through said processor means to analyse said data sequence, determining in said data sequence a data portion for each specific type of information satisfying one or more of said at least one recognition rule, and identifying said data portions as respectively corresponding to said instance of each specific type of information for that data sequence.
-
-
30. Method according to claim 3, wherein the order of the types of information in said specific type of information is not known.
-
16. Apparatus for automatically identifying an instance of at least one specific type of information present in a data sequence, wherein said instance is unknown and expressed in one of several different possible forms in said data sequence, characterised in that it comprises:
-
means for expressing at least one characteristic feature of said specific type of information that is independent of said different possible forms of instances thereof, means for expressing said at least one characteristic feature in terms of at least one machine executable recognition rule, processor means for applying said at least one machine executable recognition rule to analyse said data sequence, determining means for determining in said data sequence a data portion thereof satisfying said at least one machine executable recognition rule, and identifying means for identifying said data portion as corresponding to said instance of the specific type of information for that data sequence. - View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 31, 32)
said data sequence corresponds to a file name of a set of file names of music files, said data sequence being the characters forming a corresponding music file name, and a said data portion being a character field containing information of a given type, and said specific type of information to be identified comprises at least one of;
a first type of information corresponding to an artist name contained in said music file name, and a second type of information corresponding to a music title name contained in said music file name.
-
-
19. Apparatus according to claim 18, further comprising separator character means for detecting a separator character present between character fields respectively assigned to said first and second types of information.
-
20. Apparatus according to claim 18, further comprising means for detecting the presence of a character cluster composed of a first part which is constant and a second part which is variable over said set of music file names, said second part being e.g. an integer or equivalent count character, and for eliminating that character cluster from said character sequence.
-
21. Apparatus according to claim 18, wherein a said recognition rule instructs to identify said first type of information as contained in at least one of:
-
i) the character field forming the most words among character fields assigned to respective types of information, ii) the character field which has the most occurrence in identical form in said set of music file names, iii) the character field matching a character field in a set of stored character fields corresponding artist names, and iv) the first character field appearing in the music file name.
-
-
22. Apparatus according to claim 18, further comprising:
-
means for identifying in said characters forming said music file name a first character field and a second character field, one said field containing the first type of information (artist name) and the other containing the second type of information (music title name), means for determining, by reference to an artist database containing character fields each corresponding to a respective artist name, a first value (OCC1) corresponding to the number of occurrences, over said set of music file names, of a first character field contained in said artist database, and a second value (OCC2) corresponding to the number of occurrences, over said set of music file names, of a second character field contained in said artist database, wherein if said first value (OCC1) is greater than said second value (OCC2), said first character field is identified as corresponding to an artist name, if said second value (OCC2) is greater than said second value (OCC1), said second character field is identified as corresponding to an artist name, means, operative if said first and second values (OCC1, OCC2) are equal, for determining a new first value (OCC1) corresponding to the number of different contents of said first character field over the set of music file names and a new second value (OCC2) corresponding to the number of different contents of said second character field over the set music file names, wherein if said first value (OCC1) is greater than said second value (OCC2), said second character field is identified as corresponding to an artist name, and if said second value (OCC2) is greater than said second value (OCC1), said first character field is identified as corresponding to an artist name, means operative if said first and second values (OCC1, OCC2) are equal, for determining a new first value (OCC1) corresponding to the total number of words in said first character field summed over the entire set of music file names and a new second value (OCC2) corresponding to the total number of words in said second character field summed over the entire set of music file names, wherein if said first value (OCC1) is greater than said second value (OCC2), said first character field as corresponding to an artist name, and if said second value (OCC2) is greater than said second value (OCC1), said second character field as is identified as corresponding to an artist name, and means, operative if said first and second values (OCC1, OCC2) are equal, for identifying said first character field as corresponding to an artist name.
-
-
23. Apparatus according to claim 18, further comprising rewriting means for applying rewriting rules to at least one of an artist name and a music title name identified from a said music file name, said rewriting rules being executable for transforming an artist name/music title name into a form corresponding to that used for storing artist names/music title names in a database.
-
24. Apparatus according to claim 23, further comprising compiling means for compiling a directory of rewritten music file names, corresponding to said identified music file names, in which at least one of an artist name and a music title name is organised to be machine readable.
-
25. Apparatus according to claim 18, further comprising constructing means for constructing for each music file name a machine readable information module comprising at least of an identified artist name and an identified music title name, to which is associated metadata, said metadata being provided from a database on the basis of said identified artist name and/or music title name.
-
26. Apparatus according to claim 25, wherein said metadata is indicative of a genre or genre/subgenre associated with the corresponding music title.
-
27. System combining an apparatus according to claim 16 with a music playlist generator (16), wherein said playlist generator accesses stored music files by reference to identified artist names and/or identified music title names.
-
31. Apparatus according to claim 16, wherein:
-
said means for applying said at least one machine executable recognition rule also applies said recognition rule(s) to analyse a set of data sequences including said data sequence;
said determining means also determines in each data sequence a data portion for each data sequence satisfying said at least one machine executable recognition rule; and
said identifying means also identifies said data portion for each data sequence as corresponding to said specific type of information.
-
-
32. Apparatus according to claim 16, wherein:
-
said data sequence includes an instance for each of two or more specific types of information, each instance is expressed in one of several possible forms, said means for expressing at least one characteristic feature also expresses at least one characteristic feature of each specific type of information that is independent of said different possible forms of instances thereof, said determining means also determines in said data sequence a data portion for each specific type of information satisfying one or more of said at least one machine executable recognition rule, and said identifying means also identifies said data portions as respectively corresponding to said instance of each specific type of information for that data sequence.
-
Specification