Content search in complex language, such as Japanese
First Claim
1. A system for searching content requested using a text-based query including one or more scripts or orthographic forms associated with Japanese, the system comprising:
- a hierarchically structured vocabulary knowledge base for storing vocabulary information associated with the one or more scripts or orthographic forms, wherein the vocabulary knowledge base is generated by a method comprising;
assigning an identifier to a semantic concept;
identifying a main orthographic form for the semantic concept, wherein the main orthographic form is based on kanji script, katakana script, hiragana script, or any combination of kanji script, katakana script, and hiragana script;
for at least one of the one or more scripts or orthographic forms, associating at least one synonymous orthographic form with the semantic concept, wherein the synonymous orthographic form is at least partially distinct from the main orthographic form, and wherein the at least one synonymous orthographic form includes any one or more of;
kanji script, katakana script, hiragana script, okurigana variant, romaji written form, phonetic variants associated with one or more of kanji script, katakana script, hiragana script, okurigana variant, and romaji written form, and/or hybrid variants associated with one or more of kanji script, katakana script, hiragana script, okurigana variant, and romaji written form;
storing the identifier, the main orthographic form, and the at least one synonymous orthographic form in a data storage component associated with the system; and
repeating the assigning, the identifying, the associating, and the storing for additional semantic concepts;
an asset repository for storing information associated with searchable assets;
a classification component for classifying the searchable assets in the asset repository to facilitate matching between the searchable assets and the vocabulary information, wherein the matching is based, at least in part, on identifiers assigned to semantic concepts included in the vocabulary knowledge base; and
a search engine for receiving and executing queries for the searchable assets.
11 Assignments
0 Petitions
Accused Products
Abstract
A search facility provides searching capabilities in languages such as Japanese. The facility may use a vocabulary knowledge base organized by concepts. For example, each concept may be associated with at least one keyword (as well as any synonyms or variant forms) by applying one or more rules that relate to identifying common main forms, script variants, alternative grammatical forms, phonetic variants, proper noun variants, numerical variants, scientific name, cultural relevance, etc. The contents of the vocabulary knowledge base are then used in executing search queries. A user may enter a search query in which keywords (or synonyms associated with those key words) may be identified, along with various stopwords that facilitate segmentation of the search query and other actions. Execution of the search query may result in a list of assets or similar indications being returned, which relate to concepts identified within the search query.
96 Citations
44 Claims
-
1. A system for searching content requested using a text-based query including one or more scripts or orthographic forms associated with Japanese, the system comprising:
-
a hierarchically structured vocabulary knowledge base for storing vocabulary information associated with the one or more scripts or orthographic forms, wherein the vocabulary knowledge base is generated by a method comprising;
assigning an identifier to a semantic concept;
identifying a main orthographic form for the semantic concept, wherein the main orthographic form is based on kanji script, katakana script, hiragana script, or any combination of kanji script, katakana script, and hiragana script;
for at least one of the one or more scripts or orthographic forms, associating at least one synonymous orthographic form with the semantic concept, wherein the synonymous orthographic form is at least partially distinct from the main orthographic form, and wherein the at least one synonymous orthographic form includes any one or more of;
kanji script, katakana script, hiragana script, okurigana variant, romaji written form, phonetic variants associated with one or more of kanji script, katakana script, hiragana script, okurigana variant, and romaji written form, and/or hybrid variants associated with one or more of kanji script, katakana script, hiragana script, okurigana variant, and romaji written form;
storing the identifier, the main orthographic form, and the at least one synonymous orthographic form in a data storage component associated with the system; and
repeating the assigning, the identifying, the associating, and the storing for additional semantic concepts;
an asset repository for storing information associated with searchable assets;
a classification component for classifying the searchable assets in the asset repository to facilitate matching between the searchable assets and the vocabulary information, wherein the matching is based, at least in part, on identifiers assigned to semantic concepts included in the vocabulary knowledge base; and
a search engine for receiving and executing queries for the searchable assets. - View Dependent Claims (2, 3)
-
-
4. In a computer system, a method for searching for content identified using text and symbols of a language having multiple written variants of at least some semantic concepts, the method comprising:
-
receiving a search query comprising a textual expression represented using one or more characters in at least one script or orthographic form associated with the language;
normalizing the received search query, wherein the normalizing includes converting any characters with multiple possible representations to a standardized form;
tokenizing the normalized search query, wherein the tokenizing includes separating the textual expression into one or more tokens;
based on contents of a vocabulary knowledge base, determining a match for each token of the tokenized search query;
for each token where the determining of a match is not successful, segmenting the token, wherein the segmenting includes performing additional matching on identified segments of the token;
executing the search using a set of tokens for which a match has been determined; and
displaying a result set from the executed search. - View Dependent Claims (5, 6, 7)
-
-
8. A system for searching for content identified using text and symbols of a complex language associated with multiple written forms, the system comprising:
-
a vocabulary knowledge base for storing vocabulary information associated with the complex language, wherein the vocabulary knowledge base is generated or updated by a repeatable method comprising;
assigning an identifier to a semantic concept;
identifying a main written form for the semantic concept, wherein the main written form is based on at least one of the multiple written forms;
for at least one of the multiple written forms associated with the complex language, associating at least one synonymous written form with the semantic concept, wherein the synonymous written form is at least partially distinct from the main written form; and
storing the identifier, the main written form, and the at least one synonymous written form in a data storage component associated with the system;
an asset repository for storing information associated with searchable assets; and
a search engine for receiving and executing queries for the searchable assets, wherein the execution is based, at least in part, on the contents of the vocabulary knowledge base. - View Dependent Claims (9, 10, 11, 12, 13, 14, 15)
-
-
16. A method for generating or updating a vocabulary repository used in executing queries for retrieving content, wherein the queries are provided in a complex language associated with multiple written forms, the method comprising:
-
identifying a main form for a semantic concept, wherein the main form is based on at least one of the multiple written forms;
assigning an identifier to a semantic concept, wherein the assigned identifier is associated with at least the main form;
associating at least one synonymous variant with the semantic concept, wherein the at least one synonymous written variant is at least partially distinct from the main form; and
storing the identifier, the main form, and the at least one synonymous variant in a data storage component accessible to a search engine configured to search for assets when provided with search queries in the complex language. - View Dependent Claims (17, 18)
-
-
19. A computer-readable medium containing a data structure associated with a semantic concept that is representable in a complex language, the data structure comprising:
-
identifier information associated with the semantic concept;
information identifying a main textual form for the semantic concept, wherein the main textual form is based on a first script type associated with the complex language; and
information associating at least one synonymous textual form with the semantic concept, wherein the synonymous textual form is based on a second script type associated with the complex language, wherein the second script type is at least partially distinct from the first script type, and wherein the data structure is configured for facilitating matching assets with search queries based on the complex language. - View Dependent Claims (20, 21, 22)
-
-
23. A computer-implemented method for executing a search query, the method comprising:
-
receiving a search query including a textual expression, wherein the textual expression is written in a language that at least occasionally lacks discrete boundaries between words or autonomous language units;
determining whether the textual expression comprises a keyword or synonym associated with a structured vocabulary knowledge base for storing vocabulary information associated a language having multiple orthographic forms or scripts; and
if the textual expression does not comprise a keyword or synonym associated with a structured vocabulary knowledge base, performing segmentation on the textual expression, wherein the segmentation includes systematically splitting the textual expression into two or more segments, and identifying at least one keyword from the vocabulary knowledge based on the textual expression and the two or more segments. - View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31)
-
-
32. A structured vocabulary stored in a computer memory usable with a computer system, comprising:
-
multiple digital media assets; and
multiple keywords, wherein the keywords are ordered in the structured vocabulary according to inherent relationships between the terms, such that at least some of the digital media assets are associated with a keyword and at least one variant form of the keyword, and wherein each digital media asset is associated with an identifier and can be identified by matching a requested term with a keyword or at least one variant form of the keyword associated with one of the identifiers. - View Dependent Claims (33)
-
-
34. A method in a first computer system for retrieving a media content unit from a second computer system having a plurality of media content units that have been classified according to keyword terms of a structured vocabulary, comprising:
-
sending a request for a media content unit, the request specifying a search term; and
receiving an indication of at least one media content unit that corresponds to the specified search term, wherein the search term is located within the structured vocabulary and is used to determine at least one media content unit that corresponds to the search term, and wherein orthographic variations of the search term are automatically provided to assist in determining the at least one media content unit that corresponds to the search term. - View Dependent Claims (35)
-
-
36. A computer-implemented method to search for data in a database, the method comprising:
-
receiving a search request, wherein the request includes a search query term;
automatically determine any orthographic variations of the search query term; and
conducting a search of the database based on the search query term and any orthographic variations of the search query term. - View Dependent Claims (37, 38, 39, 40, 41, 42, 43, 44)
-
Specification