Information processing for searching categorizing information in a document based on a categorization hierarchy and extracted phrases
First Claim
1. A method to process a document from a Web site, based on a categorization hierarchy which has a plurality of categories, each category including one or more phrases, the method comprising:
- extracting phrases from the document;
categorizing at least one of the extracted phrases under a category of the categorization hierarchy; and
identifying at least one of the extracted phrases that cannot be categorized into the categorization hierarchy for analysis;
such that information in the document can be appropriately categorized and the document can be systematically retrieved by a natural language responding engine when needed;
wherein the location of the document is related to a URL;
wherein the document includes at least an image when the document is displayed on the Web site, and the method includes not categorizing the image; and
wherein the document includes at least a phrase that is hidden when the document is displayed on the Web site, and the method includes extracting that phrase for categorizing.
14 Assignments
0 Petitions
Accused Products
Abstract
Processing automatically information in a document to be incorporated into databases to be searched, retrieved and learned. This would significantly enhance categorizing information in the domain so that information can be systematically and efficiently retrieved when needed. In one approach, first, the context or the domain of the document is determined. Then, domain-specific phrases in the document are automatically extracted based on grammar and dictionaries. From these phrases, categories in a category hierarchy are identified, and the document is linked to those categories. Phrases in the document that cannot be categorized are identified to be analyzed. If these new phrases are relevant, new categories may be created based on suggestions provided to categorize them. Later when a user asks a question that is related to the categorized phrases, the corresponding categories are identified, with the document retrieved to respond to the question. In one approach, the question is in natural-language.
317 Citations
20 Claims
-
1. A method to process a document from a Web site, based on a categorization hierarchy which has a plurality of categories, each category including one or more phrases, the method comprising:
-
extracting phrases from the document;
categorizing at least one of the extracted phrases under a category of the categorization hierarchy; and
identifying at least one of the extracted phrases that cannot be categorized into the categorization hierarchy for analysis;
such that information in the document can be appropriately categorized and the document can be systematically retrieved by a natural language responding engine when needed;
wherein the location of the document is related to a URL;
wherein the document includes at least an image when the document is displayed on the Web site, and the method includes not categorizing the image; and
wherein the document includes at least a phrase that is hidden when the document is displayed on the Web site, and the method includes extracting that phrase for categorizing. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
a first phrase is related to a first category;
a second phrase is related to a second category; and
if the first phrase precedes the second phrase in normal usage, then the first category and the second category are grouped together in the categorization hierarchy.
-
-
11. An apparatus to process a document from a Web site, based on a categorization hierarchy, which has a plurality of categories, each category including one or more phrases, the apparatus comprising:
-
an extractor configured to extract phrases from the document;
a categorizer configured to categorize at least one of the extracted phrases under a category of the categorization hierarchy; and
an identifier configured to identify at least one of the extracted phrases that cannot be categorized into the categorization hierarchy for analysis;
such that information in the document can be appropriately categorized and the document can be systematically retrieved by a natural language responding engine when needed; and
wherein the location of the document is related to a URL;
wherein the document includes at least an image when the document is displayed on the Web site, and the method includes not categorizing the image; and
wherein the document includes at least a phrase that is hidden when the document is displayed on the Web site, and the method includes extracting that phrase for categorizing. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
a first phrase is related to a first category;
a second phrase is related to a second category; and
if the first phrase precedes the second phrase in normal usage, then the first category and the second category are grouped together in the categorization hierarchy.
-
Specification