Method of thematic classification of documents, themetic classification module, and search engine incorporating such a module
First Claim
1. A method of thematically classifying documents, in particular for making up or updating thematic databases for a search engine, the method comprising the following steps:
- manually and/or automatically selecting a sample of documents representative of each theme;
automatically identifying within the selected documents elements that are characteristic of each said theme;
automatically allocating a coefficient to each identified element, wherein said coefficient is representative of a relevance of said element to a corresponding theme;
downloading documents from a computer network;
for each downloaded document to be classified, identifying said theme-characterizing elements that are contained in the document for each said theme, and for each theme corresponding to the elements, using the coefficients allocated to said elements to calculate a characteristic value representative of the relevance of that theme for the document, in order to decide whether or not the document relates to the theme, said theme—
characterizing elements identification and calculation steps being performed automatically for each document downloaded from the computer network;
automatically classifying the downloaded documents as a function of themes with which they deal;
automatically storing the documents classified thematically in databases that can be interrogated on the basis of themes contained in a request; and
making the databases available to users who interrogate the databases on the basis of themes contained in a request;
and the step of allocating said coefficient to each identified element comprises the following steps for each theme;
automatically calculating a frequency of the element in the selected documents relating to the theme;
automatically calculating a frequency of the element in the selected documents that do not relate to the theme; and
automatically calculating a ratio of the calculated frequencies of the theme-related element and of the non-theme-related element.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of thematically classifying documents, in particular for making up or updating thematic databases (42) for a search engine, includes the steps of selecting documents representative of each theme, identifying within the selected documents, elements that are characteristic of each theme, allocating a coefficient (R) to each identified element, said coefficient being representative of the relevance of said element relative to the corresponding theme, and for each document (50) for classification, identifying said elements characteristic of each theme contained in the document and, for each theme corresponding thereto, using the coefficients allocated to said elements to calculate the value of a characteristic representative of the relevance of the theme for the document (50), in order to decide whether or not the document relates to the theme.
320 Citations
13 Claims
-
1. A method of thematically classifying documents, in particular for making up or updating thematic databases for a search engine, the method comprising the following steps:
-
manually and/or automatically selecting a sample of documents representative of each theme; automatically identifying within the selected documents elements that are characteristic of each said theme; automatically allocating a coefficient to each identified element, wherein said coefficient is representative of a relevance of said element to a corresponding theme; downloading documents from a computer network; for each downloaded document to be classified, identifying said theme-characterizing elements that are contained in the document for each said theme, and for each theme corresponding to the elements, using the coefficients allocated to said elements to calculate a characteristic value representative of the relevance of that theme for the document, in order to decide whether or not the document relates to the theme, said theme—
characterizing elements identification and calculation steps being performed automatically for each document downloaded from the computer network;automatically classifying the downloaded documents as a function of themes with which they deal; automatically storing the documents classified thematically in databases that can be interrogated on the basis of themes contained in a request; and making the databases available to users who interrogate the databases on the basis of themes contained in a request; and the step of allocating said coefficient to each identified element comprises the following steps for each theme; automatically calculating a frequency of the element in the selected documents relating to the theme; automatically calculating a frequency of the element in the selected documents that do not relate to the theme; and automatically calculating a ratio of the calculated frequencies of the theme-related element and of the non-theme-related element. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
- 9. A module for thematically classifying documents, in particular for a search engine, the module comprising a central processor unit having means for comparing elements extracted from each document with elements characteristic of various themes, each element being allocated a coefficient representative of a relevance of said element for a corresponding theme, and means for calculating a characteristic value representative of the relevance of a theme for the document on the basis of the coefficients of said characteristic elements that the document contains, in order to decide whether or not the document relates to said theme, said central processor unit being coupled to means for storing documents classified by theme that can be interrogated on the basis of themes contained in a request, and the module has means for calculating a frequency of the element in the documents relating to the theme, means for calculating a frequency of the element in the documents that do not relate to the theme, and means for calculating a ratio between the calculated frequencies of the theme-related element and of the non-theme-related element.
Specification