Method and system for mining a document containing dirty text
First Claim
1. A method for mining a document containing dirty text comprising:
- removing an instance of dirty text within said document to produce a cleaned document; and
performing a data mining operation on said cleaned document.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for mining a document containing dirty text. Dirty text is removed or replaced and the document is processed using a variety of text mining techniques. In one embodiment, dirty text removal and replacement occurs in two stages. In the first stage, a general cleaning occurs on all documents without regard to what domain they belong to or the mining task to be performed. In the second stage, document cleaning is more specific to the anomalies of the domain and the mining task to be performed. In the third stage, the document is processed using a variety of data mining techniques according to the mining task. In one embodiment, the present invention scores and ranks sentences in a document according to their relevance, extracts the highest ranked sentences, and presents a summary. The present invention allows users to leverage existing domain knowledge and can be customized according the domain and task requirements.
-
Citations
30 Claims
-
1. A method for mining a document containing dirty text comprising:
-
removing an instance of dirty text within said document to produce a cleaned document; and
performing a data mining operation on said cleaned document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer system comprising:
-
a bus;
a memory unit coupled to said bus; and
a processor coupled to said bus, said processor for executing a method for mining a document containing dirty text comprising;
removing an instance of dirty text within said document to produce a cleaned document; and
performing a data mining operation on said cleaned document. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer-usable medium having computer-readable program code embodied therein for causing a computer system to perform the steps of:
-
removing an instance of dirty text within said document to produce a cleaned document; and
performing a data mining operation on said cleaned document. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification