Method and apparatus for incorporating metadata in data clustering
First Claim
Patent Images
1. A method of clustering a plurality of documents from a data stream comprising:
- identifying, by a processor, metadata in the plurality of documents;
emphasizing, by the processor, one or more words corresponding to the metadata;
generating, by the processor, a single feature vector for each of the plurality of documents based at least in part on the emphasized words by determining a numerical value for each word in one or more of the plurality of documents by determining a Term Frequency Inverse Document Frequency (TFIDF);
3 Assignments
0 Petitions
Accused Products
Abstract
Documents in a high density data stream are clustered. Incoming documents are analyzed to find metadata, such as words in a documents headline or abstract and people, places, and organizations discussed in the document. The metadata is emphasized as compared to other words found in the document. A single feature vector for each document determined based on the emphasized metadata will accordingly take into account the importance of such words and clustering efficacy and efficiency are improved.
13 Citations
12 Claims
-
1. A method of clustering a plurality of documents from a data stream comprising:
-
identifying, by a processor, metadata in the plurality of documents; emphasizing, by the processor, one or more words corresponding to the metadata; generating, by the processor, a single feature vector for each of the plurality of documents based at least in part on the emphasized words by determining a numerical value for each word in one or more of the plurality of documents by determining a Term Frequency Inverse Document Frequency (TFIDF); - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. An apparatus for clustering a plurality of documents from a data stream comprising:
-
a memory device for storing a program; a processor in communication with the memory device, the processor comprising; means for identifying metadata in the plurality of documents; means for emphasizing one or more words corresponding to the metadata; means for generating a single feature vector for each of the plurality of documents based at least in part on the emphasized words by determining a numerical value for each word in one or more of the plurality of documents by determining a Term Frequency Inverse Document Frequency (TFIDF); - View Dependent Claims (10)
-
-
11. A non-transitory computer-readable storage medium having program instructions stored thereon, the instructions defining the steps of:
-
clustering a plurality of documents from a data stream by; identifying metadata in the plurality of documents; emphasizing one or more words corresponding to the metadata; generating a single feature vector for each of the plurality of documents based at least in part on the emphasized words by determining a numerical value for each word in one or more of the plurality of documents by determining a Term Frequency Inverse Document Frequency (TFIDF); - View Dependent Claims (12)
-
Specification