Method and apparatus for incorporating metadata in data clustering

US 7,809,718 B2
Filed: 01/15/2008
Issued: 10/05/2010
Est. Priority Date: 01/29/2007
Status: Expired due to Fees

First Claim

Patent Images

1. A method of clustering a plurality of documents from a data stream comprising:

identifying, by a processor, metadata in the plurality of documents;

emphasizing, by the processor, one or more words corresponding to the metadata;

generating, by the processor, a single feature vector for each of the plurality of documents based at least in part on the emphasized words by determining a numerical value for each word in one or more of the plurality of documents by determining a Term Frequency Inverse Document Frequency (TFIDF);

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Documents in a high density data stream are clustered. Incoming documents are analyzed to find metadata, such as words in a documents headline or abstract and people, places, and organizations discussed in the document. The metadata is emphasized as compared to other words found in the document. A single feature vector for each document determined based on the emphasized metadata will accordingly take into account the importance of such words and clustering efficacy and efficiency are improved.

13 Citations

View as Search Results

12 Claims

1. A method of clustering a plurality of documents from a data stream comprising:
- identifying, by a processor, metadata in the plurality of documents;
  
  emphasizing, by the processor, one or more words corresponding to the metadata;
  
  generating, by the processor, a single feature vector for each of the plurality of documents based at least in part on the emphasized words by determining a numerical value for each word in one or more of the plurality of documents by determining a Term Frequency Inverse Document Frequency (TFIDF);
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1 wherein the plurality of documents are text articles received from an article server.
  - 3. The method of claim 2 wherein the data stream is a continuous stream.
  - 4. The method of claim 1 wherein said identifying metadata in the plurality of documents comprises:
    - selecting a subset of words based on parameters of the words with respect to the document; and
      
      identifying a feature vector corresponding to each of the words in the subset of words.
  - 5. The method of claim 4 wherein the parameters of the words comprise at least one of:
    - locations of the words in the documents, physical location names detected in the documents, person names detected in the documents, and organization names detected in the document.
  - 6. The method of claim 5 wherein the locations of the words are selected from the group of a headline, an abstract, a category, and a title.
  - 7. The method of claim 1 wherein said emphasizing one or more words corresponding to the metadata comprises emphasizing the one or more words with one or more multiplicative weights.
  - 8. The method of claim 1 wherein said emphasizing one or more words corresponding to the metadata comprises emphasizing the one or more words with one or more additive weights.

9. An apparatus for clustering a plurality of documents from a data stream comprising:
- a memory device for storing a program;
  
  a processor in communication with the memory device, the processor comprising;
  
  means for identifying metadata in the plurality of documents;
  
  means for emphasizing one or more words corresponding to the metadata;
  
  means for generating a single feature vector for each of the plurality of documents based at least in part on the emphasized words by determining a numerical value for each word in one or more of the plurality of documents by determining a Term Frequency Inverse Document Frequency (TFIDF);
- View Dependent Claims (10)
- - 10. The apparatus of claim 9 wherein the means for identifying metadata in the plurality of documents comprises:
    - means for selecting a subset of words based on parameters of the words with respect to the document; and
      
      means for identifying a feature vector corresponding to each of the words in the subset of words.

11. A non-transitory computer-readable storage medium having program instructions stored thereon, the instructions defining the steps of:
- clustering a plurality of documents from a data stream by;
  
  identifying metadata in the plurality of documents;
  
  emphasizing one or more words corresponding to the metadata;
  
  generating a single feature vector for each of the plurality of documents based at least in part on the emphasized words by determining a numerical value for each word in one or more of the plurality of documents by determining a Term Frequency Inverse Document Frequency (TFIDF);
- View Dependent Claims (12)
- - 12. The computer-readable storage medium of claim 11, wherein the instructions for identifying metadata in the plurality of documents further define the steps of:
    - selecting a subset of words based on parameters of the words with respect to the document; and
      
      identifying a feature vector corresponding to each of the words in the subset of words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Siemens Corp. (Siemens AG)
Original Assignee
Siemens Corp. (Siemens AG)
Inventors
Moerchen, Fabian, Brinker, Klaus
Primary Examiner(s)
Mofiz; Apu M
Assistant Examiner(s)
Le; Jessica N

Application Number

US12/008,886
Publication Number

US 20080183665A1
Time in Patent Office

994 Days
Field of Search

707/2, 707/705, 707722-735, 707/737
US Class Current

707/722
CPC Class Codes

G06F 16/355 Class or cluster creation o...

Method and apparatus for incorporating metadata in data clustering

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

13 Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for incorporating metadata in data clustering

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

13 Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links