Methods and apparatus for clustering news content

US 8,225,190 B1
Filed: 12/24/2008
Issued: 07/17/2012
Est. Priority Date: 09/20/2002
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

identifying, by one or more processors, a plurality of documents published online by a source;

calculating, by the one or more processors, a measure of freshness, for the plurality of documents published by the source, where the measure of freshness is derived from a difference between a time that the source published the plurality of documents and a time that a news event, described by the plurality of documents, occurred; and

deriving, by the one or more processors, a source score for the source based, at least in part, on the measure of freshness.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods and apparatus are described for scoring documents in response, in part, to parameters related to the document, source, and/or cluster score. Methods and apparatus are also described for scoring a cluster in response, in part, to parameters related to documents within the cluster and/or sources corresponding to the documents within the cluster. In one embodiment, the invention may identify the source; detect a plurality of documents published by the source; analyze the plurality of documents with respect to at least one parameter; and determine a source score for the source in response, in part, to the parameter. In another embodiment, the invention may identify a topic; identify a plurality of clusters in response to the topic; analyze at least one parameter corresponding to each of the plurality of clusters; and calculate a cluster score for each of the plurality of clusters in response, in part, to the parameter.

56 Citations

View as Search Results

33 Claims

1. A computer-implemented method comprising:
- identifying, by one or more processors, a plurality of documents published online by a source;
  
  calculating, by the one or more processors, a measure of freshness, for the plurality of documents published by the source, where the measure of freshness is derived from a difference between a time that the source published the plurality of documents and a time that a news event, described by the plurality of documents, occurred; and
  
  deriving, by the one or more processors, a source score for the source based, at least in part, on the measure of freshness.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, where the time that the source published the plurality of documents is determined from a time stamp.
  - 3. The method of claim 1, where calculating the measure of freshness further comprises:
    - calculating a frequency with which the source, during a particular time period, published canonical documents.
  - 4. The method of claim 1, where calculating the measure of freshness further comprises:
    - calculating a measure of quality for the plurality of documents published by the source.
  - 5. The method of claim 4, where the measure of quality is calculated based on a number of views of the plurality of documents that occur within a particular time frame.
  - 6. The method of claim 4, where the measure of quality is calculated based on one or more circulation statistics associated with the source.
  - 7. The method of claim 1, where deriving the source score further comprises:
    - identifying a number of documents in the plurality of documents, andderiving the source score based on the number of documents.
  - 8. The method of claim 1, where deriving the source score further comprises:
    - deriving a measure of originality for the plurality of documents based on whether duplicates, of the plurality of documents, exist that were published before the plurality of documents were published by the source.
  - 9. The method of claim 1, further comprising:
    - identifying a first document, of the plurality of documents, that is a duplicate of a second document in the plurality of documents; and
      
      removing the first document from the plurality of documents.
  - 10. The method of claim 9, where identifying the first document that is a duplicate of the second document comprises:
    - comparing text of the first document to text of the second document.
  - 11. The method of claim 1, further comprising:
    - categorizing the source by comparing the source score to one or more category thresholds.

12. A system, comprising:
- a processor; and
  
  a memory to store one or more instructions, which when executed by the processor, cause the processor to;
  
  identify a plurality of documents published online by a source;
  
  calculate a measure of freshness, for the plurality of documents published by the source, where the measure of freshness is derived from a difference between a time that the source published the plurality of documents and a time that a news event, described by the plurality of documents, occurred; and
  
  derive a source score for the source based, at least in part, on the measure of freshness.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 13. The system of claim 12, where the memory further stores one or more instructions to cause the processor to:
    - determine the time that the source published the plurality of documents from a time stamp.
  - 14. The system of claim 12, where, when the processor is to calculate the measure of freshness, the processor further is to:
    - calculate a frequency with which the source, during a particular time period, published canonical documents.
  - 15. The system of claim 12, where, when the processor is to calculate the measure of freshness, the processor further is to:
    - calculate a measure of quality for the plurality of documents published by the source.
  - 16. The system of claim 15, where, when the processor is to calculate the measure of quality, the processor further is to:
    - calculate the measure of quality based on a number of views of the plurality of documents that occur within a particular time frame.
  - 17. The system of claim 15, where, when the processor is to calculate the measure of quality, the processor further is to:
    - calculate the measure of quality based on one or more circulation statistics associated with the source.
  - 18. The system of claim 12, where, when the processor is to derive the source score, the processor further is to:
    - identify a number of documents in the plurality of documents, andderive the source score based on the number of documents.
  - 19. The system of claim 12, where, when the processor is to derive the source score, the processor further is to:
    - derive a measure of originality for the plurality of documents based on whether duplicates, of the plurality of documents, exist that were published before the plurality of documents were published by the source.
  - 20. The system of claim 12, where the memory further stores one or more instructions to cause the processor to:
    - identify a first document, of the plurality of documents, that is a duplicate of a second document in the plurality of documents; and
      
      remove the first document from the plurality of documents.
  - 21. The system of claim 20, where, when the processor is to identify the first document that is a duplicate of the second document, the processor further is to:
    - compare text of the first document to text of the second document.
  - 22. The system of claim 12, where the memory further stores one or more instructions to cause the processor to:
    - categorize the source by comparing the source score to one or more category thresholds.

23. A non-transitory memory device that stores one or more computer-executable instructions for execution by one or more processors, the instructions, comprising:
- one or more instructions, which, when executed by the one or more processors, cause the one or more processors to;
  
  identify a plurality of documents published online by a source;
  
  calculate a measure of freshness, for the plurality of documents published by the source, where the measure of freshness is derived from a difference between a time that the source published the plurality of documents and a time that a news event, described by the plurality of documents, occurred; and
  
  derive a source score for the source based, at least in part, on the measure of freshness.
- View Dependent Claims (24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
- - 24. The memory device of claim 23, further comprising:
    - one or more instructions, which, when executed by the one or more processors, cause the one or more processors to determine the time that the source published the plurality of documents from a time stamp.
  - 25. The memory device of claim 23, where the one or more instructions to calculate the measure of freshness further comprise:
    - one or more instructions, which, when executed by the one or more processors, cause the one or more processors to calculate a frequency with which the source, during a particular time period, published canonical documents.
  - 26. The memory device of claim 23, where the one or more instructions to calculate the measure of freshness further comprise:
    - one or more instructions, which, when executed by the one or more processors, cause the one or more processors to calculate a measure of quality for the plurality of documents published by the source.
  - 27. The memory device of claim 26, where the one or more instructions to calculate the measure of quality further comprise:
    - one or more instructions, which, when executed by the one or more processors, cause the one or more processors to calculate the measure of quality based on a number of views of the plurality of documents that occur within a particular time frame.
  - 28. The memory device of claim 26, where the one or more instructions to calculate the measure of quality further comprise:
    - one or more instructions, which, when executed by the one or more processors, cause the one or more processors to calculate the measure of quality based on one or more circulation statistics associated with the source.
  - 29. The memory device of claim 23, where the one or more instructions to derive the source score further comprise:
    - one or more instructions, which, when executed by the one or more processors, cause the one or more processors to;
      
      identify a number of documents in the plurality of documents, andderive the source score based on the number of documents.
  - 30. The memory device of claim 23, where the one or more instructions to derive the source score further comprise comprises:
    - one or more instructions, which, when executed by the one or more processors, cause the one or more processors to derive a measure of originality for the plurality of documents based on whether duplicates, of the plurality of documents, exist that were published before the plurality of documents were published by the source.
  - 31. The memory device of claim 23, further comprise:
    - one or more instructions, which, when executed by the one or more processors, cause the one or more processors to;
      
      identify a first document, of the plurality of documents, that is a duplicate of a second document in the plurality of documents; and
      
      remove the first document from the plurality of documents.
  - 32. The memory device of claim 31, where the one or more instructions to identify the first document that is a duplicate of the second document further comprise:
    - one or more instructions, which, when executed by the one or more processors, cause the one or more processors to compare text of the first document to text of the second document.
  - 33. The memory device of claim 23, further comprise:
    - one or more instructions, which, when executed by the one or more processors, cause the one or more processors to categorize the source by comparing the source score to one or more category thresholds.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Bharat, Krishna, Curtiss, Michael, Schmitt, Michael
Primary Examiner(s)
Huynh, Thu

Application Number

US12/344,153
Time in Patent Office

1,301 Days
Field of Search

715/205, 715/209, 715/229, 715/231, 715/234, 715/255, 715/271, 715/272, 707/737, 707/748
US Class Current

715/200
CPC Class Codes

G06F 16/24578   using ranking

G06F 16/355   Class or cluster creation o...

G06F 17/00   Digital computing or data p...

G06F 40/143   Markup, e.g. Standard Gener...

H04L 67/02   based on web technology, e....

Y10S 707/99937   Sorting

Y10S 707/99953   Recoverability

Methods and apparatus for clustering news content

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

56 Citations

33 Claims

Specification

Use Cases

Quick Links

Others

Methods and apparatus for clustering news content

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

56 Citations

33 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others