Classifying content

US 8,577,866 B1
Filed: 12/07/2006
Issued: 11/05/2013
Est. Priority Date: 12/07/2006
Status: Active Grant

First Claim

Patent Images

1. A method performed by data processing apparatus, the method comprising:

fragmenting a document into a plurality of content pieces, each of the content pieces representing a number of consecutive words in the document, the document having a date of creation and an author;

comparing the document'"'"'s content pieces to a repository of stored content pieces that occur in a first corpus of documents, each document of the first corpus having a date that is earlier than the date of the document and an author that is not the same as the author of the document;

determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces;

in response to determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces, determining that the one or more content pieces are original content pieces, wherein an original content piece is a content piece that has not occurred in the first corpus of documents;

determining a rate of occurrence over time of the one or more content pieces in a second corpus of documents, wherein determining the rate of occurrence comprises determining the rate of occurrence over time of the one or more content pieces for each of a plurality of time intervals, the rate of occurrence over time of the one or more content pieces in an interval based on a count of a number of the other documents that both contain the first content piece and are associated with a time within the interval, each document of the second corpus having a date that is later than the date of the document and an author that is not the same as the author of the document;

comparing the rate of occurrence over time of the one or more content pieces to predefined copying patterns that are each associated with a different document classification;

selecting a document classification that is associated with a predefined copying pattern that is most consistent with the rate of occurrence over time of the one or more content pieces; and

associating the selected document classification with the document.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems, and apparatus, including computer program products for identifying original content. In one aspect a method is described that includes deriving a plurality of content pieces from a collection of documents, each content piece occurring in one or more documents in the collection of documents. Each document in the collection of documents is associated with a time and an author. A first document in the collection of documents is identified, the identified first document being the earliest document containing an occurrence of a first piece of content. A first author associated with the first document is ranked based on a number of documents that contain at least one occurrence of the content piece and that are associated with an author other than the first author.

Citations

13 Claims

1. A method performed by data processing apparatus, the method comprising:
- fragmenting a document into a plurality of content pieces, each of the content pieces representing a number of consecutive words in the document, the document having a date of creation and an author;
  
  comparing the document'"'"'s content pieces to a repository of stored content pieces that occur in a first corpus of documents, each document of the first corpus having a date that is earlier than the date of the document and an author that is not the same as the author of the document;
  
  determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces;
  
  in response to determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces, determining that the one or more content pieces are original content pieces, wherein an original content piece is a content piece that has not occurred in the first corpus of documents;
  
  determining a rate of occurrence over time of the one or more content pieces in a second corpus of documents, wherein determining the rate of occurrence comprises determining the rate of occurrence over time of the one or more content pieces for each of a plurality of time intervals, the rate of occurrence over time of the one or more content pieces in an interval based on a count of a number of the other documents that both contain the first content piece and are associated with a time within the interval, each document of the second corpus having a date that is later than the date of the document and an author that is not the same as the author of the document;
  
  comparing the rate of occurrence over time of the one or more content pieces to predefined copying patterns that are each associated with a different document classification;
  
  selecting a document classification that is associated with a predefined copying pattern that is most consistent with the rate of occurrence over time of the one or more content pieces; and
  
  associating the selected document classification with the document.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1 wherein fragmenting the document into the plurality of content pieces further comprises translating each content piece in the plurality of content pieces into a single base language.
  - 3. The method of claim 1, wherein the document classification is one of news, a blog and an advertisement.
  - 4. The method of claim 1, further comprising associating an author classification with the document'"'"'s author based at least partly on the rate of occurrence over time of the one or more content pieces in the second corpus of documents.
  - 5. The method of claim 4, wherein the author classification is one of reporter, blogger and advertiser.

6. A non-transitory computer readable medium having stored therein instructions, which, when executed by one or more processors, causes the one or more processors to perform operations comprising:
- fragmenting a document into a plurality of content pieces, each of the content pieces representing a number of consecutive words in the document, the document having a date of creation and an author;
  
  comparing the document'"'"'s content pieces to a repository of stored content pieces that occur in a first corpus of documents, each document of the first corpus having a date that is earlier than the date of the document and an author that is not the same as the author of the document;
  
  determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces;
  
  in response to determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces, determining that the one or more content pieces are original content pieces, wherein an original content piece is a content piece that has not occurred in the first corpus of documents;
  
  determining a rate of occurrence over time of the one or more content pieces in a second corpus of documents, wherein determining the rate of occurrence comprises determining the rate of occurrence over time of the one or more content pieces for each of a plurality of time intervals, the rate of occurrence over time of the one or more content pieces in an interval based on a count of a number of the other documents that both contain the first content piece and are associated with a time within the interval, each document of the second corpus having a date that is later than the date of the document and an author that is not the same as the author of the document;
  
  comparing the rate of occurrence over time of the one or more content pieces to predefined copying patterns that are each associated with a different document classification;
  
  selecting a document classification that is associated with a predefined copying pattern that is most consistent with the rate of occurrence over time of the one or more content pieces; and
  
  associating the selected document classification with the document.
- View Dependent Claims (7, 8, 9)
- - 7. The computer readable medium of claim 6, wherein the document classification is one of news, a blog or an advertisement.
  - 8. The computer readable medium of claim 6, wherein the operations further compriseassociating an author classification with the document'"'"'s author based at least partly on the rate of occurrence over time of the one or more content pieces in the second corpus of documents.
  - 9. The computer readable medium of claim 8, wherein the author classification is one of reporter, blogger or advertiser.

10. A system comprising:
- one or more processors; and
  
  a computer readable medium coupled to the one or more processors, having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations comprising;
  
  fragmenting a document into a plurality of content pieces, each of the content pieces representing a number of consecutive words in the document, the document having a date of creation and an author;
  
  comparing the document'"'"'s content pieces to a repository of stored content pieces that occur in a first corpus of documents, each document of the first corpus having a date that is earlier than the date of the document and an author that is not the same as the author of the document;
  
  determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces;
  
  in response to determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces, determining that the one or more content pieces are original content pieces, wherein an original content piece is a content piece that has not occurred in the first corpus of documents;
  
  determining a rate of occurrence over time of the one or more content pieces in a second corpus of documents, wherein determining the rate of occurrence comprises determining the rate of occurrence over time of the one or more content pieces for each of a plurality of time intervals, the rate of occurrence over time of the one or more content pieces in an interval based on a count of a number of the other documents that both contain the first content piece and are associated with a time within the interval, each document of the second corpus having a date that is later than the date of the document and an author that is not the same as the author of the document;
  
  comparing the rate of occurrence over time of the one or more content pieces to predefined copying patterns that are each associated with a different document classification;
  
  selecting a document classification that is associated with a predefined copying pattern that is most consistent with the rate of occurrence over time of the one or more content pieces; and
  
  associating the selected document classification with the document.
- View Dependent Claims (11, 12, 13)
- - 11. The system of claim 10, wherein the document classification is one of news, a blog or an advertisement.
  - 12. The system of claim 10, wherein the operations further comprise:
    - associating an author classification with the document'"'"'s author based at least partly on the rate of occurrence over time of the one or more content pieces in the second corpus of documents.
  - 13. The system of claim 12, wherein the author classification is one of reporter, blogger or advertiser.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Osinga, Douwe, Christoph, Stefan
Primary Examiner(s)
Smith, Brannon W

Application Number

US11/608,207
Time in Patent Office

2,525 Days
Field of Search

707/5, 707/999.005, 707/751, 707/710
US Class Current

707/710
CPC Class Codes

G06F 16/353   into predefined classes

G06F 16/382   using citations hypermedia ...

G06F 40/131   Fragmentation of text files...

G06F 40/194   Calculation of difference b...

Classifying content

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Classifying content

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links