Classifying content
First Claim
1. A method performed by data processing apparatus, the method comprising:
- fragmenting a document into a plurality of content pieces, each of the content pieces representing a number of consecutive words in the document, the document having a date of creation and an author;
comparing the document'"'"'s content pieces to a repository of stored content pieces that occur in a first corpus of documents, each document of the first corpus having a date that is earlier than the date of the document and an author that is not the same as the author of the document;
determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces;
in response to determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces, determining that the one or more content pieces are original content pieces, wherein an original content piece is a content piece that has not occurred in the first corpus of documents;
determining a rate of occurrence over time of the one or more content pieces in a second corpus of documents, wherein determining the rate of occurrence comprises determining the rate of occurrence over time of the one or more content pieces for each of a plurality of time intervals, the rate of occurrence over time of the one or more content pieces in an interval based on a count of a number of the other documents that both contain the first content piece and are associated with a time within the interval, each document of the second corpus having a date that is later than the date of the document and an author that is not the same as the author of the document;
comparing the rate of occurrence over time of the one or more content pieces to predefined copying patterns that are each associated with a different document classification;
selecting a document classification that is associated with a predefined copying pattern that is most consistent with the rate of occurrence over time of the one or more content pieces; and
associating the selected document classification with the document.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods, systems, and apparatus, including computer program products for identifying original content. In one aspect a method is described that includes deriving a plurality of content pieces from a collection of documents, each content piece occurring in one or more documents in the collection of documents. Each document in the collection of documents is associated with a time and an author. A first document in the collection of documents is identified, the identified first document being the earliest document containing an occurrence of a first piece of content. A first author associated with the first document is ranked based on a number of documents that contain at least one occurrence of the content piece and that are associated with an author other than the first author.
-
Citations
13 Claims
-
1. A method performed by data processing apparatus, the method comprising:
-
fragmenting a document into a plurality of content pieces, each of the content pieces representing a number of consecutive words in the document, the document having a date of creation and an author; comparing the document'"'"'s content pieces to a repository of stored content pieces that occur in a first corpus of documents, each document of the first corpus having a date that is earlier than the date of the document and an author that is not the same as the author of the document; determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces; in response to determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces, determining that the one or more content pieces are original content pieces, wherein an original content piece is a content piece that has not occurred in the first corpus of documents; determining a rate of occurrence over time of the one or more content pieces in a second corpus of documents, wherein determining the rate of occurrence comprises determining the rate of occurrence over time of the one or more content pieces for each of a plurality of time intervals, the rate of occurrence over time of the one or more content pieces in an interval based on a count of a number of the other documents that both contain the first content piece and are associated with a time within the interval, each document of the second corpus having a date that is later than the date of the document and an author that is not the same as the author of the document; comparing the rate of occurrence over time of the one or more content pieces to predefined copying patterns that are each associated with a different document classification; selecting a document classification that is associated with a predefined copying pattern that is most consistent with the rate of occurrence over time of the one or more content pieces; and associating the selected document classification with the document. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A non-transitory computer readable medium having stored therein instructions, which, when executed by one or more processors, causes the one or more processors to perform operations comprising:
-
fragmenting a document into a plurality of content pieces, each of the content pieces representing a number of consecutive words in the document, the document having a date of creation and an author; comparing the document'"'"'s content pieces to a repository of stored content pieces that occur in a first corpus of documents, each document of the first corpus having a date that is earlier than the date of the document and an author that is not the same as the author of the document; determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces; in response to determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces, determining that the one or more content pieces are original content pieces, wherein an original content piece is a content piece that has not occurred in the first corpus of documents; determining a rate of occurrence over time of the one or more content pieces in a second corpus of documents, wherein determining the rate of occurrence comprises determining the rate of occurrence over time of the one or more content pieces for each of a plurality of time intervals, the rate of occurrence over time of the one or more content pieces in an interval based on a count of a number of the other documents that both contain the first content piece and are associated with a time within the interval, each document of the second corpus having a date that is later than the date of the document and an author that is not the same as the author of the document; comparing the rate of occurrence over time of the one or more content pieces to predefined copying patterns that are each associated with a different document classification; selecting a document classification that is associated with a predefined copying pattern that is most consistent with the rate of occurrence over time of the one or more content pieces; and associating the selected document classification with the document. - View Dependent Claims (7, 8, 9)
-
-
10. A system comprising:
-
one or more processors; and a computer readable medium coupled to the one or more processors, having instructions stored thereon which, when executed by the one or more processors, cause the one or more processors to perform operations comprising; fragmenting a document into a plurality of content pieces, each of the content pieces representing a number of consecutive words in the document, the document having a date of creation and an author; comparing the document'"'"'s content pieces to a repository of stored content pieces that occur in a first corpus of documents, each document of the first corpus having a date that is earlier than the date of the document and an author that is not the same as the author of the document; determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces; in response to determining that one or more of the document'"'"'s content pieces does not match at least one of the stored content pieces, determining that the one or more content pieces are original content pieces, wherein an original content piece is a content piece that has not occurred in the first corpus of documents; determining a rate of occurrence over time of the one or more content pieces in a second corpus of documents, wherein determining the rate of occurrence comprises determining the rate of occurrence over time of the one or more content pieces for each of a plurality of time intervals, the rate of occurrence over time of the one or more content pieces in an interval based on a count of a number of the other documents that both contain the first content piece and are associated with a time within the interval, each document of the second corpus having a date that is later than the date of the document and an author that is not the same as the author of the document; comparing the rate of occurrence over time of the one or more content pieces to predefined copying patterns that are each associated with a different document classification; selecting a document classification that is associated with a predefined copying pattern that is most consistent with the rate of occurrence over time of the one or more content pieces; and associating the selected document classification with the document. - View Dependent Claims (11, 12, 13)
-
Specification