Cloud-based plagiarism detection system performing predicting based on classified feature vectors

US 9,514,417 B2
Filed: 12/30/2013
Issued: 12/06/2016
Est. Priority Date: 12/30/2013
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method comprising:

generating training data for training a predictive model using a machine learning technique to estimate a probability that a given document is plagiarized or is not plagiarized, the training data including, for each of a plurality of training documents, a feature vector that includes (i) data referencing a content of an edit to the training document, (ii) data referencing a type of the edit to the training document, (iii) data referencing a time associated with the edit to the training document, and (iv) a label indicating whether the training document is or is not plagiarized;

training the predictive model using the training data;

after training the predictive model, identifying a particular document stored in a database;

receiving data referencing (i) a content of an edit to the particular document stored in the database, and (ii) a time associated with the edit to the particular document;

generating a feature vector based at least on the data referencing (i) the content of the edit to the particular document stored in the database, and (ii) the time associated with the edit to the particular document; and

determining a probability that the particular document is plagiarized or is not plagiarized based on classifying the feature vector by the predictive model that is trained using the machine learning technique.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Plagiarism may be detected, as disclosed herein, utilizing a database that stores documents for one or more courses. The database may restrict sharing of content between documents. A feature extraction module may receive edits and timestamp the edits to the document. A writing pattern for a particular user or group of users may be discerned from the temporal data and the documents for the particular user or group of users. A feature vector may be generated that represents the writing pattern. A machine learning technique may be applied to the feature vector to determine whether or not a document is plagiarized.

16 Citations

View as Search Results

20 Claims

1. A computer-implemented method comprising:
- generating training data for training a predictive model using a machine learning technique to estimate a probability that a given document is plagiarized or is not plagiarized, the training data including, for each of a plurality of training documents, a feature vector that includes (i) data referencing a content of an edit to the training document, (ii) data referencing a type of the edit to the training document, (iii) data referencing a time associated with the edit to the training document, and (iv) a label indicating whether the training document is or is not plagiarized;
  
  training the predictive model using the training data;
  
  after training the predictive model, identifying a particular document stored in a database;
  
  receiving data referencing (i) a content of an edit to the particular document stored in the database, and (ii) a time associated with the edit to the particular document;
  
  generating a feature vector based at least on the data referencing (i) the content of the edit to the particular document stored in the database, and (ii) the time associated with the edit to the particular document; and
  
  determining a probability that the particular document is plagiarized or is not plagiarized based on classifying the feature vector by the predictive model that is trained using the machine learning technique.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the type of the edit comprises an insertion or a deletion.
  - 3. The method of claim 1, wherein the type of edit comprises a move.
  - 4. The method of claim 1, wherein the type of edit comprises a replacement.
  - 5. The method of claim 1, comprising automatically generating the plurality of training documents.
  - 6. The method of claim 1, comprising, for each of the plurality of training documents, pre-processing the content of the edit to insert variable-invariant features.
  - 7. The method of claim 6, wherein pre-processing the content comprises substituting a particular term included in the content with a synonym of the particular term.

8. A system comprising:
- a processor configured to execute computer program instructions; and
  
  a computer storage medium encoded with the computer program instructions that, when executed by the processor, cause the system to perform operations comprising;
  
  generating training data for training a predictive model using a machine learning technique to estimate a probability that a given document is plagiarized or is not plagiarized, the training data including, for each of a plurality of training documents, a feature vector that includes (i) data referencing a content of an edit to the training document, (ii) data referencing a type of the edit to the training document, (iii) data referencing a time associated with the edit to the training document, and (iv) a label indicating whether the training document is or is not plagiarized;
  
  training the predictive model using the training data;
  
  after training the predictive model, identifying a particular document stored in a database;
  
  receiving data referencing (i) a content of an edit to the particular document stored in the database, and (ii) a time associated with the edit to the particular document;
  
  generating a feature vector based at least on the data referencing (i) the content of the edit to the particular document stored in the database, and (ii) the time associated with the edit to the particular document; and
  
  determining a probability that the particular document is plagiarized or is not plagiarized based on classifying the feature vector by the predictive model that is trained using the machine learning technique.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the type of the edit comprises an insertion or a deletion.
  - 10. The system of claim 8, wherein the type of edit comprises a move.
  - 11. The system of claim 8, wherein the type of edit comprises a replacement.
  - 12. The system of claim 8, wherein the operations comprise automatically generating the plurality of training documents.
  - 13. The system of claim 8, wherein the operations comprise, for each of the plurality of training documents, pre-processing the content of the edit to insert variable-invariant features.
  - 14. The system of claim 13, wherein pre-processing the content comprises substituting a particular term included in the content with a synonym of the particular term.

15. A computer-readable storage device encoded with a computer program, the computer program comprising instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising:
- generating training data for training a predictive model using a machine learning technique to estimate a probability that a given document is plagiarized or is not plagiarized, the training data including, for each of a plurality of training documents, a feature vector that includes (i) data referencing a content of an edit to the training document, (ii) data referencing a type of the edit to the training document, (iii) data referencing a time associated with the edit to the training document, and (iv) a label indicating whether the training document is or is not plagiarized;
  
  training the predictive model using the training data;
  
  after training the predictive model, identifying a particular document stored in a database;
  
  receiving data referencing (i) a content of an edit to the particular document stored in the database, and (ii) a time associated with the edit to the particular document;
  
  generating a feature vector based at least on the data referencing (i) the content of the edit to the particular document stored in the database, and (ii) the time associated with the edit to the particular document; and
  
  determining a probability that the particular document is plagiarized or is not plagiarized based on classifying the feature vector by the predictive model that is trained using the machine learning technique.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The device of claim 15, wherein the type of the edit comprises an insertion or a deletion.
  - 17. The device of claim 15, wherein the type of edit comprises a move.
  - 18. The device of claim 15, wherein the type of edit comprises a replacement.
  - 19. The device of claim 15, wherein the operations comprise automatically generating the plurality of training documents.
  - 20. The device of claim 15, wherein the operations comprise, for each of the plurality of training documents, pre-processing the content of the edit to insert variable-invariant features.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Kumar, Sanjiv, Kernighan, Brian
Primary Examiner(s)
Vincent, David

Application Number

US14/143,710
Publication Number

US 20150186787A1
Time in Patent Office

1,072 Days
Field of Search

706/12, 706/45
US Class Current

1/1
CPC Class Codes

G06F 16/93   Document management systems

G06N 20/00   Machine learning

G06Q 10/107   Computer-aided management o...

Cloud-based plagiarism detection system performing predicting based on classified feature vectors

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

16 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Cloud-based plagiarism detection system performing predicting based on classified feature vectors

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

16 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links