Cloud-based plagiarism detection system performing predicting based on classified feature vectors
First Claim
Patent Images
1. A computer-implemented method comprising:
- generating training data for training a predictive model using a machine learning technique to estimate a probability that a given document is plagiarized or is not plagiarized, the training data including, for each of a plurality of training documents, a feature vector that includes (i) data referencing a content of an edit to the training document, (ii) data referencing a type of the edit to the training document, (iii) data referencing a time associated with the edit to the training document, and (iv) a label indicating whether the training document is or is not plagiarized;
training the predictive model using the training data;
after training the predictive model, identifying a particular document stored in a database;
receiving data referencing (i) a content of an edit to the particular document stored in the database, and (ii) a time associated with the edit to the particular document;
generating a feature vector based at least on the data referencing (i) the content of the edit to the particular document stored in the database, and (ii) the time associated with the edit to the particular document; and
determining a probability that the particular document is plagiarized or is not plagiarized based on classifying the feature vector by the predictive model that is trained using the machine learning technique.
2 Assignments
0 Petitions
Accused Products
Abstract
Plagiarism may be detected, as disclosed herein, utilizing a database that stores documents for one or more courses. The database may restrict sharing of content between documents. A feature extraction module may receive edits and timestamp the edits to the document. A writing pattern for a particular user or group of users may be discerned from the temporal data and the documents for the particular user or group of users. A feature vector may be generated that represents the writing pattern. A machine learning technique may be applied to the feature vector to determine whether or not a document is plagiarized.
16 Citations
20 Claims
-
1. A computer-implemented method comprising:
-
generating training data for training a predictive model using a machine learning technique to estimate a probability that a given document is plagiarized or is not plagiarized, the training data including, for each of a plurality of training documents, a feature vector that includes (i) data referencing a content of an edit to the training document, (ii) data referencing a type of the edit to the training document, (iii) data referencing a time associated with the edit to the training document, and (iv) a label indicating whether the training document is or is not plagiarized; training the predictive model using the training data; after training the predictive model, identifying a particular document stored in a database; receiving data referencing (i) a content of an edit to the particular document stored in the database, and (ii) a time associated with the edit to the particular document; generating a feature vector based at least on the data referencing (i) the content of the edit to the particular document stored in the database, and (ii) the time associated with the edit to the particular document; and determining a probability that the particular document is plagiarized or is not plagiarized based on classifying the feature vector by the predictive model that is trained using the machine learning technique. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A system comprising:
-
a processor configured to execute computer program instructions; and a computer storage medium encoded with the computer program instructions that, when executed by the processor, cause the system to perform operations comprising; generating training data for training a predictive model using a machine learning technique to estimate a probability that a given document is plagiarized or is not plagiarized, the training data including, for each of a plurality of training documents, a feature vector that includes (i) data referencing a content of an edit to the training document, (ii) data referencing a type of the edit to the training document, (iii) data referencing a time associated with the edit to the training document, and (iv) a label indicating whether the training document is or is not plagiarized; training the predictive model using the training data; after training the predictive model, identifying a particular document stored in a database; receiving data referencing (i) a content of an edit to the particular document stored in the database, and (ii) a time associated with the edit to the particular document; generating a feature vector based at least on the data referencing (i) the content of the edit to the particular document stored in the database, and (ii) the time associated with the edit to the particular document; and determining a probability that the particular document is plagiarized or is not plagiarized based on classifying the feature vector by the predictive model that is trained using the machine learning technique. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer-readable storage device encoded with a computer program, the computer program comprising instructions that, when executed by one or more computers, cause the one or more computers to perform operations comprising:
-
generating training data for training a predictive model using a machine learning technique to estimate a probability that a given document is plagiarized or is not plagiarized, the training data including, for each of a plurality of training documents, a feature vector that includes (i) data referencing a content of an edit to the training document, (ii) data referencing a type of the edit to the training document, (iii) data referencing a time associated with the edit to the training document, and (iv) a label indicating whether the training document is or is not plagiarized; training the predictive model using the training data; after training the predictive model, identifying a particular document stored in a database; receiving data referencing (i) a content of an edit to the particular document stored in the database, and (ii) a time associated with the edit to the particular document; generating a feature vector based at least on the data referencing (i) the content of the edit to the particular document stored in the database, and (ii) the time associated with the edit to the particular document; and determining a probability that the particular document is plagiarized or is not plagiarized based on classifying the feature vector by the predictive model that is trained using the machine learning technique. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification