Variables and method for authorship attribution
First Claim
1. A method to determine whether an unidentified author of a textual work corresponds to a known author, the method comprising the steps of:
- obtaining a known sample of text of the known author;
selecting from the known sample a known grammatical unit;
parsing and analyzing the known grammatical unit to produce known grammatical unit level data;
selecting from the textual work an unknown grammatical unit;
parsing and analyzing the unknown grammatical unit to produce unknown grammatical unit level data; and
comparing the unknown grammatical unit level data to the known grammatical unit level data.
0 Assignments
0 Petitions
Accused Products
Abstract
A method uses linguistic units of analysis to identify the authorship of a document. The method is useful to determine authorship of brief documents, and in situations where there are less than ten documents per known author, i.e. when there is scarcity of text. The method analyzes parameters such as the syntax, punctuation, and, optionally the average word and paragraph length, and when the parameters are analyzed using statistical methods, obtains a high degree of reliability (>90% accuracy). The method can be applicable to numerous languages other than English because the variables selected are characteristic of most languages. The reliability of the method is verified when subjected to a cross-validation statistical analysis.
56 Citations
5 Claims
-
1. A method to determine whether an unidentified author of a textual work corresponds to a known author, the method comprising the steps of:
-
obtaining a known sample of text of the known author;
selecting from the known sample a known grammatical unit;
parsing and analyzing the known grammatical unit to produce known grammatical unit level data;
selecting from the textual work an unknown grammatical unit;
parsing and analyzing the unknown grammatical unit to produce unknown grammatical unit level data; and
comparing the unknown grammatical unit level data to the known grammatical unit level data.
-
-
2. A set of characteristics of a textual work comprising a syntactic feature and a graphemic feature.
-
3. A computer-aided method to determine whether an unidentified author of a textual work belongs to a group comprising the textual work of a known author, the method comprising the steps of:
-
obtaining a sample of the textual work of the unidentified author;
obtaining a sample of the textual work of the known author;
entering the samples into a computer system, the computer system including a memory, a means for analyzing documents, and a means for determining belonging, stored within the memory;
utilizing the means for analyzing documents, splitting the entered samples into individual sentences, the sentences each including a head, a plurality of words and punctuation, the punctuation defining a syntactic edge within the individual sentence, and the punctuation defining a discursive function emphatic within the individual sentence;
categorizing the punctuation by determining the syntactic edge;
indicating the discursive function emphatic, a graphemic feature being generated by the steps of categorizing and indicating;
dividing each of the individual sentences into the words;
labeling each of the words as a part of speech;
listing the labeled words into phrases for each labeled word;
identifying phrases for each said head;
classifying the identified phrases as marked or unmarked;
characterizing the identified phrases by markedness, thereby producing a plurality of syntactic features; and
utilizing the means for determining belonging, inputting at least one of the syntactic features and at least one of the graphemic feature for each said sample to determine whether the unidentified author of the textual work sample belongs to the known author group.
-
-
4. A system for determining whether an unidentified author of a textual work belongs to a group comprising the textual work of a known author, the system comprising:
-
a computer system including a memory, an input means, a means for analyzing documents, and a means for determining belonging, stored within the memory;
a sample of the textual work of the unidentified author;
a sample of the textual work of the known author, the samples being input into the computer system;
the means for analyzing documents splitting the entered samples into individual sentences, the sentences each including a head, a plurality of words and punctuation, the punctuation defining a syntactic edge within the individual sentences, and the punctuation defining a discursive function emphatic within the individual sentence;
the means for analyzing documents categorizing the punctuation by determining the syntactic edge; and
indicating the discursive function emphatic, thereby generating a graphemic feature;
the means for analyzing documents dividing each of the individual sentences into the words;
labeling each of the words as a part of speech;
listing the labeled words into phrases for each labeled word, identifying phrases for each said head, classifying the identified phrases as marked or unmarked, characterizing the identified phrases by markedness, thereby producing a plurality of syntactic features; and
inputting at least one of the syntactic features and at least one of the graphemic features into the means for determining belonging, thereby determining whether the unidentified author of the textual work sample belongs to the known author group.
-
-
5. A method to determine whether an unidentified author of a textual work belongs to a group comprising the textual work of a known author, the method comprising the steps of:
-
obtaining a sample of the textual work of the unidentified author;
obtaining a sample of the textual work of the known author;
analyzing the samples by, splitting the entered samples into individual sentences, the sentences each including a head, a plurality of words and punctuation, the punctuation defining a syntactic edge within the individual sentence, and the punctuation defining a discursive function emphatic within the individual sentence;
categorizing the punctuation by determining the syntactic edge;
indicating the discursive function emphatic, a graphemic feature being generated by the steps of categorizing and indicating;
dividing each of the individual sentences into the words;
labeling each of the words as a part of speech;
listing the labeled words into phrases for each labeled word;
identifying phrases for each said head;
classifying the identified phrases as marked or unmarked;
characterizing the identified phrases by markedness, thereby producing a plurality of syntactic features;
utilizing a means for determining belonging, inputting at least one of the syntactic features and at least one of the graphemic features to determine whether the unidentified author of the textual work sample belongs to the known author group.
-
Specification