Method and system for discovering significant subsets in collection of documents
First Claim
Patent Images
1. A method of discovering a subset in a collection of documents, the method being performed on a computer programmed to perform the method, the method comprising:
- obtaining a collection of documents, the documents being arranged in sequence and the subset arranged in sequence within the collection of documents;
analyzing a first document in the collection of documents to determine characteristic features of the first document, a plurality of the characteristic features including human created indicia;
generating a profile based on the characteristic features of the first document;
assigning a variable weight to each of the characteristic features;
comparing subsequent documents in the collection of documents to the profile to identify the subset, said comparing comprising considering the characteristic features based on the variable weight assigned to the characteristic features;
during said comparing, redistributing the variable weight when it is determined that one or more of the characteristic features is more reliable than other characteristic fields; and
preselecting a subset of users, said users having created at least one document in the collection of documents, said analyzing is conducted based on said preselecting a subset of users.
0 Assignments
0 Petitions
Accused Products
Abstract
A method (and system) of discovering a significant subset in a collection of documents, includes identifying a set of documents from a plurality of documents based on a likelihood that documents in the set of documents carries an instance of information that is characteristic to the documents in the set of documents.
-
Citations
16 Claims
-
1. A method of discovering a subset in a collection of documents, the method being performed on a computer programmed to perform the method, the method comprising:
-
obtaining a collection of documents, the documents being arranged in sequence and the subset arranged in sequence within the collection of documents; analyzing a first document in the collection of documents to determine characteristic features of the first document, a plurality of the characteristic features including human created indicia; generating a profile based on the characteristic features of the first document; assigning a variable weight to each of the characteristic features; comparing subsequent documents in the collection of documents to the profile to identify the subset, said comparing comprising considering the characteristic features based on the variable weight assigned to the characteristic features; during said comparing, redistributing the variable weight when it is determined that one or more of the characteristic features is more reliable than other characteristic fields; and preselecting a subset of users, said users having created at least one document in the collection of documents, said analyzing is conducted based on said preselecting a subset of users. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A system for discovering a significant subset in a collection of documents, comprising:
-
an analyzing unit that determines characteristic features from a document in the collection of documents; a profile-generating unit that generates a profile for said document based on said characteristic features; and a comparison unit that compares a subsequent document with said profile, wherein when said subsequent document matches said profile, said subsequent document is included in said set of documents and a next subsequent document is compared at least to said subsequent document.
-
-
15. A method, comprising:
-
analyzing a first document in a collection of documents to determine a characteristic feature of the first document; generating a profile of the first document based on the characteristic feature; and comparing a subsequent document in the collection of documents to the profile, wherein when the subsequent document matches the profile, the subsequent document is included in a set of documents and a next subsequent document is compared at least to the subsequent document. - View Dependent Claims (16)
-
Specification