Detecting novel document content
First Claim
Patent Images
1. A computer-implemented method, comprising:
- determining an ordered sequence of documents;
determining an amount of novel content contained in each document of the ordered sequence of documents by identifying one or more pairs of textual sequences that occur in close proximity to one another in a respective document;
assigning a novelty score to each document based on the determined amount of novel content; and
providing the documents of the ordered sequence of documents based on the assigned novelty scores.
2 Assignments
0 Petitions
Accused Products
Abstract
A system determines an ordered sequence of documents and determines an amount of novel content contained in each document of the ordered sequence of documents. The system assigns a novelty score to each document based on the determined amount of novel content.
72 Citations
32 Claims
-
1. A computer-implemented method, comprising:
-
determining an ordered sequence of documents; determining an amount of novel content contained in each document of the ordered sequence of documents by identifying one or more pairs of textual sequences that occur in close proximity to one another in a respective document; assigning a novelty score to each document based on the determined amount of novel content; and providing the documents of the ordered sequence of documents based on the assigned novelty scores. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer-implemented method, comprising:
-
identifying textual sequences that carry information within a document of a plurality of documents; determining an importance value Na(t, A) of each of the identified textual sequences as equal to TF(t, A)*WTF(t, A), where t is an identified textual sequence contained in the document, A is the document, TF(t, A) is the term frequency of textual sequence t in document A, and assigning a score to the document of the plurality of documents based on the importance value of each of the identified textual sequences to the document, the assigned score indicating the presence of content in the document that is novel relative to content in other documents of the plurality of documents; ranking the document among the other documents of the plurality of documents based on the assigned score; and displaying the document among other documents based on the ranking. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25)
-
-
17. The computer-implemented method of claim 12, wherein assigning a score to a document further comprises:
determining an indication of importance of each of the identified textual sequences to the plurality of documents.
-
18. The computer-implemented method of claim 17, wherein determining an indication of importance of each of the identified textual sequences to the plurality of documents comprises:
-
determining an importance value Ns(t, A) that measures the importance of an identified textual sequence t to the plurality of documents, wherein Ns(t, A) is computed based on one or more of the following quantities; a) a number of documents in the plurality of documents that contain the identified textual sequence t; b) a sum of Na(t, A) for all documents A in the plurality of documents; c) a sum of log(Na(t, A)) for all documents A in the plurality of documents; d) a sum of Ia(i, S) over all interactions i in the plurality of documents that involve identified textual sequence t;
ore) a sum of Ma(i, S) over all interactions i in the plurality of documents that involve the identified textual sequence t, wherein Ma(i, S) is the maximum Ia(i, A) over all documents A in the plurality of documents.
-
-
19. The computer-implemented method of claim 14, wherein assigning a score to a document further comprises:
determining an indication of importance of each of the identified one or more pairs to the plurality of documents.
-
20. The method of claim 19, wherein determining the indication of importance of each of the identified one or more pairs to the plurality of documents comprises:
-
determining an importance value Is(i, S) that measures the importance of a pair i of the one or more pairs to the plurality of documents, wherein Is(i, S) is computed based on one or more of the following quantities; a) a number of documents in the plurality of documents that contain the pair i; b) a sum of Ia(i, A) for all documents A in the plurality of documents; c) a sum of log(Ia(i, A)) for all documents A in the plurality of documents; d) Ns(t1, S)*Ns(t2, S) for pair i comprising identified textual sequences t1 and t2.
-
-
21. The computer-implemented method of claim 20, wherein assigning a score to a document further comprises:
measuring an indication of importance of the document relative to the plurality of documents.
-
22. The computer-implemented method of claim 19, wherein measuring an importance of the document comprises:
determining sigma (Ns(t, S)) for all identified textual sequences t that the document introduced for the first time in the plurality of documents.
-
23. The computer-implemented method of claim 21, wherein measuring an indication of importance of the document comprises:
determining sigma (Is(i, S)) for all of the one or more pairs that the document introduced for a first time in the plurality of documents.
-
24. The computer-implemented method of claim 21, wherein measuring an indication of importance of the document comprises:
-
determining a sum of improvement values for all of the identified textual sequences contained in the document, wherein an improvement for an identified textual sequence t of the document A comprises one or more of the following; a) Ns(t, {S1, A})−
Ns(t, S1) over all identified textual sequences contained in the document A;b) (Ns(t, {S1, A})−
Ns(t, S1))/Ns(t, {S1, A});
orc) ((Ns(t, {S1, A})−
Ns(t, S1))/Ns(t, {S1, A}))*Ns(t, S),where S1 comprises documents in the sequence of documents that are temporally earlier in the plurality of documents than document A.
-
-
25. The computer-implemented method of claim 21, wherein measuring an indication of importance of the document comprises:
-
determining a sum of improvement values for all of the pairs of the one or more pairs contained in the document, wherein an improvement for a pair of the one or more pairs for the document A comprises one or more of the following; a) Is(i, {S1, A})−
Is(i, S1) over all pairs contained in the document A;b) Is(i, {S1, A})−
Is(i, S1))Is(t, {S1, A));
orc) ((Is(i, {S1, A})−
Is(i, S1))/Is(t, {S1, A}))*Is(i, S),where S1 comprises documents in the sequence of documents that are temporally earlier in the plurality of documents than document A.
-
-
26. A computer-implemented method, comprising:
-
identifying one or more textual sequences that carry information in a document of a plurality of documents; determining an indication of importance of each of the textual sequences relative to the plurality of documents; assigning a score to the document based on the determined indication of importance of each of the textual sequences; and providing the document among other documents based on the assigned score. - View Dependent Claims (27, 28, 29, 30)
-
-
31. A computer-readable memory device that stores computer-executable instructions, comprising:
-
instructions for obtaining an ordered sequence of documents; instructions for determining an amount of novel content contained in each document of the ordered sequence of documents by identifying one or more pairs of textual sequences that occur in close proximity to one another in a respective document; instructions for assigning novelty scores to the documents based on the determined amount of novel content; and instructions for providing the documents of the ordered sequence of documents based on the assigned novelty scores.
-
-
32. A computer-implemented system, comprising:
-
means for identifying one or more textual sequences that carry information in a document of a plurality of documents; means for determining an indication of importance of each of the textual sequences relative to the plurality of documents; means for assigning a score to the document based on the determined indication of importance of each of the textual sequences; and means for providing the document among other documents based on the assigned score.
-
Specification