Determination of a semantic snapshot

US 20030221160A1
Filed: 05/22/2003
Published: 11/27/2003
Est. Priority Date: 05/24/2002
Status: Abandoned Application

First Claim

Patent Images

1. A method of characterizing a document wherein a series of statistical properties of text in the document is determined, the method comprising:

determining a list of words occurring in the document;

determining a frequency of occurrence for each word in the list; and

building up the series with pairs, each pair having one word from the list and the frequency of that word, wherein the series forms a semantic snapshot of the document.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and apparatus for characterizing a document are described, particularly for the recognition, organization or relating of documents, for which purpose a series of statistical properties of the text in the document is determined. A list of words occurring in the document is determined and a frequency of occurrence is determined for each word in the list. The series is then built up of pairs respectively of one word from the list and the frequency of that word, where the series forms a semantic snapshot of the document. The semantic snapshot is used for comparing documents with one another or for comparing with a semantic snapshot of a specific area of attention or subject, so that the relevance of the document to that subject is determined.

Citations

29 Claims

1. A method of characterizing a document wherein a series of statistical properties of text in the document is determined, the method comprising:
- determining a list of words occurring in the document;
  
  determining a frequency of occurrence for each word in the list; and
  
  building up the series with pairs, each pair having one word from the list and the frequency of that word, wherein the series forms a semantic snapshot of the document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. A method according to claim 1, wherein, in the step of determining the list of words, the list of words is processed by omitting words shorter than a predetermined length.
  - 3. A method according to claim 1, wherein, in the step of determining the list of words, the list of words is processed by sorting by at least one of the following criteria:
    - sequence of occurrence;
      
      alphabetical sequence;
      
      sequence of word length; and
      
      sequence of frequency.
  - 4. A method according to claim 1, wherein, in the step of determining the list of words, the list of words is processed by combining or replacing words based on correcting incorrectly or differently spelled words, on reduction of verbs or nouns to a basic form, on recognition of homonyms or synonyms, and/or on a database of technical terms.
  - 5. A method according to claim 1, wherein, in the step of determining the list of words, the list of words is processed by translating words into another language.
  - 6. A method according to claim 1, wherein, in the building step, the semantic snapshot is processed by normalizing the frequencies in the pairs.
  - 7. A method according to claim 1, wherein, in the building step, the semantic snapshot is processed by adding data concerning a semantic structure.
  - 8. A method according to claim 7, wherein the data concerning the semantic structure include author, department, keywords and/or subject.
  - 9. A method according to claim 1, further comprising:
    - determining a relationship between the document and other documents by comparing semantic snapshots, so as to group related documents by subject or to arrange closely related documents.
  - 10. A method according to claim 1, further comprising:
    - determining a relationship between the document and a specific subject by comparing the semantic snapshot of the document and a semantic snapshot specific to the subject and on the basis of a set of known documents and/or a list of words relating to the subject.
  - 11. A method according to claim 1, wherein the document is a document delivered by an application program, an e-mail, or a document scanned by a scanner.
  - 12. A method according to claim 1, further comprising:
    - transmitting the semantic snapshot of the document over a network.

13. A computer program product embodied on at least one computer-readable medium, for characterizing a document, the computer program product comprising computer-executable instructions for:
- determining a list of words occurring in the document;
  
  determining a frequency of occurrence for each word in the list; and
  
  building up the series with pairs, each pair having one word from the list and the frequency of that word, wherein the series forms a semantic snapshot of the document.
- View Dependent Claims (14, 15, 16, 17)
- - 14. A computer program product according to claim 13, wherein the list of words is processed by omitting words shorter than a predetermined length.
  - 15. A computer program product according to claim 13, wherein the list of words is processed by sorting by at least one of the following criteria:
    - sequence of occurrence;
      
      alphabetical sequence;
      
      sequence of word length; and
      
      sequence of frequency.
  - 16. A computer program product according to claim 13, wherein the list of words is processed by combining or replacing words based on correcting incorrectly or differently spelled words, on reduction of verbs or nouns to a basic form, on recognition of homonyms or synonyms, and/or on a database of technical terms.
  - 17. A computer program product according to claim 13, wherein the list of words is processed by translating words into another language.

18. A data signal, wherein the signal represents a data structure of a semantic snapshot as formed by:
- determining a list of words occurring in a document;
  
  determining a frequency of occurrence for each word in the list; and
  
  building up a series of statistical properties of text in the document with pairs, each pair having one word from the list and the frequency of the word, wherein the series forms the semantic snap shop of the document.
- View Dependent Claims (19)
- - 19. A data signal according to claim 18, wherein the signal is stored on a data support.

20. An apparatus for processing documents, the apparatus comprising:
- a module for characterizing a document by using a series of statistical properties of text of the document, wherein the module determines a list of words occurring in the document, determines a frequency of occurrence for each word in the list, and builds up the series from pairs, each pair having one word from the list and the frequency of that word, wherein the series forms a semantic snapshot of the document.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29)
- - 21. An apparatus according to claim 20, further comprising:
    - a document input unit to extract the text.
  - 22. An apparatus of claim 20, wherein the module processes the list of words by omitting words shorter than a predetermined length.
  - 23. An apparatus of claim 20, wherein the module processes the list of words by sorting by at least one of the following criteria:
    - sequence of occurrence;
      
      alphabetical sequence;
      
      sequence of word length; and
      
      sequence of frequency.
  - 24. An apparatus of claim 20, wherein the module processes the list of words by combining or replacing words based on correcting incorrectly or differently spelled words, on reduction of verbs or nouns to a basic form, on recognition of homonyms or synonyms, and/or on a database of technical terms.
  - 25. An apparatus of claim 20, wherein the module processes the list of words by translating words into another language.
  - 26. An apparatus of claim 20, wherein the module processes the semantic snapshot by normalizing the frequencies in the pairs.
  - 27. An apparatus of claim 20, wherein the module processes the semantic snapshot by adding data concerning a semantic structure.
  - 28. An apparatus of claim 20, further comprising:
    - means for determining a relationship between the document and other documents by comparing semantic snapshots, so as to group related documents by subject or to arrange closely related documents.
  - 29. An apparatus of claim 20, further comprising:
    - means for determining a relationship between the document and a specific subject by comparing the semantic snapshot of the document and a semantic snapshot specific to the subject and on the basis of a set of known documents and/or a list of words relating to the subject.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Océ-Technologies B.V. (Canon Inc.)
Original Assignee
Océ-Technologies B.V. (Canon Inc.)
Inventors
Van Den Tillaart, Robertus Cornelis Willibrordus Theodorus Maria

Application Number

US10/443,229
Publication Number

US 20030221160A1
Time in Patent Office

Days
Field of Search
US Class Current

715/500
CPC Class Codes

G06F 16/319 Inverted lists

Determination of a semantic snapshot

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

29 Claims

Specification

Solutions

Use Cases

Quick Links

Determination of a semantic snapshot

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

29 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links