Automatic extraction of metadata using a neural network

US 6,044,375 A
Filed: 04/30/1998
Issued: 03/28/2000
Est. Priority Date: 04/30/1998
Status: Expired due to Term

First Claim

Patent Images

1. A method of automatically extracting metadata from a document, the method comprising:

(a) providing;

a computer readable document including blocks comprised of words,an authority list, including common uses of a set of words, anda neural network trained to extract metadata from compounds;

(b) locating authority information associated with the words by comparing the words with the authority list;

(c) creating compounds, a first of the compounds describing a first of the blocks and including;

first-block words,descriptive information associated with one of the first-block and the first block words, andauthority information associated with one first-block word;

(d) processing the compounds through the neural network to generate metadata guesses; and

(e) deriving the metadata from the metadata guesses.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of automatically extracting metadata from a document. The method of the invention provides a computer readable document that includes blocks comprised of words, an authority list that includes common uses of a set of words, and a neural network trained to extract metadata from groupings of data called compounds. Compounds are created with one compound describing each of the blocks. Each compound includes the words making up the block, descriptive information about the blocks, and authority information associated with some of the words. The descriptive information may include such items as bounding box information, describing the size and position of the block, and font information, describing the size and type of font the words of the block use. The authority information is located by comparing each the words from the block to the authority list. The compounds are processed through the neural network to generate metadata guesses including word guesses, compound guesses and document guesses along with confidence factors associated with the guesses indicating the likelihood that each of the guesses is correct. The method may additionally include providing a document knowledge base of positioning information and size information for metadata in known documents. If the document knowledge base is provided, then the method includes deriving analysis data from the metadata guess and comparing the analysis data to the document knowledge base to determine metadata output.

322 Citations

16 Claims

1. A method of automatically extracting metadata from a document, the method comprising:
- (a) providing;
  
  a computer readable document including blocks comprised of words,an authority list, including common uses of a set of words, anda neural network trained to extract metadata from compounds;
  
  (b) locating authority information associated with the words by comparing the words with the authority list;
  
  (c) creating compounds, a first of the compounds describing a first of the blocks and including;
  
  first-block words,descriptive information associated with one of the first-block and the first block words, andauthority information associated with one first-block word;
  
  (d) processing the compounds through the neural network to generate metadata guesses; and
  
  (e) deriving the metadata from the metadata guesses.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. A method as in claim 1, in which step (a) additionally includes providing a document knowledge base including positioning information and size information for metadata in known documents;
    - andthe method additionally comprises before step (e);
      
      deriving analysis data from the metadata guess; and
      
      comparing the analysis data to the document knowledge base to improve the metadata guesses.
  - 3. A method as in claim 1, in which the descriptive information includes bounding box information describing the size and position of the first of the blocks.
  - 4. A method as in claim 1, in which the descriptive information includes font information for the first-block words.
  - 5. A method as in claim 1, in which the metadata guesses include:
    - compound guesses, a first of the compound guesses indicating a possible block type for the first of the blocks, anddocument guesses, a first of the document guesses indicating a possible document type for the computer readable document.
  - 6. A method as in claim 5, in which the metadata guesses additionally include:
    - word guesses, a first of the word guesses indicating a possible word type for the one first-block word.
  - 7. A method as in claim 5, in which the first of the compound guesses includes a compound confidence factor indicating a likelihood that the possible block type is correct.
  - 8. A method as in claim 5, in which the first of the document guesses includes a document confidence factor indicating a likelihood that the possible document type is correct.
  - 9. A method as in claim 2, in which the metadata guesses includes:
    - compound guesses, a first of the compound guesses including;
      
      a possible block type for the first of the blocks, anda compound confidence factor indicating a likelihood the possible block type is correct; and
      
      document guesses, a first of the document guesses including;
      
      a possible document type for the computer readable document, anda document confidence factor indicating a likelihood that the possible document type is correct.
  - 10. A method as in claim 9, in which the analysis data includes:
    - the first of the compound guesses and the first of the document guesses.
  - 11. A method as in claim 9, in which the analysis data of step (d) includes:
    - proximate block type data derived by comparing the first of the compound guesses against a second of the compound guesses,the second of the compound guesses including a possible block type for a second of the blocks located on the computer readable document proximate to the first of the blocks.
  - 12. A method as in claim 3, in which the bounding box is a first bounding box and the analysis data includes:
    - proximate block position data derived by comparing the first bounding box information with a second bounding box information,the second bounding box information describing the size and position of a second of the blocks located on the computer readable document proximate to the first of the blocks.
  - 13. A method as in claim 3, in which the analysis data of step (d) includes:
    - page position data derived from the bounding box information.
  - 14. A method as in claim 4, in which the analysis data of step (d) includes:
    - font data derived from the font information.
  - 15. A method as in claim 1, in which providing a computer readable document includes:
    - scanning a paper document to create scanner output; and
      
      performing an optical character recognition operation on the scanner output.

16. A method of automatically extracting metadata from a document, the method comprising:
- (a) providing;
  
  a computer readable document including blocks comprised of words,an authority list, including common uses of a set of words,a neural network trained to extract metadata from compounds, anda document knowledge base including positioning information and size information for metadata in known documents;
  
  (b) locating authority information associated with the words by comparing the words with the authority list;
  
  (c) creating compounds, a first compound describing a first of the blocks and including;
  
  first-block words,descriptive information associated with one of the first of the blocks and the first-block words, the descriptive information including;
  
  a first bounding box information describing the size and position of the first of the blocks, andfont information describing one of the first-block words, andauthority information associated with one of the first-block words;
  
  (d) processing the compounds through the neural network to generate metadata guesses including;
  
  word guesses, a first of the word guesses indicating a possible word type for the one of the first-block words,compound guesses, a first of the compound guesses indicating a possible block type for the first of the blocks and including a compound confidence factor indicating a likelihood that the possible block type is correct, anddocument guesses, a first of the document guesses indicating a possible document type for the computer readable document and including a document confidence factor indicating the likelihood that the possible document type is correct;
  
  (e) deriving analysis data from the metadata guesses, the analysis data including;
  
  the first of the compound guesses and the first of the document guesses,proximate block type data derived by comparing the first of the compound guesses against a second of the compound guesses, the second of the compound guesses including a possible block type for a second of the blocks located on the computer readable document proximate to the first of the blocks,proximate block position data derived by comparing the first bounding box information against a second bounding box information, the second bounding box information describing the size and position of the second of the blocks,page position data derived from the first bounding box information and the second bounding box information, andfont data derived from the font information;
  
  (f) comparing the analysis data to the document knowledge base to improve the metadata guesses; and
  
  (g) deriving the metadata from the metadata guesses.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
HTC Corporation
Original Assignee
Hewlett-Packard Company (HP Inc.)
Inventors
Greig, Darryl, Staelin, Carl, Tamir, Tami, Shmueli, Oded
Primary Examiner(s)
Kulik, Paul V.

Application Number

US09/070,439
Time in Patent Office

698 Days
Field of Search

706/20, 706/934, 382/159, 382/161, 382/229, 707/100-102, 707/200
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

Y10S 706/934   Information retrieval or In...

Y10S 707/99942   Manipulating data structure...

Y10S 707/99943   Generating database or data...

Automatic extraction of metadata using a neural network

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

322 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Automatic extraction of metadata using a neural network

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

322 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links