Identifying a property of a document

US 8,380,488 B1
Filed: 04/19/2007
Issued: 02/19/2013
Est. Priority Date: 04/19/2006
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method, comprising:

receiving a sequence of bytes representing text in a document;

identifying a plurality of byte-n-grams occurring in the sequence of bytes, each byte-n-gram comprising n adjacent bytes from the sequence of bytes, where n is an integer greater than 1;

identifying, for multiple encodings, a respective likelihood of each byte-n-gram occurring in each of the respective multiple encodings;

receiving multiple likelihoods associated with a top-level-domain of the document, each of the multiple likelihoods indicating a likelihood that documents from the top level domain are in a respective one of the multiple encodings;

determining a respective encoding score for each of the multiple encodings, each respective encoding score being based on the likelihood of the plurality of byte-n-grams occurring in the corresponding one of the multiple encodings and based on the likelihoods associated with the top-level-domain the corresponding one of the multiple encodings; and

identifying a most likely encoding of the document based on a highest encoding score among the encoding scores.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems and apparatus, including computer program products, for identifying properties of an electronic document. In one aspect, a sequence of bytes representing text in a document is received. A plurality of byte-n-grams are identified from the bytes. For multiple encodings, a respective likelihood of each byte-n-gram occurring in each of the respective multiple encodings is identified. A respective encoding score for each of the multiple encodings is determined. A most likely encoding of the document is identified based on a highest encoding score among the encoding scores. In another aspect, a sequence of characters, having an encoding, are identified in a document. The sequence is segmented into features, each corresponding to two or more characters. A respective score for each of multiple languages is determined based on the features and a respective language model. A language of the document is identified based on the scores.

121 Citations

View as Search Results

15 Claims

1. A computer-implemented method, comprising:
- receiving a sequence of bytes representing text in a document;
  
  identifying a plurality of byte-n-grams occurring in the sequence of bytes, each byte-n-gram comprising n adjacent bytes from the sequence of bytes, where n is an integer greater than 1;
  
  identifying, for multiple encodings, a respective likelihood of each byte-n-gram occurring in each of the respective multiple encodings;
  
  receiving multiple likelihoods associated with a top-level-domain of the document, each of the multiple likelihoods indicating a likelihood that documents from the top level domain are in a respective one of the multiple encodings;
  
  determining a respective encoding score for each of the multiple encodings, each respective encoding score being based on the likelihood of the plurality of byte-n-grams occurring in the corresponding one of the multiple encodings and based on the likelihoods associated with the top-level-domain the corresponding one of the multiple encodings; and
  
  identifying a most likely encoding of the document based on a highest encoding score among the encoding scores.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, where n is two.
  - 3. The method of claim 1, further comprising:
    - removing from the plurality of byte-n-grams every byte-n-gram in which all bytes have a value less than 127.
  - 4. The method of claim 1, further comprising:
    - identifying a language based on the most likely encoding of the document.
  - 5. The method of claim 1, further comprising:
    - identifying a plurality of words represented in the sequence of bytes representing text in the document;
      
      determining a language score for each of multiple languages based on a respective per-word likelihood that each word occurs in each of the multiple languages; and
      
      identifying a most likely language of the document based on a highest language score among the language scores for the multiple languages.

6. A computer program product, encoded on a non-transitory computer-readable medium, operable to cause data processing apparatus to:
- receive a sequence of bytes representing text in a document;
  
  identify a plurality of byte-n-grams occurring in the sequence of bytes, each byte-n-gram comprising n adjacent bytes from the sequence of bytes, where n is an integer greater than 1;
  
  identify, for multiple encodings, a respective likelihood of each byte-n-gram occurring in each of the respective multiple encodings;
  
  receive multiple likelihoods associated with a top-level-domain of the document, each of the multiple likelihoods indicating a likelihood that documents from the top level domain are in a respective one of the multiple encodings;
  
  determine a respective encoding score for each of the multiple encodings, each respective encoding score being based on the likelihood of the plurality of byte-n-grams occurring in the corresponding one of the multiple encodings and based on the likelihoods associated with the top-level-domain the corresponding one of the multiple encodings; and
  
  identify a most likely encoding of the document based on a highest encoding score among the encoding scores.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The program product of claim 6, where n is two.
  - 8. The program product of claim 6, being further operable to:
    - remove from the plurality of byte-n-grams every byte-n-gram in which all bytes have a value less than 127.
  - 9. The program product of claim 6, being further operable to:
    - identify a language based on the most likely encoding of the document.
  - 10. The program product of claim 6, being further operable to:
    - identify a plurality of words represented in the sequence of bytes representing text in the document;
      
      determine a language score for each of multiple languages based on a respective per-word likelihood that each word occurs in each of the multiple languages; and
      
      identify a most likely language of the document based on a highest language score among the language scores for the multiple languages.

11. A system comprising:
- means for receiving a sequence of bytes representing text in a document;
  
  means for identifying a plurality of byte-n-grams occurring in the sequence of bytes, each byte-n-gram comprising n adjacent bytes from the sequence of bytes, where n is an integer greater than 1;
  
  means for identifying, for multiple encodings, a respective likelihood of each byte-n-gram occurring in each of the respective multiple encodings;
  
  means for receiving multiple likelihoods associated with a top-level-domain of the document, each of the multiple likelihoods indicating a likelihood that documents from the top level domain are in a respective one of the multiple encodings;
  
  means for determining a respective encoding score for each of the multiple encodings, each respective encoding score being based on the likelihood of the plurality of byte-n-grams occurring in the corresponding one of the multiple encodings and based on the likelihoods associated with the top-level-domain the corresponding one of the multiple encodings; and
  
  means for identifying a most likely encoding of the document based on a highest encoding score among the encoding scores.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The system of claim 11, where n is two.
  - 13. The system of claim 11, further comprising:
    - means for removing from the plurality of byte-n-grams every byte-n-gram in which all bytes have a value less than 127.
  - 14. The system of claim 11, further comprising:
    - means for identifying a language based on the most likely encoding of the document.
  - 15. The system of claim 11, further comprising:
    - means for identifying a plurality of words represented in the sequence of bytes representing text in the document;
      
      means for determining a language score for each of multiple languages based on a respective per-word likelihood that each word occurs in each of the multiple languages; and
      
      means for identifying a most likely language of the document based on a highest language score among the language scores for the multiple languages.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google Inc. (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Liu, Xin, Yang, Stewart
Primary Examiner(s)
Saint Cyr, Leonard

Application Number

US11/737,603
Time in Patent Office

2,133 Days
Field of Search

704 4- 10
US Class Current

704/4
CPC Class Codes

G06F 40/126   Character encoding

G06F 40/129   Handling non-Latin characte...

G06F 40/263   Language identification

Identifying a property of a document

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

121 Citations

15 Claims

Specification

Use Cases

Quick Links

Others

Identifying a property of a document

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

121 Citations

15 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others