Methods and systems for determining a language of a document

US 7,191,116 B2
Filed: 06/19/2001
Issued: 03/13/2007
Est. Priority Date: 06/19/2001
Status: Active Grant

First Claim

Patent Images

1. A system for automatically determining a language of a document from a set of candidate languages, the system comprising:

a database containing probability data for a plurality of text strings each having a predetermined length equal to each other, each text string of the plurality of text strings having an associated probability value indicating a probability that the text string occurs within a language based on occurrences of the text string in all of the candidate languages;

logic for setting a negative assumption value for each of the candidate languages indicating the document is not one of the candidate languages;

an extractor for extracting a character string from the document, the character string having a length equal to the predetermined length of the plurality of text strings contained in the database; and

a language analyzer for determining a probability value for each of the candidate languages that the character string does not belong to the candidate languages by retrieving the probability value associated to the character string from the database for each or the candidate languages, and includes logic for adjusting the negative assumption value based on the probability value, the language analyzer determining that the document is one language of the candidate languages when the negative assumption value passes a threshold value.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for determining the language of an unknown document is provided. For a set of candidate languages, a negative assumption is set for each candidate language that the document is not that language and the system attempts to prove the negative assumption is wrong. If the negative assumption fails for one language, then the document is identified as being in that language. The present system and method provides a higher degree of accuracy when determining the language of a document.

48 Citations

View as Search Results

19 Claims

1. A system for automatically determining a language of a document from a set of candidate languages, the system comprising:
- a database containing probability data for a plurality of text strings each having a predetermined length equal to each other, each text string of the plurality of text strings having an associated probability value indicating a probability that the text string occurs within a language based on occurrences of the text string in all of the candidate languages;
  
  logic for setting a negative assumption value for each of the candidate languages indicating the document is not one of the candidate languages;
  
  an extractor for extracting a character string from the document, the character string having a length equal to the predetermined length of the plurality of text strings contained in the database; and
  
  a language analyzer for determining a probability value for each of the candidate languages that the character string does not belong to the candidate languages by retrieving the probability value associated to the character string from the database for each or the candidate languages, and includes logic for adjusting the negative assumption value based on the probability value, the language analyzer determining that the document is one language of the candidate languages when the negative assumption value passes a threshold value.
- View Dependent Claims (2, 3, 4)
- - 2. The system as set forth in claim 1 further including an information retrieval engine for retrieving documents in response to a search request, the documents retrieved being analyzed by the language analyzer.
  - 3. The system as set forth in claim 1 wherein the logic for adjusting includes logic for combining the negative assumption value with the probability value.
  - 4. The system as set forth in claim 1 wherein to language analyzer further includes iteration logic for causing the extractor to extract another character string from the document if the negative assumption value fails to pass to threshold value.

5. A method of determining a language of a document from a set of candidate languages, the method comprising the steps of:
- setting a null hypothesis to a true value for each candidate language indicating the document is not in the candidate language and setting a false value;
  
  extracting a text string from the document, the text string having a predetermined length;
  
  determining a contrary probability for each candidate language that the text string does not belong to the candidate language based on probabilities that the text string belongs to each of the candidate languages where the probabilities are retrieved from a database that stores probability values for a plurality of text strings each having the predetermined length, each text string of the plurality of text strings having an associated probability value for each candidate language indicating a probability that the text string occurs within a language from the candidate languages based on occurrences of the text string in all of the candidate languages;
  
  adjusting the null hypothesis for each candidate language with the contrary probability corresponding to the candidate language; and
  
  determining the document is one language from the candidate languages when the null hypothesis for the one language is disproved by approaching the false value.
- View Dependent Claims (6, 7, 8, 9, 10, 11, 12)
- - 6. The method as set forth in claim 5 further includes setting a threshold value indicating that the null hypothesis is disproved.
  - 7. The method as set forth in claim 6 further includes repeating the extracting step for a different text string from the document and repeating the method until the null hypothesis is disproved for one of the candidate languages bypassing the threshold value.
  - 8. The method as set forth in claim 5 further includes pregenerating probability data corresponding to each candidate language, the probability data including a probability value for a text string that is normalized based on an occurrence probability of the text string in all the candidate languages.
  - 9. The method as set forth in claim 5 further includes identifying the document based on a search request.
  - 10. The method as set forth in claim 5 wherein the extracting step includes extracting a plurality of sequential characters that form the text string.
  - 11. The method as set forth in claim 5 wherein the setting step includes setting the true value to 1 and setting the false value to 0.
  - 12. The method as set forth in claim 5 wherein the contrary probability for a first candidate language is determined based on a number of occurrences of the text string found in a sample set of documents from the first candidate language which is normalized by a sum of occurrences of the text string found in a sample set of documents from all the candidate languages.

13. A process of determining that a document is in a selected language, the process comprising the steps of:
- setting a probability assumption indicating that the document is not in the selected language;
  
  extracting a character string from the document; and
  
  disproving the probability assumption based on a contrary probability that the character string does not belong to the selected language such that if the contrary probability fails to support the probability assumption, then the document is determined as being in the selected language.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. The process as set forth in claim 13 further includes determining the document is the selected language from a set of candidate languages.
  - 15. The process as set forth in claim 14 further including generating a probability database having a contrary probability for each of a plurality of character strings for each of the candidate languages, where the contrary probability of a character string in one language is determined based on an occurrence frequency of the character string in the one language influenced by a total occurrence frequency of the character string in all the candidate languages.
  - 16. The process as set forth in claim 15 further including determining the occurrence frequency of each character string based on a sample set of documents provided for each of the candidate languages.
  - 17. The process as set forth in claim 15 wherein the contrary probability of the character string in one language is normalized by the total occurrence frequency of the character string in all the candidate languages.
  - 18. The process as set forth m claim 13 further including identifying the document in response to a search request.
  - 19. A computer program product configured to perform the process of claim 13.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oracle International Corporation (Oracle Corporation)
Original Assignee
Oracle International Corporation (Oracle Corporation)
Inventors
Alpha, Shamim A
Primary Examiner(s)
Dorvil; Richemond
Assistant Examiner(s)
SPOONER, LAMONT M

Application Number

US09/884,403
Publication Number

US 20030009324A1
Time in Patent Office

2,093 Days
Field of Search

704 2- 8
US Class Current

704/8
CPC Class Codes

G06F 40/263 Language identification

Methods and systems for determining a language of a document

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

48 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Methods and systems for determining a language of a document

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

48 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links