Method and system for document indexing and data querying

US 9,275,128 B2
Filed: 07/20/2010
Issued: 03/01/2016
Est. Priority Date: 07/23/2009
Status: Expired due to Fees

First Claim

Patent Images

1. A method for generating a document index, comprising:

generating a preset filter character list, wherein generating includes;

determining monadic partitions from a sample set of documents, wherein monadic partitions comprise character text;

determining an appearance frequency for each of at least a subset of the monadic partitions among the sample set of documents; and

including a subset of the monadic partitions into the preset filter character list based at least in part on appearance frequencies corresponding to respective ones of the monadic partitions;

obtaining a document to be indexed;

performing a monadic partition operation on the document to obtain a plurality of monadic partitions associated with the document;

for a first monadic partition in the plurality of monadic partitions associated with the document;

determining that the first monadic partition is a first filter character monadic partition based at least in part on matching the first monadic partition with the first filter character monadic partition of the preset filter characters list; and

in response to the determination that the first monadic partition is the first filter character monadic partition;

not adding a first entry in the document index corresponding to the first filter character monadic partition;

forming a polynary partition by combining the first filter character monadic partition with at least one other monadic partition in the plurality of monadic partitions associated with the document, wherein the polynary partition comprises a binary partition, wherein the at least one other monadic partition is adjacent to the first filter character monadic partition in the document; and

adding the first entry in the document index corresponding to the polynary partition; and

for a second monadic partition in the plurality of monadic partitions associated with the document;

determining that the second monadic partition is not a second filter character monadic partition based at least in part on not matching the second monadic partition with the second filter character monadic partition of the preset filter characters list; and

in response to the determination that the second monadic partition is not the second filter character monadic partition, adding a second entry in the document index corresponding to the second monadic partition.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Generating a document index comprises: obtaining a document to be indexed; determining whether each monadic partition obtained from the document is a filter character and if so, forming a polynary partition with the monadic partition and at least one adjacent monadic partition and indexing the polynary partition, otherwise, indexing the monadic partition. Querying data comprising: receiving a data query, determining whether each monadic partition obtained from the data query is a filter character and if so, forming a polynary partition with the monadic partition and at least one adjacent monadic partition and using the polynary partition to obtain search results, otherwise, using the monadic partition to obtain search results; and combining search results to form a final query search result.

Citations

12 Claims

1. A method for generating a document index, comprising:
- generating a preset filter character list, wherein generating includes;
  
  determining monadic partitions from a sample set of documents, wherein monadic partitions comprise character text;
  
  determining an appearance frequency for each of at least a subset of the monadic partitions among the sample set of documents; and
  
  including a subset of the monadic partitions into the preset filter character list based at least in part on appearance frequencies corresponding to respective ones of the monadic partitions;
  
  obtaining a document to be indexed;
  
  performing a monadic partition operation on the document to obtain a plurality of monadic partitions associated with the document;
  
  for a first monadic partition in the plurality of monadic partitions associated with the document;
  
  determining that the first monadic partition is a first filter character monadic partition based at least in part on matching the first monadic partition with the first filter character monadic partition of the preset filter characters list; and
  
  in response to the determination that the first monadic partition is the first filter character monadic partition;
  
  not adding a first entry in the document index corresponding to the first filter character monadic partition;
  
  forming a polynary partition by combining the first filter character monadic partition with at least one other monadic partition in the plurality of monadic partitions associated with the document, wherein the polynary partition comprises a binary partition, wherein the at least one other monadic partition is adjacent to the first filter character monadic partition in the document; and
  
  adding the first entry in the document index corresponding to the polynary partition; and
  
  for a second monadic partition in the plurality of monadic partitions associated with the document;
  
  determining that the second monadic partition is not a second filter character monadic partition based at least in part on not matching the second monadic partition with the second filter character monadic partition of the preset filter characters list; and
  
  in response to the determination that the second monadic partition is not the second filter character monadic partition, adding a second entry in the document index corresponding to the second monadic partition.
- View Dependent Claims (2, 3, 4)
- - 2. The method of claim 1, wherein forming the binary partition further comprises:
    - forming the binary partition by combining the first filter character monadic partition with a subsequent monadic partition in the document in the event that the first filter character monadic partition is a first monadic partition in the document;
      
      forming the binary partition by combining the first filter character monadic partition with a previous monadic partition in the document in the event that the first filter character monadic partition is a last monadic partition in the document; and
      
      forming a first binary partition by combining the first filter character monadic partition with the previous monadic partition and forming a second binary partition by combining the first filter character monadic partition with the subsequent monadic partition in the event the first filter character monadic partition is neither the first monadic partition in the document nor the last monadic partition in the document.
  - 3. The method of claim 1, further comprising:
    - determining that the binary partition is not a first filter character binary partition based at least in part on not matching the binary partition with the first filter character binary partition of the preset filter characters list;
      
      wherein the first entry added in the document index corresponds to the binary partition.
  - 4. The method of claim 1, further comprising:
    - determining that the binary partition is a first filter character binary partition based at least in part on matching the binary partition with the first filter character binary partition of the preset filter characters list;
      
      in response to the determination that the binary partition is the first filter character binary partition, forming a ternary partition by combining the first filter character binary partition with at least one other monadic partition that is adjacent to the binary partition in the document; and
      
      determining that the ternary partition is not a first filter character ternary partition based at least in part on not matching the ternary partition with the first filter character ternary partition of the preset filter characters list;
      
      wherein the first entry added in the document index corresponds to the ternary partition.

5. A method for querying data, comprising:
- generating a preset filter characters list, wherein generating includes;
  
  determining monadic partitions from a sample set of documents, wherein monadic partitions comprise character text;
  
  determining an appearance frequency for each of at least a subset of the monadic partitions among the sample set of documents; and
  
  including a subset of the monadic partitions into the preset filter characters list based at least in part on appearance frequencies corresponding to respective ones of the monadic partitions;
  
  receiving a data query;
  
  performing a monadic partition operation on the data query to obtain a first plurality of monadic partitions associated with the data query;
  
  for a first monadic partition in the first plurality of monadic partitions associated with the data query;
  
  determining that the first monadic partition is a first filter character monadic partition based at least in part on matching the first monadic partition with the first filter character monadic partition of the preset filter characters list; and
  
  in response to the determination that the first monadic partition is the first filter character monadic partition;
  
  not searching a preset index using the first filter character monadic partition;
  
  forming a polynary partition by combining the first filter character monadic partition with at least one other monadic partition in the first plurality of monadic partitions associated with the data query, wherein the polynary partition comprises a binary partition, wherein the at least one other monadic partition is adjacent to the first filter character monadic partition in the data query; and
  
  searching the preset index using the polynary partition to obtain a search result corresponding to the polynary partition; and
  
  for a second monadic partition in the first plurality of monadic partitions associated with the data query;
  
  determining that the second monadic partition is not a second filter character monadic partition based at least in part on not matching the second monadic partition with the second filter character monadic partition of the preset filter characters list;
  
  in response to the determination that the second monadic partition is not the second filter character monadic partition, searching the preset index using the second monadic partition to obtain a search result corresponding to the second monadic partition; and
  
  combining the search results to form a final query search result.
- View Dependent Claims (6, 7)
- - 6. The method of claim 5, wherein the preset index is established by:
    - obtaining a document to be indexed;
      
      performing an indexing monadic partition operation on the document to obtain a second plurality of monadic partitions associated with the document;
      
      for a third monadic partition in the second plurality of monadic partitions associated with the document;
      
      determining that the third monadic partition is a third filter character monadic partition based at least in part on matching the third monadic partition with the third filter character monadic partition of the preset filter characters list; and
      
      in response to the determination that the third monadic partition is the third filter character monadic partition;
      
      not adding a first entry in the preset index corresponding to the third filter character monadic partition;
      
      forming a second polynary partition by combining the third filter character monadic partition with at least one other monadic partition in the second plurality of monadic partitions associated with the document, wherein the at least one other monadic partition is adjacent to the third filter character monadic partition in the document; and
      
      adding the first entry in the preset index corresponding to the second polynary partition; and
      
      for a fourth monadic partition in the second plurality of monadic partitions associated with the document;
      
      determining that the fourth monadic partition is not a fourth filter character monadic partition based at least in part on not matching the fourth monadic partition with the fourth filter character monadic partition of the preset filter characters list; and
      
      in response to the determination that the fourth monadic partition is not the fourth filter character monadic partition, adding a second entry in the preset index corresponding to the fourth monadic partition.
  - 7. The method of claim 5, wherein forming the binary partition further comprises:
    - forming the binary partition by combining the first filter character monadic partition and a subsequent monadic partition in a document in the event that the first filter character monadic partition is a first monadic partition in the document;
      
      forming the binary partition by combining the first filter character monadic partition and a previous monadic partition in the document in the event that the first filter character monadic partition is a last monadic partition in the document; and
      
      forming a first binary partition by combining the first filter character monadic partition with the previous monadic partition and forming a second binary partition by combining the first filter character monadic partition with the subsequent monadic partition in the event that the first filter character monadic partition is neither the first monadic partition in the document nor the last monadic partition in the document.

8. A document indexing system, comprising:
- one or more processors coupled to an interface, configured to;
  
  generate a preset filter characters list, wherein to generate includes to;
  
  determine monadic partitions from a sample set of documents, wherein monadic partitions comprise character text;
  
  determine an appearance frequency for each of at least a subset of the monadic partitions among the sample set of documents; and
  
  include a subset of the monadic partitions into the preset filter characters list based at least in part on appearance frequencies corresponding to respective ones of the monadic partitions;
  
  obtain a document to be indexed;
  
  perform a monadic partition operation on the document to obtain a plurality of monadic partitions associated with the document;
  
  for a first monadic partition in the plurality of monadic partitions;
  
  determine that the first monadic partition is a first filter character monadic partition based at least in part on matching the first monadic partition with the first filter character monadic partition of the preset filter characters list; and
  
  in response to the determination that the monadic partition is the first filter character monadic partition;
  
  do not add a first entry in a document index corresponding to the first filter character monadic partition;
  
  form a polynary partition by combining the first filter character monadic partition with at least one other monadic partition in the plurality of monadic partitions associated with the document, wherein the polynary partition comprises a binary partition, wherein the at least one other monadic partition is adjacent to the first filter character monadic partition in the document; and
  
  add the first entry in the document index corresponding to the polynary partition; and
  
  for a second monadic partition in the plurality of monadic partitions associated with the document;
  
  determine that the second monadic partition is not a second filter character monadic partition based at least in part on not matching the second monadic partition with the second filter character monadic partition of the preset filter characters list; and
  
  in response to the determination that the second monadic partition is not the second filter character monadic partition, add a second entry in the document index corresponding to the second monadic partition; and
  
  one or more memories coupled to the one or more processors, configured to provide the processors with instructions.
- View Dependent Claims (9)
- - 9. The system of claim 8, wherein forming the binary partition further comprises:
    - forming the binary partition by combining the first filter character monadic partition with a subsequent monadic partition in the document in the event that the first filter character monadic partition is a first monadic partition in the document;
      
      forming the binary partition by combining the first filter character monadic partition with a previous monadic partition in the document in the event that the first filter character monadic partition is a last monadic partition in the document; and
      
      forming a first binary partition by combining the first filter character monadic partition with the previous monadic partition and forming a second binary partition by combining the first filter character monadic partition with the subsequent monadic partition in the event the first filter character monadic partition is neither the first monadic partition in the document nor the last monadic partition in the document.

10. A data querying system, comprising:
- one or more processors coupled to an interface, configured to;
  
  generate a preset filter characters list, wherein to generate includes to;
  
  determine monadic partitions from a sample set of documents, wherein monadic partitions comprise character text;
  
  determine an appearance frequency for each of at least a subset of the monadic partitions among the sample set of documents; and
  
  include a subset of the monadic partitions into the preset filter characters list based at least in part on appearance frequencies corresponding to respective ones of the monadic partitions;
  
  receive a data query;
  
  perform a monadic partition operation on the data query to obtain a first plurality of monadic partitions associated with the data query;
  
  for a first monadic partition in the first plurality of monadic partitions associated with the data query;
  
  determine that the first monadic partition is a first filter character monadic partition based at least in part on matching the first monadic partition with the first filter character monadic partition of the preset filter characters list; and
  
  in response to the determination that the first monadic partition is the first filter character monadic partition;
  
  not search a preset index using the first filter character monadic partition;
  
  form a polynary partition by combining the first filter character monadic partition with at least one other monadic partition in the first plurality of monadic partitions associated with the data query, wherein the polynary partition comprises a binary partition, wherein the at least one other monadic partition is adjacent to the first filter character monadic partition in the data query; and
  
  search the preset index using the polynary partition to obtain a search result corresponding to the polynary partition;
  
  for a second monadic partition in the first plurality of monadic partitions associated with the data query;
  
  determine that the second monadic partition is not a second filter character monadic partition based at least in part on not matching the second monadic partition with the second filter character monadic partition of the preset filter characters list; and
  
  in response to the determination that the second monadic partition is not the second filter character monadic partition, search the preset index using the second monadic partition to obtain a search result corresponding to the second monadic partition; and
  
  combine the search results to form a final query search result; and
  
  one or more memories coupled to the one or more processors, configured to provide the processors with instruction.
- View Dependent Claims (11, 12)
- - 11. The system of claim 10, wherein the preset index is established by:
    - obtaining a document to be indexed;
      
      performing an indexing monadic partition operation on the document to obtain a second plurality of monadic partitions associated with the document;
      
      for a third monadic partition in the second plurality of monadic partitions associated with the document;
      
      determining that the third monadic partition is a third filter character monadic partition based at least in part on matching the third monadic partition with the third filter character monadic partition of the preset filter characters list; and
      
      in response to the determination that the third monadic partition is the third filter character monadic partition;
      
      not adding a first entry in the preset index corresponding to the third filter character monadic partition;
      
      forming a second polynary partition by combining the third filter character monadic partition with at least one other monadic partition in the second plurality of monadic partitions associated with the document, wherein the at least one other monadic partition is adjacent to the third filter character monadic partition in the document; and
      
      adding the first entry in the preset index corresponding to the second polynary partition; and
      
      for a fourth monadic partition in the second plurality of monadic partitions associated with the document;
      
      determining that the fourth monadic partition is not a fourth filter character monadic partition based at least in part on not matching the fourth monadic partition with the fourth filter character monadic partition of the preset filter characters list; and
      
      in response to the determination that the fourth monadic partition is not the fourth filter character monadic partition, adding a second entry in the preset index corresponding to the fourth monadic partition.
  - 12. The method of claim 11, wherein forming the binary partition further comprises:
    - forming the binary partition by combining the first filter character monadic partition and a subsequent monadic partition in the document in the event that the first filter character monadic partition is a first monadic partition in the document;
      
      forming the binary partition by combining the first filter character monadic partition and a previous monadic partition in the document in the event that the first filter character monadic partition is a last monadic partition in the document; and
      
      forming a first binary partition by combining the first filter character monadic partition with the previous monadic partition and forming a second binary partition by combining the first filter character monadic partition with the subsequent monadic partition in the event that the first filter character monadic partition is neither the first monadic partition in the document nor the last monadic partition in the document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Alibaba Group Holding Ltd.
Original Assignee
Alibaba Group Holding Ltd.
Inventors
Wei, Lei, Shen, Jiaxiang
Primary Examiner(s)
Pulliam, Christyann
Assistant Examiner(s)
Ohba, Mellissa M

Application Number

US12/804,441
Publication Number

US 20110022596A1
Time in Patent Office

2,051 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 16/2228   Indexing structures

G06F 16/2379   Updates performed during on...

G06F 16/24554   Unary operations; Data part...

G06F 16/31   Indexing; Data structures t...

G06F 16/316   Indexing structures

G06F 16/3337   Translation of the query la...

G06F 16/93   Document management systems

Method and system for document indexing and data querying

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for document indexing and data querying

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links