Discovery engine

US 9,507,867 B2
Filed: 03/06/2014
Issued: 11/29/2016
Est. Priority Date: 04/06/2012
Status: Active Grant

First Claim

Patent Images

1. A system for semantically searching a group of documents containing words, exclusive of stop words of the documents, thereby improving efficiency by flatly looking at the words being searched without attempting to understand the meaning of the words, comprising:

a memory containing a set of instructions; and

a processor for processing the set of instructions, wherein the instructions cause the processor to perform a method comprising;

receiving by the processor a current instance of a search criteria containing words;

determining by the processor a first total number of the words, exclusive of stop words, in the current instance of the search criteria;

storing in the memory by the processor the first total number;

for each of the words, exclusive of stop words, respectively, in the current instance of the search criteria, determining by the processor a respective first number of times that the word appears in the current instance of the search criteria;

storing in the memory by the processor the respective first number of times;

for each of the words, exclusive of stop words, respectively, in the current instance of the search criteria, calculating by the processor a first uniqueness score, respectively, for the word, respectively, based on the respective first number and the first total number;

storing in the memory by the processor the first uniqueness score, respectively, for the word, respectively;

for each of the words, exclusive of stop words, respectively, of the current instance of the search criteria and the documents, determining by the processor a respective second number of times that the word appears in the current instance of the search criteria and the documents;

storing in the memory by the processor the respective second number of times, as a first frequency score, respectively;

for each of the words, exclusive of stop words, of the current instance of the search criteria and the each of the documents, respectively, calculating by the processor a respective first significance magnitude factor based on the first frequency score, respectively, and the first uniqueness score, respectively;

storing in the memory by the processor the respective first significance magnitude factor;

determining by the processor a second total number of the words, exclusive of stop words, in the documents of the group;

storing in the memory by the processor the second total number;

for each of the words, exclusive of stop words, respectively, of the documents, respectively, determining by the processor a respective third number of times that the word appears in the documents of the group;

storing in the memory by the processor the respective third number of times;

for each of the words, exclusive of stop words, respectively, of the documents, calculating by the processor a second uniqueness score, respectively, for the word, respectively, based on the respective third number and the second total number;

storing in the memory by the processor the second uniqueness score, respectively, for the word, respectively;

for each of the words, exclusive of stop words of the documents, respectively, in each of the documents, respectively, determining by the processor a respective fourth number of times that the word appears in the document;

storing in the memory by the processor the respective fourth number, as a second frequency score, respectively;

for each of the words, exclusive of stop words, of the documents, calculating by the processor a respective second significance magnitude factor based on the second frequency score, respectively, and the second uniqueness score, respectively;

storing in the memory by the processor the respective second significance magnitude factor; and

for each document of the group, generating by the processor a respective similarity score of contents of the document to the current instance of the search criteria, wherein generating the respective similarity score includes characterizing each document based on the respective second significance magnitude factor compared to the respective first significance magnitude factor.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method that is relatively inexpensive to implement and that permits a user to conduct searches of electronically stored documents using an entire document, multiple documents or portions of a document as the search criteria and to collect, store and to share the relevant documents from the search.

10 Citations

View as Search Results

27 Claims

1. A system for semantically searching a group of documents containing words, exclusive of stop words of the documents, thereby improving efficiency by flatly looking at the words being searched without attempting to understand the meaning of the words, comprising:
- a memory containing a set of instructions; and
  
  a processor for processing the set of instructions, wherein the instructions cause the processor to perform a method comprising;
  
  receiving by the processor a current instance of a search criteria containing words;
  
  determining by the processor a first total number of the words, exclusive of stop words, in the current instance of the search criteria;
  
  storing in the memory by the processor the first total number;
  
  for each of the words, exclusive of stop words, respectively, in the current instance of the search criteria, determining by the processor a respective first number of times that the word appears in the current instance of the search criteria;
  
  storing in the memory by the processor the respective first number of times;
  
  for each of the words, exclusive of stop words, respectively, in the current instance of the search criteria, calculating by the processor a first uniqueness score, respectively, for the word, respectively, based on the respective first number and the first total number;
  
  storing in the memory by the processor the first uniqueness score, respectively, for the word, respectively;
  
  for each of the words, exclusive of stop words, respectively, of the current instance of the search criteria and the documents, determining by the processor a respective second number of times that the word appears in the current instance of the search criteria and the documents;
  
  storing in the memory by the processor the respective second number of times, as a first frequency score, respectively;
  
  for each of the words, exclusive of stop words, of the current instance of the search criteria and the each of the documents, respectively, calculating by the processor a respective first significance magnitude factor based on the first frequency score, respectively, and the first uniqueness score, respectively;
  
  storing in the memory by the processor the respective first significance magnitude factor;
  
  determining by the processor a second total number of the words, exclusive of stop words, in the documents of the group;
  
  storing in the memory by the processor the second total number;
  
  for each of the words, exclusive of stop words, respectively, of the documents, respectively, determining by the processor a respective third number of times that the word appears in the documents of the group;
  
  storing in the memory by the processor the respective third number of times;
  
  for each of the words, exclusive of stop words, respectively, of the documents, calculating by the processor a second uniqueness score, respectively, for the word, respectively, based on the respective third number and the second total number;
  
  storing in the memory by the processor the second uniqueness score, respectively, for the word, respectively;
  
  for each of the words, exclusive of stop words of the documents, respectively, in each of the documents, respectively, determining by the processor a respective fourth number of times that the word appears in the document;
  
  storing in the memory by the processor the respective fourth number, as a second frequency score, respectively;
  
  for each of the words, exclusive of stop words, of the documents, calculating by the processor a respective second significance magnitude factor based on the second frequency score, respectively, and the second uniqueness score, respectively;
  
  storing in the memory by the processor the respective second significance magnitude factor; and
  
  for each document of the group, generating by the processor a respective similarity score of contents of the document to the current instance of the search criteria, wherein generating the respective similarity score includes characterizing each document based on the respective second significance magnitude factor compared to the respective first significance magnitude factor.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The system of claim 1 wherein:
    - the current instance of the search criteria includes a uniform resource locator (URL); and
      
      receiving the current instance of search criteria includes accessing information residing at a location designated by the URL, extracting at least a portion of the information that is in a format native to the location designated by the URL and generating search criteria in a text-based format from at least a portion of the identified information in the format native to the location designated by the URL.
  - 3. The system of claim 2 wherein the method further comprises:
    - sorting the respective similarity scores for at least a portion of the documents of the group for creating a set of the respective similarity scores associated with the current instance of the search criteria;
      
      enabling a document corresponding to one of the respective similarity scores of the set to be designated as a next instance of the search criteria; and
      
      causing the method to be performed for the next instance of the search criteria as the current instance of the search criteria.
  - 4. The system of claim 1 wherein the method further comprises:
    - sorting the respective similarity scores for at least a portion of the documents of the group for creating a set of the respective similarity scores associated with the current instance of the search criteria;
      
      enabling a document corresponding to one of the respective similarity scores of the set to be designated as a next instance of the search criteria; and
      
      causing the method to be performed for the next instance of the search criteria as the current instance of the search criteria.
  - 5. The system of claim 1 wherein generating the respective similarity score includes normalizing at least a portion of the respective second significance magnitude factor as a function of page count of the respective one of the documents with respect to one or more other documents in the documents of the group.
  - 6. The system of claim 5 wherein:
    - the current instance of the search criteria includes a uniform resource locator (URL); and
      
      receiving the current instance of search criteria includes accessing information residing at a location designated by the URL, extracting at least a portion of the information that is in a format native to the location designated by the URL and generating search criteria in a text-based format from at least a portion of the identified information in the format native to the location designated by the URL.
  - 7. The system of claim 6 wherein the method further comprises:
    - sorting the respective similarity scores for at least a portion of the documents of the group for creating a set of the respective similarity scores associated with the current instance of the search criteria;
      
      enabling a document corresponding to one of the respective similarity scores of the set to be designated as a next instance of the search criteria; and
      
      causing the method to be performed for the next instance of the search criteria as the current instance of the search criteria.
  - 8. The system of claim 1 wherein the method further comprises:
    - generating respective similarity scores for documents from a first group and for documents from a second group; and
      
      normalizing the respective similarity scores of each one of the documents from the first group and each one of the documents from the second group with respect to all documents of the first and second groups.
  - 9. The system of claim 8 wherein normalizing the respective similarity scores includes:
    - for each one of the groups, determining an arithmetic mean of the respective similarity scores for all of the documents in the one of the groups;
      
      for each one of the groups, generating a respective group normalized similarity score for each document of the one of the groups dependent upon the arithmetic mean of the respective similarity scores for all of the documents of the one of the groups; and
      
      for each one of the documents of each one of the groups, determining relevance of each one of the documents dependent upon the respective group normalized similarity score of each one of the documents of each one of the groups.
  - 10. The system of claim 8 wherein:
    - the current instance of the search criteria includes a uniform resource locator (URL); and
      
      receiving the current instance of search criteria includes accessing information residing at a location designated by the URL, extracting at least a portion of the information that is in a format native to the location designated by the URL and generating search criteria in a text-based format from at least a portion of the identified information in the format native to the location designated by the URL.
  - 11. The system of claim 8 wherein the method further comprises:
    - sorting the similarity scores for at least a portion of the documents of at least one of the groups for creating a set of respective similarity scores associated with the current instance of the search criteria;
      
      enabling a document corresponding to one of the similarity scores of the at least one of the groups to be designated as a next instance of the search criteria; and
      
      causing the method to be performed for the next instance of the search criteria as the current instance of the search criteria.
  - 12. The system of claim 8 wherein generating the respective similarity score includes normalizing at least a portion of the respective second significance magnitude factor as a function of page count of the respective one of the documents with respect to one or more other documents in the first and second groups.

13. A non-transitory computer-readable medium having tangibly embodied thereon and accessible therefrom processor-executable instructions that, when executed by at least one data processing device of at least one computer, causes said at least one data processing device to perform a method comprising:
- receiving a current instance of search criteria of words;
  
  determining a first total number of words in the current instance of the search criteria;
  
  for each of the words in the current instance of the search criteria, determining a respective first number of times that the word appears in the current instance of the search criteria;
  
  for each of the words in the current instance of the search criteria, calculating a first uniqueness score, respectively, for the word in the search criteria based on the respective first number and the first total number;
  
  for each of the words of the search criteria and each document of at least one dataset, determining a respective second number of times that the word appears in the search criteria and the document;
  
  for each of the words of the current instance of the search criteria and the documents, calculating a respective first significance magnitude factor based on the respective second number and the first uniqueness score, respectively;
  
  determining a second total number of words in the documents;
  
  for each of the words, respectively, of each of the documents, respectively, determining a respective third number of times that the word appears in the document;
  
  for each of the words, respectively, of the documents, calculating a second uniqueness score, respectively, for the word in the documents;
  
  for each of the words of each document, determining a fourth number of times that the word appears in the documentfor each of the words of the documents, calculating a respective second significance magnitude factor based on the respective fourth number and the second uniqueness score, respectively;
  
  for each document of the at least one dataset, generating a respective similarity score of contents of the document to the current instance of the search criteria, wherein generating the respective similarity score includes characterizing each document based on the respective second significance magnitude factor compared to the respective first significance magnitude factor;
  
  thereby improving efficiency of data processing by flatly looking at the words being searched without attempting to understand the meaning of the words.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. The non-transitory computer-readable medium of claim 13 wherein:
    - the current instance of the search criteria includes a uniform resource locator (URL); and
      
      receiving the current instance of search criteria includes accessing information residing at a location designated by the URL, extracting at least a portion of the information that is in a hypertext markup language (HTML) format and generating search criteria in an extensible markup language (XML) format from at least a portion of the identified information in the HTML format.
  - 15. The non-transitory computer-readable medium of claim 13 wherein the method further comprises:
    - sorting the respective similarity scores for at least a portion of the documents of the at least one dataset for creating a set of the respective similarity scores associated with the current instance of the search criteria;
      
      enabling a document corresponding to one of the respective similarity scores of the set to be designated as a next instance of the search criteria; and
      
      causing the method to be performed for the next instance of the search criteria as the current instance of the search criteria.
  - 16. The non-transitory computer-readable medium of claim 13 wherein generating the respective similarity score includes normalizing at least a portion of the respective second significance magnitude factor as a function of page count of the respective one of the documents with respect to one or more other documents in the documents of the at least one dataset.
  - 17. The non-transitory computer-readable medium of claim 16 wherein:
    - the current instance of the search criteria includes a uniform resource locator (URL); and
      
      receiving the current instance of search criteria includes accessing information residing at a location designated by the URL, extracting at least a portion of the information that is in a format native to the location designated by the URL and generating search criteria in a text-based format from at least a portion of the identified information in the format native to the location designated by the URL.
  - 18. The non-transitory computer-readable medium of claim 13 wherein the method further comprises:
    - generating the respective similarity scores for documents from a first dataset and for documents from a second dataset; and
      
      normalizing the respective similarity scores of each one of the documents from the first dataset and each one of the documents from the second dataset with respect to all documents of the first and second datasets.
  - 19. The non-transitory computer-readable medium of claim 18 wherein normalizing the respective similarity scores includes:
    - for each one of the datasets, determining an arithmetic mean of the respective similarity scores for all of the documents in the one of the datasets;
      
      for each one of the datasets, generating a respective dataset normalized similarity score for each document of the one of the datasets dependent upon the arithmetic mean of the respective similarity scores for all of the documents of the one of the datasets; and
      
      for each one of the documents of each one of the datasets, determining relevance of each one of the documents dependent upon the respective dataset normalized similarity score of each one of the documents of each one of the datasets.

20. A non-transitory computer-readable medium having tangibly embodied thereon and accessible therefrom processor-executable instructions that, when executed by at least one data processing device of at least one computer, causes said at least one data processing device to perform a method comprising:
- receiving a current instance of search criteria, wherein the current instance of the search criteria includes a uniform resource locator (URL);
  
  determining a first total number of words in the current instance of the search criteria;
  
  for each of the words in the current instance of the search criteria, determining a respective first number of times that the word appears in the current instance of the search criteria;
  
  for each of the words in the current instance of the search criteria, calculating a first uniqueness score, respectively, for the word in the search criteria based on the respective first number and the first total number;
  
  for each of the words of the search criteria and each document of at least one source of documents, performing a respective second number of times that the word each token appears in the search criteria and the document of the at least one source of documents;
  
  for each of the words of the current instance of the search criteria and the documents, calculating a respective first significance magnitude factor based on the respective second number and the first uniqueness score, respective;
  
  determining a second total number of words in the documents;
  
  for each of the words, respectively, of each of the documents, respective, determining a respective third number of times that the word appears in the document;
  
  for each of the words, respectively, of the documents, calculating a second uniqueness score, respectively, for the word in the documents;
  
  for each of the words of each document, determining a fourth number of times that the word appears in the document;
  
  for each of the words of the documents, calculating a respective second significance magnitude factor based on the respective fourth number and the second uniqueness score, respectively; and
  
  for each document in the at least one source of documents, generating a respective similarity score between the text used as the current instance of the search criteria and the document, wherein the similarity score is a function of the respective second significance magnitude factor and the respective first significance magnitude factor for the document;
  
  thereby improving efficiency of data processing by flatly looking at the words being searched without attempting to understand the meaning of the words.
- View Dependent Claims (21, 22, 23, 24, 25, 26)
- - 21. The non-transitory computer-readable medium of claim 20 wherein receiving the current instance of search criteria includes:
    - accessing information residing at a location designated by the URL;
      
      extracting at least a portion of the information that is in a format native to the location designated by the URL; and
      
      generating search criteria in a text-based format from at least a portion of the identified information in the format native to the location designated by the URL.
  - 22. The non-transitory computer-readable medium of claim 20 wherein the method further comprises:
    - sorting the respective similarity scores for at least a portion of the documents of the at least one source of documents for creating a set of the respective similarity scores associated with the current instance of the search criteria;
      
      enabling a document corresponding to one of the respective similarity scores of the set to be designated as a next instance of the search criteria; and
      
      causing the method to be performed for the next instance of the search criteria as the current instance of the search criteria.
  - 23. The non-transitory computer-readable medium of claim 20 wherein generating the respective similarity score includes normalizing at least a portion of the respective second significance magnitude factor as a function of page count of the respective one of the documents with respect to one or more other documents in the at least one source of documents.
  - 24. The non-transitory computer-readable medium of claim 23 wherein receiving the current instance of search criteria includes:
    - accessing information residing at a location designated by the URL;
      
      extracting at least a portion of the information that is in a format native to the location designated by the URL; and
      
      generating search criteria in a text-based format from at least a portion of the identified information in the format native to the location designated by the URL.
  - 25. The non-transitory computer-readable medium of claim 20 wherein the method further comprises:
    - generating the respective similarity scores for documents from a first source of documents and for documents from a second source of documents; and
      
      normalizing the respective similarity scores of each one of the documents from the first source of documents and each one of the documents from the second source of documents with respect to all documents of the first and second source of documents.
  - 26. The non-transitory computer-readable medium of claim 25 wherein normalizing the respective similarity scores includes:
    - for each one of the sources of documents, determining an arithmetic mean of the respective similarity scores for all of the documents in the one of the source of documents;
      
      for each one of the sources of documents, generating a respective dataset normalized similarity score for each document of the one of the source of documents dependent upon the arithmetic mean of the respective similarity scores for all of the documents of the one of the source of documents; and
      
      for each one of the documents of each one of the sources of documents, determining relevance of each one of the documents dependent upon the respective dataset normalized similarity score of each one of the documents of each one of the source of documents.

27. A method of semantically searching a group of documents containing words, the words are exclusive of stop words of the documents, by a computer including at least a processor and memory, thereby improving the efficiency of computer resources by flatly looking at the words being searched without attempting to understand the meaning of the words, comprising:
- (A) indexing by the processor each document of the group by (a) counting a first count of a total number of the words contained in the documents of the group, (b) storing in the memory the first count, (c) for each of the words, respectively, of the documents of the group, respectively, counting a second count, respectively, of a number of times that the word appears in the documents of the group, (d) for each of the words, respectively, of the documents of the group, storing in the memory the second count, respectively, (e) for each of the words, respectively, of the documents of the group, calculating a uniqueness score, respectively, based on the second count, respectively, for the word, and the first count, and (f) for each of the words, respectively, of the documents of the group, storing in the memory the uniqueness score for the word;
  
  (B) indexing by the processor each document of the group by (a) for each of the words and for each of the documents, counting a third count, respectively, of a number of times the word appears in the document, (b) for each of the words for each of the documents, respectively, storing in the memory the third count, respectively, as a frequency score, respectively, (c) for each of the words and for each of the documents, calculating a first significance magnitude factor, respectively, based on the frequency score, respectively, and the uniqueness score, respectively, and (d) for each of the words for each of the documents, respectively, storing in the memory the first significance magnitude factor, respectively;
  
  (C) receiving by the processor a search criteria, the search criteria selected from the group consisting of;
  
  any of the documents, any search words, any other document not in the group, any URL, and combinations;
  
  (D) indexing by the processor the search criteria and the documents of the group using the same steps set forth in (A) and (B) above using only words, exclusive of stop words, of the search criteria, to obtain a second significance magnitude factor, respectively, for each of the words, respectively, of the search criteria;
  
  (E) comparing the second significance factor for each of the words in the search criteria to the first significance factor for the words, respectively, in each of the documents of the group;
  
  (F) for each of the documents of the group, aggregating results of the comparing, for each of the words of the search criteria, into a similarity score, respectively, for the document in comparison to the search criteria;
  
  (G) presenting the similarity scores, respectively, for the documents, respectively, so that the documents in the group having significance in respect of the similarity scores can be utilized by a person looking for documents in the group that are similar to the search criteria.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Enlyton, Inc.
Original Assignee
Enlyton, Inc.
Inventors
Johns, Mark Ellingham, McKinzie, Chris
Primary Examiner(s)
Perveen, Rehana
Assistant Examiner(s)
Hoffler, Raheem

Application Number

US14/199,985
Publication Number

US 20140236941A1
Time in Patent Office

999 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/334   Query execution G06F16/335 ...

G06F 16/93   Document management systems

G06F 16/951   Indexing; Web crawling tech...

G06F 16/953   Querying, e.g. by the use o...

G06F 16/9535   Search customisation based ...

G06F 16/954   Navigation, e.g. using cate...

G06F 16/9566   URL specific, e.g. using al...

Discovery engine

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

10 Citations

27 Claims

Specification

Use Cases

Quick Links

Others

Discovery engine

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

10 Citations

27 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others