Text processing system and methods for automated topic discovery, content tagging, categorization, and search

US 9,483,532 B1
Filed: 05/24/2015
Issued: 11/01/2016
Est. Priority Date: 01/29/2010
Status: Expired due to Fees

First Claim

Patent Images

1. A computer system, comprising:

a processor operable toreceive a text content comprising a plurality of terms, each term comprising one or more words or phrases;

tokenize the text content into a plurality of terms, each term comprising one or more words or phrases;

identifying a first semantic attribute or a first part of speech, wherein the first semantic attribute is selected from the group of semantic attributes consisting of at least an action, a thing, a person, an agent of an action, a recipient of an action or a thing, a state of an object, a mental state of a person, a physical state of a person, a positive or negative opinion, a name of a product, a name of a service, a name of an organization, wherein the first part of speech is selected from the group of parts of speech consisting of at least a noun or a pronoun, a transitive or intransitive verb or modal verb or link verb, an adjective, an adverb, a preposition, an article, a conjunction;

identify a first term in the text content, wherein the first term is associated with the first semantic attribute or the first part of speech;

identify a second term in the text content, wherein the second term is not associated with the first semantic attribute or the first part of speech;

associate an importance measure to the first term, based at least on the first semantic attribute or the first part of speech, to mark the first term as bearing more importance than the second term in representing a topic or an information focus in the text content;

extract the first term based on the importance measure; and

output the first term;

when the first term is output, the function of the first term includes being a tag or a label to represent a topic or a summary of the text content, or a category node;

when the first term is output and displayed, the display format includes the font type, size, color, shape, position, or orientation of the first term based on the importance measure;

when the text content containing the first term is made searchable using a query or is associated with a search index to produce a search result, the search result is ranked based at least on the importance measure.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer system and methods are disclosed for automatically discovering topics and building a hierarchical topic structure, and for tagging and categorizing contents in a document or other natural language contents. The disclosed methods include steps for obtaining terms that best represent the topics in a text content, and building a hierarchical representation of topics of different levels or topic-comment relationships, and folder-subfolder structures. The methods further include obtaining, identifying, and selecting terms representing different degrees of informational importance based on the grammatical roles, parts of speech, and semantic attributes associated with the terms, using the terms to represent topics in the document, to automatically tag the document, to rank search results, and to build a category structure based on the selected terms.

66 Citations

View as Search Results

20 Claims

1. A computer system, comprising:
- a processor operable toreceive a text content comprising a plurality of terms, each term comprising one or more words or phrases;
  
  tokenize the text content into a plurality of terms, each term comprising one or more words or phrases;
  
  identifying a first semantic attribute or a first part of speech, wherein the first semantic attribute is selected from the group of semantic attributes consisting of at least an action, a thing, a person, an agent of an action, a recipient of an action or a thing, a state of an object, a mental state of a person, a physical state of a person, a positive or negative opinion, a name of a product, a name of a service, a name of an organization, wherein the first part of speech is selected from the group of parts of speech consisting of at least a noun or a pronoun, a transitive or intransitive verb or modal verb or link verb, an adjective, an adverb, a preposition, an article, a conjunction;
  
  identify a first term in the text content, wherein the first term is associated with the first semantic attribute or the first part of speech;
  
  identify a second term in the text content, wherein the second term is not associated with the first semantic attribute or the first part of speech;
  
  associate an importance measure to the first term, based at least on the first semantic attribute or the first part of speech, to mark the first term as bearing more importance than the second term in representing a topic or an information focus in the text content;
  
  extract the first term based on the importance measure; and
  
  output the first term;
  
  when the first term is output, the function of the first term includes being a tag or a label to represent a topic or a summary of the text content, or a category node;
  
  when the first term is output and displayed, the display format includes the font type, size, color, shape, position, or orientation of the first term based on the importance measure;
  
  when the text content containing the first term is made searchable using a query or is associated with a search index to produce a search result, the search result is ranked based at least on the importance measure.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The system of claim 1, wherein the processor is further operable to:
    - associate a first weighting co-efficient to the first semantic attribute or the first part of speech, and determine the importance measure based on the first weighting co-efficient.
  - 3. The system of claim 1, wherein the processor is further operable to:
    - count a first frequency of the first term in the text content, and determine the importance measure based on the first frequency.
  - 4. The system of claim 1, wherein the processor is further operable tooutput an element selected from the group of elements consisting of the importance measure and a first text unit that includes the first term.
  - 5. The system of claim 4, wherein the processor is further operable tocreate a link between two of the elements selected from the group of elements consisting of the first term, the first text unit, and at least a portion of the text content associated with the first term.
  - 6. The system of claim 5, wherein the processor is further operable todetect an action on the link;
    - anddisplay, based on the action, at least one of the elements selected from the group of elements consisting of the first term, the first text unit, and at least a portion of the text content associated with the first term.
  - 7. The system of claim 1, wherein the processor is further operable toreceive, via a user interface object, a user indication;
    - andhighlight the first term in the text content according to the user indication.
  - 8. The system of claim 1, wherein the processor is further operable toreceive a search query comprising a keyword;
    - match the keyword with the first term;
      
      return the text content as a search result or part of a search result; and
      
      rank the search result based at least on the importance measure.
  - 9. The system of claim 1, wherein the text content is a sub-segment text unit in a collection of sub-segment text units, wherein the sub-segment text unit includes a sentence or a paragraph when the collection is a document comprising a plurality of sentences or paragraphs, or the sub-segment text unit includes an individual document when the collection is a document collection containing a plurality of documents.

10. A method implemented on a computer comprising a processor, the method comprising:
- receiving a text content containing text;
  
  tokenizing the text content into a plurality of terms each comprising an element selected from the group of elements consisting at least of a word, a phrase, a sentence, a paragraph;
  
  identifying a first term in the text content;
  
  identifying, in at least a portion of the text content, a second term, wherein the portion of the text content contains the first term or is grammatically or semantically associated with the first term;
  
  identifying a first grammatical attribute associated with the second term, or identifying a first semantic attribute associated with the second term;
  
  selecting the second term as a term related to the first term based at least on the first grammatical attribute or the first semantic attribute;
  
  marking the first term for use as a first-level entity in a hierarchical format, and marking the second term for use as a second-level entity in the hierarchical format, wherein the second-level entity is marked as an element under or subordinate to the first-level entity; and
  
  outputting the first term and the second term to be used for at least providing a relational or hierarchical representation of the informational elements in the text content;
  
  when the first term is used to represent a first-level category node, and the second term is used to represent a second-level category node or the content of the first-level category, an embodiment format of at least one of the category nodes includes a text element, a folder or a directory, or a link name associated with the linked contents on a device selected from the group of devices consisting at least of a computer file system, an email system, a web-based or cloud-based system, a mobile or handheld computing or communication device;
  
  when the first term and the second term are displayed, a display format comprises at least representing the first term as a topic or information focus in the text content, and the second term as a comment or attribute associated with the topic or the information focus;
  
  or representing the first term as a folder or directory in an electronic content management system, and the second term as a sub-folder or sub-directory in the electronic content management system;
  
  when the text content or the first term is made searchable using a query or is associated with a search index to produce a search result, a display format of the search result comprises the first term with one or more of its corresponding second terms if the first term matches a keyword in the search query.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 11. The method of claim 10, when the first grammatical attribute is identified, the grammatical relationship between the first term and the second term includes a relationship between a subject and a non-subject of a sentence, or a modifier and a head of a complex phrase.
  - 12. The method of claim 10, when the first semantic attribute is identified, the semantic relationship between the first term and the second term includes a relationship between an object represented by the first term and a semantic attribute or attribute value represented by the second term and associated with the object, wherein the object includes at least a person, an organization, a product or service, a physical thing or a concept.
  - 13. The method of claim 10, further comprising:
    - comparing the first term and the second term, andselecting the second term if the second term is different from the first term.
  - 14. The method of claim 10, wherein the first term is obtained from an input source.
  - 15. The method of claim 10, when the first term is selected from the text content, the method further comprising:
    - identifying a second grammatical attribute or a second semantic attribute;
      
      identifying a term associated with the second grammatical attribute or the second semantic attribute; and
      
      selecting the term as the first term based at least on the second grammatical attribute or the second semantic attribute.
  - 16. The method of claim 15, further comprising:
    - counting the occurrence of the first term in the text content; and
      
      selecting the first term if the occurrence of the first term in the text content is above a threshold.
  - 17. The method of claim 15, further comprising:
    - associating a first importance measure with the first term based on the second grammatical attribute or the second semantic attribute associated with the first term;
      
      selecting the first term if the first importance measure is above a threshold.
  - 18. The method of claim 10, wherein the text content includes user reviews of a product or a service, or comments on social or political or financial or other topics, wherein the first term represents a major topic in the text content, and the second term represents a minor topic or a description or comment about the major topic;
    - wherein the first semantic attribute or the second semantic attribute includes at least a positive and a negative opinion, the method further comprising;
      
      providing a user interface object to selectively display, hide, or highlight the first term or the second term based on whether the first term or the second term indicates a positive or a negative opinion;
      
      or to selectively highlight one or more terms in the text content associated with the first term or the second term, according to whether the one or more terms indicate a positive or negative opinion.
  - 19. The method of claim 10, when the first grammatical attribute is identified and used, the first grammatical attribute includes a subject, a predicate, a sub-phrase of a multi-word phrase, a modifier in a multi-word phrase, a head of a multi-word phrase, a direct or indirect object, a predicative, a complement;
    - and the first grammatical attribute further includes parts of speech, wherein the parts of speech include at least a noun or a pronoun, a transitive or intransitive verb or modal verb or link verb, an adjective, an adverb, a preposition, an article, a conjunction.
  - 20. The method of claim 10, when the first semantic attribute is identified and used, the first semantic attribute includes at least an action, a thing or a person, an agent of an action, a recipient of an action or a thing, a state of an object, a mental or physical state of a person, a positive or negative opinion, a name of product or service or an organization.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Linfo IP LLC (Pueblo Nuevo LLC)
Original Assignee
Chizhong Zhang, Guangsheng Zhang
Inventors
Zhang, Chizhong, Zhang, Guangsheng
Primary Examiner(s)
Ly, Anh

Application Number

US14/720,789
Time in Patent Office

527 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/24553   of query operations

G06F 16/24578   using ranking

G06F 16/3334   Selection or weighting of t...

G06F 16/3346   using probabilistic model

G06F 16/355   Class or cluster creation o...

G06F 16/951   Indexing; Web crawling tech...

G06F 40/10   Text processing natural lan...

G06F 40/205   Parsing

G06F 40/30   Semantic analysis

Text processing system and methods for automated topic discovery, content tagging, categorization, and search

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

66 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Text processing system and methods for automated topic discovery, content tagging, categorization, and search

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

66 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links