Identifying entities in semi-structured content
First Claim
1. A system for identifying entities in semi-structured content, the system comprising:
- one or more processors; and
a non-transitory computer readable medium storing a plurality of instructions, which when executed, cause the one or more processors to;
identify a sequence of tokens in the semi-structured content based on assigning an information score to each token;
assign, by a first layer of classifications that is executed by a machine learning model of a database system, an entity type for each token in the sequence of tokens based on an entity score representing a probability that the token corresponds to the entity type, the entity type being one of a plurality of entity types and the entity score being a maximum score for correspondence between the token and any of the plurality of entity types;
assign, by a second layer of classifications that is executed by the machine learning model, a structure score for each token in the sequence of tokens based on the token matching one of a plurality of structure types;
re-assign, by the second layer, the corresponding entity type and corresponding entity score for each token in the sequence of tokens matching one of the structure types;
assign, by a third layer of classifications that is executed by the machine learning model, a boundary type for each token in the sequence of tokens based on a boundary type score, the boundary type being one of a begin boundary type and a continue boundary type;
identify, by a fourth layer of classifications that is executed by the machine learning model, an entity based on;
i) the entity type and the boundary type for each token, and;
ii) the structure score for each token; and
output the sequence of tokens as an identified set of entities based on the identified entity.
2 Assignments
0 Petitions
Accused Products
Abstract
Identifying entities in semi-structured content is described. A system assigns a corresponding entity type based on a corresponding entity type score for each token in a sequence of tokens in semi-structured content, based on multiple entity types, wherein each token is a corresponding character set. The system assigns a corresponding boundary type based on a corresponding boundary type score for each token in the sequence of tokens, based on a begin boundary type or a continue boundary type. The system identifies an entity based on a corresponding entity type score and a corresponding boundary type for each token in the sequence of tokens. The system outputs the sequence of tokens as an identified set of entities based on the identified entity.
156 Citations
17 Claims
-
1. A system for identifying entities in semi-structured content, the system comprising:
-
one or more processors; and a non-transitory computer readable medium storing a plurality of instructions, which when executed, cause the one or more processors to; identify a sequence of tokens in the semi-structured content based on assigning an information score to each token; assign, by a first layer of classifications that is executed by a machine learning model of a database system, an entity type for each token in the sequence of tokens based on an entity score representing a probability that the token corresponds to the entity type, the entity type being one of a plurality of entity types and the entity score being a maximum score for correspondence between the token and any of the plurality of entity types; assign, by a second layer of classifications that is executed by the machine learning model, a structure score for each token in the sequence of tokens based on the token matching one of a plurality of structure types; re-assign, by the second layer, the corresponding entity type and corresponding entity score for each token in the sequence of tokens matching one of the structure types; assign, by a third layer of classifications that is executed by the machine learning model, a boundary type for each token in the sequence of tokens based on a boundary type score, the boundary type being one of a begin boundary type and a continue boundary type; identify, by a fourth layer of classifications that is executed by the machine learning model, an entity based on;
i) the entity type and the boundary type for each token, and;
ii) the structure score for each token; andoutput the sequence of tokens as an identified set of entities based on the identified entity. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer program product comprising a non-transitory computer-readable medium storing computer-readable program code which when executed by one or more processors, cause the one or more processors to:
-
identify, by a database system, a sequence of tokens in semi-structured content based on assigning an information score to each token; assign, by a first layer of classifications that is executed by a machine learning model of the database system, an entity type for each token in the sequence of tokens based on an entity score representing a probability that the token corresponds to the entity type, the entity type being one of a plurality of entity types and the entity score being a maximum score for correspondence between the token and any of the plurality of entity types; assign, by a second layer of classifications that is executed by the machine learning model, a structure score for each token in the sequence of tokens based on the token matching one of a plurality of structure types; re-assign, by the second layer, the corresponding entity type and corresponding entity score for each token in the sequence of tokens matching one of the structure types; assign, by a third layer of classifications that is executed by the machine learning model, a boundary type for each token in the sequence of tokens based on a boundary type score, the boundary type being one of a begin boundary type and a continue boundary type; identify, by a fourth layer of classifications that is executed by the machine learning model, an entity based on;
i) the entity type and the boundary type for each token, and;
ii) the structure score for each token; andoutput, by the database system, the sequence of tokens as an identified set of entities based on the identified entity. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A computer-implemented method for identifying entities in semi-structured content in an on-demand database system, the method comprising:
-
identifying, by a database system, a sequence of tokens in the semi-structured content based on assigning an information score to each token; assigning, by a first layer of classifications that is executed by a machine learning model of the database system, an entity type for each token in the sequence of tokens based on an entity score representing a probability that the token corresponds to the entity type, the entity type being one of a plurality of entity types and the entity score being a maximum score for correspondence between the token and any of the plurality of entity types; assigning, by a second layer of classifications that is executed by the machine learning model of the database system, a structure score for each token in the sequence of tokens based on the token matching one of a plurality of structure types; re-assigning, by the second layer, the corresponding entity type and the corresponding entity score for each token in the sequence of tokens matching one of the structure types; assigning, by a third layer of classifications that is executed by the machine learning model of the database system, a boundary type for each token in the sequence of tokens based on a boundary type score, the boundary type being one of a begin boundary type and a continue boundary type; identifying, by a fourth layer of classifications that is executed by the machine learning model of the database system, an entity based on;
i) the entity type and the boundary type for each token, and;
ii) the structure score for each token; andoutputting, by the database system, the sequence of tokens as an identified set of entities based on the identified entity. - View Dependent Claims (14, 15, 16, 17)
-
Specification