Identification of content in an electronic document
First Claim
Patent Images
1. A method comprising:
- receiving an electronic document that comprises a plurality of sections;
marking the plurality of sections as a content section or a non-content section using an attribute of the sections that includes at least one of a width of the section, a density of the plurality of hyperlinks in the section, a size of a font of text in the section, and whether a title of the electronic document overlaps with text in the section;
comparing a value of a different attribute of two adjacent sections of the plurality of sections, a first section of the two adjacent sections being marked to include content and a second section of the two adjacent sections being marked not to include content;
changing a mark of the second section from a mark not to include content to a mark to include content in response to the comparison resulting in a determination that the value of the different attribute of the first section is the same as the value of the different attribute of the second section; and
storing the mark of the plurality of sections of the electronic document in a machine-readable medium.
1 Assignment
0 Petitions
Accused Products
Abstract
In some embodiments, a method includes receiving an electronic document that comprises a plurality of sections. The method includes marking the plurality of sections as a content section or a non-content section using an attribute of the sections that includes at least one of a width of the section, a density of the plurality of hyperlinks in the section, a size of a font of text in the section and whether a title of the electronic document overlaps with text in the section. The method also includes storing the marking of the plurality of sections of the electronic document in a machine-readable medium.
-
Citations
22 Claims
-
1. A method comprising:
-
receiving an electronic document that comprises a plurality of sections; marking the plurality of sections as a content section or a non-content section using an attribute of the sections that includes at least one of a width of the section, a density of the plurality of hyperlinks in the section, a size of a font of text in the section, and whether a title of the electronic document overlaps with text in the section; comparing a value of a different attribute of two adjacent sections of the plurality of sections, a first section of the two adjacent sections being marked to include content and a second section of the two adjacent sections being marked not to include content; changing a mark of the second section from a mark not to include content to a mark to include content in response to the comparison resulting in a determination that the value of the different attribute of the first section is the same as the value of the different attribute of the second section; and storing the mark of the plurality of sections of the electronic document in a machine-readable medium. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A non-transitory machine-readable medium including instructions which when executed by a machine causes the machine to perform operations comprising:
-
receiving a web page that comprises a plurality of blocks; and marking the plurality of blocks as a content block or a non-content block based on a value of an attribute of the blocks, the plurality of blocks including a first block adjacent to a second block, the first block being marked as a content block and the second block being marked as a non-content block; comparing a value of a different attribute for the first block to a value of the different attribute for the second block; changing a mark of the second block to a content block in response to a determination that the value of the different attribute of the first block is the same as the value of the different attribute of the second block; and storing the mark of the plurality of blocks of the web page in a machine-readable medium. - View Dependent Claims (11, 12, 13, 14, 15)
-
-
16. A system comprising:
-
a hardware implemented identification module to mark a plurality of sections of a electronic document as having content or having non-content using an attribute of the sections that includes at least one of a width of the section, a density of the plurality of hyperlinks in the section, a size of a font of text in the section and whether a title of the electronic document overlaps with text in the section, the plurality of sections comprising a first section adjacent to a second section, the first section being marked as having content and the second section being marked as having non-content, the identification module operating to change a mark of the second section to having content based on a comparison of a different attribute of the first section and the different attribute of the second section; and a search engine to parse the sections having content to index for a subsequent search. - View Dependent Claims (17, 18, 19, 20, 21, 22)
-
Specification