IDENTIFICATION OF CONTENT IN AN ELECTRONIC DOCUMENT
First Claim
Patent Images
1. A method comprising:
- receiving an electronic document that comprises a plurality of sections; and
marking the plurality of sections as a content section or a non-content section using an attribute of the sections that includes at least one of a width of the section, a density of the plurality of hyperlinks in the section, a size of a font of text in the section and whether a title of the electronic document overlaps with text in the section; and
storing the marking of the plurality of sections of the electronic document in a machine-readable medium.
1 Assignment
0 Petitions
Accused Products
Abstract
In some embodiments, a method includes receiving an electronic document that comprises a plurality of sections. The method includes marking the plurality of sections as a content section or a non-content section using an attribute of the sections that includes at least one of a width of the section, a density of the plurality of hyperlinks in the section, a size of a font of text in the section and whether a title of the electronic document overlaps with text in the section. The method also includes storing the marking of the plurality of sections of the electronic document in a machine-readable medium.
20 Citations
25 Claims
-
1. A method comprising:
-
receiving an electronic document that comprises a plurality of sections; and marking the plurality of sections as a content section or a non-content section using an attribute of the sections that includes at least one of a width of the section, a density of the plurality of hyperlinks in the section, a size of a font of text in the section and whether a title of the electronic document overlaps with text in the section; and storing the marking of the plurality of sections of the electronic document in a machine-readable medium. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A machine-readable medium including instructions which when executed by a machine causes the machine to perform operations comprising:
-
receiving a web page that comprises a plurality of blocks; and marking the plurality of blocks as a content block or a non-content block based on a value of an attribute of the blocks, wherein the plurality of blocks includes a first block adjacent to a second block, wherein the first block is marked as a content block and the second block is marked as a non-content block; comparing a value of a different attribute for the first block to a value of the different attribute for the second block; changing the second block to be marked to include content, in response to a determination that the value of the different attribute of the first block is the same as the value of the different attribute of the second block; and storing the marking of the plurality of blocks of the web page in a machine-readable medium. - View Dependent Claims (12, 13, 14, 15, 16)
-
-
17. A system comprising:
-
an identification module to mark a plurality of sections of a electronic document as having content or non-content using an attribute of the sections that includes at least one of a width of the section, a density of the plurality of hyperlinks in the section, a size of a font of text in the section and whether a title of the electronic document overlaps with text in the section; and a search engine to parse the sections having content to index for a subsequent search. - View Dependent Claims (18, 19, 20, 21, 22, 23, 24, 25)
-
Specification