Method and apparatus for building sales tools by mining data from websites
First Claim
1. A method for characterizing a plurality of extensible markup language documents, each of the plurality of extensible markup language documents comprising a link to another extensible markup language document of the plurality of extensible markup language documents, the method comprising:
- traversing the plurality of extensible markup language documents by following each link of the plurality of extensible markup language documents;
parsing the plurality of extensible markup language documents to determine a structural hierarchy of the plurality of extensible markup language documents;
associating a task complexity with the plurality of extensible markup language documents based on the structural hierarchy, the associating comprising;
extracting a plurality of blocks of information from the plurality of extensible markup language documents; and
assigning a block of information in the plurality of blocks of information to a category in a plurality of categories, the plurality of categories comprising a task complexity category, the block of information comprising a value indicative of a number of extensible markup language documents associated with the plurality of extensible markup language documents and a value indicative of a number of links associated with the plurality of extensible markup language documents; and
characterizing the plurality of extensible markup language documents based on the task complexity.
2 Assignments
0 Petitions
Accused Products
Abstract
A website mining tool is disclosed that extracts information from, for example, a company'"'"'s website and presents the extracted information in a graphical user interface (GUI). In one embodiment, web pages from a website are stored in, for example, computer memory and a structure of the web pages is identified. A plurality of blocks of information is then extracted as a function of this structure and a category is assigned to each block of information. The elements in the blocks of information are then displayed, for example to a salesperson, as a function of these categories. In another embodiment, Document Object Modeling parsing is used to identify the structure of the web pages. In yet another embodiment, a support vector machine is used to categorize each block of information.
16 Citations
14 Claims
-
1. A method for characterizing a plurality of extensible markup language documents, each of the plurality of extensible markup language documents comprising a link to another extensible markup language document of the plurality of extensible markup language documents, the method comprising:
-
traversing the plurality of extensible markup language documents by following each link of the plurality of extensible markup language documents; parsing the plurality of extensible markup language documents to determine a structural hierarchy of the plurality of extensible markup language documents; associating a task complexity with the plurality of extensible markup language documents based on the structural hierarchy, the associating comprising; extracting a plurality of blocks of information from the plurality of extensible markup language documents; and assigning a block of information in the plurality of blocks of information to a category in a plurality of categories, the plurality of categories comprising a task complexity category, the block of information comprising a value indicative of a number of extensible markup language documents associated with the plurality of extensible markup language documents and a value indicative of a number of links associated with the plurality of extensible markup language documents; and characterizing the plurality of extensible markup language documents based on the task complexity. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system for characterizing a plurality of extensible markup language documents, each of the plurality of extensible markup language documents comprising a link to another extensible markup language document of the plurality of extensible markup language documents, the system comprising:
-
means for traversing the plurality of extensible markup language documents by following each link of the plurality of extensible markup language documents; means for parsing the plurality of extensible markup language documents to determine a structural hierarchy of the plurality of extensible markup language documents; means for associating a task complexity with the plurality of extensible markup language documents based on the structural hierarchy, the means for associating comprising; means for extracting a plurality of blocks of information from the plurality of extensible markup language documents; and means for assigning a block of information in the plurality of blocks of information to a category in a plurality of categories, the plurality of categories comprising a task complexity category, the block of information comprising a value indicative of a number of extensible markup language documents associated with the plurality of extensible markup language documents and a value indicative of a number of links associated with the plurality of extensible markup language documents; and means for characterizing the plurality of extensible markup language documents based on the task complexity. - View Dependent Claims (8, 9, 10)
-
-
11. A non-transitory computer readable medium storing computer program instructions for characterizing a plurality of extensible markup language documents, each of the plurality of extensible markup language documents comprising a link to another extensible markup language document of the plurality of extensible markup language documents, the computer program instructions, which when executed on a processor, cause the processor to perform operations comprising:
-
traversing the plurality of extensible markup language documents by following each link of the plurality of extensible markup language documents; parsing the plurality of extensible markup language documents to determine a structural hierarchy of the plurality of extensible markup language documents; associating a task complexity with the plurality of extensible markup language documents based on the structural hierarchy, the associating comprising; extracting a plurality of blocks of information from the plurality of extensible markup language documents; and assigning a block of information in the plurality of blocks of information to a category in a plurality of categories, the plurality of categories comprising a task complexity category, the block of information comprising a value indicative of a number of extensible markup language documents associated with the plurality of extensible markup language documents and a value indicative of a number of links associated with the plurality of extensible markup language documents; and characterizing the plurality of extensible markup language documents based on the task complexity. - View Dependent Claims (12, 13, 14)
-
Specification