Product pages at online stores contain product records. The product records are inserted into the pages using a template. The product records are stores in a database or index which is accessed when a user navigates to the product page. The web site is essentially an online database which contains product records stored in HTML, XML, microdata, RDL, … The semantic elements in the product pages made from the same site template are stored at the same relative position in the HTML structure. Extraction of data from store web pages requires knowing what the HTML path (XPATH) and semantic type are for each extraction template element.
Extracting data records from product pages requires the use of custom parsers, extraction templates, or deep learning systems. The extraction process is subject to error. The data on the product page may contain errors. The wrong data may be extracted from the web page or not be extracted at all.
Reverse engineering product databases in web pages, while maintaining a high level of accuracy, is necessary in order to create single product records which represent the product record in the source product web page.
The DRS system creates extraction templates for product web pages at online stores. A single extraction system uses the templates to extract product records from web pages.
Data Record Science has patented product data extraction technology.
USPTO patent: Intelligent data search engine US 20080059486 A1