Bad Product Data on the Internet

Product data records are created by brands and manufacturers. The data records contain the brand/manufacturer name, the product name, and some combination of the UPC, the model number, the manufacturer part number, the description, and the specifications.

Stores acquire the product data records from brands and manufacturers and wholesalers via data feeds and possibly printed catalogs. The stores import the acquired information into their internal databases. The stores then add store specific information to the data records such as prices, discounts, rebates, sales prices, availability, shipping time, and shipping costs. Stores also acquire images from brands and manufacturers. Some stores will take their own product images or crop the original images, and/or add watermarks to the images. Stores will alter the product names and descriptions. Stores will some times use use the UPC in the SKU. There are many more transformations that stores perform. Moreover, stores will transform key information such as UPC. The wrong UPC or model number can be inserted into the wrong product record. Key information such as the manufacturer part number may not be shown on the web pages or in the data feeds. Images may not be shown and instead default image is shown in lieu of the product image. Variants images may show the wrong color. Stores transform and alter data.

Once the data is created in the database a product page template and a web server in conjunction with the database create a static or dynamic web page for each product. The values in the pages may be Javascript loaded. If the page is downloaded in a non-Javascript supported program the Javascript loaded values will not appear. Crawling a site or downloading pages shows different content than what is shown in browsers.

Data that is downloaded from stores is incomplete, transformed, altered and has mistakes. Trying to match the same product, including variants from different stores is a difficult task. Many, many startups with 10’s of millions in funding and large companies have not been able to crack the problem of bad data recognition in product records, dealing with mall store (sold by) information, and comprehensive matching of product records, including variants. Bad data recognition in product records is of paramount importance when matching product records.

When a product record contains the wrong information the product will be matched incorrectly, unless the bad data is recognized, If the brand/manufacturer information has been altered at a store then it becomes difficult to match the record. If the image has been replaced with an image taken but he store then matching becomes difficult. These are just a few of the issues that matching system has to handle correctly.

Overcoming all of the obstacles to product matching requires a new approach. Matching must be exact. Matching must only use known good data. Relying on probabilities and statistics to determine if data is good works to a certain degree. However, more advanced methods are required.

Data Record Science has created a product data processing pipeline that identifies and reports on bad data and matches product records, including variants, and produces a matched record database and matched record reports. is coming soon.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s