Data collection practices receive increasingly more attention and sophistication. Web scraping, and automated acquisition processes in general, changed the nature of data collection so much that old challenges were solved and new problems emerged.
One of them is the selection of data in regards to dynamicity. Since now we’re able to collect unthinkable volumes of information in mere seconds, getting some particular sample is no longer an issue. Additionally, in business, we will often scour the same sources over and over to monitor competition, brands, and anything else that’s relevant to the industry.
Data dynamicity is, as such, a question of optimization. Refreshing data each time might not be necessary in cases where certain fields might not be updated frequently, or those changes might have no importance to the use case.
Static vs dynamic data
Static data can be defined in a two-fold manner. As an object of information, it’s one that doesn’t change (frequently). Examples of such sources could be editorial articles, country or city names, descriptions of events and locations, etc. A factual news report, once published, is unlikely to ever be changed in the future.
Dynamic data, on the other hand, is something that is constantly in flux, often due to external factors. Frequently encountered types of dynamic data might be product pricing, stock numbers, reservation counts, etc.
Somewhere in the middle lies the twilight zone of both definitions, as is the case when you try to put everything into neat little boxes. There are objects of information such as product descriptions, meta titles of articles, and commercial pieces of content that change with some frequency.
Whether these fall under static or dynamic data will depend upon the intended use. Projects, independently from the type of data, will have more or less use for specific informational sources. SEO tools, for example, might find less value in pricing data, but will want to refresh meta titles, descriptions, and many other features.
Pricing models, on the other hand, will scarcely have use for frequently-updated product descriptions. They might need to grab it once for product-matching purposes. If it gets updated for SEO purposes down the line, there’s still no reason to ever revisit the description.
Mapping out your data
Every data analysis and collection project will have its necessities. Going back to the pricing model example, two technical features will be necessary — product matching and pricing data.
Products need to be matched as any automated pricing implementation needs accuracy. Mismatching products and changing pricing could cause an enormous amount of damage to revenue, especially if the changes go unaddressed.
Most of the matching happens through product titles, descriptions, and specifications. The former two will change often, especially in ecommerce platforms, where optimizing for keywords is an important ranking factor. They, however, will have no impact on the ability to match product identities as fundamental features will not change (e.g., an iPhone will always remain an iPhone).
As such, descriptions and titles might be treated as static data, even if they are somewhat dynamic. For the project’s purposes, the changes are not nearly as impactful to warrant continued monitoring.
Pricing data, as it may already be obvious, is not only naturally constantly in flux, but catching any changes as they happen would be essential to the project. As such, it would certainly be considered dynamic data.
Reducing costs with mapping
Regardless of the integration method, whether internal or external, data collection and storage practices are costly. Additionally, most companies will use cloud-based storage solutions, which can include all writes into the overall cost, meaning that refreshing data will cut into the budget.
Mapping out data types (i.e., static or dynamic) can optimize data collection processes through several avenues. First, pages can be categorized into static-data, dynamic-data, or mixed. While the first category might be somewhat shallow, it would still indicate that there’s no need to revisit those pages frequently, if at all.
Mixed pages might also make it easier to reduce write and storage costs. Reducing the amount of data transferred from one place to another is, by itself, a form of optimization, but these become more relevant when bandwidth, read/write, and storage costs are taken into account.
Since, however, scrapers usually download the entire HTML, any visit to a URL will have the entire object stored in memory. With the use of external providers, costs are usually allocated per request, so there’s no difference between updating all data fields or only the dynamic ones.
Yet, in some applications, historical data might be necessary. Downloading and updating the same field with the same data every time period would run up write and storage costs without good reason. A simple comparison function can be implemented that checks whether anything has changed and only performs a write if it has been so.
Finally, with internal scraping pipelines, all of the above still applies, however, to a much greater degree. Costs can be optimized by reducing unnecessary scrapes, limiting the amount of writes, and parsing only the necessary parts of the HTML.
In the end, developing frameworks is taking the first step towards true optimization. They may start out, as this one may be, as overly theoretical, but frameworks give us a lens for interpreting processes that are already in place.