In the past decade, more and more companies have launched big data programs to enhance their business intelligence capabilities, often with a focus on customer behavior analysis. Having access to a large amount of customer behavior data can allow businesses to infer relationships, perform predictions, and better inform product and marketing decisions.

For organizations with an existing large user base, setting up a big data system involves many challenges. One difficulty is how to collect the data itself, when existing customer devices may be old and not have the necessary remote update capabilities, communication bandwidth, or processing power to support new data collection processes.

Cardinal Peak worked with a large cable television company to determine whether useful data could be collected from existing set-top box installations via existing communication channels. This approach was considered as an alternative to developing dedicated big data support on those devices, which would involve a large effort spread out over many months.

In our case, we had access to existing system log data for set-top boxes, that were periodically sent over the network to the data center. System logs were never intended to track user data and contained a blend of remote control key presses, channel changes, user interface screen changes, and other system events, logged with varying levels of consistency. Our challenge was to take this anonymized data and extract useful customer behavioral information from it.

We used Python to develop a data processing pipeline that acted on aggregated log streams and reduced them to multiple streams of events split by behavior category: channel changes, screens visited, navigation paths through the user interface, session durations, keypresses, etc. We were able to calculate statistical distributions for metrics such as time spent watching each TV channel, time spent in each user interface screen, frequency of customers watching their video-on-demand search results, and numerous others.

Once data was collected, Cardinal Peak performed statistical analysis to determine confidence intervals and statistical significance for the results. Simply put, these are measures of how likely our statistics calculated from a sample of set-top boxes match the underlying population of all set-top boxes, and how likely particular results could have occurred by chance alone.