Segmenting Demographics Using Telecommunication Behavior
Our goal for this analysis is to find the distinct telecommunication patterns in and around the city of Milan via k-means clustering. Having these patterns gives us the ability to segment communication tendencies by block characteristics. K-means clustering takes the total number of observations and splits them into a certain number of groups, or clusters, based on their innate grouping tendencies.
Establishing clusters allows us to identify similar patterns in telecommunication activity across geographical checkpoints, referred to as blocks. These blocks reference a geographical area of Milan, subdivided by a simple grid. Similarly most geospatial data can be tracked using a similar technique of creating sub regions and aggregating behavior by region.
The R script for this analysis can be found on my GitHub.
The jumping off point for this analysis came from here, some tweaking to the original code has been done and a substantial amount of new code has been added.
The Milan Dataset
The data were sanitized and made public for open use. The blocks, and subjects of our analysis, are large geographic squares created by a grid overlaid on the city of Milan, Italy. Each observation is one instance of telecommunication via call, text, or internet. The original file contains the following columns:
$ square_id : ID of the square or "block" on the Milano Grid $ time_interval : Start time of communication interval (milliseconds) $ country_code : Phone country code of inbound or outbound communication $ sms_in_activity : SMS received inside the specified square, sent from the nation of the country code $ sms_out_activity : SMS sent from inside the specified square to the nation identified by the country code $ call_in_activity : Call received inside the specified square, during the time interval, from the nation of the country code $ call_out_activity : Call issued inside the specified square, during the time interval, to the nation identified by the country code $ internet_traffic_activity: Performed internet traffic inside the specified square during the time interval by the nation of the user(s) performing the connection identified by the country code
☨☨New CDR generated at start and end of internet connection OR 15 minutes from the last generated CDR OR 5 MB from the last generated CDR.
This dataset is certainly nuanced. Most datasets, even for the same customer segmentation, are unique and have their own challenges. Teams with a comparative advantage in the approach they take to analysis cut the time it takes to find and answer supporting assumptions. The focus on results helps break down roadblocks which arise from bid data.
The variety of data issues is why a robust data exploration routine is key to successful completion of segmentation projects.
Initial Data Exploration
The initial exploration of the block level data show similar trends in activity and the times at which these occur. Below is a graphic of 4 different blocks to show how activity is consistent during business hours and taper off until the end of day. The spike at midnight may have several explanations, but given the open data source and level of sanitation it’s unlikely the true source can be found. It is consistent in each block so the anomaly wasn’t explored further as methods used will minimize the biassing the anomaly has on the outcome.
Separating the volume of communication by channel and by time of day helped disseminated call trends from SMS trends.
Based on the analysis, it was decided that two observations are similar if they have the same volume pattern of total telecommunication throughout the day and a similar ratio of talk to text communication.
Below is a graph that compares talk and text activity for one hour (11:00 pm) for the first 500 square IDs, colored by cluster.
Why this graph?
This graph was one of the clearest representations of the different clusters. This brings into question why four clusters were chosen because looking at this graph, one could argue segmenting the data a couple different ways.
*4 clusters is the “correct” number of clusters, mathematically, but is that the best solution for the kind of analysis we’re trying to do here? *
If you look at the above graphs, each point is a square ID and they are colored based on the generated k-means clustering. Would you argue there needs to be a fifth cluster to accommodate more of the “outliers”? Maybe only three clusters are necessary.
Not every observation is the same, so how many communication types do we have?
The overall pattern of communication throughout a day are similar across the grid, which makes sense. The interesting differences occur with the different ratios between communications mediums - ie talk and sms. Is ratio an acceptable degree of separation between observations? What would be a better solution for categorization of communication types?
What would the next steps be for this analysis?
Integrating more of the available data to allow branching out from a solely volume based analysis.