Looking around for some data to play with, I stumbled-upon a small dataset that hopper is using as a teaser puzzle in their job posting for a data scientist. Taking a quick look in R…

options(scipen=999)
lapply(c("plyr","dplyr","ggplot2","scales","Hmisc","YaleToolkit","gpairs","RColorBrewer"),require,character.only=T,quietly=T)

system("cat ~/Downloads/puzzle.csv | (head -n 5 ; tail -n 5)")
0.3971497,2.1136286
0.3971497,2.1136286
0.3971497,2.1136286
0.3971497,2.1136286
0.3971497,2.1136286
0.6589491,-1.5535368
0.8439924,0.6126211
0.7515614,1.2998793
0.9068190,0.0774441
1.0608076,-2.8246059

hop <- read.csv("~/Downloads/puzzle.csv",header = F)

str(hop)
'data.frame':   1024 obs. of  2 variables:
 $ V1: num  0.397 0.397 0.397 0.397 0.397 ...
 $ V2: num  2.11 2.11 2.11 2.11 2.11 ...

sum(is.finite(hop$v1))
[1] 0

sum(is.finite(hop$v2))
[1] 0

summary(hop)
       V1                V2
 Min.   :-0.9251   Min.   :-2.8379
 1st Qu.: 0.3930   1st Qu.:-1.5649
 Median : 0.5914   Median :-1.1465
 Mean   : 0.5057   Mean   :-0.3965
 3rd Qu.: 0.7462   3rd Qu.: 0.5701
 Max.   : 1.2213   Max.   : 3.0970

The dataset seems to be clean and straightforward. 1024 obs: Kilobyte? A few plots…

qplot(as.numeric(row.names(hop)),V1,data=hop,xlab = "Index")

scatterplot

qplot(as.numeric(row.names(hop)),V2,data=hop,xlab = "Index")

scatterplot

Patterns here. And weird groups of duplicates in both vars between index 0 and 125.

gpairs(hop)

pairs plots

Looks like a world map!

with(hop,qplot(V2,V1))

pairs plot

Clearly. Values of 1 on the Y axis appear to correspond to 60 degrees N/S latitude. Maybe all values have been divided by 60?

hop$lon <- hop$V2 * 60
hop$lat <- hop$V1 * 60
lapply(c("ggmap","maps","maptools","mapproj"),require,character.only=T,quietly=T)
basemap <- borders("world")
m <- ggplot() + basemap
m + geom_point(aes(x=hop$lon,y=hop$lat))

map plot

Not quite right. Easting goes beyond 180. Another projection? Looking at different options in Snyder’s manual I don’t think so. But I am reminded that formulas for working with projections are typically done in radians…

names(hop) <- c("v1","v2","lon60","lat60")

180/pi
[1] 57.29578

hop$lonrad2deg <- hop$v2 * 57.29578
hop$latrad2deg <- hop$v1 * 57.29578
m + geom_point(aes(x=hop$lonrad2deg,y=hop$latrad2deg))

map plot

Nice! What about the duplicates?

coordsn <- count(hop,lon = lonrad2deg, lat = latrad2deg)

summary(coordsn)
      lon               lat               n
 Min.   :-162.60   Min.   :-53.00   Min.   :  1.000
 1st Qu.: -92.83   1st Qu.: 19.75   1st Qu.:  1.000
 Median : -73.10   Median : 36.08   Median :  1.000
 Mean   : -37.97   Mean   : 29.63   Mean   :  1.119
 3rd Qu.:  13.26   3rd Qu.: 43.46   3rd Qu.:  1.000
 Max.   : 177.44   Max.   : 69.98   Max.   :101.000

unique(coordsn$n)
[1]   1   2 101

dupset <- data.frame(subset(coordsn,n>1))

str(dupset)
'data.frame':   10 obs. of  3 variables:
 $ lon: num  -150 -122.4 -96.2 -86.3 -82.6 ...
 $ lat: num  61.2 37.6 19.1 39.7 38.4 ...
 $ n  : int  2 2 2 2 2 2 2 2 2 101

dupset[dupset$n > 2,]
       lon    lat   n
10 121.102 22.755 101

m + geom_point(aes(x=dupset$lon,y=dupset$lat))

map plot

Weird. Seemingly no connection between locations with duplicates. Coming from hopper, I assume these are airports. I can plot points against OSM or similar basemap to confirm. But what is going on with the one in what looks like Taiwan with 101 of them? Maybe something like search traffic on hopper’s servers at a given moment. China’s population is so large that it’s possible.

Unfortunately, no more time for this right now. A future post…