Aggregating geodata using geohashes

What we want to do

The situation is the following. We have a set of points with gps coordinates. We want to visualize them in the browser, but with large datasets (e.g. millions of points), displaying all of them is not what we want to do. Instead we want to aggregate them, in order to display multiple points being near each other as one single point.

The following example will be using MongoDB and a geocode system called geohash

What is Geohash

Geohash is a representation of latitude/longitude coordinates as a unique hash.

An example: The geohash of 57.64911,10.40744 would be u4pruydqqvj.

The geohash has a useful characteristic, which is that the closest points are often the ones with the closest geohash, means that they have the most characters in common.

How we use it

The fact that the closest points are often the ones with the closest geohashes allows us to create a database query which aggregates multiple points into one.
In order to aggregate our data points we will be “adding” a field which is the geohash shortened to a certain number of characters. This number of characters depends on how big of an area we want to aggregate into one point.
For example, a precision of 5 characters would represent an area of about 4.9km x 4.9km, whereas 9 characters would be 4.8m x 4.8m (according to elastic.co)

  • Let’s say our stored data has the following structure and is stored in a collection called ‘point’:
      {
          '_id' : 'somemongodbobjectID',
          'gps' : [57.64911,10.40744],
          'geohash' : 'u4pruydqqvj'
      }
    
  • In order to aggregate all points we are using mongodb’s aggregate pipeline.
    1. The first step is it to add the shortened geohash as a field. We will just call it shortGeohash. This is done using mongodb’s $project pipeline stage and the $substr operator, which allows us to create a substring of an existing field. $substr takes 3 arguments, where the first is the field to create the shortened string of, and the other twos define where to slice the string.
       db.point.aggregate([
           { $project : {shortGeohash: {$substr: ["$geohash", 0, 9]}}},
       ])
      
    2. The second step is two aggregate the points which have the same shortened geohash, means they are located in the same area of 4.8m x 4.8m (in this example). In order to know which documents are grouped together we also $push the documents before the $group stage with the $$ROOT operator.
       db.point.aggregate([
           { $project : {shortGeohash: {$substr: ["$geohash", 0, 9]}}},
           { $group: {_id: "$shortGeohash", count: {$sum:1}, originalDoc:{$push: "$$ROOT" }}
       ])
      
    3. This will result in an array of documents grouped by their geohash. We now have the points aggregated by their location and we have an array of IDs, which will allow us to do further work on our aggregated data.
       [{ _id: "u4pruydqq",
           count: 2,
           originalDoc: [{
               "_id": "5579b75416b8101ca37d9ab0",
               "shortGeohash": "u4pruydqq"
           }, {
               "_id": "5579b75416b8101ca37d9ab1",
               "shortGeohash": "u4pruydqq"
           }] 
       }, { _id: "u4pruydqr",
           count: 5,
           originalDoc:
           [{
               "_id": "5579b75416b8101ca37d9ab2",
               "shortGeohash": "u4pruydqr"
           }, {
               "_id": "5579b75416b8101ca37d9ab3",
               "shortGeohash": "u4pruydqr"
           }, {
               "_id": "5579b75416b8101ca37d9ab4",
               "shortGeohash": "u4pruydqr"
           }, {
               "_id": "5579b75416b8101ca37d9ab5",
               "shortGeohash": "u4pruydqr"
           }, {
               "_id": "5579b75416b8101ca37d9ab",
               "shortGeohash": "u4pruydqr" 
           }]
       }]
      

Uploading measured data using Socket.IO

Task

The current task is it to upload a .csv file to the web server using Socket.IO. Socket.IO is a realtime framework for web applications, but can be used for communication between two NodeJS instances as well. It provides improved communication mechanisms over websockets and some fallback options as well.

What we want to implement will consist of a client, which will read data from a .csv file and transmit this data row by to the web server, which will receive the data, parse and transform it into JSON and then save it in the database.

The Client

The client consists only of a few lines of code. A connection to the server will be established. After that, the file will be read and the content will be split up into rows (by splitting between each new line: ‘\n’ ). Then each row will be sent via the websocket to the server. After sending each row, a ‘done’ sequence will be sent as a simple workaround for the server to see whether the whole file was sent.

var io = require('socket.io-client'),
 fs = require('fs');
 _ = require('lodash');

var socket = io.connect('http://localhost:3002');

fs.readFile('test_data/newFormat.csv', function(err, data) {
   data = data.toString('utf-8');
   data = data.split('\n');
   console.log('Scanned file with '+data.length+' rows');
   _.each(data, function(row) {
     socket.emit('upload', row);
   });
   socket.emit('upload', "####done####");
});

The Server

The server consists of a simple http server, on which the Socket.IO framework will listen for incoming connections. For each connection, a new socket will be opened and each it will receive the data row by row. The data is then parsed into JSON and stored in an array in order to buffer. Only if 1000 elements are in our buffer, the content will be saved in the database. This is because MongoDBs maximum bulk insert size is 1000 objects, and these settings resulted in the fastest execution of both receiving and storing the objects.

'use strict';
var app = require('express')();
var server = require('http').Server(app);
var io = require('socket.io')(server);
var _ = require('lodash');
var Promise = require('bluebird');
var mongoose = Promise.promisifyAll(require('mongoose')),
Schema = mongoose.Schema,
Measurement = Promise.promisifyAll(require('./model/Measurement'));

mongoose.connectAsync('mongodb://localhost/roadstar_csv')
.then(server.listen(3002));

io.on('connection', function (socket) {
    socket.on('upload', receiveData);
});

var buffer = [];
var ops = [];
var firstTime = true;

function receiveData(chunk) {
     var op;

    if(firstTime) {
       console.time('receiving rows');
       console.time('writing to db');
       firstTime = false;
    } else if(chunk === '####done####') {
       op = Measurement.collection.insertAsync(buffer);
       ops.push(op);
       buffer = [];
 
       console.timeEnd('receiving rows');
       Promise.all(ops)
       .then(function() {
         console.timeEnd('writing to db');
         firstTime = true;
     });
   } else {
         try {
             chunk = dataToJSON(chunk);
             if(buffer.length < 1000) {
                 buffer.push(chunk);
             } else {
                 op = Measurement.collection.insertAsync(buffer);
                 ops.push(op);
                 buffer = [];
                 buffer.push(chunk);
            }
         } catch (err) {
             console.log(err);
         }
     }
}

Node-osrm – An alternativ mapmatching solution

The alternative

After testing and trying to patch the graphhopper map-matching framework, I decided to have a second look for alternative map matching frameworks. After googling for a long while and trying various search terms, I found the Project-OSRM on Github.

The Open Source Routing Machine (OSRM) is a C++ implementation of a high-performance routing engine for shortest paths in road networks. – project-osrm.com

Thankfully this project also comes with a repository containing NodeJS bindings for OSRM. It gives us various wrappers to access the routing functionality. Many examples can be found in the readme on github. The one especially useful for use, as it enables us to do point per point map matching, is:

osrm.locate([52.4224,13.333086], function (err, result) {
  console.log(result);
  // Output: {"status":0,"mapped_coordinate":[52.422442,13.332101]}
});

Setup

The Setup is described in the projects readme as well, however there are a few challenges to tackle:

  • Use the current official node version, as there seem to be bugs under iojs
  • You seem to have to run the make command first, before being able to create supporting files.
  • The node module needs various files, such as *.edges, *.nodes, … to work, which are generated after running the make command as shown in the readme. In order to support map-matching for a bigger area, you need to change the default area (berlin) to the wanted area. Be aware that this might take a lot of time for bigger areas, as it downloads the whole .pbf file from the web (e.g. germany ~2.6 GB)
  • There where a few more challenges, i will update this list as I remember 😉

Results

After setting everything up, some test runs with existing data where made. In general, the map-matching worked pretty well and reasonably fast. However, as this does the map-matching point by point,  there are a few outliers visible. Some pictures of the results can be seen below. Note that the only the pink points are the map matched ones, all other ones are “raw data (e.g. yellow, green, red, …)

screen_shot_2015-05-20_at_13.51.58_1024

Patching Graphhopper

For displaying our collected data on a map, we have to make sure, that the gps data is right on the road. The individual points need to snap to the streets and than show the quality there. This process is called Map-Matching. We looked into current solutions and stumbled upon Graphhopper, which is a routing framework with this functionality right built in.
Graphhopper would fit perfect in our pipeline, but there is a mayor problem.
After some tests we realized, that the map matching part works flawlessly, but after the procession, all added extra data, like quality measurement data is lost. Also the timestamps are gone. This is a serious problem we need to work around. We basicly lose the ability to map the quality data to the street.
Graphhopper is open source. So we contacted the developer and looked into the GitHub Repository. It shouldn’t be too hard to patch Graphhopper ourselves. After working trough the code, we saw another problem.
The algorithm does not only snap existing gps waypoints to the road, it also creates new ones. And for these new ones, there are no quality data available. Also the timestamps are lost because of that reasion.

In the end, we created or more or less working hack, that takes the data for one waypoint and spreads it to all new waypoints that are created from this one. That waorks okayis and we will have to test further how it will work with large datasets.

The issue with the timestamps is still in progress. We will try to interpolate between the waypoints, and calculate the new time.

All of this will be a ongoing task. We decided that it is more important to finish the gathering of data, its storage and interpretation. We will get in contact with the developer of Graphhopper again and ask for help.

Working with MongoDB and GeoJSON

What is MongoDB

“MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling.

A record in MongoDB is a document, which is a data structure composed of field and value pairs. MongoDB documents are similar to JSON objects. The values of fields may include other documents, arrays, and arrays of documents.”

http://docs.mongodb.org/manual/core/introduction/

What is GeoJSON

GeoJSON is a format for encoding a variety of geographic data structures.

– http://geojson.org

GeoJSON allows storing geographic data as Points, LineString, Polygon and many more formats. Each geometric information can be enriched with properties and is then called a Feature

{ 
    "type": "Feature",
    "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
    "properties": {"someproperty": "somevalue"}
}

These features again can be grouped into collections, called FeatureCollection

Combine best of both worlds

Working with GeoJSON and MongoDB in NodeJS works very simple, because MongoDBs JSON-like documents allow us to store GeoJSON as is.

A short example using the mongoose for MongoDB would look like the following.
(Attention: this example is shortened and missing some boilerplate code)

var GeojsonfeatureSchema = new Schema({
    type: {type: String},
    'geometry' : {
        type: {type: String},
        'coordinates' : {
            'type' : [Number],
            'index' : '2dsphere',
            'required' : true
        }
    },
    'properties' : {
        'speed' : Number,
        'measurement' : Number,
        'quality' : String
    }
});

mongoose.model('GeojsonFeature', GeojsonfeatureSchema);
var GeojsonFeature = mongoose.model('GeojsonFeature');

new GeojsonFeature({
    'type' : 'Feature',
    'geometry' : {
        'type' : 'Point',
        'coordinates' : [50.2, 9.7],
    },
    'properties' : {
        'speed' : 10
        'measurement' : 9.06
        'quality' : "very bad"
    }
}).save(function(err, doc) {
 //...
));

Create an index and query geo data

After storing data in our database, it is now about time to think about how to get data out of the database again. Because we are working with geo data, it would be nice to retrieve data in a “show me all entries near a certain coordinate” way. So let’s find a way to do this.
An index over the coordinates in our document collection was automatically created because we added the property 'index' : '2dsphere'. See details about these indexes here.

Because MongoDB is awesome, it now lets us query our data in a very intuitiv way, creating queries like this:

Find all data near a coordinate

Note: GeoJSON defines the first of the coordinates to be longitude!

var query = {
    'coordinates' : {
        $near: {
        $geometry: {
             type: "Point" ,
             coordinates: [ lng , lat ]
        },
        $maxDistance: distance,
        $minDistance: 0
        }
    }
};
Geojson.find(query, '-__v -_id', function(err, doc) {
    //hooray we've got our documents 
});

Find all data in a given bounding box

var query = {
    'geometry.coordinates': {
        $geoWithin: {
            $box: [
                [ swlng, swlat ],
                [ nelng , nelat ]
            ]
        }
    }
};
Geojson.find(query, '-__v -_id', function(err, doc) {
    //hooray we've got our documents 
});

That’s all about that. No complex calculations, just some simple queries 🙂

Planning a NodeJS backend architecture for the RoadStar project

After evaluating the task of creating a backend for storage of measurement and gps data and displaying it on a map, it is time to think about how to put the pieces together.

What we want to use

  • MongoDB for storage
  • GraphHopper/mapmatching for matching the measured position of the GPS sensor to the nearest street
  • Express/NodeJS API enabling access to the stored data from the outside
  • Leaflet for displaying the data on a map
  • GPS and gyro sensors on a Raspberry Pi

MongoDB allows us to store the JSON which contains the measured data as is. This comes in handy, because no conversion needs to be done internally. But wait… The GraphHopper only accepts data in the gpx format, a XML standard for storing geo data. And there also is Leaflet, which works best with GeoJSON, another predefined standard, this time JSON again. This problem of different data formats can only be solved by creating and/or using existing converters. This adds another piece to our architecture, the

  • custom converter

Having all these pieces and different formats, the need for a first overview arises. After sketching and refining, i came to the following concept:

architecture

A first concept of an architecture

The next steps will be creating some of the parts and check whether this architecture will stand the test 🙂

MapView & Map-Matching

Last week I decided to choose Polymaps as the mapping framework, which should present our results. Trying to add some more gimmicks to the map, I found out for myself, that it is pretty uncomfortable to add for example small popups to the map.

Because of that I tried another framework (leaflet.js). Instead of using an SVG layer, it uses an own vector layer to show up markers and lines. All in all that framwork is way easier to use and has a lot of more documentation in the web. The only disadvantage is the strange zooming behaviour of the geoJson layer data. It always keeps the same size, which looks bad on very high/low zoom levels. We’ll find a workaround for that! Moreover I got some nice, more minimalistic (in comparison to the standard OSM) map tiles from mapbox.

In addition to that I tried out the map-matching framework from graphhopper, which seems to work pretty nice, but has a big disadvantage of losing all the metadata from GPX files. We’ll have to rewrite some Java classes, so that it fits our GPX structure and will keep the important sensor data in it. This will be one of the main tasks for next week.

REST API for Roadstar Project

The Roadstar Project requires a REST API for storing and accessing the measured data on a separate server.

Using NodeJS, MongoDB and the Express Framework, setting up a server stack is relatively easy. Express enables easy routing of API requests as well as easy integration of various middleware for purposes such as validating requests and much more. Due to MongoDB’s non relational nature we can then easily store the retrieved JSON measurements directly in the database.

Using mocha and chai for TDD, testing the API endpoints and controller functions allows us to verify the functionality.

The first app skeleton, controllers and models are created from previous projects, but could also be scaffolded with the popular Yeoman.