Preparing presentation....
Loading

Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.


Spatial Partitioning and Querying for Apache Spark
Hariharan Gandhi (hariharan.gandhi@de.ibm.com)

In this presentation



What's the big deal

What do we want to solve

How did we Solve :)

Trending Scenario



  • Very large volume of Geo-spatial data

  • Businesses need location aware services
  • Seek support from Parallel

    computing platforms



  • To handle such large volumes

  • But unfortunately, no native support (yet))

  • Needs modifying internal implementations
  • Why partition ?



  • Spatial Indexing for retrieval

  • Range Queries, grouping – Network cost due to shuffling between clusters/nodes


  • When to partition ?



  • GeoSpatial operations intense application,. Eg: Location aware services

  • Major interests of queries are Geospatial:

  • Multiple reads
  • The plan



    Introduce a partitioning scheme in Apache Spark that partitions data-sets considering Geo-spatial info:

  • Communication in distributed environment:

  • Retained Spatial Locality:

  • Minimize data points shuffling during Geo-spatial queries

  • Optimized for Search, Range queries on Geo-points
  • Implementing on Spark



  • GeoHash Partitioner

  • GeoHash Query Layer


  • Space Filling Curves



    Z-order Curve


    Hilbert Curve


    Variable Length Indexing



  • We need a Mapping from grid to spark nodes

  • Variable resolution grids mapping is preserved in a data structure
  • Custom Partitioning



  • Step1: Load Aware

  • Step2: Variable length to generate Hash

  • Step3: Map these hash to RDD partitions
  • Custom Partitioner Code


    Implementing on Spark



  • GeoHash Partitioner


  • GeoHash Query Layer

  • Query affects all nodes



  • Incoming query is executed on all nodes

  • Large scans

  • GeoHash Query Layer



  • Leverage the knowledge of previous partitioning and efficiently query on subset of data:

  • Identify the type of Query

  • Determine the areas of interest from the overall world

  • Figure out partitions that contain our interested region

  • Slice out the partitions from the RDD

  • Push the query on the final RDD based on the Query predicates

  • Thank you



  • Prof. Alejandro Buchmann Ph.D.

  • Robert Rehner, M.Sc.

  • Dr. Gregor Möhler (IBM, Germany)

  • Dr. Raghu Kiran Ganti (IBM Lab, USA)

  • Master Thesis Student:


  • Hariharan Gandhi

  • My Contact: