Spatial Partitioning and Querying for Apache Spark

Hariharan Gandhi (hariharan.gandhi@de.ibm.com)

In this presentation

What's the big deal

What do we want to solve

How did we Solve :)

Trending Scenario

Very large volume of Geo-spatial data

Businesses need location aware services

- Like tweets in an particular area

- Weather information of a holiday spot which you plan to visit over the vacation

- Attractions nearby

- Take me through the shortest route

- many more.....

Seek support from Parallel

computing platforms

To handle such large volumes

But unfortunately, no native support (yet))

Needs modifying internal implementations

Why partition ?

Spatial Indexing for retrieval

Range Queries, grouping – Network cost due to shuffling between clusters/nodes

When to partition ?

GeoSpatial operations intense application,. Eg: Location aware services

- Ex: Maps, Uber, Jodel, AirBnB, Emergency Services

Major interests of queries are Geospatial:

Multiple reads

The plan

Introduce a partitioning scheme in Apache Spark that partitions data-sets considering Geo-spatial info:

Communication in distributed environment:

- minimize network traffic by allowing spark program to control RDD partitioning

Retained Spatial Locality:

- nearby points are probably in the same partition / cluster

Minimize data points shuffling during Geo-spatial queries

Optimized for Search, Range queries on Geo-points

Implementing on Spark

GeoHash Partitioner

GeoHash Query Layer

Space Filling Curves

Z-order Curve

Hilbert Curve

Variable Length Indexing

We need a Mapping from grid to spark nodes

Variable resolution grids mapping is preserved in a data structure

Custom Partitioning

Step1: Load Aware

Step2: Variable length to generate Hash

Step3: Map these hash to RDD partitions

Custom Partitioner Code

Implementing on Spark

GeoHash Partitioner

GeoHash Query Layer

Query affects all nodes

Incoming query is executed on all nodes

Large scans

GeoHash Query Layer

Leverage the knowledge of previous partitioning and efficiently query on subset of data:

Identify the type of Query

Determine the areas of interest from the overall world

Figure out partitions that contain our interested region

Slice out the partitions from the RDD

Push the query on the final RDD based on the Query predicates

Thank you

Prof. Alejandro Buchmann Ph.D.

Robert Rehner, M.Sc.

Dr. Gregor Möhler (IBM, Germany)

Dr. Raghu Kiran Ganti (IBM Lab, USA)

Master Thesis Student:

Hariharan Gandhi

My Contact:

Preparing presentation....

In this presentation

Trending Scenario

Seek support from Parallel computing platforms

Why partition ?

When to partition ?

The plan

Implementing on Spark

Space Filling Curves

Variable Length Indexing

Custom Partitioning

Custom Partitioner Code

Implementing on Spark

Query affects all nodes

GeoHash Query Layer

Thank you

Seek support from Parallel

computing platforms