Difference between revisions of "Finding hotspots in geospatial data using spatial statistics"
(Created page with "Title: Finding hotspots in geospatial data using spatial statistics Mentor: Dr.Satish Puri Approach: This project deals with computeintensive spatial data mining algorithm...") 
(No difference)

Latest revision as of 21:44, 12 January 2020
Title: Finding hotspots in geospatial data using spatial statistics
Mentor: Dr.Satish Puri
Approach: This project deals with computeintensive spatial data mining algorithms and uses parallel computing for speeding up analytics tasks. This project utilizes spatial correlation methods for mining interesting patterns in geospatial data. Summary: Spatial data mining is the process of discovering interesting and potentially useful patterns from spatial data sources. The complexity of spatial data and implicit spatial relationships limits the usefulness of standard data mining algorithms for extracting spatial patterns. Although standard data mining algorithms can be applied under assumptions such as independent and identical distribution, they often perform poorly on geospatial data due to their selfcorrelated nature. Here we focus on one such data mining algorithm, namely, hotspot detection. Hot spots are statistically significant clusters. In other words, given a set of weighted data points, the hotspots are those clusters of points with values higher in magnitude than what is possible by random chance. Centers for Disease Control (CDC) uses hot spot analysis to find disease outbreaks. Another example is finding traffic accident hotspots in a region. A computeintensive algorithm known as GetisOrd is used to find such hotspots in data. The output of the algorithm is a Z score for each location. The Z score represents the statistical significance of clustering for a specified distance. Pvalues are calculated to check for null hypothesis. New York Taxi trips data sets containing about 100 million records of pickup and dropoff location/time will be used in the project. The size of the data and the computational complexity motivates exploring parallel computing methods in this project.
Student Research Activities: The REU fellows will perform the following major tasks:
 Perform a systematic literature review of data mining techniques in geospatial data.
 Understand hotspot detection and the associated data mining algorithms.
 Implement and evaluate spatial data mining algorithms and apply parallel computing methods for speeding up analytic tasks on New York Taxi trips data set (publicly available).
Student Background: Students need to have basic computing knowledge and introductory programming skills in Python or C/C++. Students will be introduced to compiler pragmabased methods for quick parallelization of sequential codes on multicore CPUs and manycore GPUs.