Efficient algorithms for substring near neighbor problem:为子串近邻问题的高效算法.ppt
1 What ’ s SNN? ? SNN ≈ Text Indexing with mismatches ? Text Indexing: ? Construct a data structure on a text T[1..n], . ? Given query P[1..m], finds occurrences of P in T ? Text indexing with mismatches: ? Given P, find the substrings of T that are equal to P except ≤ R chars. ? Motivation: ., computational bio (BLAST) T= G AGTA ACTCAATA P= AGTA T= G AGTA ACTC AA TA 2 Outline ? General approach ? View: Near Neighbor in Hamming ? Focus: reducing space ? Background ? Locality-Sensitive Hashing (LSH) ? Solution ? Reducing query & preprocessing ? Redesign LSH ? Concluding remarks 3 Approach (Or, why SNN?) ? SNN = a near neighbor problem in Hamming metric with m dimensions: ? Construct data structure on D={all substrings of T of length m}, . ? Given P, find a point in D that is at distance ≤ R from P ?? Use a NN data structure for Hamming D={GAGT, AGTA , GTAA, ….AA TA } T= G AGTA ACTC AA TA P= AGTA 4 Approximate NN ? Exact NN problem seems hard (., hard w/o exponential space or O(n) query time) ? Approximate NN is easier ? Defined for approximation c=1+ ε as ? OK to report a point at distance ≤ cR (when there is a point at distance ≤ R) Query Space [KOR98, IM98] poly(log n, m) n O(1/ ε^2) LSH [IM98] n 1/c +m n 1+1/c R cR q5 Our contribution ? Problem: need m in advance for NN ? Have to construct a data structure for each m ≤M ? Here: approx SNN data structure for unknown m ? Without degradation in space or query time ? Our algorithm for SNN based on LSH: ? Supports patterns of length m ≤M ? Optimal * space: n 1+1/c ? Optimal * query time: n 1/c ? Slightly worse preprocessing time if c>3 ?(* Optimal . LSH, modulo subpoly factors) ? Also extends to l 1 6 Outline ? General approach ? View: Near Neighbor in Hamming ? Focus: reducing space ? Background ? Locality-Sensitive Hashing (LSH) ? Solution ? Reducing query & preprocessing ? Redesign LSH ? Concluding remarks 7 Locality-Sensitive Hashing ? Based on a family of hash functions {
Efficient algorithms for substring near neighbor problem:为子串近邻问题的高效算法 来自淘豆网m.daumloan.com转载请标明出处.