Managing and Mining Uncertain Data

von: Charu C. Aggarwal

Springer-Verlag, 2010

ISBN: 9780387096902 , 494 Seiten

Format: PDF, OL

Kopierschutz: DRM

Windows PC,Mac OSX für alle DRM-fähigen eReader Apple iPad, Android Tablet PC's Online-Lesen für: Windows PC,Mac OSX,Linux

Preis: 96,29 EUR

Mehr zum Inhalt

Managing and Mining Uncertain Data


 

Preface

6

Contents

8

List of Figures

16

List of Tables

21

Chapter 1 AN INTRODUCTION TO UNCERTAIN DATA ALGORITHMS AND APPLICATIONS

22

1. Introduction

22

2. Algorithms for Uncertain Data

24

3. Conclusions

27

Acknowledgements

28

References

28

Chapter 2 MODELS FOR INCOMPLETE AND PROBABILISTIC INFORMATION

30

1. Introduction

30

2. Incomplete Information and Representation Systems

34

3. RA-Completeness and Finite Completeness

35

4. Closure Under Relational Operations

39

5. Algebraic Completion

40

6. Probabilistic Databases and Representation Systems

42

7. Probabilistic ?-Tables and Probabilistic Or-Set Tables

43

8. Probabilistic c-tables

45

9. Queries on Annotated Relations

46

10. K-Relations

48

11. Polynomials for Provenance

51

12. Query Containment

54

13. RelatedWork

55

14. Conclusion and Further Work

55

Acknowledgments

56

References

57

15. Appendix

61

Chapter 3 RELATIONAL MODELS AND ALGEBRA FOR UNCERTAIN DATA

64

1. Introduction

64

2. Different Probabilistic Data Models

67

2.1 Point-Valued Probability Measures Assigned to Each Tuple

67

2.2 Point-Valued Probability Measures Assigned to Attributes and Attribute Sets

71

2.3 Interval-Valued Probability Measures Assigned to Attribute-Values

73

2.4 Interval-valued Probability Measures Assigned to Tuples

75

3. Probabilistic Relational Algebra

75

3.1 Basic Definitions

75

3.2 Primary and Foreign Keys

77

3.3 Relational Operations

78

3.4 Relational Algebra

81

3.5 Incomplete Distribution and Null Values

82

4. Algebraic Implications of the Different Representations and Associated Assumptions

86

4.1 Models Assigning Point-Valued Probability Measures at the Tuple Level

87

4.2 Models Assigning Point-Valued Probability Measures at the Attribute Level

88

4.3 Models Assigning Interval-Valued Probability Measures at the Attribute Level

89

4.4 Models Assigning Interval-Valued Probability Measures to Tuples

90

4.5 Some Observations on the Independence Assumption Across Tuples

91

5. Concluding Remarks

91

References

93

Chapter 4 GRAPHICAL MODELS FOR UNCERTAIN DATA

95

1. Introduction

96

2. Graphical Models: Overview

98

2.1 Directed Graphical Models: Bayesian Networks

100

2.2 Undirected Graphical Models: Markov Networks

102

2.3 Inference Queries

103

3. Representing Uncertainty using Graphical Models

105

3.1 PossibleWorld Semantics

106

3.2 Shared Factors

109

3.3 Representing Probabilistic Relations

109

4. Query Evaluation over Uncertain Data

110

4.1 Example

112

4.2 Generating Factors during Query Evaluation

114

4.3 Query Evaluation as Inference

117

4.4 Optimizations

118

5. RelatedWork and Discussion

119

5.1 Safe Plans

119

5.2 Representing Uncertainty using Lineage

120

5.3 Probabilistic Relational Models

120

5.4 Lifted Inference

122

5.5 Scalable Inference using a Relational Database

122

6. Conclusions

123

Acknowledgments

124

References

125

Chapter 5 TRIO:ASYSTEMFORDATA,UNCERTAINTY,AND LINEAGE

131

Introduction

131

1. ULDBs: Uncertainty-Lineage Databases

134

1.1 Alternatives

135

1.2 ‘?’ (Maybe) Annotations

135

1.3 Confidences

136

1.4 Lineage

137

1.5 Relational Queries

139

2. TriQL: The Trio Query Language

140

2.1 Operational Semantics

140

2.2 Querying Confidences

142

2.3 Querying Lineage

142

2.4 Duplicate Elimination

143

2.5 Aggregation

144

2.6 Reorganizing Alternatives

145

2.7 Horizontal Subqueries

146

2.8 Query-Defined Result Confidences

146

2.9 Other TriQL Query Constructs

147

3. Data Modifications in Trio

148

3.1 Inserts

148

3.2 Deletes

149

3.3 Updates

149

3.4 Data Modifications and Versioning

151

4. Confidence Computation

151

5. Additional Trio Features

153

6. The Trio System

154

6.1 Encoding ULDB Data

156

6.2 Basic Query Translation Scheme

158

6.3 Duplicate Elimination

159

6.4 Aggregation

159

6.5 Reorganizing Alternatives

160

6.6 Horizontal Subqueries

160

6.7 Built-In Predicates and Functions

161

6.8 Query-Defined Result Confidences

162

6.9 Remaining Constructs

162

References

164

Chapter 6 MAYBMS: A SYSTEM FOR MANAGING LARGE UNCERTAIN AND PROBABILISTIC DATABASES

166

1. Introduction

166

2. Probabilistic Databases

168

3. Query Language Desiderata

169

4. The Algebra

170

5. Representing Probabilistic Data

176

6. Conceptual Query Evaluation, Rewritings, and Asymptotic Efficiency

181

7. The MayBMS Query and Update Language

187

8. The MayBMS System

192

9. Conclusions and Outlook

195

References

197

Chapter 7 UNCERTAINTY IN DATA INTEGRATION

200

1. Introduction

200

2. Overview of the System

202

2.1 Uncertainty in data integration

202

2.2 System architecture

203

2.3 Source of probabilities

204

2.4 Outline of the chapter

205

3. Uncertainty in Mappings

205

3.1 Motivating probabilistic mappings

205

3.2 Definition and Semantics

207

3.3 Query Answering

212

3.4 Creating P-mappings

216

3.5 Broader Classes of Mappings

219

3.6 Other Types of Approximate Schema Mappings

221

4. Uncertainty in Mediated Schema

222

4.1 P-Med-Schema Motivating Example

222

4.2 Probabilistic Mediated Schema

225

4.3 P-med-schema Creation

227

4.4 Consolidation

229

4.5 Other approaches

231

5. Future Directions

232

References

233

Chapter 8 SKETCHING AGGREGATES OVER PROBABILISTIC STREAMS

236

1. Introduction

236

1.1 Aggregates over probabilistic streams

238

1.2 Organization

239

2. The Probabilistic Stream Model

239

2.1 Problem definitions

241

2.2 Frequency Moments and Quantiles

242

3. Overview of techniques and summary of results

244

4. Universal Sampling

248

5. Frequency moments: DISTINCT and REPEAT-RATE

249

5.1 DISTINCT

249

5.2 REPEAT-RATE

251

6. Heavy-Hitters, Quantiles, and MEDIAN

253

7. A Binning Technique for MIN and MAX

254

8. Estimating AVG using generating functions

257

8.1 Generating functions

257

8.2 Estimating AVG

258

8.3 Approximating AVG by SUM/COUNT

261

9. Discussion

265

References

266

Chapter 9 PROBABILISTIC JOIN QUERIES IN UNCERTAIN DATABASES

269

1. Introduction

269

2. Traditional Join Approaches

270

2.1 Simple Nested-Loop Join

271

2.2 Nested-Block-Loop Join

272

2.3 Sort-Merge-Loop Join

272

2.4 Other Join Methods

273

2.5 Spatial Join Algorithms

273

2.6 Spatial Join using a spatial index structure for both relations

274

2.7 Spatial Join using a spatial index structure on one relation

274

2.8 Spatial Join using no Spatial-Index Structure

275

3. Uncertainty Models and Join Predicates

275

3.1 The Continuous Uncertainty Model

276

3.2 The Discrete Uncertainty Model

278

3.3 Join Predicates and Score

283

3.4 Probabilistic Join Query Types

284

3.5 Example

285

3.6 Overview of UncertaintyModels and Probabilistic Join Queries

286

4. Approaches for Efficient Join Processing on Uncertain Data

289

4.1 Confidence-Based Join Methods

289

4.2 Probabilistic Similarity Joins

292

4.3 Probabilistic Spatial Join

300

5. Summary

301

References

304

Chapter 10 INDEXING UNCERTAIN DATA

310

1. Introduction

311

2. Data Models and Query Semantics

313

2.1 Uncertain Attribute types

314

3. Uncertainty Index for Continuous Domains

314

3.1 Probability Threshold Indexing

315

3.2 Special Case: Uniform PDFS

318

3.3 2D mapping of intervals

318

3.4 Join Queries

320

3.5 Multi-dimensional Indexing

322

4. Uncertainty Index for discrete domains

322

4.1 Data Model and Problem Definition

323

4.2 Probabilistic Inverted Index

324

4.3 Probabilistic Distribution R-tree

327

5. Indexing for Nearest Neighbor Queries

329

References

333

Chapter 11 RANGEANDNEARESTNEIGHBORQUERIESON UNCERTAIN SPATIOTEMPORAL DATA

336

1. Introduction

336

2. Range Search

340

2.1 Query Definitions

340

2.2 Filter and Refinement

342

2.3 Nonfuzzy Range Search

343

2.4 Fuzzy Range Search

345

2.5 Indexing

348

3. Nearest Neighbor Retrieval

349

3.1 Query Definition

349

3.2 Query Processing

350

3.3 Variations of Nearest Neighbor Retrieval

352

4. Summary

356

References

357

Chapter 12 PROBABILISTIC XML

360

1. Introduction

360

2. Sources of Uncertainty in XML Data

361

3. Modeling Uncertainty using Tags

362

4. Modeling Uncertainty using Semi-structured Data

365

5. Modeling Uncertainty in XML with Independent or Mutually Exclusive Distribution in Point Probability

367

6. Formal Model of Probabilistic Semi-structured Data with Arbitrary Probabilistic Distributions

370

6.1 Motivating Examples

372

6.2 Probabilistic Semi-structured Data Model

374

6.3 Semantics

379

6.4 PXML Algebra and Comparison with PreviousWork

382

6.5 Probabilistic Aggregate Operations

384

6.6 Modeling Interval Uncertainty in Semi-structured Data

386

7. Summary

389

Acknowledgements

390

References

391

Chapter 13 ON CLUSTERING ALGORITHMS FOR UNCERTAIN DATA

394

1. Introduction

394

2. Density Based Clustering Algorithms

396

3. The UK-means and CK-means Algorithms

399

4. UMicro: Streaming Algorithms for Clustering Uncertain Data

400

4.1 The UMicro Algorithm: Overview

402

4.2 Computing Expected Similarity

403

4.3 Computing the Uncertain Boundary

407

4.4 Further Enhancements

407

5. Approximation Algorithms for Clustering UncertainData

408

6. Conclusions and Summary

408

Acknowledgements

409

References

409

Chapter 14 ON APPLICATIONS OF DENSITY TRANS FORMSFOR UNCERTAIN DATA MINING

412

1. Introduction

412

2. Kernel Density Estimation with Errors

414

2.1 Scalability for Large Data Sets

417

3. Leveraging Density Estimation for Classification

421

4. Application of Density Based Approach to Outlier Detection

424

4.1 Outlier Detection Approach

425

4.2 Subspace Exploration for Outlier Detection

426

5. Conclusions

428

Acknowledgements

428

References

429

Chapter 15 FREQUENT PATTERN MINING ALGORITHMS WITH UNCERTAIN DATA

431

1. Introduction

432

2. Frequent Pattern Mining of Uncertain Data Sets

433

3. Apriori-style Algorithms

434

3.1 Pruning Methods for Apriori-Style Algorithms

435

4. Set-Enumeration Methods

438

5. Pattern Growth based Mining Algorithms

438

5.1 Extending the H-mine algorithm

439

5.2 Extending the FP-growth Algorithm

440

5.3 Another Variation of the FP-growth Algorithm

450

6. A Comparative Study on Challenging Cases

450

6.1 Performance Comparison

454

6.2 Scalability Comparison

456

7. Generalization to the PossibleWorlds Model

458

8. Discussion and Conclusions

459

Acknowledgements

460

References

461

Chapter 16 PROBABILISTIC QUERYING AND MINING OF BIOLOGICAL IMAGES

464

1. Introduction

464

1.1 An Illustrative Example

465

2. RelatedWork

468

3. Probabilistic Image Analyses

469

3.1 Probabilistic Segmentation

469

3.2 Measuring Neurite Thickness

472

3.3 Ganglion Cell Features

473

4. Querying Probabilistic Image Data

474

4.1 Range Queries on Uncertain Data

475

4.2 k-NN Queries on Uncertain Data

475

4.3 Adaptive, Piecewise-Linear Approximations

477

4.4 Indexing the APLA

477

4.5 Experimental Results

478

5. Mining Probabilistic Image Data

479

5.1 Defining Probabilistic Spatial Join (PSJ)

480

5.2 Threshold PSJ Query

481

5.3 Top-k PSJ Query

482

5.4 Experimental Results

483

6. Conclusion

484

Acknowledgements

485

References

486

Index

492