Confidence intervals for maximum unseen probabilities, with application to sequential sampling design

Stefano Favaro

Confidence intervals for maximum unseen probabilities, with application to sequential sampling design

Abstract

Discovery problems often require deciding whether additional sampling is needed to detect all categories whose prevalence exceeds a prespecified threshold. We study this question under a Bernoulli product (incidence) model, where categories are observed only through presence--absence across sampling units. Our inferential target is the maximum unseen probability, the largest prevalence among categories not yet observed. We develop nonasymptotic, distribution-free upper confidence bounds for this quantity in two regimes: bounded alphabets (finite and known number of categories) and unbounded alphabets (countably infinite under a mild summability condition). We characterise the limits of data-independent worst-case bounds, showing that in the unbounded regime no nontrivial data-independent procedure can be uniformly valid. We then propose data-dependent bounds in both regimes and establish matching lower bounds demonstrating their near-optimality. We compare empirically the resulting procedures in both simulated and real datasets. Finally, we use these bounds to construct sequential stopping rules with finite-sample guarantees, and demonstrate robustness to contamination that introduces spurious low-prevalence categories.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…