fil/tools/benchmark-harness/vendored/docling/md/nougat_040.md at 5304f72106aa45fb047fda0a6005d1aa932f6a94

hjess/fil

Fork 0

Files

Henrik Jess Nielsen b4c07d3693

Deploy fil (kreuzberg) / deploy (push) Successful in 49s

Details

Nomad changes

2026-06-01 23:40:55 +02:00

17 KiB

Raw Blame History

Differentially Private Set Union: Supplementary Materials

Sivakanth Gopi 1 Pankaj Gulhane 1 Janardhan Kulkarni 1 Judy Hanwen Shen 1 2 Milad Shokouhi 1 Sergey Yekhanin 1

A. Proofs of Policy Algorithms (Theorems 3.1 and 4.1)

Let D denote the collection of all databases. We say that D,D ′ ∈ D are neighboring databases, denoted by D ∼ D ′ , if they differ in exactly one user.

Definition A.1. For p ≥ 0 , the glyph[lscript] p -sensitivity of f : D → R k is defined as sup D ∼ D ′ ‖ f ( D ) -f ( D ′ ) ‖ glyph[lscript] p where the supremum is over all neighboring databases D,D ′ .

The noise that we add is sampled either from Laplace or Gaussian ( Normal ) distribution. The probability density functions of these distributions are given by:

We will need the following standard DP mechanisms.

Proposition A.1 (The Laplace Mechanism (Dwork et al., 2006)) . Suppose f : D → R k is a function with glyph[lscript] 1 sensitivity ∆ 1 . For any ε ≥ 0 , the Laplace mechanism M ( x ) = f ( x ) + ( Y 1 , Y 2 , . . . , Y k ) is ( ε, 0) -DP when Y 1 , Y 2 , . . . , Y k are i.i.d. random variables drawn from Lap (0 , ∆ 1 /ε ) .

Proposition A.2 (Gaussian Mechanism (Balle & Wang, 2018)) . Suppose f : D → R d is a function with glyph[lscript] 2 -sensitivity ∆ 2 . For any ε ≥ 0 and δ ∈ [0 , 1] , the Gaussian mechanism M ( x ) = f ( x ) + Z with Z ∼ N (0 , σ 2 I ) is ( ε, δ ) -DP if and only if

where Φ is the cdf of standard normal distribution.

Equal contribution 1 Microsoft 2 Work done as part of the Microsoft AI Residency Program. Correspondence to: Sivakanth Gopi < sigopi@microsoft.com > , Janardhan Kulkarni < jakul@microsoft.com > , Judy Hanwen Shen < hashe@microsoft.com > .

Definition A.2. We say that two distributions P, Q on a domain X are ( ε, δ ) -close to each other, denoted by P ≈ ε,δ Q , if for every S ⊂ X , we have

We say that two random variables X,Y are ( ε, δ ) -close to each other, denoted by X ≈ ε,δ Y , if their distributions are ( ε, δ ) -close to each other.

We will need the following lemmas which is useful to prove ( ε, δ ) -DP.

Lemma A.1. Let P, Q be probability distributions over a domain X . If there exists an event E s.t. P ( E ) = 1 -δ ′ and P | E ≈ ε,δ Q , then P ≈ ε,δ + δ ′ Q .

Proof. Fix some subset S ⊆ X .

We now prove the other direction.

Now if e ε Pr x ∼ P [ x ∈ S ] ≤ 1 -δ ′ , then we have Pr x ∼ Q [ x ∈ S ] ≤ e ε Pr x ∼ P [ x ∈ S ] + δ ′ + δ. Otherwise, trivially

We will also need the fact that if X ≈ ε,δ Y , then after postprocessing they also remain ( ε, δ ) -close.

Lemma A.2. If two random variables X,Y are ( ε, δ ) -close and M is any randomized algorithm, then M ( X ) ≈ ε,δ M ( Y ) .

Proof. Let M ( z ) = F ( z, R ) for some function F where R is the random bits used by M . For any subset S of the possible outputs of M ,

The other direction holds by symmetry.

Proof of Theorem 3.1. Suppose D 1 and D 2 are neighboring databases where D 1 has one extra user compared to D 2 . Let P and Q denote the distribution of output of the algorithm when the database is D 1 and D 2 respectively. We want to show that P ≈ ε,δ Q . Let E be the event that A ⊂ supp( H 2 ) .

Claim A.1. P | E ≈ ε, 0 Q

Proof. Let H 1 and H 2 be the histograms generated by the algorithm from databases D 1 and D 2 respectively. And ˆ H 1 and ˆ H 2 be the histograms obtained by adding Lap (0 , 1 /ε ) noise to each entry of H 1 and H 2 respectively. For any possible output A of Algorithm 4, we have

So A ∼ P | E is obtained by postprocessing ˆ H 1 | E and A ∼ Q is obtained by postprocessing ˆ H 2 . Since postprocessing only makes two distributions closer (Lemma A.2), it is enough to show that the distributions of the ˆ H 1 | supp( H 2 ) and ˆ H 2 are ( ε, 0) -close to each other. Because the histogram building algorithm (Algorithm 2) has glyph[lscript] 1 -sensitivity of at most 1 by hypothesis, ∥ ∥ H 1 | supp( H 2 ) -H 2 ∥ ∥ glyph[lscript] 1 ≤ 1 . Therefore P | E ≈ ε, 0 Q by the properties of Laplace mechanism (Proposition A.1).

By Lemma A.1, it is enough to show that P ( E ) ≥ 1 -δ . Let T = supp( H 1 ) \ supp( H 2 ) . Note that | T | ≤ ∆ 0 and H 1 [ u ] ≤ 1 | T | for u ∈ T.

Thus for

we have P ( ¯ E ) ≤ δ . Therefore the DP Set Union algorithm (Algorithm 1) is ( ε, δ ) -DP.

Proof of Theorem 4.1. Suppose D 1 and D 2 are neighboring databases where D 1 has one extra user compared to D 2 . Let P and Q denote the distribution of output of the algorithm when the database is D 1 and D 2 respectively. We want to show that P ≈ ε,δ Q . Let E be the event that A ⊂ supp( H 2 ) .

Claim A.2. P | E ≈ ε,δ/ 2 Q

Proof. Let H 1 and H 2 be the histograms generated by the algorithm from databases D 1 and D 2 respectively. And ˆ H 1 and ˆ H 2 be the histograms obtained by adding N (0 , σ 2 ) noise to each entry of H 1 and H 2 respectively. By the postprocessing lemma (Lemma A.2), it is enough to show that the distributions of the ˆ H 1 | supp( H 2 ) and ˆ H 2 are ( ε, δ/ 2) -close to each other. Because the histogram building algorithm (Algorithm 2) has glyph[lscript] 2 -sensitivity of at most 1 by hypothesis, ∥ ∥ H 1 | supp( H 2 ) -H 2 ∥ ∥ glyph[lscript] 2 ≤ 1 . Therefore by properties of Gaussian mechanism (Proposition A.2), it is enough to choose σ as in the statement of the theorem.

By Lemma A.1, it is enough to show that P ( E ) ≥ 1 -δ/ 2 . Let T = supp( H 1 ) \ supp( H 2 ) . Note that | T | ≤ ∆ 0 and H 1 [ u ] ≤ 1 √ | T | for u ∈ T.

Thus for

we have P ( ¯ E ) ≤ δ/ 2 . Therefore the DP Set Union algorithm (Algorithm 1) is ( ε, δ ) -DP.

B. Bounded Sensitivity implies DP (Proof of Theorem 1.2)

We will now prove a formal version of Theorem 1.2, i.e., if the histogram output by Algorithm 2 has bounded glyph[lscript] p -sensitivity (for p ∈ { 1 , 2 } ), then by adding appropriate noise and setting an appropriate threshold, Algorithm 1 for DP set union can be made differentially private. The lower bounds on the threshold ( ρ ) that we obtain in this generality are only slightly worse compared to the corresponding bounds in Theorems 3.1 and 4.1.

Theorem B.1. Suppose the histogram output by Algorithm 2 has glyph[lscript] 1 -sensitivity 1. Then Algorithm 1 is ( ε, δ ) -DP when the Noise distribution is Lap (0 , λ ) where λ = 1 /ε and the threshold

Proof. Proof of Theorem B.1 is extremely similar to the proof of Theorem 3.1. The only place where it differes is in Equation (1) where we bound H 1 [ u ] ≤ 1 instead of H 1 [ u ] ≤ 1 / | T | .

Theorem B.2. Suppose the histogram output by Algorithm 2 has glyph[lscript] 2 -sensitivity 1. Then Algorithm 1 is ( ε, δ ) -DP when the Noise distribution is N (0 , σ 2 ) where σ and the threshold ρ are chosen s.t.

Proof. Proof of Theorem B.2 is extremely similar to the proof of Theorem 4.1. The only place where it differes is in Equation (2) where we bound H 1 [ u ] ≤ 1 instead of H 1 [ u ] ≤ 1 / √ | T | .

C. Proof of Lemma 4.1

Figure 1. Geometric explanation of Lemma 4.1 when | AB | , | AC | > 1 .

Proof of Lemma 4.1. Let us first assume that both | AB | , | AC | > 1 . Let θ be the angle at A and let | AB ′ | = x, | AC ′ | = y as shown in Figure 1. Then by the cosine formula,

If | AB | , | AC | ≤ 1 , then B ′ = C ′ = A and then the

Figure 2. Geometric explanation of Lemma 4.1 when | AB | > 1 , | AC | ≤ 1 .

Mass

H'i- 1)

Items

Mass claim is trivially true. Suppose | AB | > 1 , | AC | ≤ 1 . Now C ′ = A. Let | AB ′ | = x, | AC | = z ≤ 1 and θ be the angle at A as shown in Figure 2. Then by the cosine formula,

Mass from us

Mass from us ... uN

By symmetry, the claim is also true when | AC | > 1 , | AB | ≤ 1 .

D. Privacy analysis of Weighted Laplace and Guassian Algorithms

D.1. Weighted Laplace

Algorithm 1 LAPLACE weighted update

Input: H : Current histogram

W : A subset of U of size at most ∆ 0

Output: H : Updated histogram for u in W do H[ u ] ← H[ u ] + 1 | W |

end for

Theorem D.1. The WEIGHTED LAPLACE algorithm (Algorithm 1) is ( ε, δ ) -DP when

Proof. Proof is exactly the same as that of Theorem 3.1.

D.2. Weighted Gaussian

Algorithm 2

GAUSSIAN weighted update Input: H : Current histogram W : A subset of U of size at most ∆ 0 Output: H : Updated histogram for u in W do H[ u ] ← H[ u ] + √ 1 | W | end for

Theorem D.2. The WEIGHTED GAUSSIAN algorithm (Algorithm 2) is ( ε, δ ) -DP if σ, ρ Gauss are chosen s.t.

Proof. Proof is exactly the same as that of Theorem 4.1.

· Mass from us

H(NI

Mass from us -. UN

E. Greedy Policy

Figure 3. Visualization of greedy update example where the final l 1 sensitivity is larger than 1.

In this section, we give a simple counter example to illustrate how the sensitivity of a greedy policy algorithm can be unbounded.

Algorithm 3 GREEDY POLICY update

Input: H : Current histogram W : A subset of U of size at most ∆ 0 Γ : cutoff parameter Output: H : Updated histogram // Build cost dictionary G G = {} // Empty dictionary for u ∈ W do if H [ u ] < Γ then // Gap to cutoff for items below cutoff Γ G [ u ] ← Γ -H [ u ] end if end for budget ← 1 // Each user gets a total budget of 1 // Sort in increasing order of the gap Γ -H [ u ] G ← sort( G ) // Let u 1 , u 2 , . . . , u | G | be the sorted order for j = 1 to | G | do if G [ u j ] ≤ budget then H [ u j ] ← H [ u j ] + G [ u j ] budget ← budget G [ u j ] else H [ u j ] ← H [ u j ]+ budget break end if end for

Suppose there are N user let u 1 and u 2 be two items in the universe. We will denote the weight of item u after user

10%

· 101

10°

log-log plot of word rank by user count (row sums)

10°

i 's contribution as H ( i ) [ u ] . Suppose user i has only item u 1 while users i +1 , i +2 , . . . , N have both items. Let H 1 be the histogram generated with all N users while H 2 be the histogram generated without user i . Let ∆ 0 = 2 and H ( i -1) [ u 1 ] < H ( i -1) [ u 2 ] < 1 + H ( i -1) [ u 1 ] . According to the greedy update described in Algorithm 3, in H 1 , user i will add weight 1 to u 1 and users i +1 , i +2 , . . . , N will also to u 1 since H ( i ) [ u 1 ] > H ( i ) [ u 2 ] . In H 2 , users i +1 , i + 2 , . . . , N will add to u 2 since H ( i -1) [ u 1 ] < H ( i -1) [ u 2 ] . This process is described in figure 3. Therefore the glyph[lscript] 1 -sensitivity of the histogram built using Greedy Policy update (Algorithm 3) can be Ω(Γ , N ) .

F. Dataset Details

Using a log-log scale, the frequency of users for each unigram vs. the rank of the unigram is linear (Figure 4). In other words, the lowest ranked (most common) unigrams are used by almost all users while the highest ranked (least common) unigrams are used by very few users.

Figure 4. Frequency (i.e. number of users who use the unigram) vs. rank of the unigram (based on frequency) on a log-log scale. This linear relationship shows that the frequency of unigrams among users also follows Zipf's law (power law), i.e., count ∝ 1 / rank α for some constant α > 0 . The α in this case is ≈ 1 .

The distribution of how many unigrams each user uses also follows a long tail distribution. While the top 10 users contribute between 850 and 2000 unique unigrams, most users (93.1%) contribute less than 100 unique unigrams. Table 1 summarizes the percentage of users with a unique vocabulary smaller than each threshold T provided.

Table 1. Percentage of users with unique unigram count of less than or equal to T. The vast majority of user have less than 100 unique unigrams.

THRESHOLD (T) USERS WITH \| W i \| ≤ T	THRESHOLD (T) USERS WITH \| W i \| ≤ T
1	2.78%
10	29.82%
50	79.16%
100	93.13%
300	99.59%

G. Additional Experiments

G.1. Multiple passes through each user

In the experiments described thus far, each user contributes items once within the budget constraints. We also investigate whether the output of set union increases in size when each user contributes the same budget over multiple passes (e.g. user 1 contributes half of their budget each time over 2 passes), we compare POLICY LAPLACE and POLICY GAUSSIAN outputs. Table 2 summarizes the results showing that there is not strong evidence suggesting that running multiple passes through the users improves the size of the output set.

G.2. Selecting α : parameter to set threshold Γ

Figure 5 shows the number of unigrams released by POLICY LAPLACE and POLICY GAUSSIAN for various values of α . We observe that the number of unigrams released increases sharply until α = 4 , then remains nearly constant and then slowly decreases. This choice of α only affects the policy algorithms since the weighted and count algorithms do not use a threshold.

Figure 5. Number of unigrams released for various values of α . The number of unigrams released increases sharply until about α = 2 , then remains nearly constant and then decreases. Here we fixed ∆ 0 = 100 and ε = 3 .

Table 2. Count of unigrams released POLICY LAPLACE and POLICY GAUSSIAN algorithms for single and double passes over users. Results are averaged and rounded across 5 shuffles of user order. The privacy parameters are ε = 3 and δ = exp( -10) . α = 2 is chosen for the threshold parameter. Significant p-values for a two-sided independent t-test are bolded.

POLICY LAPLACE	POLICY LAPLACE	POLICY LAPLACE	POLICY GAUSSIAN	POLICY GAUSSIAN	POLICY GAUSSIAN	POLICY GAUSSIAN
∆ 0	1 PASS	2 PASSES	P-VAL	1 PASS	2 PASSES	P-VAL
1	4236 ± 14	4257 ± 17	0.083	3135 ± 25	3131 ± 20	0.829
10	12452 ± 31	12389 ± 17	0.008	10784 ± 22	10817 ± 54	0.293
50	15056 ± 35	15080 ± 21	0.262	15763 ± 33	15809 ± 45	0.139
100	14562 ± 50	14567 ± 24	0.846	14562 ± 50	14568 ± 24	0.846
200	14005 ± 33	13979 ± 31	0.271	14005 ± 33	13979 ± 31	0.271
300	13702 ± 37	13678 ± 47	0.448	13702 ± 37	13678 ± 47	0.447

THE EFFECT OF ε

Figure 6. Number of unigrams released for various values of ε . Here we fixed ∆ 0 = 100 and α = 5 .

We use ε = 3 for the experiments in table 1. At this value of ε our policy algorithms perform much better than previous count and weighted algorithms. To check whether this result holds with smaller ε , we also run these algorithms on various values of ε . Figure 6 shows that for ε ≥ 1 our policy algorithms always perform better.

References

Balle, B. and Wang, Y.-X. Improving the gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning , pp. 403-412, 2018.
Dwork, C., McSherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference , pp. 265-284. Springer, 2006.

17 KiB Raw Blame History Unescape Escape