tools/benchmark-harness/vendored/docling/md/pdfa_015.md

## Differentially Private Set Union: Supplementary Materials

## Sivakanth Gopi 1 Pankaj Gulhane 1 Janardhan Kulkarni 1 Judy Hanwen Shen 1 2 Milad Shokouhi 1 Sergey Yekhanin 1

## A. Proofs of Policy Algorithms (Theorems 3.1 and 4.1)

Let D denote the collection of all databases. We say that D,D ′ ∈ D are neighboring databases, denoted by D ∼ D ′ , if they differ in exactly one user.

Definition A.1. For p ≥ 0 , the glyph[lscript] p -sensitivity of f : D → R k is defined as sup D ∼ D ′ ‖ f ( D ) -f ( D ′ ) ‖ glyph[lscript] p where the supremum is over all neighboring databases D,D ′ .

The noise that we add is sampled either from Laplace or Gaussian ( Normal ) distribution. The probability density functions of these distributions are given by:

<!-- formula-not-decoded -->

We will need the following standard DP mechanisms.

Proposition A.1 (The Laplace Mechanism (Dwork et al., 2006)) . Suppose f : D → R k is a function with glyph[lscript] 1 sensitivity ∆ 1 . For any ε ≥ 0 , the Laplace mechanism M ( x ) = f ( x ) + ( Y 1 , Y 2 , . . . , Y k ) is ( ε, 0) -DP when Y 1 , Y 2 , . . . , Y k are i.i.d. random variables drawn from Lap (0 , ∆ 1 /ε ) .

Proposition A.2 (Gaussian Mechanism (Balle &amp; Wang, 2018)) . Suppose f : D → R d is a function with glyph[lscript] 2 -sensitivity ∆ 2 . For any ε ≥ 0 and δ ∈ [0 , 1] , the Gaussian mechanism M ( x ) = f ( x ) + Z with Z ∼ N (0 , σ 2 I ) is ( ε, δ ) -DP if and only if

<!-- formula-not-decoded -->

where Φ is the cdf of standard normal distribution.

- Equal contribution 1 Microsoft 2 Work done as part of the Microsoft AI Residency Program. Correspondence to: Sivakanth Gopi &lt; <sigopi@microsoft.com> &gt; , Janardhan Kulkarni &lt; <jakul@microsoft.com> &gt; , Judy Hanwen Shen &lt; <hashe@microsoft.com> &gt; .

Proceedings of the 37 th International Conference on Machine Learning , Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

Definition A.2. We say that two distributions P, Q on a domain X are ( ε, δ ) -close to each other, denoted by P ≈ ε,δ Q , if for every S ⊂ X , we have

<!-- formula-not-decoded -->

<!-- formula-not-decoded -->

We say that two random variables X,Y are ( ε, δ ) -close to each other, denoted by X ≈ ε,δ Y , if their distributions are ( ε, δ ) -close to each other.

We will need the following lemmas which is useful to prove ( ε, δ ) -DP.

Lemma A.1. Let P, Q be probability distributions over a domain X . If there exists an event E s.t. P ( E ) = 1 -δ ′ and P | E ≈ ε,δ Q , then P ≈ ε,δ + δ ′ Q .

Proof. Fix some subset S ⊆ X .

<!-- formula-not-decoded -->

We now prove the other direction.

<!-- formula-not-decoded -->

Now if e ε Pr x ∼ P [ x ∈ S ] ≤ 1 -δ ′ , then we have Pr x ∼ Q [ x ∈ S ] ≤ e ε Pr x ∼ P [ x ∈ S ] + δ ′ + δ. Otherwise, trivially

<!-- formula-not-decoded -->

We will also need the fact that if X ≈ ε,δ Y , then after postprocessing they also remain ( ε, δ ) -close.

Lemma A.2. If two random variables X,Y are ( ε, δ ) -close and M is any randomized algorithm, then M ( X ) ≈ ε,δ M ( Y ) .

Proof. Let M ( z ) = F ( z, R ) for some function F where R is the random bits used by M . For any subset S of the possible outputs of M ,

<!-- formula-not-decoded -->

The other direction holds by symmetry.

Proof of Theorem 3.1. Suppose D 1 and D 2 are neighboring databases where D 1 has one extra user compared to D 2 . Let P and Q denote the distribution of output of the algorithm when the database is D 1 and D 2 respectively. We want to show that P ≈ ε,δ Q . Let E be the event that A ⊂ supp( H 2 ) .

Claim A.1. P | E ≈ ε, 0 Q

Proof. Let H 1 and H 2 be the histograms generated by the algorithm from databases D 1 and D 2 respectively. And ˆ H 1 and ˆ H 2 be the histograms obtained by adding Lap (0 , 1 /ε ) noise to each entry of H 1 and H 2 respectively. For any possible output A of Algorithm 4, we have

<!-- formula-not-decoded -->

So A ∼ P | E is obtained by postprocessing ˆ H 1 | E and A ∼ Q is obtained by postprocessing ˆ H 2 . Since postprocessing only makes two distributions closer (Lemma A.2), it is enough to show that the distributions of the ˆ H 1 | supp( H 2 ) and ˆ H 2 are ( ε, 0) -close to each other. Because the histogram building algorithm (Algorithm 2) has glyph[lscript] 1 -sensitivity of at most 1 by hypothesis, ∥ ∥ H 1 | supp( H 2 ) -H 2 ∥ ∥ glyph[lscript] 1 ≤ 1 . Therefore P | E ≈ ε, 0 Q by the properties of Laplace mechanism (Proposition A.1).

By Lemma A.1, it is enough to show that P ( E ) ≥ 1 -δ . Let T = supp( H 1 ) \ supp( H 2 ) . Note that | T | ≤ ∆ 0 and H 1 [ u ] ≤ 1 | T | for u ∈ T.

<!-- formula-not-decoded -->

Thus for

<!-- formula-not-decoded -->

we have P ( ¯ E ) ≤ δ . Therefore the DP Set Union algorithm (Algorithm 1) is ( ε, δ ) -DP.

Proof of Theorem 4.1. Suppose D 1 and D 2 are neighboring databases where D 1 has one extra user compared to D 2 . Let P and Q denote the distribution of output of the algorithm when the database is D 1 and D 2 respectively. We want to show that P ≈ ε,δ Q . Let E be the event that A ⊂ supp( H 2 ) .

Claim A.2. P | E ≈ ε,δ/ 2 Q

Proof. Let H 1 and H 2 be the histograms generated by the algorithm from databases D 1 and D 2 respectively. And ˆ H 1 and ˆ H 2 be the histograms obtained by adding N (0 , σ 2 ) noise to each entry of H 1 and H 2 respectively. By the postprocessing lemma (Lemma A.2), it is enough to show that the distributions of the ˆ H 1 | supp( H 2 ) and ˆ H 2 are ( ε, δ/ 2) -close to each other. Because the histogram building algorithm (Algorithm 2) has glyph[lscript] 2 -sensitivity of at most 1 by hypothesis, ∥ ∥ H 1 | supp( H 2 ) -H 2 ∥ ∥ glyph[lscript] 2 ≤ 1 . Therefore by properties of Gaussian mechanism (Proposition A.2), it is enough to choose σ as in the statement of the theorem.

By Lemma A.1, it is enough to show that P ( E ) ≥ 1 -δ/ 2 . Let T = supp( H 1 ) \ supp( H 2 ) . Note that | T | ≤ ∆ 0 and H 1 [ u ] ≤ 1 √ | T | for u ∈ T.

<!-- formula-not-decoded -->

Thus for

<!-- formula-not-decoded -->

we have P ( ¯ E ) ≤ δ/ 2 . Therefore the DP Set Union algorithm (Algorithm 1) is ( ε, δ ) -DP.

## B. Bounded Sensitivity implies DP (Proof of Theorem 1.2)

We will now prove a formal version of Theorem 1.2, i.e., if the histogram output by Algorithm 2 has bounded glyph[lscript] p -sensitivity (for p ∈ { 1 , 2 } ), then by adding appropriate noise and setting an appropriate threshold, Algorithm 1 for DP set union can be made differentially private. The lower bounds on the threshold ( ρ ) that we obtain in this generality are only slightly worse compared to the corresponding bounds in Theorems 3.1 and 4.1.

Theorem B.1. Suppose the histogram output by Algorithm 2 has glyph[lscript] 1 -sensitivity 1. Then Algorithm 1 is ( ε, δ ) -DP when the Noise distribution is Lap (0 , λ ) where λ = 1 /ε and the threshold

<!-- formula-not-decoded -->

Proof. Proof of Theorem B.1 is extremely similar to the proof of Theorem 3.1. The only place where it differes is in Equation (1) where we bound H 1 [ u ] ≤ 1 instead of H 1 [ u ] ≤ 1 / | T | .

Theorem B.2. Suppose the histogram output by Algorithm 2 has glyph[lscript] 2 -sensitivity 1. Then Algorithm 1 is ( ε, δ ) -DP when the Noise distribution is N (0 , σ 2 ) where σ and the threshold ρ are chosen s.t.

<!-- formula-not-decoded -->

Proof. Proof of Theorem B.2 is extremely similar to the proof of Theorem 4.1. The only place where it differes is in Equation (2) where we bound H 1 [ u ] ≤ 1 instead of H 1 [ u ] ≤ 1 / √ | T | .

## C. Proof of Lemma 4.1

Figure 1. Geometric explanation of Lemma 4.1 when | AB | , | AC | &gt; 1 .

<!-- image -->

Proof of Lemma 4.1. Let us first assume that both | AB | , | AC | &gt; 1 . Let θ be the angle at A and let | AB ′ | = x, | AC ′ | = y as shown in Figure 1. Then by the cosine formula,

<!-- formula-not-decoded -->

If | AB | , | AC | ≤ 1 , then B ′ = C ′ = A and then the

Figure 2. Geometric explanation of Lemma 4.1 when | AB | &gt; 1 , | AC | ≤ 1 .

<!-- image -->

Mass

41

H'i- 1)

Items

U2

Mass claim is trivially true. Suppose | AB | &gt; 1 , | AC | ≤ 1 . Now C ′ = A. Let | AB ′ | = x, | AC | = z ≤ 1 and θ be the angle at A as shown in Figure 2. Then by the cosine formula,

Mass from us

Mass from us ... uN

<!-- formula-not-decoded -->

<!-- formula-not-decoded -->

By symmetry, the claim is also true when | AC | &gt; 1 , | AB | ≤ 1 .

## D. Privacy analysis of Weighted Laplace and Guassian Algorithms

## D.1. Weighted Laplace

## Algorithm 1 LAPLACE weighted update

Input: H : Current histogram

W : A subset of U of size at most ∆ 0

```text
Output: H : Updated histogram for u in W do H[ u ] ← H[ u ] + 1 | W |
```

end for

Theorem D.1. The WEIGHTED LAPLACE algorithm (Algorithm 1) is ( ε, δ ) -DP when

<!-- formula-not-decoded -->

Proof. Proof is exactly the same as that of Theorem 3.1.

## D.2. Weighted Gaussian

Algorithm 2

```text
GAUSSIAN weighted update Input: H : Current histogram W : A subset of U of size at most ∆ 0 Output: H : Updated histogram for u in W do H[ u ] ← H[ u ] + √ 1 | W | end for
```

Theorem D.2. The WEIGHTED GAUSSIAN algorithm (Algorithm 2) is ( ε, δ ) -DP if σ, ρ Gauss are chosen s.t.

<!-- formula-not-decoded -->

Proof. Proof is exactly the same as that of Theorem 4.1.

· Mass from us

H(NI

Mass from us -. UN

## E. Greedy Policy

Figure 3. Visualization of greedy update example where the final l 1 sensitivity is larger than 1.

<!-- image -->

In this section, we give a simple counter example to illustrate how the sensitivity of a greedy policy algorithm can be unbounded.

## Algorithm 3 GREEDY POLICY update

```text
Input: H : Current histogram W : A subset of U of size at most ∆ 0 Γ : cutoff parameter Output: H : Updated histogram // Build cost dictionary G G = {} // Empty dictionary for u ∈ W do if H [ u ] < Γ then // Gap to cutoff for items below cutoff Γ G [ u ] ← Γ -H [ u ] end if end for budget ← 1 // Each user gets a total budget of 1 // Sort in increasing order of the gap Γ -H [ u ] G ← sort( G ) // Let u 1 , u 2 , . . . , u | G | be the sorted order for j = 1 to | G | do if G [ u j ] ≤ budget then H [ u j ] ← H [ u j ] + G [ u j ] budget ← budget G [ u j ] else H [ u j ] ← H [ u j ]+ budget break end if end for
```

Suppose there are N user let u 1 and u 2 be two items in the universe. We will denote the weight of item u after user

10%

· 101

10°

log-log plot of word rank by user count (row sums)

10°

i 's contribution as H ( i ) [ u ] . Suppose user i has only item u 1 while users i +1 , i +2 , . . . , N have both items. Let H 1 be the histogram generated with all N users while H 2 be the histogram generated without user i . Let ∆ 0 = 2 and H ( i -1) [ u 1 ] &lt; H ( i -1) [ u 2 ] &lt; 1 + H ( i -1) [ u 1 ] . According to the greedy update described in Algorithm 3, in H 1 , user i will add weight 1 to u 1 and users i +1 , i +2 , . . . , N will also to u 1 since H ( i ) [ u 1 ] &gt; H ( i ) [ u 2 ] . In H 2 , users i +1 , i + 2 , . . . , N will add to u 2 since H ( i -1) [ u 1 ] &lt; H ( i -1) [ u 2 ] . This process is described in figure 3. Therefore the glyph[lscript] 1 -sensitivity of the histogram built using Greedy Policy update (Algorithm 3) can be Ω(Γ , N ) .

## F. Dataset Details

Using a log-log scale, the frequency of users for each unigram vs. the rank of the unigram is linear (Figure 4). In other words, the lowest ranked (most common) unigrams are used by almost all users while the highest ranked (least common) unigrams are used by very few users.

Figure 4. Frequency (i.e. number of users who use the unigram) vs. rank of the unigram (based on frequency) on a log-log scale. This linear relationship shows that the frequency of unigrams among users also follows Zipf's law (power law), i.e., count ∝ 1 / rank α for some constant α &gt; 0 . The α in this case is ≈ 1 .

<!-- image -->

The distribution of how many unigrams each user uses also follows a long tail distribution. While the top 10 users contribute between 850 and 2000 unique unigrams, most users (93.1%) contribute less than 100 unique unigrams. Table 1 summarizes the percentage of users with a unique vocabulary smaller than each threshold T provided.

Table 1. Percentage of users with unique unigram count of less than or equal to T. The vast majority of user have less than 100 unique unigrams.

| THRESHOLD (T) USERS WITH &#124; W i &#124; ≤ T | THRESHOLD (T) USERS WITH &#124; W i &#124; ≤ T |
| ---------------------------------------------- | ---------------------------------------------- |
| 1                                              | 2.78%                                          |
| 10                                             | 29.82%                                         |
| 50                                             | 79.16%                                         |
| 100                                            | 93.13%                                         |
| 300                                            | 99.59%                                         |

## G. Additional Experiments

## G.1. Multiple passes through each user

In the experiments described thus far, each user contributes items once within the budget constraints. We also investigate whether the output of set union increases in size when each user contributes the same budget over multiple passes (e.g. user 1 contributes half of their budget each time over 2 passes), we compare POLICY LAPLACE and POLICY GAUSSIAN outputs. Table 2 summarizes the results showing that there is not strong evidence suggesting that running multiple passes through the users improves the size of the output set.

## G.2. Selecting α : parameter to set threshold Γ

Figure 5 shows the number of unigrams released by POLICY LAPLACE and POLICY GAUSSIAN for various values of α . We observe that the number of unigrams released increases sharply until α = 4 , then remains nearly constant and then slowly decreases. This choice of α only affects the policy algorithms since the weighted and count algorithms do not use a threshold.

Figure 5. Number of unigrams released for various values of α . The number of unigrams released increases sharply until about α = 2 , then remains nearly constant and then decreases. Here we fixed ∆ 0 = 100 and ε = 3 .

<!-- image -->

Table 2. Count of unigrams released POLICY LAPLACE and POLICY GAUSSIAN algorithms for single and double passes over users. Results are averaged and rounded across 5 shuffles of user order. The privacy parameters are ε = 3 and δ = exp( -10) . α = 2 is chosen for the threshold parameter. Significant p-values for a two-sided independent t-test are bolded.

| POLICY LAPLACE | POLICY LAPLACE | POLICY LAPLACE | POLICY GAUSSIAN | POLICY GAUSSIAN | POLICY GAUSSIAN | POLICY GAUSSIAN |
| -------------- | -------------- | -------------- | --------------- | --------------- | --------------- | --------------- |
| ∆ 0            | 1 PASS         | 2 PASSES       | P-VAL           | 1 PASS          | 2 PASSES        | P-VAL           |
| 1              | 4236 ± 14      | 4257 ± 17      | 0.083           | 3135 ± 25       | 3131 ± 20       | 0.829           |
| 10             | 12452 ± 31     | 12389 ± 17     | 0.008           | 10784 ± 22      | 10817 ± 54      | 0.293           |
| 50             | 15056 ± 35     | 15080 ± 21     | 0.262           | 15763 ± 33      | 15809 ± 45      | 0.139           |
| 100            | 14562 ± 50     | 14567 ± 24     | 0.846           | 14562 ± 50      | 14568 ± 24      | 0.846           |
| 200            | 14005 ± 33     | 13979 ± 31     | 0.271           | 14005 ± 33      | 13979 ± 31      | 0.271           |
| 300            | 13702 ± 37     | 13678 ± 47     | 0.448           | 13702 ± 37      | 13678 ± 47      | 0.447           |

## THE EFFECT OF ε

Figure 6. Number of unigrams released for various values of ε . Here we fixed ∆ 0 = 100 and α = 5 .

<!-- image -->

We use ε = 3 for the experiments in table 1. At this value of ε our policy algorithms perform much better than previous count and weighted algorithms. To check whether this result holds with smaller ε , we also run these algorithms on various values of ε . Figure 6 shows that for ε ≥ 1 our policy algorithms always perform better.

## References

- Balle, B. and Wang, Y.-X. Improving the gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning , pp. 403-412, 2018.
- Dwork, C., McSherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference , pp. 265-284. Springer, 2006.