A neat data sharing technique


We needed to share some data with another company, and this related to Credit card transactions. But we did not want to share the actual card numbers (PANs), what to do. What we came up with is quite neat, and can probably be used by others.

The external company collects the card numbers they want information on, they encrypt these with RSA with a key they generate and do not share. They send these encrypted numbers to us, we further encrypt them with our own RSA key and jumble the order of the entries, then send them back. So now they have a set of PANs double encrypted.

We perform an extract of the relevant transactions and encrypt the PANs with our RSA key, and send these as well. Now the recipients of these can encrypt these with their key and because RSA is a commutative function, can match up the two sets to see if the PANs they sent to us were used in the extracted transactions.

We have added a daily salt to these encryptions so that correlations can’t be used to work out which encrypted PANs map to the original PANs, and we bulk up the transactions so that individual transactions cannot be identified.

A friend of mine wrote up the proof of this:

Let Nv = our Key Modulus

Let E = our Operand (a number greater than 1, which is carefully chosen…)

So our Public Key, Vp = the couplet (Nv, E)

[Ignoring the Private Key as it’s not important…]

Similarly

Let Ng = their Key modulus

F = their operand

So their public key Gp = couplet (Ng, F)

Let X = a PAN

Let Encrypt(K,M) be the RSA encryption algorithm of encrypting message M using key K

To encrypt the PAN using our Public Key: Cv = Encrypt( Vp, X)

This is actually Cv = X^E mod Nv

Then encrypt again using their Public Key: Cvg = Encrypt(Gp, Cv)

This is actually Cvg = Cv^F mod Ng

Similarly, encrypt the PAN using their Public Key: Cg = Encrypt (Gp, X)

This is actually Cg = X^F mod Ng

Then encrypt again using our Public Key: Cgv = Encrypt(Vp, Cg)

This is actually Cgv = Cg^E mod Nv

So Cvg = CvF mod Ng

= (X^E mod Nv)F mod Ng

= X^EF mod Nv mod Ng

= X^FE mod Ng mod Nv

= (X^F mod Ng)^E mod Ny

= Cg^E mod Ny

= Cgv
i.e.

1. it doesn’t matter whether we encrypt with our key first or second, we get the same answer.

2. this means that the RSA algorithm is a commutative encryption algorithm

And so if we produce a value Cvg and they produce value Cgv and the values are the same, we can deduce that both organisations encrypted the same PAN, and nobody is actually sharing any PANs in the process.

Advertisements

9 Responses to A neat data sharing technique

  1. Dale says:

    Could not PANs be inferred if sufficiently small number of account numbers were sent in? Or a series of requests with slightly different account number sets?

    • Stuart says:

      For the application of finding the real world impact of online advertising, the number of ad views is going to be quite large, so reversing that is unfeasibly large problem. The responses will be a much smaller set, so that is why we have bulked up these responses, so we only indicate that this group of PANs made a total of this spend, not listing individual transactions. Remember that there might be multiple items purchased in a single transaction as well, so it is a inference that these are the item advertised.

  2. Cryptographer says:

    You must be very careful with homomorphic encryption.

    1) What does the shuffling add here? What happens if I send the E(card1), E(card2), E(card2), E(card3), E(card3), E(card3)?

    Remove duplicated entries.

    2) What prevents either party from pretending to hold a few hundred additional cards per round? The scan can occur over a long period of time and may be targeted to a specific bank.

    Rate limitations. Do not permit too many queries.

    3) The RSA public key may be recovered which enables the adversary to perform the search for (2) offline. I do not know how many samples the adversary must have to recover the public key.

    Use distinct keys per round. This also eliminates the need for the salt / deterministic-padding. A new RSA key per day, or per hour, should not be an issue.

    Finally RSA is slow, ElGamal is fast.

    https://en.wikipedia.org/wiki/Homomorphic_encryption

    • Stuart says:

      1) yes we are using a random shuffle, and if you did include multiple entries for a single card, then you could determine the originating card number from a frequency analysis. The obvious fix for that would be to sort the encrypted values with the unique option.
      2) well because we are adding a daily salt, results are only useful for a specific day run.
      3) reversing a RSA encryption is a well studied problem, and we believe we are using long enough keys for that to not be feasible.

      • Cryptographer says:

        1) That “obvious fix” was my suggestion. I asked instead if you are removing these duplicates? And in either case, what security does the shuffle provide? If I have just a single entry, you cannot shuffle it. It may also be possible to leak the position through using different RSA keys when encrypting.
        2) You misunderstand the problem. This scan does not care about old results – it only scans new guesses against the pretty-constant list the innocent party produces. Imagine if a service was compromised where the last 4 digits are revealed (or were revealed during an account recovery; Amazon or Apple). If the exact bank is known (or guessed from a region) we have 6 digits. 6 + 4 – 1 (checksum) digits are known.
        This leaves only 8 digits left and is reduced to 7 thanks to the checksum (10^8/10 = 10^7). 10 million queries (per bank prefix guessed). For an emulated set of [only] 1 million users this would take only 10 days to iterate for a single needle – an adversary is probably interested in many needles which is conveniently only increasing their odds of finding a common card of their interest.
        You may detect common cards disappearing from the tested sets, but you cannot tell the difference of two failed attempts between days.
        The scan may be used to recover candidate card numbers for attacks against other systems or to simply gather information about cards the other party does not yet know about.
        Another reason to using distinct RSA keys is forward secrecy. If the key is later recovered – all recorded encrypted card numbers may be decrypted. Mitigated later with (a) and (b).
        3) Reversing the *secret* key is hard. Reversing the *public* key is much easier. If the public key is recovered then (2) can be accelerated offline. To make (3) more difficult I recommend using a different key per day. This is MUCH stronger than the daily salt. The online-security of your protocol relies on secrecy of a public number. This is not a good property – even if kept secret.
        The protocol you have described may be better than revealing the card numbers directly; but it does not protect against active attacks – especially at the scale we’re talking about. The primary problem here is that the credit card number-space is too small and guessable.
        Below I cover some alternatives that provide either better security or increase the costs to adversaries.
        Credit cards are effectively passwords. Weak and valuable. You should avoid sending and storing them in a recoverable fashion (such as RSA encrypted).
        a) Harden the card number to some preshared (possibly public) key. A password-hashing function such as scrypt or argon2i will increase the costs of an exhaustive search significantly.
        A good party only needs to perform the expensive computation once, cache it and throw away the raw card number.
        This computation may occur offline where the raw card number is never shared with the network-attached service for added isolation.
        `hardened_card = argon2i(salt = preshared key, card number, high RAM, high CPU)`
        If it takes 1 second per card assuming a single-core. It will take 11 days per million cards. Or 33 days for 3-seconds. This may be tuned for the chosen security margin and many cores may be used. The one-time cost to bootstrap this hardened cache may take a month, but it is relatively cheap for each new card.
        b) Establish a forward-secret session key (TripleDh/ECDH). HMAC the hardened card against this key.
        `session_key = H(dh(alice ephemeral, bob static), dh(bob ephemeral, alice static), dh(alice ephemeral, bob ephemeral))`
        With regard to forward secrecy using an ephemeral RSA key will also provide this. Both RSA and ECDH are vulnerable to quantum adversaries. ECDH is significantly cheaper.
        c) If false positives are allowed – as for statistical advertising it wouldn’t hurt to throw in a little noise. You may use a bloom-filter over the results from (a). This may be compressed and deltas may be applied over time.
        If combined with (b), deltas may not be applied.
        d) If there may exist a trusted third party. The first and second parties may combine (a) and (b) and send the per-session challenge to a third party who reveals the intersection or count to the first and third parties.
        The third party cannot reproduce the inputs as they do not know the session key, nor the card numbers. The third party knows the true intersection and may inject false positives and/or false negatives. They must return the same intersection result to both parties, otherwise, they may detect the inconsistency.
        The trust in this third party lays entirely in the secrecy of the difference of the two sets.
        Even if the third party colludes with either party, sharing the difference, they must still perform the expensive exhaustive search, hardened by (a).
        `challenge = HMAC(session_key, hardened_card)`
        Both parties must already trust the DNS and PKI systems to secure the credit card information from the clients.
        It is important to note that although your homomorphic intersection solution works, considering this as set-intersection may lend to better alternatives than partial-homomorphic encryption.
        (a) and (b) may be implemented on top of your existing RSA solution. (c) and (d) are complete alternatives with some different properties that continue to rely on the hardening of (a) and (b).
        I personally prefer (a) + (b) + (c) which does not rely on RSA. Is entirely symmetric (modulo forward-secret handshake), has reduced network load and the offline exhaustive search is expensive, while the online attack is slower.

      • Cryptographer says:

        Recap – Stuart, please do not TL;DR. Updated addition below.
        (1) The security of the shuffle.
        (2) Online exhaustive search.
        (3) Offline exhaustive search.
        (a) Make (3) hard, such that (2) is slower than (3).
        (b) Add forward secrecy such that the key compromise does not reveal the card numbers directly.
        (c) Use a bloom-filter to compress the resulting hashes from (a), maybe unique per session as per (b).
        (d) Use a set-intersection service. Either a trusted third party, or the existing RSA implementation described in this article, or some alternative better suited to solving this exact problem – possibly secret sharing.
        (d. continued) One simple solution is to truncate the hashes and have each party disclose one segment. I.e. Alice sends the first N bits and Bob sends the N bits starting from the 1+Nth bit. Finally, optionally (to mitigate false positives) reveal the hash of the hash to provide a commitment to the complete entry.

  3. Cryptographer says:

    4) I would be surprised if RSA is not vulnerable to chosen plaintext or chosen ciphertext attacks under this usage.

  4. Stuart says:

    Steve Gibson talked about this blog post on his show Security Now, the discussion starts at 1:48:25

  5. Stuart says:

    It seems that the Electronic Privacy Information Center want to understand this. https://www.cnet.com/google-amp/news/google-told-to-fess-up-about-how-it-tracks-what-you-buy-in-stores/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: