是否存在既加密又壓縮文本的加密算法/方法?
所以我試圖找到一種加密方法,不僅可以混淆文本,還可以壓縮結果。
For example, if I encrypted
ninechars
, the ideal result would be less than nine characters.Otherwise, if this is not possible, is there a method where I can set a limit on the amount of characters in the result? In other words, no matter how long the input is, the result will always be 50 characters long.
The closest results that I have found are
Caesar Encrypt
andXOR Encrypt
, but they seem to simply result in the same amount of characters as the original input. The others I have found increase the size.TL;DR I pretty much need a “hash function” that is reversible.
所以我試圖找到一種加密方法,不僅可以混淆文本,還可以壓縮結果。例如,如果我加密**
ninechars
了,理想的結果應該是少於九個字元**。即使沒有加密,可逆數據壓縮方案也不可能縮短其所有輸入。使用鴿籠原理可以很容易地證明這一點:對於任何給定的長度 $ \ell $ ,長度嚴格小於的字元串更少 $ \ell $ 最多有長度的字元串 $ \ell $ . 因此,如果壓縮方案最多映射所有長度的字元串 $ \ell $ 到長度小於的字元串 $ \ell $ ,它必須將一些不同的輸入字元串映射到相同的輸出字元串,因此是不可逆的。
(實際上,可以使用類似的論點來顯示更強的結果,即,如果無損壓縮方案使任何輸入字元串更短,那麼它也必須使某些輸入字元串更長。我將證明這一點作為練習。 )
否則,如果這是不可能的,有沒有一種方法可以限制結果中的字元數量?換句話說,無論輸入多長,結果總是 50 個字元長。 TL;DR我非常需要一個可逆的“散列函式”。
根據類似的論點,這也是不可能的。有一個有限的、固定數量的長度為 50 的字元串(即, $ k^{50} $ 為一個 $ k $ -letter alphabet), whereas the number of possible input strings of unbounded length is infinite. Thus, not only does any function mapping arbitrary unbounded inputs to 50-character outputs need to map some distinct inputs to the same output, but it actually has to map infinitely many inputs to some output.
In fact, to show that such a function cannot be reversible, it’s enough to consider only inputs that are 51 characters long. Clearly, there are more 51-character inputs than there are 50-character outputs, so some distinct inputs have to map to the same output.
Of course, if you allow some inputs to increase in size, then you can make other inputs shorter. This is basically what ordinary data compression algorithms like LZW do — they rely on the fact that most of the possible inputs to the compressor look essentially like random noise, whereas most of the typical inputs (like plain text, program code, uncompressed image or audio data, etc.) have a lot of repetitive, non-random structure. Such repetitive data can be encoded more compactly, making it a lot shorter; meanwhile, if the input does not happen to be repetitive enough to compress well, the compressor will just insert a short marker (at a minimum, a single bit, but in practice usually a few bytes) indicating that the data could not be compressed, and then include the input verbatim. Thus, the compressor can compress some common types of inputs significantly, at the cost of very slightly increasing the length of the random-looking inputs that make up the bulk of the full input space.
Anyway, none of this really has anything to do with cryptography. If you want to both compress and encrypt data, the standard way is to first compress it, and then encrypt it. (You cannot compress it after encryption, since encrypted data does look random to anyone who doesn’t know the key, and has no apparent structure that a compressor could exploit.)
As other answers have pointed out, though, even this standard method has risks. One risk comes from the fact that any encryption scheme that can handle arbitrarily long messages must necessarily reveal at least some information about the length of the message. This may allow an attacker to, say, distinguish the message
YES
(three characters) from the messageNO
(two characters).This is true even without compression, and needs to be kept in mind when designing any cryptosystem, but adding compression to the mix makes it harder to predict and defend against. For example, even if you were careful to always use fixed-length messages (like, say,
POSITIVE
andNEGATIVE
instead ofYES
andNO
), it’s quite likely that running such messages through a compressor will produce output whose length varies.More importantly, if the attacker can control some of the data being compressed and encrypted, they may be able to learn information about the other parts by observing the length of the compressed message. For example, let’s say the attacker can make you generate, compress and encrypt messages of the form
TO <ID>: STATUS <POSITIVE/NEGATIVE>
, where<ID>
can be controlled by the attacker. The attacker may then be able to request two encrypted messages, one for<ID>
=POSITIVE
and one for<ID>
=NEGATIVE
, and see which one compresses better.Of course, these are all just toy examples, but similar weaknesses have led to real attacks. For example, as mentioned in otus’s answer, the CRIME and BREACH attacks on SSL are based on the same principle as the chosen-plaintext toy attack described in the previous paragraph. Even the passive attack described earlier above, based on just observing naturally occurring message lengths, may be used to e.g. eavesdrop on encrypted VoIP conversations.
話雖如此,這裡要吸取的教訓不僅僅是數據壓縮是危險的,應該避免。相反,明文長度的洩漏是危險的,並且由於無法完全避免,因此在分析任何加密協議時必須牢記。除此之外,壓縮是一個額外的複雜因素,因為它可以使消息長度以非平凡和非本地方式取決於其內容。