I built a small encrypted VoIP system based on SIP. After a few experiments I settled for a Freeswitch server and Linphone clients and configured the use of ZRTP, probably the only reasonably documented interoperable protocol for encrypted calls. ZRTP offers what is now commonly called "end-to-end" encryption. The call is encrypted and the key is only available to the clients. Getting the system up and running was surprisingly easy, making it secure turned out to be a challenge. The way SIP and ZRTP are implemented in Linphone and other free software clients allows the server to listen to the calls.
The protocols
Suppose Alice and Bob want to talk. They both register their phones with a SIP server. Whenever Alice wants to call Bob, she asks her SIP server to connect the call. Her server looks up how Bob can be reached and talks to the SIP server Bob is registered with. Together they set the call up. They exchange detailed addresses and negotiate all the details, Bob's phone starts ringing and the call may begin. There may be several servers involved. I only consider the case where both Alice and Bob are registered with a single server, the number of servers does not really matter.
As the call is negotiated, several other protocols come into play. In particular, the actual voice packets are transmitted using RTP. The data only need to travel between Alice and Bob and can bypass the server in between. Because the server is setting everything up, it may choose a path that lets it see all the traffic. It is reasonable to assume that the SIP server involved in a call always sees all the data exchanged between Alice and Bob. It follows that the server can listen to the call unless the traffic is encrypted. This is what SRTP offers. How much security it provides depends on the way encryption keys are generated. One option is for the server to generate (or relay) the encryption keys while setting up the call (SDES), it is colloquially reffered to as "SRTP". If the key exchange is performed over TLS and all the parties are authenticated, only Alice, Bob and the server(s) in between have access to the call in plain. This setup offers privacy roughly comparable to "ordinary" phone calls. Everything stays between Alice, Bob and the trusted (telecommunications) provider in the middle.
ZRTP
Public-key cryptography can help eliminate the need for the trusted party. Diffie-Hellman key exchange within ZRTP allows Alice an Bob to jointly compute encryption keys only known to them. If an adversary interferes with the key exchange, Alice and Bob end up with different keys. To make sure that no adversary was present, Alice and Bob compare Short Authentication Strings (SAS) derived from the respective keys.
If the SAS match, the presence of an attacker in the middle is unlikely. The SAS is 16 bits long, this translates to 65536 possible authentication strings. That is a very small number, there is no such thing as "16-bit security".
The reason it works is that ZRTP gives the attacker at most one attempt at an attack. Their chances of success are 1/65536 and the odds of being detected are over 99.998%. Diffie-Hellman key exchange between Alice and Bob is only ever performed once, namely during the first call. Subsequent calls use shared secrets established earlier. This also means that if the attacker is successful during the first call, they must be present during all the later calls. Alice and Bob find out once the attacker leaves.
Any call that has been switched to ZRTP stays that way until it ends. Although the protocol does support downgrading encrypted calls back to plaintext, this can only be initiated by Alice and Bob, not the server. The feature is optional and I have never seen it implemented.
An example call
I equipped Alice with Linphone Desktop 4.4.9 and Bob with Linphone for Android 4.6.13, both built on top of Linphone SDK 5.1.57, all versions current at the time of writing. After Alice & Bob enable ZRTP and make a first call, this is what they see:
Once they confirm that the SAS match, they are typically never prompted again. Subsequent calls remain encrypted. I captured a typical experience of Alice and Bob on video. On the left hand side, there is the user interface Alice sees, Bob's screen is on the right. The sound recorded is the output played to Alice and Bob. Note that the avatars (and sound) appear to be the swapped. Alice (left) hears Bob, he (right) in turn hears Alice.
ZRTP stripping
ZRTP is sometimes described as "opportunistic encryption". It is used if supported by both endpoints, otherwise the call goes ahead unencrypted. As the various protocol downgrade attacks have taught us (see HSTS), aiming for the "best encryption method available" may leave the system vulnerable to a familiar pattern of attack by active adversaries in the middle.
If Mallory, an attacker controlling the SIP server in the middle, can cause any trace of ZRTP to be dropped from the communication between Alice and Bob, the call will proceed in plain:
The above call took place after the SAS had been confirmed and with no change in configuration, i.e. ZRTP remains enabled on both sides. The endpoints attempt a ZRTP handshake and give up after not hearing back.
In order to mount the above "attack" it is sufficient to configure Freeswitch appropriately. Often ZRTP "stripping" corresponds to default behaviour of intermediate SIP servers and reliable ZRTP pass-through actually needs to be explicitly configured. Alice and Bob are likely to be familiar with ZRTP simply not working via a particular server. They are unlikely to expect this behaviour to change between calls.
The attacks
I modified Freeswitch to demonstrate how a malicious SIP server can wiretap calls between Linphone clients configured to use ZRTP. In particular, the server can downgrade established ZRTP calls and also abort or repeat ZRTP handshakes.
Call transfers
During a chat between Alice and Bob, it may turn out that Alice should be talking to Carol instead. Alice can terminate the call with Bob and call Carol. It can be convenient to automate this and SIP does indeed do so. Bob can initiate a transfer that results in a SIP message relayed to Alice via the SIP server. Alice's endpoint is instructed to call Carol and hang up the original call once the replacement is set up. Alice is then on the phone with Carol, Bob's leg of the call terminates.
If a client receives the Refer message, they dial a new destination and hang up their original leg. A related Replaces message accompanies an incoming call and suggests the client answer it and hang up the original leg. Both methods allow the SIP server to replace either leg of any call at will.
It is a feature of SIP that the server has considerable control over the calls. In most scenarios, the server is trusted and it has access to the voice data in plain anyway. An adversary within the SIP server has little more to gain. Calls encrypted using ZRTP are a notable exception.
ZRTP to plain RTP
Imagine Mallory gets hold of the Freeswitch instance Alice and Bob are registered with and observes a ZRTP encrypted call. He can instruct Alice and Bob to call each other again by "transferring" them to the partner they are already talking to. Linphone on both ends will comply without asking the user. During the fresh calls, Mallory prevents a ZRTP handshake and listens to the call in plain. This is simply the ZRTP stripping attack executed after a delay. Even if Alice and Bob check that ZRTP encryption was established at the beginning of the call before discussing anything sensitive, Mallory can "tune in" later. If he concludes that nothing interesting is being discussed, he can turn ZRTP on again to avoid detection. In the video, the downgrade happens 10 seconds into the call, 10 seconds later ZRTP is enabled again:
The two legs are restarted using Refer, later a new ZRTP handshake is triggered by a re-INVITE. Instead of a single call, a sequence of two calls is established. Unless they watch their screens closely, Alice and Bob are unlikely to notice. Observe the effect the attack has on audio. Given no VoIP call is perfect, the short gaps are unlikely to cause much suspicion. While the first gap may interrupt one of the speakers, the second one can be timed to coincide with silence, because at that point Mallory is listening.
ZRTP to SRTP (SDES)
The encryption setting in Linphone applies to outgoing calls. If Alice enables ZRTP, Linphone still accepts incoming calls with SRTP encryption and SDES key agreement. This allows Mallory to force the use of SRTP instead of ZRTP. If the downgrade happens during a ZRTP call, this requires the use of Replaces messages towards both Alice and Bob. The risk of detection is almost non-existent here, because the user interface of Linphone does not differentiate between ZRTP and SRTP. All Alice and Bob see is that encryption is active. Mallory can as well force SRTP for the whole duration of the call. This makes the attack simpler, because Bob's leg does not need to be flipped at all. All Mallory has to do is to negotiate SRTP with Bob right at the beginning. Once he answers, Mallory flips the direction of Alice's leg and offers her SRTP. This is how it looks like:
Here Mallory has the encryption keys and listens to the call. Observe how little the experience differs from the very first video of ZRTP working normally.
Man in the middle
Both the attacks are easily prevented if ZRTP is made mandatory. Linphone does offer such a setting and it does indeed prevent both the attacks discussed so far. Yet Mallory can still listen to the call between Alice and Bob, he simply has to negotiate ZRTP with both of them. The short authentication strings displayed to Alice and Bob will almost certainly differ, but unless one of them decides to check the screen in the middle of a call, they may not notice the attack. Even if Mallory were lucky and ended up with identical SAS on both sides, the SAS prompt itself is a clear sign of an attack. It might still be worth the risk. At any time can Mallory give up and trigger a new direct ZRTP handshake between Alice and Bob. In the video this happens 10 seconds after the prompts are displayed.
The SAS prompt disappears from the desktop but persists on the mobile. Bob sees it once he hangs up the call. Mallory may as well keep listening to the entire call and terminate it once Alice and Bob are done talking. The attacker can hope that the SAS prompt will not be seen and both parties will think the other one is hanging up.
If ZRTP encryption is mandatory, the attack is no harder to execute. It is only harder to hide. Even if Alice and Bob do find out, they do so after the attack.
Abort & retry
In the above example, multiple ZRTP handshakes take place. The call starts as usual, Alice and Bob use ZRTP to establish a key hidden from Mallory. Then the attacker "restarts" both legs of the call and establishes separate keys with Alice and Bob. Imagine these two phases executed in the reverse order. First attempt a man-in-the-middle attack, then let Alice and Bob establish a direct (and secure) ZRTP session.
In particular, this can be attempted during the very first call between Alice and Bob. Mallory's chances of success (matching SAS) remain 1/65536, but the odds of getting caught drop to zero. The ZRTP handshakes can be arranged in a way that allows Mallory to compute the short authentication strings before Alice and Bob. If the two strings turn out not to match, the attacker can abort the attempt, "restart" the call and let Alice and Bob establish a secure ZRTP session. Mallory can of course also repeat the attack to increase the chances of success while keeping detection almost impossible, especially if it all happens at the beginning of a call.
Imagine Mallory attempts a man in the middle attack twice at the beginning of every single ZRTP call he observes. He will get lucky on average once per 32768 calls. Alice and Bob will be prompted to compare the short authentication strings and will see a match despite the active attack underway. What one ends up with is essentially 16-bit security (i.e. none), because the odds of getting caught went from over 99.998% to (almost) zero. With some patience, the man in the middle attack can be mounted such that only one additional SAS confirmation dialog is ever displayed to Alice and Bob. If they confirm the match, they will each have established shared keys with Mallory that can be used to wiretap future calls (of his choice) without Alice or Bob ever noticing.
Other softphones
I first tried the trick on Linphone, because that is the software I have been happily using for years. I tested other programs I was familiar with, notably baresip, Jitsi Desktop (legacy) and CSipSimple (the last two have not been updated for years). All three programs are susceptible to at least one of the attacks.
I have also found several softphones that appear not to be affected. Either they do not support the Replaces header or require user confirmation for call transfers (by default). I do wonder whether the attacks were prevented by mere coincidence or due to deliberate design decisions. If you ever implemented ZRTP, do let me know whether you considered the scenarios I describe.
Impact
The main benefit ZRTP offers over SRTP is "end-to-end" encryption. The way the protocol is implemented in Linphone means that the benefit mostly disappears. Where such ZRTP remains marginally better is the resistance against (passive) eavesdropping. In the case of SRTP, even an attacker with read-only access to the plaintext within the (TLS encrypted) signalling channel sees the key and can decrypt the media. This weaker attacker cannot do much if ZRTP is being used, thanks to the use of public-key cryptography.
Security against active man in the middle is precisely what ZRTP was designed for. If only passive eavesdropping were of concern, the protocol could be made a lot simpler, there would for example be no need for the short authentication strings.
The Linphone version of ZRTP is not much better than SRTP, it is only secure if the intermediate SIP server is trusted. In a sense, this kind of ZRTP can be worse than SRTP, because the security properties Alice and Bob expect are simply not present. Nobody would use SRTP via an untrusted server. ZRTP is believed to be different, it should reliably detect (and therefore prevent) active man in the middle attacks. Alice and Bob may end up paying little attention to the trust they have in the server(s), as it is not something ZRTP would rely on.
Conclusion
In theory, SIP combined with ZRTP allow Alice and Bob to talk in private. In practice, many details matter. SIP predates ZRTP and ZRTP is not specific to SIP. Even if both are implemented correctly, the interaction of the two protocols can compromise security. I have not "broken" any of the protocols involved, in particular not ZRTP. I simply managed to circumvent some implementations of ZRTP using (automated) call transfers and a sequence of separate calls (as far as the protocols and software are concerned) perceived by Alice and Bob as a single call.
Getting cryptographic protocols to work is no easy task. Achieving any meaningful security is harder. Consider talking to a cryptography professional.