Voice over IP is about transmitting voice information across an IP network, for example the Internet. The classical application of VoIP is as a telephone alternative. However, this thesis is about using VoIP in networked virtual environments.
The Internet Protocol (IP) is a part of the TCP/IP architecture. The protocol itself offers only a best-effort service: packets can be delivered out of order, corrupted, duplicated or not at all. Also, each packet takes a different amount of time to reach its destination. Applications normally do not use IP itself, but the higher level protocols: TCP, which offers a reliable byte stream service, and UDP, which offers a similar service as IP.
The speech signal is transmitted by digitising tiny pieces of it at regular intervals and sending these to the destination where an analogue signal is reconstructed. For good quality communication, the overall delay should be below 200 ms. Delay variance or jitter should be eliminated through buffering. Speech communication is fairly tolerant to lost or corrupted packets.
When the digitised speech signal is left uncompressed, a bandwidth of 64 kbps is needed for telephone quality communication. Various compression techniques can reduce this amount. The most successful among them model how the speech was produced rather than the signal itself. Various compression standards allow interoperability between applications.
To transmit the speech data, TCP is not a good choice: it has a lot of features which are unnecessary for VoIP, but which increase the overall delay. UDP itself is too simple, but we can extend it: this is the way RTP is used in the TCP/IP architecture. The Real-time Transport Protocol (RTP) provides information for synchronisation, flow and congestion control and identification. To provide some quality of service (QoS) guarantees, resources can be reserved by using RSVP, the Resource Reservation Protocol.
For VoIP in virtual environments, speech data will have to be sent to several destinations. This can be done in an efficient way by using multicasting. When a packet arrives at the receiver, the voice signal is extracted and a 3D effect is added to it, corresponding to the position of the sender. A sound appears to be localised because of interaural differences. These differences can be captured in Head-Related Transfer Functions (HRTFs) which can then be used to recreate localised sounds.
To be able to create VoIP applications myself, I first developed a RTP library which performs quite well. I also developed a VoIP framework in which different VoIP components can easily be tested. The applications I created with this framework include an Internet Telephony application and a 3D environment. Both allow good quality communication when sufficient bandwidth is available.