Before we start to discuss Voice over IP (VoIP) related topics, it is probably best to give a brief explanation of what it is. This way, the essence of what is discussed here will be clear throughout the document and the details can be worked out at the appropriate time.
Voice over IP is an extensive subject, but at the core it comes down to trying to transport speech signals in an acceptable way from sender to destination over an IP network. An Internet Protocol (IP) network is a computer network which uses the IP protocol to transmit information. I will give a more detailed explanation of this protocol in the next chapter, but for now it might be helpful to know that this is the basic protocol used on the Internet.
The definition of `acceptable' depends on the particular situation we are dealing with. If, for example, speech signals are being transported as part of a real-time communication between two persons, it will mean that the real-time aspects of this conversation must be respected: the overall delay between sending and receiving should be low to avoid irritably long gaps of silence. If, however, speech signals are being transmitted as part of a one-way process - e.g. an on-line radio show or a lecture - the delay constraints are less strict since the interactive aspect is no longer present.
Here, I will give the exact formulation of my thesis subject. This way there will be some clarity about what you may or may not expect to find in this document. The subject I have chosen is this one:
``A conventional way to communicate with each other using IP-networks, is through the use of textual chat facilities. The purpose of this thesis proposal is to take this one step further by using voice communication instead of these textual facilities. The goal of this proposal is to perform research and development in order to let persons which are in the same virtual environment talk to each other as they would do in reality. Their positions and orientations can be used to vary the intensity of the words: persons close to each other will hear each other clearly; persons which are moving away from each other will understand each other less and less as their distance increases. The proposal encloses technical components (like grabbing, compression, buffering, transmission, decompression and regeneration of the signal) and also a study of what is happening in the Voice over IP world today. Also, a number of experiments will have to be conducted to justify the chosen techniques.''
It should be clear from this description that the real-time aspects of VoIP will be very important. We are talking about a virtual environment in which persons can communicate with each other and so the overall delay between talking at one end and hearing what is said at the other end should be as small as possible. Because of this, I will pay less attention to those types of VoIP that do not have this constraint, but the same principles can be applied to them.
Currently, when you look at what literature can found about VoIP, you will find that most of it is about VoIP as a telephone alternative. This type of use is described first in this section, followed by a discussion about using VoIP in virtual environments.
The first kind of use is the `telephone alternative'. This means that you would use some kind of VoIP system to make a voice call to another person. This can be done in several ways.
First of all, if a PC that can be connected to some kind of network is available, it can be used to make a call to somebody else who is also connected to that network. This PC would then be equipped with speakers and a microphone and some VoIP application would be used to make the call. The PC could have a direct connection to a computer network, like in figure 1.1, but a connection through a dial-up link is also possible.
The second case is a slight variation of the first one. In this case, a telephone is connected to the PC and used in a similar way as you would when making a normal call. The PC does all the necessary work to set up the call and to transmit the speech signals. This also means that the PC has to be switched on before the call can be made. This type of configuration might be easier to use for people who do not work with computers often. As with the previous case, the connection to the network can be either direct, like in figure 1.2, or through a dial-up link.
Finally, the use of a PC and the requirement of a network could be omitted by the use of a VoIP gateway. This is a special device that connects the public telephone network with a computer network and performs the necessary actions and conversations to make the call possible. To make a call to somebody, you would call the gateway and specify the destination for the call. The call will then be set up and if the other end is available, the conversation can start. This configuration would be best for persons who do not have a PC. It is probably also the easiest to use, since most people are familiar with using a telephone and there does not have to be a PC around. This configuration is illustrated in figure 1.3.
There are probably a lot of variations to these configurations, but I believe that these three give a good idea of the possibilities. Combinations of these cases can also be worked out. A person could, for example, use his telephone to reach somebody through a VoIP gateway, while the latter uses a telephone to PC configuration.
Now, you may ask yourself: why use VoIP as a telephone alternative while the telephone itself is quite handy? Well, there are several arguments that can be made in favour of VoIP.
Suppose that somewhere - in a company or university for example - a computer network is needed. In that case, there are certain benefits by using Voice over IP instead of installing extra facilities to use telephones. The only requirement is that the IP protocol must be used, but nowadays this is almost always the case.
First of all, there is less cabling and equipment required. All the internal calls can be made using VoIP utilities. For outgoing and incoming calls, however, there still has to be some connection to the telephone network. This can be solved by installing a gateway that is connected to the computer network and the telephone network. This gateway will then perform the necessary signalling and conversations to make these calls possible.
Second, the capacity of the computer network will be better utilised. The available bandwidth of a network within an organisation is usually quite large and rarely fully used. By using VoIP, more of the network's capacity will be used.
At home, there is also an advantage in favour of VoIP. If Voice over IP could be used over a large distance, it would be much cheaper than making that same long distance call using the telephone network. For example, you could try to make the call by using the Internet.
With VoIP, not only the normal telephone features can be made possible, but also a wide range of new features could be created, especially when using VoIP on a PC. Whiteboarding could be used to make working together easier, a log book with information about incoming and outgoing calls could be kept, conversations could easily be recorded and security could be enhanced by using encryption algorithms.
When using VoIP over a Local Area Network (LAN), there is usually plenty of bandwidth available and the delay between sending and receiving is usually very low. Here, VoIP can often be used without problems. But when a Wide Area Network (WAN) is used - the Internet for example - problems can arise. One problem is the delay: while the delay on a LAN is usually very low, on a WAN this is not necessarily true. If the delay gets too large, the conversation will not be very pleasant. Another problem is the quality of the speech signals. When certain routes get too heavily loaded, packets on the WAN will be lost. These lost packets cause interruptions in the speech signal. In turn, these interruptions, when large enough, can also disturb the conversation. To alleviate the load, a lot of VoIP programs use compression techniques. However, compression often causes a certain degradation of the signal. This may or may not be disturbing to the listener, but with heavy compression, telephone quality will rarely be achieved.
The use of VoIP for virtual environments can be seen as a replacement of the textual interface of chat facilities like Internet Relay Chat (IRC). The virtual environment can be made quite abstract by using the same kind of interface as IRC chat programs, but using voice input instead of text. There is, however, also the possibility of a three dimensional interface. This kind of application probably fits the term `virtual environment' best. When you are using this kind of program, there will be some notion of a virtual world and the use of voice communication is very appropriate is this case. Note that now we are talking about facilities that do require a PC.
Because we are dealing with a virtual environment, several voice signals can be expected to go to several destinations, all at the same time. This means that considerable attention should be paid to limiting the required bandwidth. This is especially true when people can access the virtual environment through a dial-up link which has a very small capacity compared to a LAN for example.
Using VoIP this way is a rather new concept. This also means that currently, there is very little specific literature about it. However, it is obvious that a lot of the things that we have said in the previous section, also apply to VoIP in virtual environments.
VoIP techniques can be used for a wide variety of other applications which require voice or sound in general to be transmitted over a computer network and where timing and synchronisation are important issues. The same techniques also work when it is not sound, but video information which has to be transmitted.
Several other applications can be thought of. One is the use of VoIP techniques to create an on-line radio station, or perhaps even an on-line jukebox, where you can select the song you want to hear, which is then played almost immediately. If enough bandwidth is available, it would even be possible to add video data to all this. This way, television broadcasts and video on demand over IP networks could be made possible. In a similar way, we could extend a VoIP telephone conversation with video information about the persons involved in the call, creating a videophone application.
Another kind of application would be fax over IP. This is a bit different since we are no longer transmitting speech data, but a digitised image. Like with VoIP, this service could be made possible by connecting a computer network to the telephone network using a gateway. For fax over IP, this gateway would perform similar functions as with voice over IP.
Note that the list of applications presented here is certainly not complete. A wide range of applications using VoIP related techniques are conceivable, but many of them will resemble the ones discussed above.
Here, the core components of a VoIP system for virtual environments will be illustrated. With `core components' I mean the parts of the VoIP system that are at work during the conversations, so when the VoIP connection has already been established.
The entire process of the core VoIP system for virtual environments is depicted in figure 1.4. The arrows that point downward define the path which is followed when sending speech signals; the arrows that point upward define the processing sequence when speech signals are received. When the label of a box contains two items, the left one is about the sending of speech signals and the right one about the reception of such signals. They are grouped together because they operate at the same level: the right item does approximately the opposite of the left one.
This diagram can easily be adapted for VoIP applications which are not intended for virtual environments. The only thing that needs to be changed is the `3D effects' step. In those applications the 3D effects are not needed, so the entire step can just be left out.
All these components will be described in more detail in the rest of this thesis, but below I will give a general description of each component of the system. This will create a general image of the workings of the VoIP system, which is useful to keep in mind when explaining each component in detail.
To be able to send speech information across a computer network, the speech signal has to be encoded into a digital representation. In general, the signal will be detected by a microphone and transformed into a digital one by a special device, a soundcard for example. This process is called `grabbing' or digitisation and it is often also referred to as sampling1.
To maintain the real-time aspects of the conversation, it is necessary for the receiver to start receiving the signal as soon as possible after the sender has started it. To accomplish this, at regular small intervals blocks of digitised speech information are sent across the network, where they can be processed by the receiver.
When a digitised block is received, it has to be transformed back into an audio signal. The output of the process will usually go to speakers, so that the receiver will be able to hear what the sender is saying. Like the digitisation step, this process is also done by a special device. In essence, regeneration is the reverse operation of grabbing.
Several things have to be considered before transforming the digitised signal. First of all, if multiple persons are allowed to talk at the same time, like in a virtual environment, the speech signals of those persons have to mixed together at the receiver.
Second, when sending blocks of data across a network, there will be tiny variations in the time it takes each block to get to the destination. If we are unlucky, these variations can even be rather large. The importance of these variations is this: suppose we start playing back the voice signal in a block as soon as we received it. Because of the jitter, it is possible that the next block has not yet arrived when the output of the first one is finished. To overcome this problem some buffering will have to be performed to make sure that when we are finished with one block, the next will be available. However, this buffering will introduce a certain amount of delay so care must be taken to avoid that the overall delay will be too large.
To give the virtual environment a more realistic impression, it is important that some three dimensional (3D) effects are added to the voice signal. A participant should be able to determine roughly where the source of the voice signal is located.
Two general approaches can be thought of. Either the sender processes its own voice signal to appear as coming from a certain position, or the receiver adds the three dimensional effect to the sound. We will discuss later which one can best be used.
With the first approach the sender does the necessary transformations of the digitised signal. This signal can then be used by the receiver without any additional processing. The second approach requires that the receiver knows the position of the sender to modify the digitised signal accordingly. If necessary, this information can be added by the sender to the block containing the voice data.
The digitised information requires a certain amount of the available bandwidth of the connection. Very often compression schemes are used to reduce the required bandwidth for voice communication.
Several types of compression exist. Some of them use general compression techniques which are also used on other kinds of data; other types try to exploit the fact that we are dealing with voice information to achieve large compression ratios. Of course, combinations are also possible.
Once the compressed blocks with speech data reach the destination, they have to be decompressed. This means that given the compressed signal, the original digitised signal has to be reconstructed as good as possible. The decompression is very closely related to compression as it must be the inverse operation of the compression scheme that was used.
Compression is very important when the connection is slow, like with dial-up links for example. It is also an important issue when using VoIP in virtual environments, since the bandwidth requirements get larger as the number of senders increases.
Finally, the blocks have to be sent from source to destination, across the network. Some timing information should probably be added to the data, to make it possible for the receiver to reconstruct the exact order of the blocks. This is necessary because blocks may be lost, delayed or duplicated during the transfer. There are ways to assure a certain quality of the VoIP communication and to make the transfer more efficient when working with multiple destinations, but they will be explained later.
This thesis is organised in four major parts. First, there is this introductory chapter. Next there are a number of chapters which can be categorised as research. Following these, there are some chapters in which I will discuss the development part of my thesis. Finally, the last chapter will contain an overall conclusion. Here is a short overview of the research and development parts.
The next chapter is about IP networks. Since we are talking about Voice over IP, it is important to know some features of IP networks and the protocols used there. Therefore, that chapter will only discuss such items, without talking much about using IP for voice data.
In chapter three, we will talk about voice communication in general. Features which are important when using IP networks will be clarified here. The chapter will also include more information about grabbing and regeneration of voice signals.
Next, there is a chapter which contains a discussion about compression techniques. Several techniques will be explained and their use for VoIP will be clarified.
In chapter five the actual transmission of voice data is covered. Here, we will see a very useful protocol to transmit the data. Also, some techniques to provide quality of service (QoS) are discussed.
These four chapters were mostly about Voice over IP in general. Chapter six is about using VoIP in networked virtual environments. Here, techniques for the generation of localised sound will be detailed.
The last chapter of the research part is about subjects which do not belong to the core VoIP problem. However, to make sure that this thesis will produce a good image about what is going on in the VoIP world, some related topics will be discussed. These subjects include some related protocols and standards.
The development part contains three chapters. The first one, chapter eight, contains information about the Real-time Transport Protocol (RTP) library that I wrote. Chapter nine describes the VoIP framework I created. Finally, chapter ten is about the VoIP test applications I developed. Since I have learned a lot while I was working on these programs, I will also discuss some of the design decisions I made and explain why certain changes were made.
Voice over IP (VoIP) is about transmitting a voice signal across an IP network (the Internet for example). The context of this voice signal determines constraints for this transmission. For instance, if this voice signal is a part of a conversation between two people, care must be taken to preserve its real-time characteristics: the delay between one person talking and the other person hearing what was said should be as low as possible to avoid irritable gaps in the communication. Other applications of VoIP - like an on-line lecture - do not have this delay constraint.
This thesis is about VoIP in networked virtual environments. It contains information about VoIP in general and its application in virtual environments. I will also describe the applications which I developed to test aspects of VoIP in virtual environments.
The classical use for VoIP is as a replacement for a telephone call. Using VoIP like this can reduce costs in various ways, but the quality of the conversation is usually lower than that of a normal telephone call. Using VoIP in virtual environments is relatively new. Such applications would allow users to chat with each other, like on IRC, but instead of typing messages to each other they could simply talk with other users. Adding a 3D effect to the speech signal of a user helps to create a more natural environment. Many other applications use similar techniques as VoIP, for example the transmission of a video signal.
Several components are required to make VoIP in virtual environments possible. The speech signal is split in tiny pieces which are transmitted separately. To be able to transmit a piece of the speech signal, it must first be digitised. At the other end, this digitised signal must be reconstructed into a continuous speech signal which can then be sent to some speakers. Note that several signals may have to be mixed together if several persons are talking at the same time. Also, either at the sender or at the receiver, 3D effects will have to be added to a speech signal. To reduce the amount of required bandwidth to transmit the signal, the digitised speech signal should be compressed. Of course, at the other end it must be decompressed before it can be processed. Finally, there must also be a component which handles the transmission and reception of packets containing speech data.