Waseda Frontline Research Vol. 23: Information theory and data science (Part 1 of 3)

Toshiyasu Matsushima
Professor, Faculty of Science and Engineering; Director, Center for Data Science
Field of expertise: Information theory

Part 1: Approaching the limit of information communications

Information is indispensable in the lives of humankind. As such, there is a research field called information theory that aims to clarify the essence of information using a mathematical approach. Professor Toshiyasu Matsushima of the Department of Applied Mathematics, School of Fundamental Science and Engineering refers to information theory as being like air for society. Of the many information-related research fields, information theory is the most fundamental and essential. In three parts, the Center for Research Strategy asks Professor Matsushima about the history, significance, and appeal of information theory. In addition, we will introduce the Waseda University Center for Data Science, where he has served as director since 2017. Part 1 will provide an overview and specific examples of information theory and its usefulness in society. It is said that information theory researchers are currently ramping up the competition to get closer to the accepted limits. (Date of interview: September 6, 2018)

Claude Shannon’s Information Theory

I study information theory. The essence of information theory is solving the question of what information is mathematically. The term “information” may conjure up images of computers and information networks. However, the human gene sequence is information, and logical thinking and judgment can be regarded as processes for information processing. Information theory covers all information, including these things.

The reason why all kinds of information can be targeted is because we can formulate information using mathematics. When working with length and weight, you can express them physically in units such as meters and grams. It is also possible to use mathematical expressions in units such as bits when working with information.

Information theory was established by Claude Shannon (1916 -2001) of the United States. Shannon first formulated the idea that any information can be represented by the combination of 0 and 1. I think everyone is aware that digital information such as words, sound, video, and everything else could be expressed in 0s and 1s. Shannon built up the fundamental idea in 1937 while he was a master’s degree student. Shannon also worked on formulating information mathematically, and in 1948, he published a paper entitled “A Mathematical Theory of Communication.” That was the beginning of the field now known as information theory.

The theoretical limits of data compression sought by researchers

Shannon sought for a limit. When trying to understand what information or what its essence is, you are trying to find the limit of how much information data could be compressed. This is because information data compressed to its limit can be interpreted as having been distilled to its essential amount. If information data is compressed beyond its limit, rendering it so that the compressed data cannot be fully restored, then the essence of the information will be lost.

So, if information data is compressed to its limit, how large will its size be? Shannon calculated this quantity mathematically and proved it as a theorem. The theorem is called the source coding theorem, or Shannon’s first theorem, and the amount of information contained in information data compressed to the limit is called entropy.

Shannon figured out the limit of data compression, however, he did not show the implementation of the algorithm as a procedure. The problem of how to approach this limit was left to researchers who came after him, such as myself. While Shannon showed, figuratively speaking, that Mount Everest is the highest mountain in the world, later generations of researchers have been competing to reach the summit.

Shannon mathematically showed the limits of basic problems in information communication and information processing, and tried to elucidate what information is, or in other words, what its essence is. First, the basic problem of information communication was to compress digital information, such as text and images, and represent them in as short a way as possible in sets of 0 and 1. By expressing information as a sequence of short 0s and 1s, it will save time and memory in internet communications and computer memory, but how short could it be? It is of course obvious that if the sequence is too short, you will not be able to restore the original information. Furthermore, on the assumption that they can be completely restored, when considering compression of images of landscapes, for example, an image of a white wall and an image of the scramble intersection in Shibuya, the former can be compressed into an overwhelmingly shorter sequence of 0s and 1s than the latter.

Shannon mathematically formulated the problem of compressing information and proved that when information is compressed into sequences of 0 and 1s, there is a length limit beyond which it cannot be restored, and that such length can be expressed in terms of entropy. This theorem is called the source coding theorem.

This quantity called entropy is a very important index from an engineering point of view, and it can also be considered as a measure of the quantity of intrinsic information contained in a given piece of information. For example, a sentence with many characters may appear to contain a large amount of information, but the sentence may be compressed if it is repetitive. In other words, such a sentence has low entropy (information) and contains only a small amount of essential information.

The fact that entropy as measurement of an amount of information has been formulated mathematically and accurately by Shannon’s information theory has established the mathematical basic theories not only problems related to information communication and information processing but also for all fields that handle information (e.g. data science, etc.), and it has made significant contribution to the development of all subsequent information fields.

Allow me to explain another aspect of information theory’s contribution. Since the limitation of the compression of text and image information has become clear, there can be no better method if it is possible to create a method (algorithm) of compression to the length of the entropy, which is the limit. Even if maximum compression performance cannot be achieved, it is important to be able to clearly evaluate how much the algorithm could compress the length compared to the limit, making the entropy, the limit value, an indispensable unit for engineering.

Communication model and the limits of information data compression. In information communication, the data in an information source is compressed (coded), but no matter how hard you try, you can only compress to an amount called entropy. Shannon showed this in his source coding theorem. (Source: Matsushima Laboratory, partially modified)

Error-free information data communication

Another example of fundamental theorem of information theory is the noisy-channel coding theorem. This was also derived by Shannon. It is a theory regarding the problem of correcting or detecting errors in information caused by noise when sending information such as data. Because of this theory, information can be exchanged by mobile phones, digital terrestrial broadcasting, the Internet, and so on, without any worries.

Information is sent via fiber optic cables, radio waves, and other communication channels, but noise may enter along the way and cause errors in the information. To prevent this, for example, a code word is created by adding extra information before sending out information through a channel. By transmitting the code word via the communication channel, even if an error occurs during transmission, there is a way to make it difficult for an error to occur so that the original information can be decoded and restored properly. This is the problem of channel coding.

Having said that, however, if the information data sent to the communication path contains too much noise, it will no longer be recoverable. In other words, there is a limit to how much noise there can be until information data cannot be recovered. Shannon formulated this limitation as the noisy-channel coding theorem. From this theorem, when information is transmitted without errors, one can mathematically determine the limit of how efficiently information such as data can be transmitted.

A method to achieve the limit of the source coding theorem in the problem of compressing information described above was already found. But no method to achieve the limitations of the noisy-channel coding theorem in the problem of sending information without errors had been found until recent years. Although Shannon gave the limit, he showed no algorithm as a feasible procedure to achieve it. The problem of how to approach this limit was left to later researchers such as myself. While Shannon showed, figuratively speaking, that Mount Everest is the highest mountain in the world, later generations of researchers have been competing to reach the summit.

Fortunately, in response to the theoretical limits of the noisy-channel coding theorem derived by Shannon, communication companies and research institutes have continued work toward finding ways to get even closer to the limits. We are in a situation where achievable methods may be found in the next few years.

Although sending “010” information data is desirable, because an error may occur in combining, the same symbol is repeated three times and a longer code is sent, such as “000111000” (channel coding). The noisy-channel coding theorem shows that there is a coding method that can be performed to bring the decoding error rate as close as possible to 0 by increasing the code length if the channel coding capability is smaller than the channel capacity (the upper limit of the amount of information that can be sent per unit time). (Source: Matsushima Laboratory, partially modified)

Information theory in terms of social contribution

Regarding information encryption, Shannon also theorized undecipherable encryption. As you can see in the discussion thus far, the information theory Shannon established and passed on to future researchers has become the foundation for today’s information and telecommunications society.

For example, the ability to compress huge amounts of data and transmit it efficiently without errors is thanks to the establishment of theorem of information theory and research into algorithms. As mentioned, the limit that Shannon had shown has not yet been reached. However, as the limit is approached, for example, the number of optical fiber lines needed for data volume can be reduced and power savings can be achieved.

There are various reasons why mathematically proven possibilities have not been achieved in the real world. For example, although reaching these limits using a computer that could handle an infinite amount of theoretical computations may be possible, such an algorithm that would require an infinite or immense amount of computation cannot be processed by computers today. If we were to process the ideal encoding algorithm with a computer, it would take 20 to 30 years to complete data transmission.

“Personally, I like sports,” says Professor Matsushima, who is the manager of Waseda University’s rugby club.

Fun, similar to sports

Still, researchers are striving to get closer to Shannon’s limits. I am also working on research to approach and reach various theoretical limits. The first step in reaching theoretical limits is modeling. Social problems are replaced with mathematical models, and input, output and constraints are identified. Also, whether the model is probabilistic or deterministic is also considered. All these points will be reviewed.

The next step is the derivation of the theoretical limit. After defining the model, theoretical limit values are derived to develop methods to maximize the evaluation criteria in the assumed model. Although there are limits already indicated by Shannon, we need to derive the limits under various evaluation criteria.

Then, we must derive an algorithm that achieves the theoretical limit. Since it is not always possible to create a method that reaches a theoretical limit, we must develop a suboptimal method in such cases. Also, because it must be practical, we need to create new algorithms based on computation power and memory capacity.

This type of research to approach limits has a sense of fun, like competing for new records in sports. The information theory introduced at this time can be said to be a useful engineering aspect for people because it leads to the improvement of information communication efficiency. On the other hand, I am advancing information theory research from the scientific aspect. I would like to begin with that discussion next time.

Part 2 will introduce the scientific aspects of information theory and how actual research is conducted.

☞Click here for Part 2
☞Click here for Part 3

Profile

Toshiyasu Matsushima
Toshiyasu Matsushima graduated from Graduate School of Science and Engineering, Waseda University with a Ph.D. in Management System Engineering in 1991. He worked at NEC Corp, and held positions as lecturer at Yokohama College of Commerce; associate professor at the Department of Industrial Management and professor at the Department of Industrial and Management Systems Engineering, School of Engineering, Waseda University, before being appointed professor at the Department of Applied Mathematics, School of Fundamental Science and Engineering in 2007.

Professor Matsushima concurrently serves as director of Waseda University Center for Data Science, which was established in December 2017. His research field is information theory and its applications, and his research topics are those of theoretical research, such as various entropy; machine learning using statistical information; statistical processing; communications; information security; optimality such as in control; theoretical research in performance limits; and design of optimal algorithm and their performance evaluations.

He was a visiting research fellow at the Department of Electrical Engineering, University of Hawaii; visiting faculty member at the Department of Statistics, University of California, Berkeley; chairman of IEICE Engineering Sciences Society; chairman of the Special Committee on Information Theory Research, IEICE Engineering Sciences Society; deputy chairperson, Society of Information Theory and its Applications; and administrator of the Society for Quality Control Management. Professor Matsushima also served as member of the editorial committee for papers published by the Japanese Society for Artificial Intelligence; Institute of Electronics; Information and Communication Engineers; and the Society for Quality Control Management. Additionally, he is the manager of Waseda University’s rugby club.

For further details, visit the Matsushima Laboratory website at http://www.matsu.mgmt.waseda.ac.jp/?la=en

Research ActivitiesWaseda University

News

쀣 Tags