Language on the Net: What We've Got Here Is a Failure to Communicate
In this TechNewsWorld podcast, we chat with John McNaught, a language expert from England who was part of a recent report that forecast a bleak future for many languages on the Internet. McNaught talks about why the language problem is something that eludes both the free market and Google, and how devising language tools is like particle physics.
Oct 5, 2012 5:00 AM PT
Because of the prevalence of English on the Internet, as well as language technology such as speech recognition and translation software, smaller languages may be falling by the wayside, according to a recent study.
Languages such as Icelandic, Latvian and Lithuanian don't have enough speakers to gain traction as popular languages on the Web, and even German, Italian, Spanish and French could be at risk because of a dearth of resources to power translation tools, speech-to-text technology and voice-controlled devices.
The research was conducted by the National Centre for Text Mining at the University of Manchester in England. Among the researchers on the study was John McNaught, the center's deputy director. He joins us for this podcast.
Download the podcast (17:15 minutes) or use the player:
Here are some excerpts:
Hello, and welcome to a TechNewsWorld podcast. My name is David Vranicar, I'm a reporter for TechNewsWorld, and today we are talking about language extinction on the Internet.
With us is John McNaught, the deputy director of the National Centre for Text Mining. He joins us from Manchester and the University of Manchester in England. Professor McNaught was part of a study titled, "Europe's Languages in the Digital Age." It was published by META Net, a group that promotes technological foundations for a multilingual European information society.
The report has gotten some traction recently for its finding, which did not bode so well for a slew of European languages and their future prospects on the Web. So Professor McNaught is going to break down what they found and what it means for the Internet moving forward.
TechNewsWorld: It seems like it might be kind of self-reinforcing phenomenon, where the strong languages would just become more prevalent and the ones that don't have as big of populations or that don't have as much usage -- that they would continue to slid further and further. Is that what you're finding, that the strong ones are getting stronger and vice versa?
John McNaught: Well, I think it's fair to say that the strong ones are strong. I'm not sure to what extent we can say the strong ones are getting stronger for the simple reason that there is a very large proportion of people who do not wish to communicate in any language but their own -- or cannot communicate in any language but their own. So I think there's a finding that only about 10 percent of people would be prepared to use English for online services, so they wouldn't want to start ordering things online or engage in a bank or whatever in English if that wasn't their own language.
We have to be very careful about just assuming that certain languages will just get stronger and stronger and stronger, and that other ones will just slide away. We also have to be careful about maintaining balance -- maintaining cultural balance, balance in terms of heritage, balance in terms of societies getting along together. Because when you look at many of the trends over many, many centuries, often it's language that is at the root of many of the world's problems: People don't understand each other. And they need help in order to bring people to understand, to celebrate our diversity, to celebrate our cultural heritage and so on. And we can do this through language technology if we have the means.
TNW: Yeah, I thought that was one of the more interesting things about the write-up that you all did about your findings, that this really transcends not being able to buy the newest pair of shoes on the Internet and goes beyond simple things. There really is a cultural significance to it.
McNaught: Yes, I wholly agree with that view.
TNW: Is this gap -- the gap between the languages or the slow progression of these language tools -- is it something that the "free market" could rectify? Can you apply normal terms of supply and demand to this issue? Or does that kind of fall by the wayside when you get into this language landscape?
McNaught: Well, that's an interesting question. I think it's only natural for big players in the market to focus on big returns. And there may not be much traction in investing in, say, machine translation systems between two minor languages. That's not going to bring big returns. So you tend to find the big players concentrate on the major languages and indeed on translation into English as opposed to translation into other languages. So there's that phenomenon.
On the other end of the scale, Europe is quite active in the small and medium-sized enterprises sector. However, it becomes very expensive for an SME to develop a language application on its own simply because it needs access to very large languages resources, it needs the language expertise and so on to develop the tools. So it can become a significant struggle for SMEs to do this even if they wish to do so, especially in minority language area.
TNW: Is this problem too big to be solved by Google and Google Translate? Is it too deep for that?
McNaught: I think it is. Google Translate is good as far as it goes. I use it myself regularly. However, it remains at a certain level of achievement and we know we can do better.
The thing about machine translation -- I think it is one of the hardest things that one can attempt to achieve on a computer. And machine translation in common with many natural language processing tasks, relies in many cases on deeper understanding and on the content of a message. And many of the techniques that are around at the moment are based on statistics instead of on meaning. So if you've seen something so many times before, you might propose it as matching something you're seeing now. This gets you so far, but these programs really have no knowledge of semantics, or meaning. And we're trying to get to this, but it's a hard task. We can do certain things, we can get so far.
I think you can make a comparison with the efforts going on in particular physics trying to find the Higgs Boson, that kind of thing. People got together on a large scale, there was massive funding. We are after something equally elusive, which is understanding, which is meaning and getting a computer to be able to handle this. And it's that kind of large-scale effort that's needed to bring people together, to provide them with the frameworks to meet the challenge.