Nearly any time that two disciplines meet, some new group will emerge. As the group begins to formulate the problems and approaches that they find interesting, they’ll often label themselves.
- A chemist who is interested in living things might study biochemistry.
- Economists that study the effects of psychology on decision-making are called behavioral economists.
- Linguists who use computational techniques to define language models are called…what, exactly?
Meet the labels
So what do academics and engineers call themselves when they study both linguistics and computer science? There are lots of labels that get thrown around. Here are a few of the most popular ones:
- Computational Linguistics
- Natural Language Processing
- Text Mining (or Text Data Mining)
- Text Processing
- Corpus Linguistics
Unfortunately, linguists and computer scientists don’t tend to be the same people very often. Because of that, a researcher in one field will develop an interest in the other and begin working on problems without first figuring out what term to label the project under. A linguist might be tempted to say “I’m using Python to make lists of collocates for some nouns to try and get at semantic meaning” while a computer scientist would say “My new algorithm tries to maximize the probability of a given word occurring given the context words around it.” Both the linguist and the computer scientist are dealing with a common problem: contextual semantic representation (or, in layman’s terms, figuring out what a word means based on context).
So as the same types of problems get worked on over and over again, the label for that group is determined largely based on how the researchers perceive their approach. For example, “sentiment analysis” and “opinion mining” mean nearly the same thing (figuring out what people’s opinions are based on what they write). But a linguist interested in pragmatics, semantics, etc. is much more likely to consider it analysis whereas a programmer would call it mining.
Some loose definitions
Computational Linguistics (CL) is considered to be the umbrella term for any subject that encompasses both the fields of Linguistics/Language and Computer Science/Mathematics. This definition is not universally agreed upon, but tends to be a fairly good representation of what people mean when they say it. For whatever reason, CL tends to be thrown around a lot more in the Linguistics department than in the Computer Science department.
Computational linguists might deal with a problem such as “How can we combine parse trees and Bayesian models to account for grammatical variations in written language?” or “What did the common ancestor of English and Danish look and sound like?”
Natural Language Processing (NLP) deals a lot more with engineering solutions to language problems. Although it’s technically a subdomain of CL, it seems to be much less concerned with the actual linguistic principles than it is with results. NLP researchers spend a lot more time with machine learning/artificial intelligence than your run-of-the-mill computational linguist.
One example NLP problem might be “How can I classify a group of 400,000 documents according to topic and ensure the labels make sense?”
Text Mining or Text Data Mining almost always totally ignores linguistic principles altogether in its search for nuggets or patterns within large bodies of text.
A good text mining problem might involve a question like “When Reddit users say ‘Microsoft’ or ‘Google,’ what kinds of words appear in those sentences that aren’t common words?” or “Which language has the longest words?”
Text Processing is NLP-lite. These jobs tend to be filled by either (1) linguists with beginner skills in programming and (2) intermediate programmers with less knowledge of linguistics. Unfortunately, there are very few good text processing engineers, at least in the companies and organizations I’ve visited. The good ones tend to use text processing as a stepping stone to mining or NLP.
Text processing problems might include descriptions such as “Go through these documents and store each XML tag body into its own file” or “Run the Stanford POS tagger on this text and identify all noun phrases.”
Corpus Linguistics is a specialized field of text gathering. Of all five labels given above, corpus linguists are probably the most different. Their job is to gather collections of language (usually text) that represent a language segment. Corpora, the collections that these linguists produce, are an invaluable tool to solving many varied questions in linguistics.
One of the most used corpora in the world is BYU’s Corpus of Contemporary American English (COCA), which is considered fairly representative of modern American English. In fact, BYU has one of the best collections of free corpora in the world, including historical English corpora, Web-based language, and even Spanish/Portuguese corpora. See http://corpus.byu.edu/ for the full offerings.
What’s my field?
Despite there being a lot of labels, it may be hard to figure out exactly what to call your particular realm of research, especially when you’re working mostly on your own. I can’t really offer you any solid advice on how to name your work, but here are some important questions to ask yourself:
- How much knowledge of machine learning, statistics, and mathematical models are required to work on my research?
- Do my questions deal more with understanding human language or just getting results on a problem dealing with language?
- What are the tools I use to get answers to the questions I ask?
- Does a researcher in my field need programming skills in order to be successful? How advanced should these programming skills be?
- Could my tasks be automated?