Concordia's Thursday Report

Sabine Bergler, Leila Kosseim

Sabine Bergler and Leila Kosseim launched their laboratory in July.

Photo by Andrew Dobrowolskyj

by James Martin

In the world of electronic information, he that seeketh findeth — but not until he hath sorted through a whole lot of extraneous stuff.

The Computational Linguistics at Concordia (CLaC) research group is out to speed things along.
“Our work is related to what Internet search engines do right now,” explained Dr. Sabine Bergler, CLaC co-founder and associate professor of computer science, “but we’re working at a more fundamental level of going towards the content of a document and finding ways of expressing at least parts of that content.”

CLaC was formed last September when Bergler joined forces with newly arrived assistant professor Dr. Leila Kosseim. Working with a handful of graduate students (plus three undergrads working on summer NSERC scholarships),

Bergler and Kosseim are taking computational linguistics beyond what is termed “the bag of words approach,” which doesn’t take into account that the order of words may change the meaning.

For example, a search for information about a sandwich-eating contest (“Brothers eat four hundred heroes”) may yield shocking revelations about bravery gone horribly awry (“Four heroes eat hundred brothers”). “If you’re using Google and not taking care with your double quotes,” Bergler said, “you’ll get all kinds of results and then have to sift through pages and pages of material, like back in the Dark Ages.”

Developing a base technology

At the core of CLaC’s research is the idea of “noun phrase co-referencing,” which Dr. Bergler dubs “the base technology that drives the rest.” Co-referencing strives for semantic understanding of the text by attempting to link multiple appearances of the same concept (or person or place) in a group of electronic documents. It may sound easy, but meaning is slippery; Bergler warns that “serious issues arise when you look across several documents.”

Even something as seemingly benign as a text’s date can prove problematic: a group of documents may, for example, all make reference to the “President of the United States” — but those same words, as used in documents from the 1970s, don’t necessarily refer to the same person as documents from the 1990s.

Understanding the text

Another co-referencing challenge lies in identifying the differences between referents which may share the same name (as in the case of “President George Bush”). Ultimately, CLaC believes co-referencing research will result in more meaningful, useful text retrieval.

“If you have any kind of understanding of the text of a document, rather than just frequency of key-word occurrence,” said Kosseim of the importance of co-referencing, “then the hope is that you’ll improve the accuracy of the retrieval. That’s why we’re really focusing on the deeper semantic and syntactic analysis of the text, where we’re trying to represent a sense of the meaning.”

Related CLaC research includes question-answering (the goal is to semantically and syntactically break down user queries by compiling annotated corpera: the reverse is also being researched, wherein the software returns answers using full, grammatically-correct sentences), summarization (in which users are quickly and accurately briefed as the content of relevant documents), and user evaluations of whether the returned answers are, in fact, accurate or useful.

In just one year, CLaC has hosted the inaugural workshop on Computational Linguistics in the North East (CLiNE) in May, and the researchers celebrated the opening of their own computer lab in July.

After years of limited growth, computational linguistics is now experiencing a huge burst of international interest, largely because governments are under great pressure to find efficient ways to manage an ever-increasing amount of electronic text. (The European Union, for example, must produce each of its documents in seven languages; the Canadian Department of National Defence electronically archives any document with a signature.) This avalanche of virtual paperwork, Kosseim cheerily notes, means “there’s a lot of work to be done.”

Bergler admits there’s no end to the challenges presented by natural (colloquial) language’s quirks — ambiguity, irony, figures of speech — that often leave us perplexed. “Two people don’t necessarily agree about what they read in the same document, even if it’s a factual text,” she said, “so we know that we probably won’t have full semantic and syntactic understanding within our lifetime. Not unless — ”

Laughing, Kosseim finishes the thought: “Not unless someone changes natural language.”