by James Martin
In the world of electronic information, he that seeketh findeth
but not until he hath sorted through a whole lot of extraneous
stuff.
The Computational Linguistics at Concordia (CLaC) research group is out
to speed things along.
Our work is related to what Internet search engines do right now,
explained Dr. Sabine Bergler, CLaC co-founder and associate professor
of computer science, but were working at a more fundamental
level of going towards the content of a document and finding ways of expressing
at least parts of that content.
CLaC was formed last September when Bergler joined forces with newly arrived
assistant professor Dr. Leila Kosseim. Working with a handful of graduate
students (plus three undergrads working on summer NSERC scholarships),
Bergler and Kosseim are taking computational linguistics beyond what is
termed the bag of words approach, which doesnt take
into account that the order of words may change the meaning.
For example, a search for information about a sandwich-eating contest
(Brothers eat four hundred heroes) may yield shocking revelations
about bravery gone horribly awry (Four heroes eat hundred brothers).
If youre using Google and not taking care with your double
quotes, Bergler said, youll get all kinds of results
and then have to sift through pages and pages of material, like back in
the Dark Ages.
Developing a base technology
At the core of CLaCs research is the idea of noun phrase co-referencing,
which Dr. Bergler dubs the base technology that drives the rest.
Co-referencing strives for semantic understanding of the text by attempting
to link multiple appearances of the same concept (or person or place)
in a group of electronic documents. It may sound easy, but meaning is
slippery; Bergler warns that serious issues arise when you look
across several documents.
Even something as seemingly benign as a texts date can prove problematic:
a group of documents may, for example, all make reference to the President
of the United States but those same words, as used in documents
from the 1970s, dont necessarily refer to the same person as documents
from the 1990s.
Understanding the text
Another co-referencing challenge lies in identifying the differences between
referents which may share the same name (as in the case of President
George Bush). Ultimately, CLaC believes co-referencing research
will result in more meaningful, useful text retrieval.
If you have any kind of understanding of the text of a document,
rather than just frequency of key-word occurrence, said Kosseim
of the importance of co-referencing, then the hope is that youll
improve the accuracy of the retrieval. Thats why were really
focusing on the deeper semantic and syntactic analysis of the text, where
were trying to represent a sense of the meaning.
Related CLaC research includes question-answering (the goal is to semantically
and syntactically break down user queries by compiling annotated corpera:
the reverse is also being researched, wherein the software returns answers
using full, grammatically-correct sentences), summarization (in which
users are quickly and accurately briefed as the content of relevant documents),
and user evaluations of whether the returned answers are, in fact, accurate
or useful.
In just one year, CLaC has hosted the inaugural workshop on Computational
Linguistics in the North East (CLiNE) in May, and the researchers celebrated
the opening of their own computer lab in July.
After years of limited growth, computational linguistics is now experiencing
a huge burst of international interest, largely because governments are
under great pressure to find efficient ways to manage an ever-increasing
amount of electronic text. (The European Union, for example, must produce
each of its documents in seven languages; the Canadian Department of National
Defence electronically archives any document with a signature.) This avalanche
of virtual paperwork, Kosseim cheerily notes, means theres
a lot of work to be done.
Bergler admits theres no end to the challenges presented by natural
(colloquial) languages quirks ambiguity, irony, figures of
speech that often leave us perplexed. Two people dont
necessarily agree about what they read in the same document, even if its
a factual text, she said, so we know that we probably wont
have full semantic and syntactic understanding within our lifetime. Not
unless
Laughing, Kosseim finishes the thought: Not unless someone changes
natural language.
|