|  
        
        by James Martin 
         
        In the world of electronic information, he that seeketh findeth 
         but not until he hath sorted through a whole lot of extraneous 
        stuff.  
         
        The Computational Linguistics at Concordia (CLaC) research group is out 
        to speed things along. 
        Our work is related to what Internet search engines do right now, 
        explained Dr. Sabine Bergler, CLaC co-founder and associate professor 
        of computer science, but were working at a more fundamental 
        level of going towards the content of a document and finding ways of expressing 
        at least parts of that content. 
         
        CLaC was formed last September when Bergler joined forces with newly arrived 
        assistant professor Dr. Leila Kosseim. Working with a handful of graduate 
        students (plus three undergrads working on summer NSERC scholarships), 
         
         
        Bergler and Kosseim are taking computational linguistics beyond what is 
        termed the bag of words approach, which doesnt take 
        into account that the order of words may change the meaning. 
         
        For example, a search for information about a sandwich-eating contest 
        (Brothers eat four hundred heroes) may yield shocking revelations 
        about bravery gone horribly awry (Four heroes eat hundred brothers). 
        If youre using Google and not taking care with your double 
        quotes, Bergler said, youll get all kinds of results 
        and then have to sift through pages and pages of material, like back in 
        the Dark Ages. 
         
         Developing a base technology 
         
        At the core of CLaCs research is the idea of noun phrase co-referencing, 
        which Dr. Bergler dubs the base technology that drives the rest. 
        Co-referencing strives for semantic understanding of the text by attempting 
        to link multiple appearances of the same concept (or person or place) 
        in a group of electronic documents. It may sound easy, but meaning is 
        slippery; Bergler warns that serious issues arise when you look 
        across several documents.  
         
        Even something as seemingly benign as a texts date can prove problematic: 
        a group of documents may, for example, all make reference to the President 
        of the United States  but those same words, as used in documents 
        from the 1970s, dont necessarily refer to the same person as documents 
        from the 1990s.  
         
         Understanding the text 
         
        Another co-referencing challenge lies in identifying the differences between 
        referents which may share the same name (as in the case of President 
        George Bush). Ultimately, CLaC believes co-referencing research 
        will result in more meaningful, useful text retrieval.  
         
        If you have any kind of understanding of the text of a document, 
        rather than just frequency of key-word occurrence, said Kosseim 
        of the importance of co-referencing, then the hope is that youll 
        improve the accuracy of the retrieval. Thats why were really 
        focusing on the deeper semantic and syntactic analysis of the text, where 
        were trying to represent a sense of the meaning. 
         
        Related CLaC research includes question-answering (the goal is to semantically 
        and syntactically break down user queries by compiling annotated corpera: 
        the reverse is also being researched, wherein the software returns answers 
        using full, grammatically-correct sentences), summarization (in which 
        users are quickly and accurately briefed as the content of relevant documents), 
        and user evaluations of whether the returned answers are, in fact, accurate 
        or useful. 
         
        In just one year, CLaC has hosted the inaugural workshop on Computational 
        Linguistics in the North East (CLiNE) in May, and the researchers celebrated 
        the opening of their own computer lab in July.  
         
        After years of limited growth, computational linguistics is now experiencing 
        a huge burst of international interest, largely because governments are 
        under great pressure to find efficient ways to manage an ever-increasing 
        amount of electronic text. (The European Union, for example, must produce 
        each of its documents in seven languages; the Canadian Department of National 
        Defence electronically archives any document with a signature.) This avalanche 
        of virtual paperwork, Kosseim cheerily notes, means theres 
        a lot of work to be done. 
         
        Bergler admits theres no end to the challenges presented by natural 
        (colloquial) languages quirks  ambiguity, irony, figures of 
        speech  that often leave us perplexed. Two people dont 
        necessarily agree about what they read in the same document, even if its 
        a factual text, she said, so we know that we probably wont 
        have full semantic and syntactic understanding within our lifetime. Not 
        unless   
         
        Laughing, Kosseim finishes the thought: Not unless someone changes 
        natural language. 
         
         
       |