Connecting the dots: Concordia PhD student wins top prize for world's fastest FIM data mining program
Legend has it that after analyzing huge amounts of data from its checkout counters, Wal-Mart found a statistically significant correlation between beer and diaper purchases. As the story goes, the discount chain then placed the diapers next to the beer, with potato chips in between — resulting in an increase in sales for all three products.
Concordia graduate student Jianfei Zhu uses this colourful tale to illustrate the kind of work he does with data mining, the computerized process of discovering hidden patterns and regularities in large data collections.
A PhD candidate, Zhu recently won first prize for the fastest FIM (frequent itemsets mining) data mining program from an international workshop, FIMI '03. His code beat out submissions from several top U.S. schools, which his advisor, Concordia Computer Science Professor Gosta Grahne, calls exceptional. In fact, the pair is currently looking into obtaining patent rights to ensure the exclusive license to use or sell the method.
Zhu admits that while his research might seem obscure to some, it actually touches many different industries, impacting everything from the ability of a bank to detect fraudulent credit card use, to helping doctors identify the likelihood of a patient developing a particular illness.
Thanks to data mining from cancer databases, doctors in Australia have achieved improvements in the diagnosis and prognosis of the disease.
Dr. Grahne says the field of data mining is only now emerging and that Zhu has made a significant contribution with his latest data mining program.
“The situation is similar to when relational database technology was developed, starting in the 1970s. It took the scientific community, together with companies like IBM, more than 10 years of research and development to reach industrial strength database systems,” Gosta said.
“Jianfei's program is an important breakthrough, and I think it will have a definite influence on the development of data mining software.”
After completing his master's degree in China, Zhu opted to continue his studies at Concordia, where he says he learned efficient coding and ultimately, problem-solving.
“Writing code and research is really problem-solving,” he said. “I consider good hacking skills essential for a good researcher. If you have good ideas, you have to be able to write code to implement them. Otherwise you are like a person that has all the ingredients for a delicious dinner, but cannot cook!”
Zhu and his team are currently looking into building large-scale versions of the data mining method in order to process and handle massive amounts of information. No one has successfully mined databases of this magnitude — yet.
“To give a comparison, 100 terabytes (the scale we are using) is enough to store the entire collection of the U.S. Library of Congress, four times over,” Grahne said.
Zhu, who will graduate in the fall, plans on continuing his research in databases and data mining after obtaining his doctorate.
“I would be equally happy in an academic environment as in a serious research lab,” he explained, as long as the emphasis stays on data mining. “In the future, we will see more and more of a data exploration oriented approach in decision-making, and data mining will be at the core of the required technology.”