30.8.07

Aggregators

Last semester during one of my Thesis Seminar classes, Professor Stuart Madnick gave a lecture on data reuse and repurposing. One of his central themes was the idea of aggregation, particularly with respect to online portals; and within that context one of his assertions was that a portal is either an aggregator or aggregatee. Examples include mysimon.com or evenbetter.com which aggregate content from bn.com, borders.com and amazon.com. Other examples include iship.com, etc.

I've been looking around for more examples within the news media space found TextMap and News Explorer.

Text map temporally aggregates 1000s of news sources and categorizes the results using NLP entity referencing.




It even provides various useful metrics based on daily, monthly or yearly news stories. For example, the Person section of the Daily Sentiment report for August 30th looks like:





New Explorer likewise aggregates individual news sites but presents the data is a more intuitive way. On its frontpage, those stories most written about are geographically displayed on a google map...



Probably the coolest feature is its The Timeline application that shows the most widely reported news stories over the past x-number of days:



The highest spike in the graph above is Saddam Hussein's execution, btw.

One implication of all this is that, in one way or another, all portals could be negatively affected; e.g., your website's content could be used for the benefit of others and potentially contribute to their success, and conversely, your failure. Accordingly, portals have launched lawsuits against competitors who have attempted to leverage the others' investments in development cycles/database hardware etc. The most notable case being eBay vs Bidders Edge, circa 2001. In the wake of these types of lawsuits, the EU has passed Database Re-use directives and there are four like bills before the US Congress.

I doubt whether these types of laws will curtail aggregation, spiders, etc. It's in our nature to make the best use of information available to us. The challenge of course is deriving value in a way that doesn't violate intellectual property, infringe on others' investments while at the same time providing a level field for competition.

Labels: , , , ,

29.8.07

Thesis

With some of the classes I've taken and having talked to a few professors and fellow students, my thesis topic is starting to solidify. I'm considering an online application to find complex target material using data mining, specifically, natural language processing (NLP). I've found a worthy freeware application and a few other references that seem promising, but much more research is needed.

At this point though, and while the past few years of my career involved extraction and reporting, with sometimes sophisticated regexes, this new application will ideally find historical target material, given a set of complex characteristic sets. With the interesting information found, I then plan on using statistics to determine the likelihood of an event happening again in the future.

The application will have to be build robust enough to have multiple application potentials (eg, maybe apply it towards both engineering or scientific domains). I do have a target market in mind already and since an SDM thesis has to be half technology and half business oriented, what I have at this point is a good start. More later.

Labels: , , ,