Monday, January 31, 2011

How to Seperate water from Milk?

I don't want our application to run like Google or any other search engine. I don't think Butterfly search engine should crawl all the web pages on the internet. It's a waste of time.

I would like to crawl only pages which are useful (Milk). We don't want our users to reach useless/fake/false/misleading page (water) on any website.

How this could be done?

I like the websites like Wikipedia, BBC, Gizmag, Mashable, Alibaba, Dmoz which are reliable & original content generators. If we crawl similar sites for just 2 levels depth, we could have huge database. I'm not a great fan of Social Bookmarking sites, Vertical Search engines.

Butterfly crawlers will continuously crawl these kind of sites. At this point don't think that butterfly project is just like any other search engine. What we are discussing is just a tip of iceberg. So be patient!

Other way of refining what we want to crawl is to ask people to submit the content & moderate it(of-course automatically). We don't want each & everything that anybody on the net submits - so run an algorithm to decide a popularity of the content. Now pay attention that I mentioned "popularity" not "higher page ranks" or "page having nice set of keywords". I just want to check if this page is "likable" by people & the extent of liking. I don't want it to be wisely coded, stuffed with keywords with thousands of incoming & outgoing links.. no!! we are not Google :)

Now we want our own pages like twitter. Twitter creates one of his own web page for each tweet when we click on the tweet timings.

Why we want Butterfly pages??

Let's first describe a butterfly. Butterfly is a set of information you want to share with others. e.g. A Joe's Pizza in Dyersville, IO. A typical butterfly will have a location map, contact details, menu card image, review links/text, what happened there on last Friday (related news) etc. Ofcourse you could investigate further by clicking on links on this page. That's pure milk :)

Google does the same thing with Local searches. What are we doing different? I say we don't use Google's algorithm. We are just creating one page with all the user contributed information.

It's not always possible to get the information you are looking for - from your favorite search engine (in my case it's Google, for now). Let's call that information Caterpillars of Google.

I should conclude this post by saying we are not building Google.com, we want to convert Google's caterpillars into information Butterflies, & make our users happy :)

0 comments:

Post a Comment