splogs

splogs

My blog search engine SproutSearch is now indexing over 8 million blogs. I am now working on changing the way the blogs are ranked. For now, they are sorted by the sheer amount of content they contain. I noticed a big problem with this method is that many spam blogs contain masses of content. I don't like SproutSearch linking to so much spam, so I need to find a way to remove a lot of these listings. It is not practical for me to read 8 million blogs, so I need to come up with an automated method to detect spam. Many spam blogs use the same words over and over. So I wrote a program to count the number of repeated words. Most spam blogs seem to use a similar number of words per post. I made another program that computes the standard deviation of the number of words in a post. Using these metrics, I will make a program that flags potential spam so I can review and delete it.