html parsing - Distingushing features of a blog, i.e deference between a blog and a normal site -
I'm looking at things that can distinguish a blog from a general website. These are things that a website should be able to identify HTML or any special features of a website that supports a site. Pings for example News is the same for websites.
I am working on a blog / news monitor program and it will automatically determine index sites to determine whether it is a blog or a news site and then comments on user comments etc. Monitors. Posts from sites that determine a blog or news nature.
That is why I am actually in the know how I can use these sites to identify.
This is going to be a desktop app written in Java, so if you have a code specification in Java that would be great.
Thanks in advance
You can search the page for the word "blog" , Because it probably will be present. In particular, you can search it in some parts of the HTML page, or you can skip the parts like links. This will give you a good starting point.
Finally, however, it is something that must be done manually. To specify people you must create an interface if it is a blog or news site, or its various features, when the site is submitted. After this you should create a database of sites and features, and flag them so that you or any other administrator can review them and make changes. Once you do this for a site, you will not need to do it again, so for example http: //*.wordpress.com/ are going to be all blogs.
There are some features that you can automatically find or find a great chance to find out, but in the end you will need manual review.
Comments
Post a Comment