HTML parser...My recent project needs a web spider - html

HTML parser...My recent project needs a web spider..it automatically get web content which it gets the links recursively....
But, it needs to know its content exactly. like tag.
it runs in linux and windows..do you know some opensource about this needs..
thanx
or about some suggestion.

Here is a StackOverflow question showing how to use a number of XML/HTML parsers in different languages. If you tell us what language you're using, I can be more specific, but your answer may already be in there.

Depends what language you are developing for, trying googling:
html parser languagename
hpricot is a good one for Ruby, for example.

I think the subject you need to know is Regular Expression.
Regular Expression is available on all platform and all languages (Java, PHP, Python, C#, Ruby, Javascript).
Using Regular Expression, you can easily exact its content as preferred form you want.
Pattern p = Pattern.compile("<a\\s[^>]*href=\"([^\"]+?)\"[^>]*>");
Matcher m = p.matcher(pageContent);
while( m.find() ) {
System.out.println( m.group(1) );
}
Above code block written in Java will extract all anchor tags in a page and extract URL into your hand.
If you don't have enough time to learn Regular Expression, the following references will help you.
http://htmlparser.sourceforge.net/

Related

How to add hyperlink on specific word

I am planning to develop a web application which will be used to published articles.Now i want to know how content is written something like wiki where some specific words contain hyperlinks.
In this content you can see Hashtable, Java programmer, JSTL foreac tag for this are marked has a hyperlink. Is there any tools exist or i have to associate hyperlink manually.
HashMap in JSP, or any other Map implementation e.g. <a>Hashtable</a>, I personally prefer (JSTL foreach tag for this). As a <a>Java programmer</a>, I often have urge to use Java code directly in JSP using scriptlet, but that's a bad coding practice, and one should always avoid that. Infact by smart use of expression language and JSTL core tag library, you can reduce lot of Java code from JSP. In our last post, we have seen example of JSTL foreach tag to loop over List, but not a Map, and that creates a doubt in one of m
The short answer is that you need to implement this in Javascript. The problem is horribly complicated because you need to iterate over a list of search_words while simultaneously walking through the body searching for longest matches. The longest match part is what makes this tricky: You want to match the text "Java Programmer", not just the word "Java".
You also want to ignore whitespace and capitalisation problems. I strongly recommend you build on an existing solution such as the open-source MediaWiki project to save reinventing a complicated wheel.

Which programming language is CSS / HTML defined by?

How did the developers of HTML and CSS define them? I mean which programming language did they use?
Imagine i want to define a new HTML tag or new CSS property, or even a new language.
for example instead of using tags <>, I want to define a language to use brackets [] and new CSS-resembling syntax:
[foo style(bar: .....)]
How they (developers) do this? and which programming language they use, which approach they follow?
p.s.1: I'm not going to develop a new language, it is just a question.
p.s.2: I couldn't find appropriate tags, so please be patient if this question doesn't fit css,html & xml contexts.
Consider this:
You are reading English. You are able to understand punctuation, meaning of the words and then are able to extract what that means.
The same is for HTML and browsers. HTML is itself a "markup language" (not a programming language) like English in our example. Web Browser is like our brain, it understands the syntax(grammar and punctuation) and then extract what that means and show it to us.
So your question was more like "In what language is English written in?".
As for your question to how you want to create something like HTML/CSS, you need to first understand the basics of "Theory of Computation" and "Compiler Construction".
But to answer your question in brief, you need to create a dictionary(which defines meaning of each and every word in your language) and then create a "parser"(like Web Browser) which understands its meaning.
This being a very wide topic, I would like you to search the web for the two topics I mentioned above.
Hope I helped!

How extract meaningful text from HTML

I would like to parse a html page and extract the meaningful text from it. Anyone knows some good algorithms to do this?
I develop my applications on Rails, but I think ruby is a bit slow in this, so I think if exists some good library in c for this it would be appropriate.
Thanks!!
PD: Please do not recommend anything with java
UPDATE:
I found this link text
Sadly, is in python
Use Nokogiri, which is fast and written in C, for Ruby.
(Using regexp to parse recursive expressions like HTML is notoriously difficult and error prone and I would not go down that path. I only mention this in the answer as this issue seems to crop up again and again.)
With a real parser like for instance Nokogiri mentioned above, you also get the added benefit that the structure and logic of the HTML document is preserved, and sometimes you really need those clues.
Solutions integrating with Ruby
use Nokogiri as recommended by Amigable Clark kant
Use Hpricot
External Solutions
If your HTML is well-formed, you could use the Expat XML Parser for this.
For something more targeted toward HTML-only, the W3C actually released the code for the LibWWW, which contains a simple HTML parser (documentation).
Lynx is able to do this. This is open source if you want to take a look at it.
You should strip all angle-bracketed part from text and then collapse white-spaces.
In theory the < and > should not be there in other cases. Pages contain < and > everywhere instead of them.
Collapsing whitespaces: Convert all TAB, newline, etc to spaces, then replace every sequence of spaces to a single space.
UPDATE: And you should start after finding the <body> tag.

Parsing Random Web Pages

I need to parse a bunch of random pages and add them to a DB. I am thinking of using regular expressions but I was wondering if there are any 'special' techniques (other than looking for content between known text/tags). The content is more(not always) like:
Some Title
Text related to Title
I guess I don't need to extract complete Text but some way to know where the Title/Paragraph and extract the content from there. The content itself may have images/links that I would like to retain.
Thanks!
Please see this answer: RegEx match open tags except XHTML self-contained tags
Use Python. http://www.python.org/
Use Beautiful Soup. http://www.crummy.com/software/BeautifulSoup/
You need to use a proper HTML parser, and extract the elements you’re interested in via the parser’s API (or via the DOM).
Since I don’t know what language you’re programming in, it’s rather difficult to recommend a parser, but some well known ones are Jericho for Java, and Beautiful Soup for Python.

Win32.: How to scrape HTML without regular expressions?

A recent blog entry by a Jeff Atwood says that you should never parse HTML using regular expressions - yet doesn't give an alternative.
I want to scrape search search results, extracting values:
<div class="used_result_container">
...
...
<div class="vehicleInfo">
...
...
<div class="makemodeltrim">
...
<a class="carlink" href="[Url]">[MakeAndModel]</a>
...
</div>
<div class="kilometers">[Kilometers]</div>
<div class="price">[Price]</div>
<div class="location">
<span class='locationText'>Location:</span>[Location]
</div>
...
...
</div>
...
...
</div>
...and it repeats
You can see the values I want to extract, [enclosed in brackets]:
Url
MakeAndModel
Kilometers
Price
Location
Assuming we accept the premise that parsing HTML:
generally a bad idea
rapidly devolves into madness
What's the way to do it?
Assumptions:
native Win32
loose html
Assumption clarifications:
Native Win32
.NET/CLR is not native Win32
Java is not native Win32
perl, python, ruby are not native Win32
assume C++, in Visual Studio 2000, compiled into a native Win32 application
Native Win32 applications can call library code:
copied source code
DLLs containing function entry points
DLLs containing COM objects
DLLs containing COM objects that are COM-callable wrappers (CCW) around managed .NET objects
Loose HTML
xml is not loose HTML
xhtml is not loose HTML
strict HTML is not loose HTML
Loose HTML implies that the HTML is not well-formed xml (strict HTML is not well-formed xml anyway), and so an XML parser cannot be used. In reality I was present the assumption that any HTML parser must be generous in the HTML it accepts.
Clarification#2
Assuming you like the idea of turning the HTML into a Document Object Model (DOM), how then do you access repeating structures of data? How would you walk a DOM tree? I need a DIV node that is a class of used_result_container, which has a child DIV of class of vehicleInfo. But the nodes don't necessarily have to be direct children of one another.
It sounds like I'm trading one set of regular expression problems for another. If they change the structure of the HTML, I will have to re-write my code to match - as I would with regular expressions. And assuming we want to avoid those problems, because those are the problems with regular expressions, what do I do instead?
And would I not be writing a regular expression parser for DOM nodes? i'm writing an engine to parse a string of objects, using an internal state machine and forward and back capture. No, there must be a better way - the way that Jeff alluded to.
I intentionally kept the original question vague, so as not to lead people down the wrong path. I didn't want to imply that the solution, necessarily, had anything to do with:
walking a DOM tree
xpath queries
Clarification#3
The sample HTML I provided I trimmed down to the important elements and attributes. The mechanism I used to trim the HTML down was based on my internal bias that uses regular expressions. I naturally think that I need various "sign-posts in the HTML that I look for.
So don't confuse the presented HTML for the entire HTML. Perhaps some other solution depends on the presence of all the original HTML.
Update 4
The only proposed solutions seem to involve using a library to convert the HTML into a Document Object Model (DOM). The question then would have to become: then what?
Now that I have the DOM, what do I do with it? It seems that I still have to walk the tree with some sort of regular DOM expression parser, capable of forward matching and capture.
In this particular case i need all the used_result_container DIV nodes which contain vehicleInfo DIV nodes as children. Any used_result_container DIV nodes that do not contain vehicleInfo has a child are not relevant.
Is there a DOM regular expression parser with capture and forward matching? I don't think XPath can select higher level nodes based on criteria of lower level nodes:
\\div[#class="used_result_container" && .\div[#class="vehicleInfo"]]\*
Note: I use XPath so infrequently that I cannot make up hypothetical xpath syntax very goodly.
Python:
lxml - faster, perhaps better at parsing bad HTML
BeautifulSoup - if lxml fails on your input try this.
Ruby: (heard of the following libraries, but never tried them)
Nokogiri
hpricot
Though if your parsers choke, and you can roughly pinpoint what is causing the choking, I frankly think it's okay to use a regex hack to remove that portion before passing it to the parser.
If you do decide on using lxml, here are some XPath tutorials that you may find useful. The lxml tutorials kind of assume that you know what XPath is (which I didn't when I first read them.)
Edit: Your post has really grown since it first came out... I'll try to answer what I can.
i don't think XPath can select higher level nodes based on criteria of lower level nodes:
It can. Try //div[#class='vehicleInfo']/parent::div[#class='used_result_container']. Use ancestor if you need to go up more levels. lxml also provides a getparent() method on its search results, and you could use that too. Really, you should look at the XPath sites I linked; you can probably solve your problems from there.
how then do you access repeating structures of data?
It would seem that DOM queries are exactly suited to your needs. XPath queries return you a list of the elements found -- what more could you want? And despite its name, lxml does accept 'loose HTML'. Moreover, the parser recognizes the 'sign-posts' in the HTML and structures the whole document accordingly, so you don't have to do it yourself.
Yes, you are still have to do a search on the structure, but at a higher level of abstraction. If the site designers decide to do a page overhaul and completely change the names and structure of their divs, then that's too bad, you have to rewrite your queries, but it should take less time than rewriting your regex. Nothing will do it automatically for you, unless you want to write some AI capabilities into your page-scraper...
I apologize for not providing 'native Win32' libraries, I'd assumed at first that you simply meant 'runs on Windows'. But the others have answered that part.
Native Win32
You can always use IHtmlDocument2. This is built-in to Windows at this point. With this COM interface, you get native access to a powerful DOM parser (IE's DOM parser!).
Use Html Agility Pack for .NET
Update
Since you need something native/antique, and the markup is likely bad, I would recommend running the markup through Tidy and then parsing it with Xerces
Use Beautiful Soup.
Beautiful Soup is an HTML/XML parser
for Python that can turn even invalid
markup into a parse tree. It provides
simple, idiomatic ways of navigating,
searching, and modifying the parse
tree. It commonly saves programmers
hours or days of work. There's also a
Ruby port called Rubyful Soup.
If you are really under Win32 you can use a tiny and fast COM object to do it
example code with vbs:
Set dom = CreateObject("htmlfile")
dom.write("<div>Click for <img src='http://www.google.com/images/srpr/logo1w.png'>Google</a></div>")
WScript.Echo(dom.Images.item(0).src)
You can also do this in JScript, or VB/Dephi/C++/C#/Python etc on Windows. It use mshtml.dll dom layout and parser directly.
The alternative is to use an html dom parser. Unfortunately, it seems like most of them have problems with poorly formed html, so in addition you need to run it through html tidy or something similar first.
If a DOM parser is out of the question - for whatever reason,
I'd go for some variant of PHP's explode() or whatever is available in the programming language that you use.
You could for example start out by splitting by <div class="vehicleInfo">, which would give you each result (remember to ignore the first place). After that you could loop the results split each result by <div class="makemodeltrim"> etc.
This is by no means an optimal solution, and it will be quite fragile (almost any change in the layout of the document would break the code).
Another option would be to go after some CSS selector library like phpQuery or similar for your programming language.
Use a DOM parser
e.g. for java check this list
Open Source HTML Parsers in Java (I like to use cobra)
Or if you are sure e.g. that you only want to parse a certain subset of your html which ideally is also xml valid you could use some xml parser to parse only fragment you pass it in and then even use xpath to request the values your are interested in.
Open Source XML Parsers in Java (e.g. dom4j is easy to use)
I think libxml2, despite its name, also does its best to parse tag soup HTML. It is a C library, so it should satisfy your requirements. You can find it here.
BTW, another answer recommended lxml, which is a Python library, but is actually built on libxml2. If lxml worked well for him, chances are libxml2 is going to work well for you.
How about using Internet Explorer as an ActiveX control? It will give you a fully rendered structure as it viewed the page.
The HTML::Parser and HTML::Tree modules in Perl are pretty good at parsing most typical so-called HTML on the web. From there, you can locate elements using XPath-like queries.
What do you think about ihtmldocument2,
I think it should help.