I am well aware that parsing HTML with regex has its many caveats and vociferous opponents. So rather than trying to re-invent the wheel, I'm looking for a tool that I can point to a web page and say "Get me the comments, b*tch".
Anyone able to advise?
I was reading some OWASP documentation or a security blog, and I'm almost certain I saw a tool performing this task. Google has been zero help unfortunately.
Cheers
If you want a Java solution try HTMLParser and look for RemarkNodes.
Mhhhhh...I think a search in Google with the OS you use and some clever keyword gives you all you want. For UNIX based system looks at: parse HTML with SED and PERL
For Windows OS I think you can search something with VBS (VBScript).
Related
I want to write a little program that transforms my TeX files into HTML. I want to parse the documents and turn the macros (the build-in and of course my own) into HTML pieces. Here are my requirements:
predefined rules (e.g. begin{itemize} \item text \end{itemize} => <br> <p>text </p> <br/>)
defining own CSS style
ability to convert formulars (extract the formulars, load them in an imagecreator and then save the jpg/png)
easy to maintain and concise
I know there are several technologies out there, but I don't exactly know which is the best for me. Here are the technologies which flow into my mind
Ruby (I/O is easy, formular loading via webrat),
XML XSLT (I don't think that I need just overhead)
perl (there are many libs out there but I'm not quite familiar with it)
bash (I worked with sed and was surprised how easy it was to work with regular expressions)
latex2html ... (these converters won't work for me and they don't give me freedom in parsing)
Any suggestions, hints and comments are welcome.
Thanks for your time, folks.
have a look at pandoc here. it can also be installed on linux or os x. Though it won't do your custom macros. The only thing I've seen that can do a decent job with custom macros is tex4ht, but to really work well you need to be producing .DVI files. If you have a ton of custom macros, writing your own converter is going to take an ass load of time. Even if you only have a few custom macros, it's still going to be a pain. good luck!
Six: TeX
Seven: Haskell
(I gave up trying to persuade SO to start numbering my list from 6).
I wanted to use YAML but there is not a single mature YAML library for Erlang. I know there are a few JSON libraries, but was wondering which is the most mature?
Have a look at the one from mochiweb: mochijson.erl
1> mochijson:decode("{\"Name\":\"Tom\",\"Age\":10}").
{struct,[{"Name","Tom"},{"Age",10}]}
I prefer Jiffy. It works with binary and is realy fast.
1> jiffy:decode(<<"{\"Name\":\"Tom\",\"Age\":10}">>).
{[{<<"Name">>,<<"Tom">>},{<<"Age">>,10}]}
Can encode as well:
2> jiffy:encode({[{<<"Name">>,<<"Tom">>},{<<"Age">>,10}]}).
<<"{\"Name\":\"Tom\",\"Age\":10}">>
Also check out jsx. "An erlang application for consuming, producing and manipulating json. Inspired by Yajl." I haven't tried it myself yet, but it looks promising.
As a side note; I found this library through Jesse, a json schema validator by Klarna.
I use the json library provided by yaws.
Edit: I actually switched over to Jiffy, see Konstantin's answer.
Trapexit offers a really cool search feature for Erlang projects.
Lookup for JSON there, you'll find almost 13 results. Check the dates of the latest revisions, the user rating, the project activity status.
UPDATE: I've just found a similar question n StackOverflow. Apparently, they are quite happy with the erlang-json-eep-parser parser.
My favourite is mochijson2. The API is straightforward, it's fast enough for me (I never actually bothered to benchmark it though, to be honest--I'm mostly en- and de-coding small packets), and I've been using it in a stable "production server" for a year now or so. Just remember to install mochinum as well, mochijson2 uses it to encode large numbers, if you miss it, and you'll try to encode a large number, it will throw an exception.
See also: mochijson2 examples (stackoverflow)
I am fairly proficient in PHP, but just starting out in ASP.Net and JSP/Java
I would like to learn JSP/ASP.NET XML to HTML transformation with some simple practical examples. Im not looking to learn how to edit XML, just displaying it, but im having trouble finding definitive examples/tutorials.
Ive spent quite a while studying JSP/ASP.NET but quickly find how vast they are and how many different ways there are to do this (quite frankly im a bit overwhelmed). I would be really grateful for advice before I embark upon this journey (and perhaps I will be saved from going in the completely wrong direction). If there are any tutorials or especially example apps you could point me towards this would really help (i like to do hands on learning)
For this I expect I need to do the following:
1) Set up a server for each technology (im using Tomcat and IIS at the moment - are these the best?)
2) Use some parameter based routing system (MVC?, but this is most surely overkill for me)
3) Parse the XML and create some variables/objects
4) Display the HTML (Use template libraries (JSTL? not sure for ASP.NET))
Any tutorials/example apps you could point me towards to help me through the above steps will be truly appreciated.
Thankyou
Ke
By the sounds of your skillset, carefully working through this developerworks tutorial on JSTL looks like a good place for you to start. It does cover the XML handling libs around part 4, and it'll also help you avoid the mistake of using scriptlets where JSTL would give cleaner, less error-prone and much more readable code.
You'll also most likely want IDE support, so that you get documentation, syntax checking and autocomplete. I personally use Eclipse (The EE download will have everything you need and more) but NetBeans might be the most straightforward to get your started.
Tomcat will be fine to get you started, but these IDEs tend to have build in web containers to save you time in deploying and testing.
Searching for a free application for commercial usage that allows find/replace in multiple files (regular expressions are nice but not a must), that supports opening and saving in UTF-8.
Tried a few like BKReplaceEm but the application ends up saving all the files as ASCII which causes some problems with web-rendering.
Please advise.
[UPDATE] To further clarify, I am searching for a windows utility.
[UPDATE #2] This is going to be used to run through our 450 page site and replace all french characters with the much needed HTML entities.
Notepad++ supports this feature, and is a great little editor in it's own regard.
Edit : Actually, Notepad++ does support replace in files. Click Search -> Find in Files, then select "Replace in files" in the dialog.
In the spirit of previous answer, you can use Perl (which has seamless native Unicode support and whose RegEx capablity are unparalleled). There are Windows perl versions avialable (ActivePerl, Strawberry, or you can use CygWin), and you can even slap GUIs on top of it -= for the latter, you can see what answers are given to my very recent So question :)
Plus, Perl can grab pretty much unlimitedly powerful collection of files, by using globs for simple things, File::Find for more complicated, and using grep on resulting file list to refine further if you need more fancy stuff, e.g. by content of modification time.
UPDATE For a Windows Editor, you can use UltraEdit. It has free evaluation period, and to be perfectly honest, I find the purchase price to be WELL worth paying for this very nice and powerful editor. Among its other features, it supports Unicode, and has pretty fancy search/replace ablities, including Perl RegEx support and S/R in multiple files.
Use sed.
jEdit has a feature called "HyperSearch" (just open the find dialog). You can specify a directory, a file name pattern and jEdit (being based on Java) does support lots of different encodings (and is often smart enough to figure out the correct one).
You could try my editor, Code Trowel
If it doesn't do what you want I'd probably fix it :-)
For windows, Notepad++ is awesome. It's licensed under the GPL. It does search and replace in files and does support regular expressions.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 months ago.
Improve this question
I would like to get data from from different webpages such as addresses of restaurants or dates of different events for a given location and so on. What is the best library I can use for extracting this data from a given set of sites?
If using python, take a good look at Beautiful Soup (http://crummy.com/software/BeautifulSoup).
An extremely capable library, makes scraping a breeze.
The HTML Agility Pack For .net programers is awesome. It turns webpages in XML docs that can be queried with XPath.
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a#href")
{
HtmlAttribute att = link"href";
att.Value = FixLink(att);
}
doc.Save("file.htm");
You can find it here. http://www.codeplex.com/htmlagilitypack
I think the general answer here is to use any language + http library + html/xpath parser. I find that using ruby + hpricot gives a nice clean solution:
require 'rubygems'
require 'hpricot'
require 'open-uri'
sites = %w(http://www.google.com http://www.stackoverflow.com)
sites.each do |site|
doc = Hpricot(open(site))
# iterate over each div in the document (or use xpath to grab whatever you want)
(doc/"div").each do |div|
# do something with divs here
end
end
For more on Hpricot see http://code.whytheluckystiff.net/hpricot/
I personally like the WWW::Mechanize Perl module for these kinds of tasks. It gives you an object that is modeled after a typical web browser, (i.e. you can follow links, fill out forms, or use the "back button" by calling methods on it).
For the extraction of the actual content, you could then hook it up to HTML::TreeBuilder to transform the website you're currently visiting into a tree of HTML::Element objects, and extract the data you want (the look_down() method of HTML::Element is especially useful).
i think watir or selenium are the best choices. Most of the other mentioned libraries are actually HTML parsers, and that is not what you want... You are scraping, if the owner of the website wanted you to get to his data he'd put a dump of his database or site on a torrent and avoid all the http requests and expensive traffic.
basically, you need to parse HTML, but more importantly automate a browser. This to the point of being able to move the mouse and click, basically really mimicking a user. You need to use a screencapture program to get to the captchas and send them off to decaptcha.com (that solve them for a fraction of a cent) to circumvent that. forget about saving that captcha file by parsing the html without rendering it in a browser 'as it is supposed to be seen'. You are screenscraping, not httprequestscraping.
watir did the trick for me in combination with autoitx (for moving the mouse and entering keys in fields -> sometimes this is necessery to set of the right javascript events) and a simple screen capture utility for the captcha's. this way you will be most succesfull, it's quite useless writing a great html parser to find out that the owner of the site has turned some of the text into graphics. (Problematic? no, just get an OCR library and feed the jpeg, text will be returned). Besides i have rarely seen them go that far, although on chinese sites, there is a lot of text in graphics.
Xpath saved my day all the time, it's a great Domain Specific Language (IMHO, i could be wrong) and you can get to any tag in the page, although sometimes you need to tweak it.
What i did miss was 'reverse templates' (the robot framework of selenium has this). Perl had this in CPAN module Template::Extract, very handy.
The html parsing, or the creation of the DOM, i would leave to the browser, yes, it won't be as fast, but it'll work all the time.
Also libraries that pretend to be Useragents are useless, sites are protected against scraping nowadays, and the rendering of the site on a real screen is often necessery to get beyond the captcha's, but also javascript events that need to be triggered for information to appear etc.
Watir if you're into Ruby, Selenium for the rest i'd say. The 'Human Emulator' (or Web Emulator in russia) is really made for this kind of scraping, but then again it's a russian product from a company that makes no secret of its intentions.
i also think that one of these weeks Wiley has a new book out on scraping, that should be interesting. Good luck...
I personally find http://github.com/shuber/curl/tree/master and http://simplehtmldom.sourceforge.net/ awesome for use in my PHP spidering/scraping projects.
The Perl WWW::Mechanize library is excellent for doing the donkey work of interacting with a website to get to the actual page you need.
I would use LWP (Libwww for Perl). Here's a good little guide: http://www.perl.com/pub/a/2002/08/20/perlandlwp.html
WWW::Scraper has docs here: http://cpan.uwinnipeg.ca/htdocs/Scraper/WWW/Scraper.html
It can be useful as a base, you'd probably want to create your own module that fits your restaurant mining needs.
LWP would give you a basic crawler for you to build on.
There have been a number of answers recommending Perl Mechanize, but I think that Ruby Mechanize (very similar to Perl's version) is even better. It handles some things like forms in a much cleaner way syntactically. Also, there are a few frontends which run on top of Ruby Mechanize which make things even easier.
What language do you want to use?
curl with awk might be all you need.
You can use tidy to convert it to XHTML, and then use whatever XML processing facilities your language of choice has available.
I'd recommend BeautifulSoup. It isn't the fastest but performs really well in regards to the not-wellformedness of (X)HTML pages which most parsers choke on.
what someone said.
use ANY LANGUAGE.
as long as you have a good parser library and http library, you are set.
the tree stuff are slower, then just using a good parse library.