Super-fast screen scraping techniques? [closed]

Super-fast screen scraping techniques? [closed] - html

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I often find myself needing to do some simple screen scraping for internal purposes (i.e. a third party service I use only publishes reports via HTML). I have at least two or three cases of this now. I could use apache httpclient and create all the necessary screen scraping code but it takes a while. Here is my usual process:
Open up Charles Proxy on the web site and see whats going on.
Start writing some java code using Apache HttpClient, dealing with cookies, multiple requests
use Jericho HTML to deal with parsing of the HTML.
I wish I could just "record my session" quickly and then parametrize the things that vary from session to session. Imagine just using Charles to grab all the request HTTP and then parametrize the relevant query string or post params. Voila I have a reusable http script.
Is there anything that does this already? I remember when I used to work at a big company there used to be a tool we used called Load Runner by Mercury Interactive that essentially had a nice way to record an http session and make it reusable (for testing purposes). That tool, unfortunately, is very expensive.

HtmlUnit is a scriptable, headless browser written in Java. We use it for some extremely fault-heavy, complex web pages and it usually does a very good job.
To simplify things even more you can run it in Jython. The resultant program reads more like a transcript of how one might use a browser than hard work.

You don't mention what you want to use this for; One solution is to simply "script" your web browser using tools like Selenium if having a web browser repeat your actions is an acceptable solution. You can use the Selenium IDE to record what you do and then alter the parameters.

I wish I could just "record my session" quickly and then parametrize the things that vary from session to session.
If you have Visual Studio test edition it's web test function does that exactly. If you aren't using VS or want a stand alone tool I have had great success with OpenSpan. It is more than just web, it does windows apps, and java!

Selenium would be my 1st pick, as the IDE lets you do a lot of things the easy way by "recording" a session for you. But, if you're not happy with what it provides, you can also use the Python module called Beautiful Soup to programmatically walk through a website.

Python and Perl both have a module called Mechanize (WWW::Mechanize for perl) that makes it easy to do browser behavior programmaticly (filling out forms, handling cookies, etc).
So, Python + BeautifulSoup (great html/xml parser) + mechanize (browser functions) = super easy/fast scraper

I used DomInspector for manually inspecting the site of interest to parametrize it's structure. Then simple Apache HttpClient and hand-made parser using this parametrized structure. Basically I could extract any info from any site automatically with a little tweak of parameters.. It's similar to how SAX parser works, all you need to tell it is at what sequence of tags you want to start grabbing the data. For example, google have pretty standard format of search results.. So, you just run to the third occurrence of 'tab' and start getting text from the first 'div' up until the end '/div'

Internet Explorer supports Browser Helper Objects (BHOs). They can access IE' HWND (window handle) and it's easy to scrape the pixels from there. The IWebBrowser2 COM interface also gives you access to the HTTP requests, and you can get back the parsed HTML document via IWebBrowser2::Document = IHTMLDocument / IHTMLDocument2 /IHTMLDocument3

Using FireFox, it should be possible to implement much of it with its powerful support for addons and enhancements, however that wouldn't really mean to run "headless", but really be a real scripted browser. Also, I seem to recall having read that google's chrome browser uses a similar technique to do automated regression testing.

I can't personally vouch for it, but there is a free firefox plugin: DejaClick
I installed it the other day and did some remedial recording, playback, and script editing activities with it. It pulled them off without much of a learning curve. If your end goal is to show something in a web browser, then it should suffice.
They offer web transaction monitoring services, implying that you can export the scripts for other uses, but they may be too proprietary to use outside of your web browser / their paid service.
http://www.dejaclick.com/

Related

In MediaWiki, perform an action on every page of a namespace through API or maintenance script

How to programmatically — through the API or through a maintenance script — do an action on every page from a specified namespace?
I would for example like to add every page of the main namespace into a specified category, for maintenance purpose.

I've had to do this on a few occasions.
It looks like the way to do things now is through the Pywikibot scripts. These are tools written in Python for automating tasks on MediaWiki sites. Unlike most of the MediaWiki documentation, the PWB docs are actually pretty thorough.
When I last had to add some text to every page, I couldn't find a bot that worked with my private instance running on my intranet, since most bots were written to work with Wikipedia.
What I ended up doing was just generating a list of page URLs, and then fed them to a custom script that utilized Selenium to automate a browser running on my machine. You can use Selenium from Java, Python, C#, and Ruby.
It's a pretty heavyweight approach though, since you actually have to be running a realtime browser to do the work.
Take a look at Pywikibot.

Delphi with HTML/CSS interface

I want to develop a delphi application with an HTML/CSS graphical user interface, not necessarily running from within a web browser. I want to do this do create a richer graphical user interface with animations etc. and break away from normal VCL components / Windows look. Any suggestions?

HTML and CSS won't deliver animations or a Rich User Interface to your feet. Far from it in fact. Quite the opposite. You will need to invest in a toolkit to provide that sort of functionality and almost certainly involve JavaScript. And even if you don't want your eventual application hosted in a web browser, your application will itself have to host a web browser to render your HTML/CSS/JavaScript UI, and you will then have a much more difficult job of connecting your GUI to your application logic (unless you do actually embrace a web application architecture).
Delphi (or any Windows application development language for that matter) gets you much, much further down the road towards a more simply, effectively and quickly implemented Rich User Interface than HTML or CSS.
If you don't like the look and feel of the standard Delphi controls (which in essence is what you are saying) there are numerous alternative libraries available.
Also bear in mind however that when someone uses a Windows application they expect it to look and behave a certain way to a large extent. Using fancy, web based paradigm's in a desktop application simply for the sake of it is likely to confuse and frustrate users if taken too far.
I'm all for user interfaces breaking with convention where it leads to a more intuitive user interface, but simply being "prettier" does not necessarily lead in that direction and is just as misguided as dogmatically adhering to convention.

In one of my applications I have an an embedded browser and I have implemented the IDocHostUIHandler interface. This allows me to expose a COM object via the "GetExternal" method. I simply have a COM object that exposes methods and properties of my application which makes them available to the web pages hosted inside the embedded browser.
So the script in my web pages has lines like "external.DoSomething()" and "i=external.GetThisValue()". So, for example, behind button onclick events you can run a method of your application (implement in the main form, in the COM object itself, or whatever you like).
This site has lots of info on embedding a browser in your Delphi app:
http://www.delphidabbler.com/articles?article=22
It can certainly be cumbersome to implement a lot of this stuff and in many cases there are probably better options. But for my specific purpose I am able to offer a "home page" which can easily be modified to change its layout, look and even expose more (or less) functionality as required by myself or my users.

If you want a Delphi program with a better-looking interface, HTML is really not what you're looking for. What you really need are better-looking VCL controls.
Take a look at TMS Smooth Controls, for example. If you're on Delphi 2009 or 2010, you can get it as a free download here. That's one of many component libraries that can bring a slicker user interface to your program.

HTML / CSS offers some nice features which are (not yet?) available in Delphi and the VCL. They are also a good starting point for client/server programming, separating the user interface and the business logic is a key factor here.
One popular library for Delphi is the extpascal project:
ExtPascal is an Object Pascal (Delphi,
FreePascal/Lazarus) wrapper/binding
for Ext JS, a complete GUI Ajax
framework, made in JavaScript, for
Rich Internet Application (RIA)
development. ExtPascal lets you use
Ext JS from Object Pascal commands
issued by the server. It brings the
structure and strict syntax of the
Object Pascal for programming the web
browser. ExtPascal will wrap Draw2d
into future releases.
Some demos are online here and here.
p.s. and I really like the HTML / CSS support for element and font sizes in relative units (for exampe percent). Combined with browser zoom in / zoom out and WCAG, user interface ergonomy can not be much better.

HTML Option 1
If you relly want to use HTML+CSS(+JavaScript) to build a GUI, you can have a look at HTML Applications, a very fascinating concept from Microsoft. HTML Applications, .hta files, have been supported from Windows ME, if I remember correctly, and they are still supported on Windows 7.
You could create an HTML Application (i.e. an HTA file), and by so doing, creating a GUI using only HTML, CSS, and JavaScript. When the user double-clicks the HTA file, it will open like a program, but the GUI is entirely based on HTML; in fact, the entire Window is an Internet Explorer window in disguise.
And now comes the important part: you could create non-GUI Delphi applications (i.e., Delphi applications that are not console applications, but that have no forms either), and start them via hyperlinks (or JavaScript) from within your HTA GUI. (Well, it is probably better to create one such Delphi application, and use command-line arguments (ParamStrs) to communicate the desired action.)
Just an idea...
HTML Option 2
Alternatively, you could create a normal Delphi GUI application, but fill the entire main form with a TWebBrowser (a IE control), using Align := alClient. Then you could either load static HTML pages (stored in the Program Files folder or on the Internet), or you could use Delphi to dynamically create HTML pages to show. I think it is possible to intercept links from the control, so that you could respond to links using Delphi code.
What about OpenGL?
If you want to "break away" from the normal Windows look and feel, then I would recommend you to create your GUI using OpenGL. It is very easy to make a Delphi application with OpenGL (as long as you are familiar with OpenGL) - just add "OpenGL" to your uses list.

First this: I completely agree with Deltics' answer.
Having said that, if you master HTML and CSS (and JavaScript and AJAX etc etc) and you are looking for a way to use the power and speed of the Delphi compiler to run the dynamics of a website, this may be of interest.
I've created a project that uses the Delphi compiler to build a library that runs a website. The source-files combines HTML and Delphi, much like other web-scripting tools out there, but gets processed on a page-refresh, and compiled automatically. It uses a 'library handler' that plugs in the website library into pretty much anything you like: IIS, Apache, a stand-alone HTTP server (for hosting), or directly into InternetExplorer or FireFox (which is great for developing).
http://xxm.sourceforge.net/

New versions of Qt contain ability to use html/js for interfaces. I don't know if there is Qt library bindings for Delphi, but Qt is exactly what you want.

for Rich GUI and animation, have you looked at KSDev DXScen and VGScene ?

If you want to keep your delphi/Pascal Object 'background' and have a Web like RIA you also have a look to Morfik : link text

If project is open source do you bother looking at the sources? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I was pretty surprised to find out that raw sources of my little open source project are getting downloaded more often than the compiled and ready to use library (jar file in this case, platform independent). I wonder what are the reasons behind that? Lack of trust? Curiosity? Compiling with custom settings? Attaching sources for debugging?
Personally I usually don't bother downloading and looking at sources unless something is not working or I don't understand how it works.

I often download sources just to see how other people have implemented certain things. Reading (and understanding) other peoples source code is a good way of becoming a better programmer yourself.
As for the relatively high number of downloads, perhaps your library is included in other projects like a Linux distributions? Such projects usually download and build from source themselves so that they can properly package it.

The first reason would be for customizing applications.
Also its not a good practice to download some code and use it straight away without looking at how the code works. There will be something for you to learn from the code.
Also you might not need the whole functionality of the project. If the project is too big and you need to use only some functionality in it it would be a great idea to trim the project to your needs and then use it.

For every piece of software of long term interest for my company, I look at the sources to assess the quality. The rationale behind it is that badly written software is usually also bad to use and maintain and thus a business risk in the long term.
Even with most commercial software like ERP systems it is no problem to get a look at the source. Only for COTS (say MS Office) it is hard to get the source.
I also check source for every hiring decision.
An other reason why you see so many source downloads might be automated build systems like FreeBSD Ports which download and compile automatically.

I look at the source just to learn how the program works.
As silly as it might seems, the open source software ( such as open source CRMs) is notorious for the lack of documentation. The only way to find out how it works is to experiment with it. When even experiment fails, it's the time to fire up your IDE and read the source!!

Maybe the answer will be disappointing, but the relatively high number of source downloads could mean that the application is packaged in a port-based distribution like Gentoo, FreeBSD or MacPorts where every package is downloaded and compiled on a local machine during installation.

If it's a framework, I always download sources. I use them for debugging and to see how they've implemented certain things. If it's a standalone application, I generally don't look at the source unless there is a problem or the application does something unique.

As you say your binary is a jar, it sounds like it is a Java-library (rather than an application). Developers often use source: to include it in the IDE to debug in the library and lookup certain functions. Also many developers include the sources in their build-process to compile also the dependencies. That may be an explanation.

The number one reason is compiler settings. You can't imagine the amount of pain caused by linking a static library compiled with some incompatible settings. Compiling on your own with checked settings simplifies life greatly. Plus when you decide to change the compiler for the better one you don't need to have the old static library - it will be compiled by the new compiler two.
The number two reason could be that people want to see how some things work inside. For example, they want the same or similar functionality in their commercial closed-source project and can't just borrow code because of the viral license. However they can see how it works and get inspired - that't why they download the source and read.

I have downloaded libraries and compiled them my self but I have not actually looked at the code. When I use a library it is good to know that I can make changes and have the source on hand. I have on occasion taken just a file or two if it is a massive library and I only need a single functionality from a large library.

Some reasons could be:
Distrust of binary downloads due to trojans, etc
Taking a look at how you've implemented something
Checking out the quality of your code :)

Since this is a library, the need for comprehensive documentation is much higher than for a standalone app. I often find myself looking up the code of a library to figure out certain things sometimes left out of the docs, e.g. time/space complexity of certain functions.

We use some open source packages for our commercial application. I always download and build from source.
If our hosting platform changes in
the future, it might change to
something that does not have a
precompiled binary. I want to be
able to use the same package/version
on the new platform.
If the package goes dormant or
becomes unsupported, I want to be
able to apply a change or fix if
absolutely necessary.
If something is going wrong on the
server (memory leak, CPU spike,
etc.), I want to be able to add
logging or instrumentation code to
identify or eliminate the package as
the source of the problem.

I can of course only answer for myself, but it is not seldom that i download the binaries (assuming I trust the project which is usually the case), and the when I debug I download the sources. But I have a tendency to delete the sources when I think I'm done with them and since you are never really done I might have to redownload the sources later and thus causing the source downloads to be higher.

Resources for getting started with web development? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Let's say I woke up today and wanted to create a clone of StackOverflow.com, and reap the financial windfall of millions $0.02 ad clicks. Where do I start?
My understanding of web technologies are:
HTML is what is ultimately displayed
CSS is a mechanism for making HTML look pleasing
ASP.NET lets you add functionality using .NET(?)
JavaScript does stuff
AJAX does asyncronous stuff
... and the list goes on!
To write a good website to I just need to buy seven books and read them all? Are Web 2.0 sites really the synergy of all these technologies?
Where does someone go to get started down the path to creating professional-looking web sites, and what steps are there along the way.

While I have built my knowledge largely based on using the internet to search out what I want to know (w3schools.com helped a lot, as did A List Apart), a few good books have helped me along the way, though they have been platform/language-specific, so I'll avoid mentioning them unless someone is curious. For me, at least, having a book open so that I don't have to resize windows or switch between them is very valuable.
The first part of your list is ok, but the last few items need tweaking. ASP.NET adds server-side functionality (for the most part) to your application. This lives outside of the browser and is thus quite powerful and easily shared with a variety of end-users.
The problem (some say) with server-side processing is that your application must make a new HTTP request when you ask for an action to be performed. So if you click on a link to a page that yields a new set of data, you don't get instant results. The page reloads, or loads a separate page.
Javascript solves this to a degree--it allows you to respond to user input instantaneously. Do you want to display the sum of two numbers when the user clicks a button? You can do it with Javascript.
The problem with Javascript is that it can't talk directly to databases, or explore your server's file system, or other stuff like that. It lives in the browser--period.
AJAX bridges the gap between your user's browser and your server. With AJAX, Javascript makes the HTTP request without refreshing your page or loading a new one. Javascript talks to a server-side script (not necessarily ASP, either--works with PHP, Rails, Coldfusion, etc.) and sends and receives information. And because Javascript isn't dependent on page loads, a quick, snappy AJAX script can almost give the feeling of a common desktop application, in which you don't have to wait for HTTP requests when performing simple actions on your application's data.

I think that this series of Opera Articles will give you a good idea of web standards and basic concepts of web development.
2014 update: the Opera docs were relocated in 2012 to this section of webplatform.org:
http://docs.webplatform.org/wiki/Main_Page

Ian's answer has a lot of weight. You could buy all those books and read them all and know nothing about web development. What you really need to do is start with something that is not nearly as big as Stack Overflow. Start with your personal site. Read some web dev/css articles on a list apart. Learn about doctypes and why to use them. Add some css and change the colors around. Go over to quirksmode and peruse the site. Add some js. Follow some links on Crockfords site. You will probably stumble across his awesome video lectures, which you should watch. Then after that go back to all the js that you wrote and rewrite it. Then pick a server side language that you want to learn. Python is pretty easy, but it really doesn't matter what you pick. Then come back and integrate all those together in your site. At this point you will at least be getting started with web development and will have worked with several different technologies.
EDIT: I forgot to mention. READ BOOKS.
Many developers that I have worked with in the past have gotten through their career without really advancing after a certain point. I could be totally wrong, but I attribute it to not reading enough books and relying on using their same bad code over and over.

You could go out and buy a bunch of books and start reading them and quickly get overwhelmed in the seemingly massive learning curve it takes to go from nowhere, which is where it appears you are, to a rich internet entrepreneur, which is where you want to be.
Alternatively, and what I would suggest is, you could define a problem you want to solve, and then go about finding the solution to that problem. Start with something small. "I have a problem: I don't have a web site about myself.". Define what you need to do to solve that problem, learn the basics, and do it. Then, define a new problem, which probably relies on the solution to the first problem, find what you need to do, and do it.
This is how all technology professionals evolve. My first website was a personal site with nothing but text. Then I added some jokes and some movie quotes. Then I got tired of man-handling all the updates to I learned how to put them into a database and retrieve them from the database for display. It goes on and on.
Call me when you've got more money from your financial windfall than you know what to do with.

If you really just want to jump in with both feet, I would suggest looking at ColdFusion from Adobe. The developer edition is free and runs on windows, os x and linux. The documentation is authoritative and extensive, there is a very active developer community and only a few books you might want to dig into. The definitive guide is a series of books that can be found on Amazon
The nice thing about ColdFusion is that you can use it as a stepping stone to other languages and remain productive along the way. You can even mix it together with Java since it is itself written in java. There are also lots of goodies built in that you would have to scour the web for or pay more for in other languages. Things like full text indexing, graphing, server monitoring, ajax based controls, flash/flex integration, asynch os calls, etc.
You even have the choice of building object oriented code or procedural code, although some people would not count that as a benefit. Those people rarely agree on which style should win, though.
Cheers!

I think sitepoint is the best resource for learning best practices in web development. They have great articles, good references, and probably one of the best forums. However the people there can be a bit grumpy. ;)
If you are a real nerd, reading the specs for HTML 5 and CSS is also a good way to learn.

I'm with Ian on this one. Reading books is all well and good, but nothing beats getting stuck in. I actually started with a Dummies Guide to ASP (that'd be "classic" ASP), back in 1999.
If I was going to start from scratch today I'd be looking at something that covered a full stack solution, whether Apache/PHP/MySQL, RoR or whatever.
ATM I have no experience of Rails, but it might be a pretty good place to start as it includes a lot of stuff that you'd have to figure out early on otherwise (integration with a Scriptaculous, a JS framework) - you can always learn what going on under the hood at a later date.
.NET is always an option, and if you're comfortable with Visual Studio it may be the way to go, but it's not the easiest thing to pick up otherwise.
If you know a bit of HTML but are basically new to server-side programming you might look at ColdFusion. It's actually extremely powerful and like Rails includes lots of "out of the box" benefits. There's a Swiss company called Railo who are currently in the process of releasing an Open Source ColdFusion engine that is affiliated with JBoss.
Last and not least - don't forget databases! Sooner or later you'll need to get to grips with some pretty serious SQL...

CFML (aka "ColdFusion" even though that's really an Adobe product, not the language) is definitely easy to learn, and if you want FOSS for CFML, in addition to Railo you can use Open BlueDragon which is a GPL CFML engine.

Designing with Web Standards is a great first read!
http://www.zeldman.com/dwws/

I would recommend this book:
http://www.amazon.com/MCTS-Self-Paced-Training-Exam-70-528/dp/0735623341/ref=sr_1_1?ie=UTF8&s=books&qid=1218830714&sr=8-1
I have just read it to take the exam, and although I knew the web theory part, I found it to be of great value.
This of course is a ASP.NET specific book, but that is what I would recommend learning anyways.
After you learn all the ASP.NET stuff, I would suggest reading up on JQuery.
Happy coding :)

What is your experience using the TIBCO General Interface? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
It looks interesting and I've played around with it some --- but the development IDE in a web browser seems to be nightmare eventually.
Does anyone have experience using it and what are your thoughts?

We evaluated GI a few months ago for a project but didn't end up selecting it.
The IDE-in-a-browser (which is itself build with GI) actually works surprisingly well, though there are some features you normally expect from an editor that it lacks, most notably (and irritatingly) an Undo command. It's also impossible to do things like subdocument includes (practically a necessity for team development) from the IDE, though you can do them manually in the underlying XML and the IDE will respect them.
In the end the main reason we didn't go with it was that it was difficult to make the resulting web application look as good as the designers really wanted. It was relatively easy to build functionality, but the components were very restrictive in look and feel. The way GI renders its own document model to HTML involves a lot of style attributes which makes skinning in CSS all but impossible. It seems to prefer making web applications that look like applications, instead of web applications that look like websites.
So it would probably be great for building intranet type applications where look and feel isn't a huge issue, but I probably wouldn't use it to make a public facing site.
By the way for those that don't know, TIBCO GI is a completely separate product from the rest of TIBCO's SOA business integration stuff - General Interface was a separate company that was acquired by TIBCO a couple of years ago.

From a coworker who used to work at TIBCO:
TIBCO is a complicated, hard to use system because it's used for complicated, hard to solve problems.

Kieron does a good job of summarizing GI. It's really for enterprise web applications, not consumer-y widgets. The overhead of loading the entire GI framework and waiting a second or two for it to load doesn;t seem like much if you're firing up a call center or an employee provisioning application you're going to use for the next few hours. But, it seems like forever if you're waiting for a widget to load into an existing web page. And, even though, GI supports some nice functional and performance QA tools, they really are overkill unless you're working on something important and complex. So, if all you want is to toss a sexy looking datepicker on screen, use something else for sure.

Yup, couldn't agree more. I have developed a few applications with TIBCO GI and integrated it with TIBCO CIM. I work for TIBCO and GI is something I have been working with quite heavily doing some complicated stuff. Whilst doing it, I came across the odd sides of GI, things you sometimes can't explain but are just the way they are, working with JavaScript and dealing with multithreading issues can be a nightmare etc. It's good to create something quick without being too fussy about the sexiness of the application hence good for internal apps but not for consumers unless you want to get lost in a jungle of crazy CSS styling. The XML Mapping utility is a great feature saving you lots of time to implement SOA applications. The other good part is that deployment is really easy - GI apps use a combination of XML, XSLT, X-Path and JavaScript. In GI 3.8 there are also a couple of testing tools. Unfortunately, development inside GI's editor is slow and painful, so I recommend using an external editor like Notepad++.

you dont need to run tibco-GI from a web-browser, but you need to run the Programfile GI_Builder.exe which is an ActiveX application. just double-click on it and run-it.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008