How to parse HTML into .txt format using C - html

I need to parse HTML into .txt format using C.
An example - it has to detect each
1. <p>
2. <tr>
3. <ul> etc...
and convert them into text (in a document)
Can somebody help please?

I think, the easiest way to download a html webpage in c is to use libcurl. Assuming you have already set up your development environment, follow these steps:
Visit the download page of libcurl and download its latest version.
take a look at the install page and learn how to install the library. For Linux, the installation is pretty straight forward, just type ./configure && make && make install in the terminal.
Download the url2file.c example of libcurl. The <curl/curl.h> header file which is exposed in this file essentially provides necessary functions to let you communicate with the web server.
Next, compile the url2file.c using gcc -o url2file url2file.c -lcurl.
Finally, test url2file using ./url2file http://example.com. The results will be stored in page.out file which is in plaintext.
NOTES:
You need to install the libcurl in order to be able to compile the url2file.c file, unless it will throw a fatal error.
If you have already installed the curl program on your machine, you can download webpages using curl http://example.com > page.out command in the terminal.
Also, wget lets you to download and store webpages: wget http://example.com.
This answer stores a webpage as a pliantext. It doesn't perform any specific html tag processing.

Related

How to download or list all files on a website directory

I have a pdf link like www.xxx.org/content/a.pdf, and I know that there are many pdf files in www.xxx.org/content/ directory but I don't have the filename list. And When I access www.xxx.org/content/ using browser, it will redirect to www.xxx.org/home.html.
I tried to use wget like "wget -c -r -np -nd --accept=pdf -U NoSuchBrowser/1.0 www.xxx.org/content", but it returns nothing.
So does any know how to download or list all the files in www.xxx.org/content/ directory?
If the site www.xxx.org blocks the listing of files in HTACCESS, you can't do it.
Try to use File Transfer Protocol with FTP path you can download and access all the files from the server. Get the absolute path of of the same URL "www.xxx.org/content/" and create a small utility of ftp server and get the work done.
WARNING: This may be illegal without permission from the website owner. Get permission from the web site first before using a tool like this on a web site. This can create a Denial of Service (DoS) on a web site if not properly configured (or if not able to handle your requests). It can also cost the web site owner money if they have to pay for bandwidth.
You can use tools like dirb or dirbuster to search a web site for folders/files using a wordlist. You can get a wordlist file by searching for a "dictionary file" online.
http://dirb.sourceforge.net/
https://sectools.org/tool/dirbuster/

How to extract whole html with complete styling after when user designed their page on my website?

Like weeebly, Wix i want to make a website on which user can be able to design their web page with predefined controls and styling. So how can i get or extract whole web page's html with complete styling ? please mention any link or solution
If you are using linux or mac, then I'd suggest using wget. As long as the website isn't blocking these types of download requests wget will download the entire website including resource files (-r) and create the folder structure that would make sense.
wget -r -p -e robots=off http://www.example.com
If the url you want to retrieve blocks this sort of download request, you'll then only receive the index.html using wget.
On windows I use https://www.httrack.com/ It's free and downloads the website just fine. Believe someone has created a windows version of wget as well.

Installing JSON.pm on a web host without shell access

My host (iPage) does not have JSON.pm installed. I don't want to use the modules they have installed (XML) in order to transfer data from a CGI script back to a web page. Is there anyway that I can use JSON without them installing it on Perl?
The reason I ask is because I noticed when I downloaded the JSON zip that I had to run a makefile json.pm command but I don't have access to a Unix shell or a SSH terminal.
If your Perl is new enough, 5.14 and up, it will come with JSON::PP, a pure Perl implementation of the JSON parser. Confusingly it does not come with JSON.pm. So try use JSON::PP and see if it works.
Otherwise, follow Ilmari's instructions. If you switch to a host with shell access, you can use local::lib to manage CPAN modules.
You should be able to install a local copy of the pure Perl version of the JSON module without shell access. Just download the .tar.gz archive to your own computer, unpack it and copy everything under the lib subdirectory to a suitable location on your webhost.
You'll also need to tell Perl where to find the module, for which you need to know the filesystem path to which you copied the module. For example, if you copied the contents of the lib directory to /home/username/perl-lib on your webhost, then you would include in your code the lines:
use lib '/home/username/perl-lib';
use JSON;
Depending on how your webhost is configured, you might also be able to use $ENV{HOME} to obtain the path to your home directory, so that you can write:
use lib "$ENV{HOME}/perl-lib";
use JSON;
or you could try using the FindBin module to find the path to the directory containing your script, and locate the lib directory from there (see the example in the FindBin documentation).

Jenkins/Hudson Upload to Testflight

I have a jenkins job using xcode to build my ipa file. That is all working great. Right now I just have the Marketing version set to ${BUILD_ID} and the technical version set to ${BUILD_NUMBER}. I also have Release configuration specified and my job is set to archive the ipa files as a post build action. I believe those combination of settings causes my resulting IPA file to be the following:
Target-Configuration-BUILD_NUMBER.ipa
So if my target was named BillyBob and this was the 23rd successful build, my resulting .ipa file is:BillyBob-Release-23.ipa
I want to setup a job or post-build action to upload my file to testflight on a successful build.
I can not figure out what to set the file parameter of the testflight API to so that it will always find the latest build file, I don't think there is a wildcard available or if there is I don't know how to set it.
Originally, when I wasn't setting the technical version as part of the build I had it just pointed to the -1.0.ipa version of the file it was creating and that would get uploaded fine.
I've tried using both the testflight plugin for jenkins and just a curl shell script command.
I will also point out that I'm not an iOS developer, I've just been trying to help the project by setting up the automated build, so my guess as to how that file is getting generated could be way off.
***UPDATE
So it looks like this current open issue is kind of what I am looking for
jenkins issue section
For now, I just had my job specify an output path that is the workspace of my upload to testflight job.
It looks like in the Testflight app, if you don't specify anything for the IPA file, it looks for one in the workspace directory of that job. So I could probably also put in a request to the testflight plugin to allow you to specify a path in the IPA setting and have it find the .ipa file in that path, that currently does not work.
If I was better at scripting I could probably also handle it in a shell command using the curl command to upload to testflight.
Leaving those fields empty in Jenkins fixed it for me.
If you do not specify the ipa/dsym file(s) they are searched for
automatically in the workspace.
As you can see from:
https://wiki.jenkins-ci.org/display/JENKINS/Testflight+Plugin
Version 1.3.1 (Jan 12 2012)
* Default IPA upload

Browser-based app framework for Ruby

I've a ruby script running out of the command line. I want to provide a local GUI for it (for my use). As I have some exposure to Sinatra and other web frameworks, I want to use HTML pages as my front-end. But I don't want to start a server and type in a URL every time I want to launch my app.
My solution would be to write a shell script which will start a Sinatra based server and then launch Chromium(Browser) in app mode to that url.
Is there some framework which can do it better/cleaner?
I'm not interested in learning a non-HTML framework like Shoes or Ruby-Gnome2.
#!/bin/sh
ruby $1 &
chromium localhost:4567
Put that somewhere in your $PATH (or change it to contain $HOME/bin with export PATH=$HOME/bin:$PATH and put it there), make it executable with chmod +x <file> and have fun by calling <file> <sinatra startup file>
You could extend this to read the port from Sinatra, but that would require a ruby startup, and this should do in most of the cases (the 80%, as people call it).