download links from a web page with renaming - html

I'm trying to find a way to automatically download all links from a web page, but I also want to rename them. for example:
<a href = fileName.txt> Name I want to have </a>
I want to be able to get a file named 'Name I want to have' (I don't worry about the extension).
I am aware that I could get the page source, then parse all the links, and download them all manually, but I'm wondering if there are any built-in tools for that.
lynx --dump | grep http:// | cut -d ' ' -f 4
will print all the links that can be batch fetched with wget - but is there a way to rename the links on the fly?

I doubt anything does this out of the box. I suggest you write a script in Python or similar to download the page, and load the source (try the Beautiful Soup library for tolerant parsing). Then it's a simple matter of traversing the source to capture the links with their attributes and text, and download the files with the names you want. With the exception of Beautiful Soup (if you need to be able to parse sloppy HTML), all you need is built in with Python.

I solved the problem by converting the web page entirely to unicode on the first pass (using notepad++'s built-in conversion)
Then I wrote a small shell script that used cat, awk and wget to fetch all the data.
Unfortunately, I couldn't automate the process since I didn't find any tools for linux which would convert an entire page from KOI8-R to unicode.

Related

Converting several html files into one word file

I received web-service documentation in html format, but it is very unfriendly when it comes to search for a specific word. Using index file it displays list of names of each request on the left and when you click on a particular one then on the right it displays description and content of this request.
Unfortunately I have to do some mapping with web-services that we already have. When searching through CTRL + F it only goes trough the left side (list), doesn't matter if you place cursor over the description on the right, click and try to search this way too - it doesn't work.
My idea is to extract all html files that have been provided to us into one word document (this way I can go through descriptions not only trough the list of names). Unfortunately all I can reach is that these files open in separate word files (one html file per one word file). It's almost 1000 requests to be mapped and working this way is going to take forever...
So the question is: How to combine more than one html file into one word file?
There two ways to merge html files
Using Command Line
Copy all html files that you want to merge into a folder.
Navigate to that folder using terminal or command prompt.
Execute following commands
on Mac/Linux
cat *.html > output.html
on Windows :
type *.html > output.html
Using already available tools
https://www.sobolsoft.com/howtouse/combine-html-files.htm, html-merge (Windows Only)
In order to convert merged html file to a word document, read here.

How to automate getting a CSV file from this website?

I've never worked with web pages before and I'd like to know how best to automate the following through programming/scripting:
go to http://financials.morningstar.com/ratios/r.html?t=GMCR&region=USA&culture=en_US
invoke the 'Export to CSV' button near the top right
save this file into local directory
parse file
Part 4 doesn't need to use the same language as for 1-3 but ideally I would like to do everything in one shot using one language.
I noticed that if I hover my mouse over the button it says: javascript:exportKeyStat2CSV(); Is this a java function I could call somehow?
Any suggestions are appreciated.
It's a JavaScript function, which is not Java!
At first glance, this may seem like you need to execute Javascript to get it done, but if you look at the source of the document, you can see the function is simply implemented like this:
function exportKeyStat2CSV(){
var orderby = SRT_keyStuts.getOrderFromCookie("order");
var urlstr = "//financials.morningstar.com/ajax/exportKR2CSV.html?&callback=?&t=XNAS:GMCR&region=usa&culture=en-US&cur=&order="+orderby;
document.location = urlstr;
}
So, it builds a url, which is completely fixed, except the order by part, which is taken from a cookie. Then it simply navigates to that url by setting document.location. A small test shows you even get a csv file if you leave the order by part empty, so probably, you can just download the CSV from the base url that is in the code.
Downloading can be done using various tools, for instance WGet for Windows. See SuperUser for more possibilities. Anyway, 'step 1 to 3' is actually just a single command.
After that, you just need to parse the file. Parsing CSV files can be done using batch, and there are several examples available. I won't get into details, since you didn't provide any in your question.
PS. I'd check their terms of use before you actually implement this.
The button directs me to this link:
http://financials.morningstar.com/ajax/exportKR2CSV.html?&callback=?&t=XNAS:GMCR&region=usa&culture=en-US&cur=&order=asc
You could use the Python 3 module urllib and fetch the file, save it using the os or shutil modules, then parse it using one of the many CSV parsing modules, or by making your own.

Mass-upload many text files to MediaWiki

I have many text files that I want to upload to a wiki running MediaWiki.
I don't even know if this is really possible, but I want to give it a shot.
Each text file's name will be the title of the wiki page.
One wiki page for one file.
I want to upload all text files from the same folder as the program is in.
Perhaps asking you to code it all is asking too much, so could you tell me at least which language I should look for to give it a shot?
What you probably want is a bot to create the articles for you using the MediaWiki API. Probably the best known bot framework is pywikipedia for Python, but there are API libraries and bot frameworks for many other languages too.
In fact, pywikipedia comes with a script called pagefromfile.py that does something pretty close to what you want. By default, it creates multiple pages from a single file, but if you know some Python, it shouldn't be too hard to change that.
Actually, if the files are on the same server your wiki runs on (or you can upload them there), then you don't even need a bot at all: there's a MediaWiki maintenance script called importTextFile.php that can do it for you. You can run it in for all files in a given directory with a simple shell script, e.g.:
for file in directory/*.txt; do
php /path/to/your/mediawiki/maintenance/importTextFile.php "$file";
done
(Obviously, replace directory with the directory containing the text files and /path/to/your/mediawiki with the actual path of your MediaWiki installation.)
By default, importTextFile.php will base the name of the created page on the filename, stripping any directory prefixes and extensions. Also, per standard MediaWiki page naming rules, underscores will be replaced by spaces and the first letter will be capitalized (unless you've turned that off in your LocalSettings.php); thus, for example, the file directory/foo_bar.txt would be imported as the page "Foo bar". If you want finer control over the page naming, importTextFile.php also supports an explicit --title parameter. Or you could always copy the script and modify it yourself to change the page naming rules.
Ps. There's also another MediaWiki maintenance script called edit.php that does pretty much the same thing as importTextFile.php, except that it reads the page text from standard input and doesn't have the convenient default page naming rules of importTextFile.php. It can be quite handy for automated edits using Unix pipelines, though.
Addendum: The importTextFile.php script expects the file names and contents to be in the UTF-8 encoding. If your files are in some other encoding, you'll have to either fix them first or modify the script to do the conversion, e.g. using mb_convert_encoding().
In particular, the following modifications to the script ought to do it:
To convert the file names to UTF-8, edit the titleFromFilename() function, near the bottom of the script, and replace its last line:
return $parts[0];
with:
return mb_convert_encoding( $parts[0], "UTF-8", "your-encoding" );
where your-encoding should be the character encoding used for your file names (or auto to attempt auto-detection).
To also convert the contents of the files, make a similar change higher up, inside the main code of the script, replacing the line:
$text = file_get_contents( $filename );
with:
$text = file_get_contents( $filename );
$text = mb_convert_encoding( $text, "UTF-8", "your-encoding" );
In MediaWiki 1.27, there is a new maintenance script, importTextFiles.php, which can do this. See https://www.mediawiki.org/wiki/Manual:ImportTextFiles.php for information. It improves on the old (now removed) importTextFile.php script in that it can handle file wildcards, so it allows the import of many text files at once.

How might I write a program to extract my data from Google Code?

I'm about to start writing a program which will attempt to extract data from a Google Code site so that it may be imported in to another project management site. Specifically, I need to extract the full issue detail from the site (description, comments, and so on).
Unfortunately Google don't provide an API for this, nor do they have an export feature, so to me the only option looks to be extracting the data from the actual HTML (yuck). Does any one have any suggestions on "best practice" from attempting to parse data out of HTML? I'm aware that this is less than ideal, but I don't think I have much choice. Can anyone else think of a better way, or maybe someone else has already done this?
Also, I'm aware of the CSV export feature on the issue page, however this does not give complete data about issues (but could be a useful starting point).
I just finished a program called google-code-export (hosted on Github). This allows you to export your Google Code project to an XML file, for example:
>main.py -p synergy-plus -s 1 -c 1
parse: http://code.google.com/p/synergy-plus/issues/detail?id=1
wrote: synergy-plus_google-code-export.xml
... will create a file named synergy-plus_google-code-export.xml.

HTML downloading and text extraction

What would be a good tool, or set of tools, to download a list of URLs and extract only the text content?
Spidering is not required, but control over the download file names, and threading would be a bonus.
The platform is linux.
wget | html2ascii
Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).
See also: lynx.
Python Beautiful Soup allows you to build a nice extractor.
I know that w3m can be used to render an html document and put the text content in a textfile
w3m www.google.com > file.txt for example.
For the remainder, I'm sure that wget can be used.
Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.
PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.
Use wget to download the required html and then run html2text on the output files.