I have directory with > 1000 .html files, and would like to check all of them for bad links - preferably using console. Any tool you can recommend for such task?
you can use wget, eg
wget -r --spider -o output.log http://somedomain.com
at the bottom of the output.log file, it will indicate whether wget has found broken links. you can parse that using awk/grep
I'd use checklink (a W3C project)
You can extract links from html files using Lynx text browser. Bash scripting around this should not be difficult.
Try the webgrep command line tools or, if you're comfortable with Perl, the HTML::TagReader module by the same author.
Related
Is it possible to format HTML automatically with a tool in a similar way that eslint formats javascript? Why does it seem that there isn't many customizable options that you can integrate as part of your development pipeline?
I would wish to format HTML in the following way automatically with a command ran from the terminal:
<input
class="input-style"
placeholder="Replace me!"
/>
So for example I could npm run html-lint and it would fix the syntax in html files and warn about cases it cant fix.
js-beautify also works on HTML.
npm install js-beautify
js-beautify --type html file.html
Notice all this beautifying makes the file size increase substantially. The indentation is great for revision and editing, not so much for hosting. For that reason you might find html-minifier equally useful.
I personally think tidy is a fantastic options for tidying up HTML files. Checkout Tidy
maybe what you are looking for is prettier, this also supports CLI, even you can also make config, see the complete documentation here. Prettier CLI
I hope this helps.
I Googled for "Package json pretty print html" and got the following:
https://www.npmjs.com/package/pretty
(It's not clear whether this can be included in package.json)
There's also this (appears to be a command-line tool):
https://packagecontrol.io/packages/HTML-CSS-JS%20Prettify
I have a legacy C project which is using *.hx as a customized header file suffix. I'm trying to using opengrok to read the code. But it doesn't support this file extension.
I tried to modify the SUFFIX in
OpenGrok-0.12-stable\src\org\opensolaris\opengrok\analysis\c\CAnalyzerFactory.java
and compile to get the opengrok.jar
but it doesn't help.
Check the CLI options to opengrok.jar, especially the -A option.
I want to get all URLs from specific page in Bash.
This problem is already solved here: Easiest way to extract the urls from an html page using sed or awk only
The trick, however, is to parse relative links into absolute ones. So if http://example.com/ contains links like:
About us
<script type="text/javascript" src="media/blah.js"></a>
I want the results to have following form:
http://example.com/about.html
http://example.com/media/blah.js
How can I do so with as little dependencies as possible?
Simply put, there is no simple solution. Having little dependencies leads to unsightly code, and vice versa: code robustness leads to higher dependency requirements.
Having this in mind, below I describe a few solutions and sum them up by providing pros and cons of each one.
Approach 1
You can use wget's -k option together with some regular expressions (read more about parsing HTML that way).
From Linux manual:
-k
--convert-links
After the download is complete, convert the links in the document to
make them suitable for local viewing.
(...)
The links to files that have not been downloaded by Wget will be
changed to include host name and absolute path of the location they
point to.
Example: if the downloaded file /foo/doc.html links to /bar/img.gif
(or to ../bar/img.gif), then the link in doc.html will be modified to
point to http://hostname/bar/img.gif.
An example script:
#wget needs a file in order for -k to work
tmpfil=$(mktemp);
#-k - convert links
#-q - suppress output
#-O - redirect output to given file
wget http://example.com -k -q -O "$tmpfil";
#-o - print only matching parts
#you could use any other popular regex here
grep -o "http://[^'\"<>]*" "$tmpfil"
#remove unnecessary file
rm "$tmpfil"
Pros:
Works out of the box on most systems, assuming you have wget installed.
In most cases, this will be sufficient solution.
Cons:
Features regular expressions, which are bound to break on some exotic pages due to HTML hierarchical model standing below regular expressions in Chomsky hierarchy.
You cannot pass a location in your local file system; you must pass working URL.
Approach 2
You can use Python together with BeautifulSoup. An example script:
#!/usr/bin/python
import sys
import urllib
import urlparse
import BeautifulSoup
if len(sys.argv) <= 1:
print >>sys.stderr, 'Missing URL argument'
sys.exit(1)
content = urllib.urlopen(sys.argv[1]).read()
soup = BeautifulSoup.BeautifulSoup(content)
for anchor in soup.findAll('a', href=True):
print urlparse.urljoin(sys.argv[1], anchor.get('href'))
And then:
dummy:~$ ./test.py http://example.com
Pros:
It's the correct way to handle HTML, since it's properly using fully-fledged parser.
Exotic output is very likely going to be handled well.
With small modifications, this approach works for files, not URLs only.
With small modifications, you might even be able to give your own base URL.
Cons:
It needs Python.
It needs Python with custom package.
You need to manually handle tags and attributes like <img src>, <link src>, <script src> etc (which isn't presented in the script above).
Approach 3
You can use some features of lynx. (This one was mentioned in the answer you provided in your question.) Example:
lynx http://example.com/ -dump -listonly -nonumbers
Pros:
Very concise usage.
Works well with all kind of HTML.
Cons:
You need Lynx.
Although you can extract links from files as well, you cannot control the base URL and you end up with file://localhost/ links. You can fix this using ugly hacks like manual inserting <base href=""> tag into HTML.
Another option is my Xidel (XQuery/Webscraper):
For all normal links:
xidel http://example.com/ -e '//a/resolve-uri(#href)'
For all links and srcs:
xidel http://example.com/ -e '(//#href, //#src)/resolve-uri(.)'
With rr-'s format:
Pros :
Very concise usage.
Works well with all kind of HTML.
It's the correct way to handle HTML, since it's properly using fully-fledged parser.
Works for files and urls
You can give your own base URL. (with resolve-uri(#href, "baseurl"))
No dependancies except Xidel (except openssl, if you also have https urls)
Cons:
You need Xidel, which is not contained in any standard repository
Why not simply this ?
re='(src|href)='
baseurl='example.com'
wget -O- "http://$baseurl" | awk -F'(src|href)=' -F\" "/$re/{print $baseurl\$2}"
you just need wget and awk.
Feel free to improve the snippet a bit if you have both relative & absolute urls at the same time.
I tend to write a good amount of documentation so the MediaWiki format to me is easy for me to understand plus it saves me a lot of time than having to write traditional HTML. I, however, also write a blog and find that switching from keyboard to mouse all the time to input the correct tags for HTML adds a lot of time. I'd like to be able to write my articles in Mediawiki syntax and then convert it to HTML for use on my blog.
I've tried Google-ing but must need better nomenclature as surprisingly I haven't been able to find anything.
I use Linux and would prefer to do this from the command line.
Any one have any thoughts or ideas?
The best would be to use MediaWiki parser. The good news is that MediaWiki 1.19 will provide a command line tool just for that!
Disclaimer: I wrote that tool.
The script is maintenance/parse.php some usage examples straight from the source code:
Entering text yourself, ending it with Control + D:
$ php maintenance/parse.php --title foo
''[[foo]]''^D
<p><i><strong class="selflink">foo</strong></i>
</p>
$
The usual file input method:
$ echo "'''bold'''" > /tmp/foo.txt
$ php maintenance/parse.php /tmp/foo.txt
<p><b>bold</b>
</p>$
And of course piping to stdin:
$ cat /tmp/foo | php maintenance/parse.php
<p><b>bold</b>
</p>$
as of today you can get the script from http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/maintenance/parse.php and place it in your maintenance directory. It should work with MediaWiki 1.18
The script will be made available with MediaWiki 1.19.0.
Looked into this a bit and think that a good route to take here would be to learn to a general markup language like restucturedtext or markdown and then be able to convert from there. Discovered a program called pandoc that can convert either of these to HTML and Mediawiki. Appreciate the help.
Example:
pandoc -f mediawiki -s myfile.mediawiki -o myfile.html -s
This page lists tons of MediaWiki parsers that you could try.
I have nearly 100 HTML files that uses the <tt> tag to markup inline code which I'd like to change to the more meaningful <code> tag. I was thinking of doing something on the order of a massive sed -i 's/<tt>/<code>/g' command but I'm curious if there's a more appropriate industrial mechanism for changing tag types on a large HTML tree.
The nicest thing you may do is to use
xmlstartlet:
xml ed -r //b -v code
It is freaky powerful. See http://xmlstar.sourceforge.net/, http://www.ibm.com/developerworks/library/x-starlet.html
If your are on a linux environment then sed is very easy, short, and fast way to do it.
Corrected command :
SAVEIFS=$IFS
IFS="\n"
for f in `find . -name "*.htm"` do sed -i 's/tt>/code>/g' "$f" ;done
IFS=$SAVEIFS
Some text editors or IDE also allow you to do a search and replace in directories with a filter on filename.
For one time performance of such tasks I use UltraEdit on Windows. UE has a find and replace in files function that works great for this. I point it at the top of the directory tree containing the files I want to change, tell it to process sub-directories, give it the extension of the files I want to change, tell it what to change and what to change it to and go.
If you have to script this in linux, then I think the sed solution or a perl / php script will work great.