Fixing malformed html that html tidy doesn't fix - html

Okay, so I've been utilizing HTML tidy to convert regular HTML webpages into XHTML suitable for parsing. The problem is the test page I saved in firefox had its html apparently somewhat precleaned by firefox during saving, call this File F. Html tidy works fine on file F, but fails on the raw data written to a file via .NET (file N). Html tidy is complaining about form tags being intermixed with table tags. The Html isn't mine so I can't just fix the source.
How do I clean up file N enough so that it can be run through Html tidy? Is there a standard way of hooking into Firefox (completely programmically without having to use mouse or keyboard) or another tool that will apply extra fixes to the html?

I had been using HTML tidy for some time, but then found that I was getting better results from TagSoup.
It can be used as a JAXP parser, converting non-wellformed HTML on the fly. I usually let it parse the input for Saxon XQuery transformations.
But it can also be used as a stand-alone utility, as an executable jar.

I wound up using SendKeys in C# and importing functions from user32.dll to set Firefox as the active window after launching it to the website I wanted (file:///myfilepathhere/).
SendKeys seemed to require running a windowed program, so I also added another executable which performs actions in its form_load() method.
By using alt+f, down six times, enter, wait for a bit, type full path file name, enter (twice) and then killing firefox, I was able to automate firefox's ability to clean some html up.

Related

start browser at an anchor possibly using a tag

I am developing an interactive software for thermodynamic calculations using an html help file with "anchor/target" features to select the appropriate part of the help file when the user types a ? as answer to a question.
This works well but at present a new browser window is opened each time the user types a "?". I would prefer to start a new tag if there already is a browser window opened.
At present my program activate the help by creating a character with the content: browser "file:helpfile#target"
and then call the Fortran subroutine execute_system_command(character).
"browser" can be firefox or whatever is the preferred browser by the user (on Mac including path);
"helpfile" contains the path and name of the my html help file;
"target" is a text which depend on the question asked by the software to localize the appropriate help text.
How can I modify this so I open a new tag in the browser (if it is already opened) rather than starting a new browser window?
Maybe something like "target=_blank" can be added?
My program is written in the new Fortran standard so I have no facilities that might be available in Java or Python. It must work using different browsers on different OS.
As #Vladimir pointed out I ment a "tab", there are so many terms. But the answers I had made me reconsider which browser I could use. And I made some new discoveries.
On Windows I used the old Explorer because the path to Firefox contains a space and when I tried to start Firefox to open a file
I had to enclose "C:\Program ...\firefox" within double quotes. That works to start the browser but if I want the browser to open a file I must enclose that also within double quote and that did not work. I am not sure if the problem was the Fortran intrinsic EXECUTE_COMMAND_LINE(txt) or deeper down. But the old Explorer I could start without "" and just enclose "file:/help.html" with "".
So now I tried to be smart wrote a test program enclosing just the directories with a space within using "" i.e.
C:"Program Files\Mozilla Firefox"\firefox.exe "file:/help.html"
in the call to EXEXUTE...
and that worked and opened the helpfile in a tab as is my default.
Problem solved? No, when I had exchanged the browser with path in the program and tested it did not find the browser. The reason was that I have a test that the browser exist using another Fortran intrinsic INQUIRE and as I understand doublequotes are not legal inside file names so INQUIRE did not find firefox when there are " inside the path. Only if " are used around the whole file name it worked. So back to square one? No, I simply removed the " in the path+browser before calling INQUIRE, then used the path with "" inside when calling EXECTUE ...
and now it everything works as I wanted!

Way To Modify HTML Before Display using Cocoa Webkit for Internationalization

In Objective C to build a Mac OSX (Cocoa) application, I'm using the native Webkit widget to display local files with the file:// URL, pulling from this folder:
MyApp.app/Contents/Resources/lang/en/html
This is all well and good until I start to need a German version. That means I have to copy en/html as de/html, then have someone replace the wording in the HTML (and some in the Javascript (like with modal dialogs)) with German phrasing. That's quite a lot of work!
Okay, that might seem doable until this creates a headache where I have to constantly maintain multiple versions of the html folder for each of the languages I need to support.
Then the thought came to me...
Why not just replace the phrasing with template tags like %CONTINUE%
and then, before the page is rendered, intercept it and swap it out
with strings pulled from a language plist file?
Through some API with this widget, is it possible to intercept HTML before it is rendered and replace text?
If it is possible, would it be noticeably slow such that it wouldn't be worth it?
Or, do you recommend I do a strategy where I build a generator that I keep on my workstation which builds each of the HTML folders for me from a main template, and then I deploy those already completed with my setup application once I determine the user's language from the setup application?
Through a lot of experimentation, I found an ugly way to do templating. Like I said, it's not desirable and has some side effects:
You'll see a flash on the first window load. On first load of the application window that has the WebKit widget, you'll want to hide the window until the second time the page content is displayed. I guess you'll have to use a property for that.
When you navigate, each page loads twice. It's almost not noticeable, but not good enough for good development.
I found an odd quirk with Bootstrap CSS where it made my table grid rows very large and didn't apply CSS properly for some strange reason. I might be able to tweak the CSS to fix that.
Unfortunately, I found no other event I could intercept on this except didFinishLoadForFrame. However, by then, the page has already downloaded and rendered at least once for a microsecond. It would be great to intercept some event before then, where I have the full HTML, and do the swap there before display. I didn't find such an event. However, if someone finds such an event -- that would probably make this a great templating solution.
- (void)webView:(WebView *)sender didFinishLoadForFrame:(WebFrame *)frame
{
DOMHTMLElement * htmlNode =
(DOMHTMLElement *) [[[frame DOMDocument] getElementsByTagName: #"html"] item: 0];
NSString *s = [htmlNode outerHTML];
if ([s containsString:#"<!-- processed -->"]) {
return;
}
NSURL *oBaseURL = [[[frame dataSource] request] URL];
s = [s stringByReplacingOccurrencesOfString:#"%EXAMPLE%" withString:#"ZZZ"];
s = [s stringByReplacingOccurrencesOfString:#"</head>" withString:#"<!-- processed -->\n</head>"];
[frame loadHTMLString:s baseURL:oBaseURL];
}
The above will look at HTML that contains %EXAMPLE% and replace it with ZZZ.
In the end, I realized that this is inefficient because of page flash, and, on long bits of text that need a lot of replacing, may have some quite noticeable delay. The better way is to create a compile time generator. This would be to make one HTML folder with %PARAMETERIZED_TAGS% inside instead of English text. Then, create a "Run Script" in your "Build Phase" that runs some program/script you create in whatever language you want that generates each HTML folder from all the available lang-XX.plist files you have in a directory, where XX is a language code like 'en', 'de', etc. It reads the HTML file, finds the parameterized tag match in the lang-XX.plist file, and replaces that text with the text for that language. That way, after compilation, you have several HTML folders for each language, already using your translated strings. This is efficient because then it allows you to have one single HTML folder where you handle your code, and don't have to do the extremely tedious process of creating each HTML folder in each language, nor have to maintain that mess. The compile time generator would do that for you. However -- you'll have to build that compile time generator.

in PhpStorm, is it possible to reformat injected code within a PHP file?

PhpStorm can apply code style rules for specific languages with the Reformat Code command. PhpStorm can also recognize a language embedded within a file of another language (known in PhpStorm as 'Language Injection'). So, I expect that a language would be subject to its code style rules wherever the language is used -- whether embedded or in its own file.
I've found that this works as expected for css/js within an html file, but not for language injections within PHP files. PhpStorm will recognize css within a heredoc, and html as a heredoc and in single- and double- quoted strings -- yet reformatting does not work in any of these cases.
Short of using an intermediary file to reformat the code, how can I get PhpStorm to reformat these sections of code? I am using PhpStorm 6.0.3 for Mac.
Their documentation states:
PhpStorm supports full coding assistance for:
CSS and JavaScript in an HTML or XML file.
CSS, JavaScript, and SQL outside PHP code blocks and inside PHP string literals.
The second bullet seems only half true, as css/js/sql are recognized but not subjected to code styles inside PHP string literals. And injected html is not specified; but between PhpStorm recognizing the language injections and its capability to apply code styles to an arbitrary selection, all the pieces for formatting embedded languages seem to be there. What am I missing?
To reformat injected code according to PhpStorm code styles Preferences, select the injected code and open the Intention Actions list (Alt+Enter), and select "Edit __ Fragment" to edit it in it's own dedicated window (documentation). In this window, code formatting will work as expected.

MathType Word Document export using MathPage MathML

I need to convert the ms-word 2003 documents to HTML with MathML included if there are math equations. The quick solution I found at the moment is using the MathType addin to export the whole document into a HTML with MathML using its "Publish to MathPage" function.
However, it couldn't do the conversion properly. Most of the equations in the document is still in the image format, instead of MathML. The strange thing is that it converts the commas into the MathML, not the equations.
The original word document:
https://dl.dropbox.com/u/4625393/test12.doc
The key part of the converted html source:
https://gist.github.com/katat/5091021
Is this a bug of the MathType?
Kata, I'm not sure what versions of Word and MathType you are using, but I was able to successfully create the MathPage with MathML. I am using Word 2013 and MathType 6.9. This is the page I created: http://dl.dropbox.com/u/17008533/187.xht
Not sure what could have gone wrong with yours. It does seem that you chose an appropriate "target" for the MathPage; it looks like you chose XHTML+MathML.
If you can give me some more details about what steps you're taking from start to finish, I'll try to help more. Also let me know what versions of the software you're using.

Automatically tidy up JSP/JSF files

I am working on a webapplication and I do most of the XHTML stuff in an editor.
Every once in a while I froget to close a tag or mess up the nesting (we all get distracted sometimes ;-)).
So I commpile, package and run my webapp (using maven mvn clean package jetty:run-war only to notice that displaying the view (where I messed up the jsp) fails with an exception while trying to render.
So I wondered:
Is there some tool that I can include into my build-cycle that automatically catches and rectifies those careless mistakes?
There is the Maven CheckStyle plugin that looks at certain style rules in Java and other languages. It is customisable so you can add other rules. I can't say for sure that it will catch unclosed tags but this may be the place to start.
Using an IDE like Ecplipse or Netbeans will highlight any invalid code also. So you can actually see a red mark on the page as you type. That may be even more effective.
Maybe a regular xml checker would do the trick. After all a JSP file is if properly written valid xml.