Translate HTML files to another language

Translate HTML files to another language - html

I have a website with Dutch text which I want to translate to English. Is there a fast way of doing this with keeping the HTML tags(<strong>,<span>) in tact. I know I can just copy the parsed TEXT into a translator but this will remove the formatting.
I also know that at the end I have to go trough the text manually to fix some minor spelling and grammar.

Online translators are good to turn foreign text into something that can be understood, but they are useless for producing quality translations. Even if you fix obvious problems at the end, you will get an amateurish word-by-word translation. If you want your visitors to take you seriously, you should translate from scratch.

If you want to preserve the HTML formatting at the same time as translating, you will have to work directly with the HTML source and update the text yourself without touching the formatting.
You may be able to use an XML editor like XmlSpy that will let you edit text nodes directly without touching the tagging, but this requires that the HTML is actually XHTML. You may still need to translate some attributes (such as title and alt attributes).

Is a virtual traslate a good option for you? Because if you paste google translato script into your page source, it will translate your text on the site, and the formating will stay there too. http://translate.google.com/translate_tools

Related

is it possible to change text at the touch of a button in HTML

I have a problem where I need to change the displayed text between two different writing styles and wanted to ask if such a thing is even possible in HTML.
so like I my head I'm thinking about putting the text as a variable and saying
if output = 1:
display simple_text:
else display complex_text:
Here, the complex and simple text would be the variables the text needed to change is set to.
Thanks in advance for answering and reading my question

HTML is a markup language intended to describe the structure of a document in a both machine- and human-readable way.
As such, HTML doesn't have any logic like if...else or loops.
So to do what you want you will either need a template engine (which would decide at serve-time which text would be displayed, on the server), or Javascript, to implement the logic on the client-side (browser). Note that Javascript can be used on the server as well if the server runs Node.js.
To decide which one to go for, here's some cornerstones:
If the decision which text to display must only be made once - and won't change after that, going for a template engine on the server-side is probably the best approach.
If what is to displayed depends on some actions the user can perform (like you mentioned, clicking a button), go for a Javascript-based approach in the browser.

allowing users to add html formatted notes

We want to allow the users of our web application, to leave notes formatted with html.
On client side we are providing them with ckeditor [http://ckeditor.com/] which is a wisywig editor that generates html, that is then submitted to the server via a form
We then want to display the notes created by the users, with exactly the same formatting as they submitted them
My concerns are:
Putting attacks and bad intentions aside, how can I encapsulate the note when displayed on the site, so that
a. They don't inherit the design from the rest of the page
b. They don't influence the rest of the page, for example by opening and not closing a tag accidentally, or closing without opening.
Malicious code injection attacks
At the moment, the first is much more important, as it's an in house product for our clients, and is not open to the wide public. But security comments are very wellcome as well
Possible solutions that I consider are:
Ideally, I look for a way to encapsulate this pieces of user html, like : inside this area I show what you submitted (rendered, not source), you cannot influence and are not influenced by the code on other parts of the page
Specifically, we thought of displaying the notes inside iframes.
Other natural direction is dealing with parsing the inserted contents, and stripping out stuff.
Any inputs are welcome, and mainly:
How can I "encapsulate" the inserted contents, if I can?
Any comments on the iframe direction
Do I have to parse the contents anyway? What do I absolutely have to strip out?

How can I "encapsulate" the inserted contents, if I can?
The truth is unless you 'fix' their code (via some kind of check) you will get issues (think broken divs, etc). I don't see how you can encapsulate HTML FROM HTML. I would however only let them put in content like bold, italicize, center, etc;
Any comments on the iframe direction
Personally I wouldn't go that route, new can of worms for security and not a 'clean' way of doing this.
Do I have to parse the contents anyway? What do I absolutely have to strip out?
Yes don't be lazy, some devs always say "well I dont need it, its internal" and then it becomes an external thing, and at that point its so big that ONLY a full re-write will set it right, and it keeps chugging along until something is broken, then shit hits the fan and the big boss cries out why hasn't this been done. Long story short.
Yes you have to parse / validate / check all your input, wether internal or external. Anything other than that is just lazy.
In closing I would do it by using an editor like here on SO, which only allows some types of selective formatting. After all a broken <b> will not kill your whole layout, a <div> will...

Markdown formatting
You could use exactly the same type of intermediary solution that this site (StackOverflow) uses in it's user-generated-content (questions, answers, comments).
It's not the complete solution that could replace WYSIWYG solutions like the code editor, but it's just what a usual user-generated-content woudl require. It even allows you to include images.
For a complete guide:
https://www.markdownguide.org/cheat-sheet

Having the HTML of a webpage, how to obtain the visible words of that webpage?

Having the HTML of a webpage, what would be the easiest strategy to get the text that's visible on the correspondent page? I have thought of getting everything that's between the <a>..</a> and <p>...</p> but that is not working that well.
Keep in mind as that this is for a school project, I am not allowed to use any kind of external library (the idea is to have to do the parsing myself). Also, this will be implemented as the HTML of the page is downloaded, that is, I can't assume I already have the whole HTML page downloaded. It has to be showing up the extracted visible words as the HTML is being downloaded.
Also, it doesn't have to work for ALL the cases, just to be satisfatory most of the times.

I am not allowed to use any kind of external library
This is a poor requirement for a ‘software architecture’ course. Parsing HTML is extremely difficult to do correctly—certainly way outside the bounds of a course exercise. Any naïve approach you come up involving regex hacks is going to fall over badly on common web pages.
The software-architecturally correct thing to do here is use an external library that has already solved the problem of parsing HTML (such as, for .NET, the HTML Agility Pack), and then iterate over the document objects it generates looking for text nodes that aren't in ‘invisible’ elements like <script>.
If the task of grabbing data from web pages is of your own choosing, to demonstrate some other principle, then I would advise picking a different challenge, one you can usefully solve. For example, just changing the input from HTML to XML might allow you to use the built-in XML parser.

Literally all the text that is visible sounds like a big ask for a school project, as it would depend not only on the HTML itself, but also any in-page or external styling. One solution would be to simply strip the HTML tags from the input, though that wouldn't strictly meet your requirements as you have stated them.
Assuming that near enough is good enough, you could make a first pass to strip out the content of entire elements which you know won't be visible (such as script, style), and a second pass to remove the remaining tags themselves.

i'd consider writing regex to remove all html tags and you should be left with your desired text. This can be done in Javascript and doesn't require anything special.

I know this is not exactly what you asked for, but it can be done using Regular Expressions:
//javascript code
//should (could) work in C# (needs escaping for quotes) :
h = h.replace(/<(?:"[^"]*"|'[^']*'|[^'">])*>/g,'');
This RegExp will remove HTML tags, notice however that you first need to remove script,link,style,... tags.
If you decide to go this way, I can help you with the regular expressions needed.

HTML 5 includes a detailed description of how to build a parser. It is probably more complicated then you are looking for, but it is the recommended way.

You'll need to parse every DOM element for text, and then detect whether that DOM element is visible (el.style.display == 'block' or 'inline'), and then you'll need to detect whether that element is positioned in such a manner that it isn't outside of the viewable area of the page. Then you'll need to detect the z-index of each element and the background of each element in order to detect if any overlapping is hiding some text.
Basically, this is impossible to do within a month's time.

Why do I need Markdown?

Why do I need a Markdown with a front edit editor like WMD? What does the markdown do to the content that’s sent from the WMD editor?
How does Markdown store the content in the backend? Is it the same way like *bold* or in some other format? Why can’t I just do an html encode?
Sorry if I sounded very naïve.

It's probably helpful to take a step back and ask some of the larger questions. The issue Markdown is trying to solve is that of rich editing in the browser. Consider this: At some point, for any piece of software to enable rich text it has to describe the richness in a some manner, however that may be.
We could call that description of richness (by description of richness I mean like "this bit of text is bold" or "this bit of text is a hyperlink), we could call that description of richness "markup" -- it marks up the text with meta "richness".
Implementations of rich text can take on two approaches, either a.) hide the markup from the user or b.) let them have access to the markup.
For those who choose to hide it, the end result is very often WYSIWYG. The user is oblivious to what is happening behind the scenes. The editor takes care of the details. Think MS Word as an example. No one manipulates the Word markup format as a regular end user.
For implementations which choose to expose the markup, a markup language is then in order to allow users to interacat with it. Such markup languages would be things like HTML doing <tag> or BB code for example, doing things like [tag].
Markdown is one such of these languages.
As opposed to the former types I mentioned, Markdown has tried to design itself so that the markup renders common ASCII people already use. For example, it's common for people to asterisk their text to set it off, *important*, and this notation in Markdown is an indicator of italic.
In regards to storage, as Stephan pointed out, the system will most likely store the raw markdown, because the user will most likely need to have the possibility of editing, and the original markdown can be recalled for that purpose.
In most of the systems I've built, I store the markdown, and then normalize it to a 2nd field which caches the HTML rendering of the markdown. This way I don't have to do markdown->HTML rendering for every markdown field. It takes a little more space, but I'd rather the user have a faster response than use less DB storage space.
Care should also be taken when accepting Markdown from the browser, as it can easily contain <script> tags which need to be filtered out. Most markdown implementations will also recognize HTML intermingled with Markdown formatting, as so to be safe, you need to make sure your inputs and caches are sanitized properly.

The reason for using an alternate encoding system other than HTML is for security
Markdown and other such wiki style encoding systems do not usually support scripting languages
HTML supports scripting languages in many ways (
The two main security issues are:
Malware criminals use scripts in user generated content to attempt malware actions on the content readers computer by scripting to access known security holes
Free loaders using scripts to subvert the rest of the site by changing the content frame or styles i.e. ads, menu's, logos etc. This can also be criminal behaviour if not just annoying
By using an intermediate language such as Markdown you have total control on the rendered output
Filtering HTML is possible, but is also complex and risky
The other significant reason for an alternate encoding system is enforcement of style. Normal HTML has too many options. By limiting the available options, users can only use certain styles. The usually makes for cleaner looking and more readable content (compare SO to Ebay)

The main reason for using Markdown is the readability of a marked text. For instance, you can send it in a plain-text email and the reader will still understand the emphiasis, bullets, the text will be divided in paragraphs et cetera.
When you ask about storing data, it depends. If you enable Markdown in the WordPress blog engine, it stores data as the user has input it - in Markdown. In Stack Overflow, however, it seems like the data is stored as HTML. At least, the "Stack Overflow data dumps" contain HTML, not Markdown (I've seen people complaining) that they have to convert it back).
If you use the WMD editor, you can show the user how the outputs will look like after being converted to HTML. Even though Markdown syntax is really simple, it is not hard to make mistakes. Hence, it is best to show users the output.
Another reason for using Markdown instead of a WYSIWIG control - a WYSIWIG control allows the user to use HTML in data you are displaying on your web page. So, you have to be the one who decides when there is simply incorrect HTML and when it is an evil XSS/CSRF/whatever injection. In Markdown, you simply convert *something* to <b>something</b>, remove any unknow HTML elements and you're done.

A WYSIWYG Markdown control for Windows Forms?

[We have a Windows Forms database front-end application that, among other things, can be used as a CMS; clients create the structure, fill it, and then use a ASP.NET WebForms-based site to present the results to publicly on the Web. For added flexibility, they are sometimes forced to input actual HTML markup right into a text field, which then ends up as a varchar in the database. This works, but it's far from user-friendly.]
As such… some clients want a WYSIWYG editor for HTML. I'd like to convince them that they'd benefit from using simpler language (namely, Markdown). Ideally, what I'd like to have is a WYSIWYG editor for that. They don't need tables, or anything sophisticated like that.
A cursory search reveals a .NET Markdown to HTML converter, and then we have a Windows Forms-based text editor that outputs HTML, but apparently nothing that brings the two together. As a result, we'd still have our varchars with markup in there, but at least it would be both quite human-readable and still easily parseable.
Would this — a WYSIWYG editor that outputs Markdown, which is then later on parsed into HTML in ASP.NET — be feasible? Any alternative suggestions?

I think the best approach for this is to combine
Converting Markdown to HTML &
Displaying HTML in WinForms
The most up to date Markdown Library seems to be markdig which you can install via nuget
A simple implementation might be to:
Add a SplitContainer to a Form control, set Dock = Fill
Add a TextBox, set Dock = Fill and set to Multiline = True
Add a WebBrowser, set Dock = Fill
Then handle the TextChanged event, parse the text into html and set to DocumentText like this:
private void textBox1_TextChanged(object sender, EventArgs e)
{
var md = textBox1.Text;
var html = Markdig.Markdown.ToHtml(md);
webBrowser1.DocumentText = html;
}
Here's a recorded demo:

#Soeren,
You can most definitely embed IE with the Javascript Markdown editor inside a Windows Forms application.

the RichTextBox control
So you want to use Markdown but you want the user not to know it? This might not be an achievable goal. I think the point of Markdown is that it is geared toward writers that are willing to learn a little bit of fairly natural syntax and edit everything in plain text (like Wikipedia? are there pure WYSIWYG editors for that? probably... and probably some other Wikipedia editor person has to come and clean up the resulting markup and formatting...). If you want it to be transparent to the user (like MS Word) Markdown may not be what you want or give you the advantages it advertises in that situation.
The input happens in Windows Forms
Oops! Now I understand better your question. I guess it depends on how your Windows Forms app looks whether the embedded IE control sticks out like a sore thumb. If you try it you might find that you can get it to work.[1]
In your position, I would try something like this [2]:
http://wmd-editor.com/examples/splitscreen
If you don't think that sort of arrangement will go over well with your users, (especially all the editing in the text-only window) then once again I don't think Markdown is the answer for your specific application. If you think your users are keen on the idea of editing pure text, then I bet we can find a solution. Please clarify?
Jared.
[1] I had success dropping an IE HTML control into a project strictly to display some generated results as a PDF (using an IE Reader plugin like Adobe Reader or Foxit). The user has no idea that that part of the GUI is an IE control, it just shows the PDF, allows printing and saving, etc.
[2] ...but remove the borders and make the two split controls touch all four edges of the embedded IE control, or get very close... keep it light grey or white, for example, and eliminate any borders of the IE control so it blends into the surrounding controls. Maybe put this on its own tab page and I challenge a non-technical user to tell/care if it's an HTML control or native.
I could totally be wrong about all this (one would have to see this in action to determine if it would work) but it might be easier than writing your own interactive Markdown editor...
...actually to implement your own C# Markdown editor, you could just put a text edit box next to an embedded IE control and run the current Markdown through the .NET Markdown->HTML converter on a separate thread, and replace the HTML in the IE control (assuming the Markdown->HTML converter is very liberal and robust against throwing ANY exceptions).

Can't you just use the same control I'm Stack Overflow uses (that we're all typing into)---WMD, and just store the Markdown in the VARCHAR. Then use the .NET Markdown to HTML converter, as you mentioned, to display the HTML as needed. Jeff talks about this in more detail in a StackOverflow podcast (don't know the episode number).

"WYSIWYG Markdown" is really an oxymoron since the whole point of Markdown is to allow you to write markup syntax naturally and intuitively which is then post-processed into html, unless you mean actually taking for example **text** and rendering it as **text** for example. That would actually be kind of cool, but it would get very difficult for things like numbered and bulleted lists, since you would have to do all the positioning, yet keep everything based on actual textual characters (e.g. '*' instead of the bullet symbol) and support proper textual input positioning, backspace, etc.
For example,
in this bullet list,
the bullets would actually have to be asterisks,
and the spacing would not really be there.
That would certainly be worth paying attention to, if someone did tackle it.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008