Related
I'm trying to implement my own little reader view app (an app that would do the same thing as reader-mode on safari), and there are a few things I find asking myself:
Is there a technical term for this feature (reader-view doesn't really cut it)?
Is there a standard that websites are supposed to follow in order to indicate the content they would like to have in their reader views
Is there an open-source set of HTML parsing rules to pull the "readable" content from a website?
Is the effort to implement such a thing simply too big for a single person in a few weeks and if so should I opt for services such as Instaparser?
I believe the original to be implemented by arc90, and they called it readability. You can check out their page here.
It's been ported to many different languages over time, so you could take a look at the different implementations to learn more about it, how it's done etc.
Python readability
JReadability
JavaScript
Ruby
This is just a small sample here, there's many more examples if you would like to find more.
Edit: Oops, after some more Googling I found this question with an answer that explains it very well.
Does anyone have any idea that HTML5 has multilingual support.
I mean to say if I design my website in English and html5 will convert that using some lang or translate or any other supportive tag will convert that web page into a respective language.
Any guidance or bit of code can help me a lot.
Thanks a tons!!!
HTML5 doesn't translate anything. It’s simply a markup language that displays the contents you have put inside it without applying any logic to it.
However, such translations are usually done by modern browsers like Chrome, FireFox etc. Advance browsers do detect that the website you are viewing is in some other language and offer you the choice to convert it to your preferred language.
Only thing that HTML5 has in connection with this whole translation thing is that it has a new attribute called translate. By using this attribute with a value of "no" you can now flag elements that you don't want to be translated by browsers:
<span translate="no">衝撞撞沒輸贏」應是「衝衝撞撞沒輸贏</span>
<!--Now no browser will dare to translate this.-->
The question is - do you want to auto-translate the content, or do you want to translate the user interface of your website.
By itself HTML5 doesn't really provide complete support for either, only some hints, and even they are not really implemented everywhere.
You can add a Google Translate button to your website, as described at http://www.labnol.org/internet/google-translation-widgets/10135/ , for example. Google Chrome has it built-in and it works for all websites. Other browser may get such functionality in the future, but that is a feature of the browser application and not of HTML.
For translating your site's interface you can use one of the many internationalization libraries. Many CMS's and web frameworks come with such an i18n library. You can also do it client-side, using a library such as jquery.i18n ( https://github.com/wikimedia/jquery.i18n/ ).
(Disclaimer: I am one of the developers of juery.i18n.)
Finally, just a tip about good practice: Do use the lang attribute on all the relevant HTML elements. Even if it's just "en". This is useful for spell checking, picking the correct fonts, translation, etc. Whenever you know the language, specify it. It may seem redundant, but it is needed.
There is no integrated translation in HTML5. HTML provides support for the lang attribute which indicates what language a given tag's content is written in, but this is merely an indicative attribute which does not modify the content in any way.
The more philosophical reasoning behind why there is no automatic translation is that HTML is a structural language... it defines the framework upon which your web page is built. Not only does it not modify content, but if it did you as the developer would lose control over the quality and accuracy of the translation. This could be a very bad thing...
As you must already know, however, Chrome will translate pages in foreign languages for you into your preferred language. The difference here is that Chrome is controlled by the user, who has the power to decide in what language to view the content as well as whether they wish to view the original content which is explicitly defined by you, the developer.
Translation is up to the browser. To translate, you need a translation engine which will actually do the translation for you. As you know, you can't simply convert words one by one and end up with a proper translation in the other language. Chrome leverages the power of Google Translate to translate its pages and this is how it offers this functionality. Your best bet for multilanguage support is to offer your pages in as many languages as you accurately can, and then hope that users who don't use Chrome will plug your URL into Google translate or a similar service to get an approximation of your content in their native tongue.
How do I repair malformed HTML using C#? A great answer would be an HTML Agility Pack sample!
I'm scraping a site (for legitimate use). The site's HTML is OK but there are some annoying problems.
One way I could go would be through regular expressions. I used Expression Web to analyse the problems and the regular expressions needed to correct them. So one way would be to use a tool such as RegexBuddy to generate C# code for these regular expressions.
However, the recommended tool for processing malformed HTML in C# is the HTML Agility Pack (HAP). Moreover, I've analysed only a handful of pages and I'm afraid that future pages will contain patterns I've not yet solved, and I would hate to enter the "find the errors in the next few pages and correct them" maintenance business. So, if HAP already has a solid, always-working solution, this would be great. The problem is that except for a few mentions here at SO I could not find any how-to-use documentation for this tool, except for the object-by-object API help file.
So - before I spend $ and learning time on RegexBuddy (no free evaluation version), or break my teeth on HAP's API documentation - is there an easy way to do this? An HAP sample would help... :-)
can you tell me what kind of annoying problems are you having?
but you dont need to use regex to clean the html, HAP will let you access the elemtents of a malformed html using Xpath Queries.
and basically you need to learn Xpath to know how to get the html elements you want.
it really depends on the kind of html you are parsing using HAP.
but there is several ways to get the elements.
like by id or class or even you can get the element that follows another element that contain a given text like "name:" for example.
you can goto W3 schools Xpath Tutorial for a nice xpath tutorial
What I took from the answers here:
1) If you're scraping a website you don't control, you'll always enter a maintenance mode where you have to fix your scraper every time the layout of the page you're scraping changes.
2) If you are limited to this known site, why not write your scraper to adjust the problems
So, if I have to go into maintenance mode, it should be as easy as possible. Therefore, my process is as follows:
I use Webius's SWExplorerAutomation to detect scenes in Web pages. The idea is that a Scene is a collection of conditions you define for IE. When a web page is loaded, IE tries to see which set of conditions is met (e.g. - page title is "Account Login", the page contains a "Login" text box a "Password" text box). If a set of conditions corresponding to a scene is detected, IE reports that the scene has been detected. This model provides an abstraction layer - Some changes in the web page can translate to changes in the scene file, saving the code from having to change. Additionally, this shields me from IE's event driven model: I call "scene. I'm evaluating this product but I'm not yet sure I'll use it, mainly because the documentation is terrible. Another alternative is Watin, and one more reason I haven't yet bought SWEA is this article accusing its author of spamming against Watin.
Once the web page has been acquired, I use Expression Web to run compatibility checks and identify errors.
I use RegexMagic to remove and correct errors. I really love this tool. Sure, sometimes it make you murderously angry because it doesn't let you do things that should be really easy, but it's a sweet, sweet tool, and the documentation is amazing.
Finally, after all the errors I know have been corrected, I use HTML Agility Pack to convert to XHTML - cross the ts and dot the is, so to speak: all lower case, quotes across attributes, and so on.
Hope this helps!
Avi
Regex can't be used for HTML Cleaning.
Does http://tidy.sourceforge.net/ helps?
If you're scraping a website you don't control, you'll always enter a maintenance mode where you have to fix your scraper every time the layout of the page you're scraping changes. It doesn't matter if you're using the regex <td color="red">\d+</td> to get the big red number from a page or if you're using a DOM parser to get the 3rd cell in the 2nd row in the table with id numbers to get the same. The regex breaks if the webmaster replaces the color attribute with a class attribute. The DOM parser breaks if the webmaster adds another row to the top of the table.
If you're scraping larger parts of a web page and want to embed them in your own web page, it may be easier to get over your desire for web standards compliance and just let the browser figure out how to display things.
Since you're using Html Agility Pack and know of the problems that occur, if you are limited to this known site, why not write your scraper to adjust the problems when you've loaded the HtmlDocument.
i.e.:
If you know the element always appears after the , insert the element into the first child position of the tag.....
In my Delphi program I want to display some information generated by the application. Nothing fancy, just 2 columns of text with parts of words color-coded.
I think I basically have two options:
HTML in a TWebbrowser
RTF in a TRichEdit.
HTML is more standard, but seems to load slower, and I had to deal with The Annoying Click Sound.
Is RTF still a good alternative these days?
Note: The documents will be discarded after viewing.
I would vote for HTML.
I think it is more future oriented. The speed would not concern me.
The question of HTML or RTF may be irrelevant. If they are just used for display purposes, then the file format doesn't matter. It's really just an internal representation. (Are any files even being saved to disk?) I think the question to ask is which one solves the problem with the least amount of work.
I would be slightly concerned that the browser control is changing all the time. I doubt the richedit control will change much. I would lean towards the richedit control because I think there is less that could go wrong with it. But it's probably not a big deal either way.
Have you considered doing an ownerdraw TListView?
I'd also use HTML. Besides, you just got an answer for the clicking sound in TWebBrowser.
If you'd rather not use TWebBrowser, take a look at Dave Baldwin's free HTML Display Components.
I would vote for HTML, too.
We started an app a while ago...
We wanted to
display some information generated by the application. Nothing fancy, just...
(do you hear the bells ring???)
Then we wanted to display more information and style it even more....
...someone decided, that RTF isn't enough anymore, but for backwards compatibility we moved on to MS Word over OLE-Server. That was the end of talking about performance anymore.
I think if we would have done that in HTML it would be much faster now.
RTF is much easier to deal with, as the TRichEdit control is part of every single Windows installation, and has much less overhead than TWebBrowser (which is basically embedding an ActiveX version of Internet Explorer into your app).
TRichEdit is also much easier to use to programmatically add text and formatting. Using the SelStart and SelLength, along with the text Attributes, makes adding bolding and italics, setting different fonts, etc. simple. And, as Re0sless said, TRichEdit can easily be printed while TWebBrowser makes it more complicated to do so.
I would vote RTF as I dont like the fact TWebBrowser uses Internet explorer, as we have had trouble with this in the past on tightly locked down computers.
Also TRichEdit has a print method build in, where as you have to do all sorts of messing about to get the TWebBrowser to print.
Nobody seems to have mentioned a reporting component yet. Yes, it is overkill right now, but if you use it anyway (and maybe you already have got some reporting to do in your app, so the component is already included) you can just display the preview and allow to print / export to pdf later, if it makes any sense. Also if you later decide that you want to have a fancier display there is nothing holding you back.
If both HTML and RTF won't satisfy your need, you could also use an open source text/edit component that supports coloring words or create your own edit component based on a Delphi component.
Another alternative to the HTML browser is the "Embedded Web Browser" components which I used a few projects for displaying html documents to the user. You have complete control over the embedded browser, and I don't recall any clicks when a page is loaded.
I vote for HTML also
RTF is good only for its editor, else then you'd better go standard.
RTF offers some useful text editing options like horizontal tabulator which are not available in HTML. Automatic hyperlink detection is also a nice extra. But I think I would prefer HTML, if these features are not required.
I vote for HTML.
Easier to generate programmatically.
Widely supported.
Since you don't need WYSIWYG capabilities I think HTML advantages trump RTF. Moreover, should the need to export generated data for further, WP-like editing arise, remember that major word processor can open and convert HTML files.
Use HTML, but with 'Delphi Wrapper for Chromium Embedded' by Henri Gourvest , Chromium embedded uses the core that powers Google Chrome.
Don't use TWebBrowser, I'm suffering from all programs that use IE's web control - the font is too small on my 22' monitor with a 1920x1080 resolution, I use Windows 7 and my system's DPI is 150% (XP mode), I tried everything to tweak trying to fix that, no luck...
I work on a web application product which allows mnemonics (i.e. an underscore below the character 'C', to allow a keyboard combination and the key C to trigger the "Close" button).
Forms are created by different developers and they can each statically set mnemonics for buttons.
Forms can be nested, so it is not necessarily known at design time the exact mnemonics which will be required for one page.
There can be at most one mnemonic using any character on a page containing many forms.
And here's the kicker, the forms must be able to be localised into any language, meaning that the 'C' for close may not even appear in the... [insert language] word used for "Close".
The ideal solution would be some algorithm where developers didn't have to manually specify a mnemonic, instead they would be worked out at run-time, they would be localised, and they would be both convenient and consistent (I did say the ideal solution ;-D).
So I was wondering, are there any good strategies for achieving something anywhere near the ideal solution?
EDIT: To clarify,
I'm not talking about keyboard accelerators, such as Ctrl+S for save, which is hidden on a menu. The mnemonics are only used for actions which are presented on the screen, under button labels for example. Not hidden keyboard shortcuts that would change on localisation (there are none anyway, we run in a web browser, so the only accelerators are those which are part of whichever browser is being used).
The problem with attempting to choose the mnemonics at design time is that the people responsible for developing the UI are not aware of the localisation, as it could be done months later. Also, the problem of using nested and modular forms means that even without the localisation, there could still be conflict.
Some of the ideas I've batted around include having a global mnemonic registry which forms could use to apply for a certain mnemonic based on it's localised label, the registry would then calculate which was the best use of available characters. Somehow it would have to maintain the state of that - such that the same form does not appear with different mnemonic sets over the course of the application use, it could possibly even be done statically and persisted.
Surely if I was looking to do something like that it would fit a more general algorithm - I just have no idea which one! :-)
I tried to do something similar on a past project, and abandoned it. It was too complicated to get done in any reasonable amount of time.
One of the challenges is that some languages don't have a single displayable "letter" that maps to a single key on the keyboard. Another challenge, in English, was that usability standards required the mnemonic letters to be consistent with those in similar buttons/menus in other apps. This can be difficult if you are dynamically choosing the letters.
I don't know if it could be called "best practice," but consider what Microsoft Internet Explorer does in Japanese. Note the familiar F, E, V, A, and D mnemonics on the menu and the toolbar. I imagine that it follows the same convention, where appropriate, for buttons on forms and such.
(source: sidenet.ddo.jp)
(I snagged that screenshot from a google image search. If it goes stale, you can find other pictures of jp-localized IE pretty easily.)
This is really a design problem, not an algorithmic problem. It turns out that most applications don't localize keyboard accelerators, including most Microsoft ones, although there are some exceptions in certain markets. Not every keyboard shortcut is a mnemonic; really, only a few of the most common ones are.
I should note that this election not to localize accelerators is a rather recent trend; prior to 2000 or so, it was still quite common to localize shortcuts in some products (examples being ctrl-F for "Fett" instead of "bold" in German and Swedish products). But the pendulum has swung in the opposite direction, perhaps as a consequence of MUI and similar features.
A few localization tools will help you on this; I saw this feature as a bullet point on a product I've never used called Visual Localize. I'm not sure how useful automatic assignment is, as it's a fairly hard problem to automatically decide which character is the best mnemonic representation anyway, without domain knowledge of a particular product.
Generally, it only makes sense to localize the underlined mnemonic characters on dialogs, and maybe in menus. Most localization service firms are familiar with this process, and some have tools to detect duplicates in any build-time resources before handing back the localized resource package. You might actually want to invest in locating or building a tool that can do this duplicate check at runtime, and run the tool as part of acceptance criteria.
For regular menu items or keyboard command sequences, it can be more confusing than helpful, unless you have a fully baked keyboard to command mapping customization feature.
The problem I see with doing this is at runtime, is what happens when you deploy a version which has new forms, and changes Close from alt-c to ctrl-c. Or when you have two actions on two different pages but they are both close, you want to make sure close is always alt-c. Even worse would be if the algorithim was based on something non-deterministic and could change over time without a deploy.
It just seems like you might spend more time trying to build an algorithim for something that should be decided upon at design time.