I am creating a Windows Forms application in order to compare two HTML documents. The first one is retrieved from an external source and contains some structure mistakes. So an algorithm is applied to transform the HTML text into an optimized document and that corresponds to the second document to compare. After that, I want to visually compare the optimized document to the first one and display differences if they are ones.
I created a form with two webview2 controls, where the first one displays the first HTML document. A button is used to transform HTML text and then I would like to know what is the best way to perform a visual comparison. The ideal behavior would be for the second webview2 to display the transformed HTML document and to display differences with a special color for example.
So my first approach is to use this function:
await webview2.CoreWebView2.ExecuteScriptAsync("window.print();");
And perfom a visual comparison of the two screenshots, but I saw that the control of the print popup window is not possible for the moment with the webview2 component.
So do you think there is a better way to accomplish that? Are there any more suitable components or tools to perform this comparison?
Thanks in advance for your help!
In the algorithm, could you add the changed element ids to a list, take that list, and then change the CSS via javascript to highlight the elements in the list.
I'm working on an implementation of it now to see if its viable.
Related
I have a problem where I need to change the displayed text between two different writing styles and wanted to ask if such a thing is even possible in HTML.
so like I my head I'm thinking about putting the text as a variable and saying
if output = 1:
display simple_text:
else display complex_text:
Here, the complex and simple text would be the variables the text needed to change is set to.
Thanks in advance for answering and reading my question
HTML is a markup language intended to describe the structure of a document in a both machine- and human-readable way.
As such, HTML doesn't have any logic like if...else or loops.
So to do what you want you will either need a template engine (which would decide at serve-time which text would be displayed, on the server), or Javascript, to implement the logic on the client-side (browser). Note that Javascript can be used on the server as well if the server runs Node.js.
To decide which one to go for, here's some cornerstones:
If the decision which text to display must only be made once - and won't change after that, going for a template engine on the server-side is probably the best approach.
If what is to displayed depends on some actions the user can perform (like you mentioned, clicking a button), go for a Javascript-based approach in the browser.
I want to have a couple html websites in a single file. Is this possible?
A website is not just an html "page".
An html file represents the document structure of one page.
Theoretically, saying that you want to represent multiple websites on one html file is like saying that you want to write different documents (your tax files, a book, a ticket for a movie, etc) based on one single template.
While theoretically you can dinamically change the structure of such a document, there is absolutely no point in doing so.
HTML describes the structure of Web pages using markup.
So why would you use a single HTML file to represent different web pages?
Sorry, but you can't. It's not possible. Why would you even do it?
The only thing that comes into my mind is to use <embed>tag, for ex. But it's probably not what you rly want
You must be more specific. The question is vague. In general you can write a code that can change dynamically the website appearance after inputs/actions from the users. For example a JavaScript code that shows/hides something (or the complete website) as long as the mouse is over an element or select/deselect an element. It all can be in a single html document (html5, css3, JavaScript/JQuery).
I'm trying to write a crawler that gets raw html data and finds Title, price, update date, photo etc... fields and writes it to database. This is an classic and old way to crawl data.
I think that I can do this job wit an other way.
If I crawl all pages (may be more than 1000) in the web site, and compare them all I can find the specific areas.
I mean html tags will be always the same. Only specific areas will change like title, image etc...
So, what is the best way to determine changed areas?
compare them all I can find the spesific areas
what is the best way to determine changed areas?
In your question you set the scrapeing/crawling approach of comparing pages' parts and getting the data of specific areas. This smells with regex approach. Do not use it as the very non-efficient approach. Rather use xpath, operating on XML structures.
So, be simple:
Get html
Make it DOM
Make DOM a valid XML
Apply xPath queries to XML
Believe me, xml libraries are well able to handle huge structures (including idle html tags) and traverse over them. A classical example of using xpath is in this post of mine.
To determine data node paths you just use web inspector tools (F12 - in Chrome and IE and Ctrl+Shift+I in FF) to see the html tags containing useful info.
Having the HTML of a webpage, what would be the easiest strategy to get the text that's visible on the correspondent page? I have thought of getting everything that's between the <a>..</a> and <p>...</p> but that is not working that well.
Keep in mind as that this is for a school project, I am not allowed to use any kind of external library (the idea is to have to do the parsing myself). Also, this will be implemented as the HTML of the page is downloaded, that is, I can't assume I already have the whole HTML page downloaded. It has to be showing up the extracted visible words as the HTML is being downloaded.
Also, it doesn't have to work for ALL the cases, just to be satisfatory most of the times.
I am not allowed to use any kind of external library
This is a poor requirement for a ‘software architecture’ course. Parsing HTML is extremely difficult to do correctly—certainly way outside the bounds of a course exercise. Any naïve approach you come up involving regex hacks is going to fall over badly on common web pages.
The software-architecturally correct thing to do here is use an external library that has already solved the problem of parsing HTML (such as, for .NET, the HTML Agility Pack), and then iterate over the document objects it generates looking for text nodes that aren't in ‘invisible’ elements like <script>.
If the task of grabbing data from web pages is of your own choosing, to demonstrate some other principle, then I would advise picking a different challenge, one you can usefully solve. For example, just changing the input from HTML to XML might allow you to use the built-in XML parser.
Literally all the text that is visible sounds like a big ask for a school project, as it would depend not only on the HTML itself, but also any in-page or external styling. One solution would be to simply strip the HTML tags from the input, though that wouldn't strictly meet your requirements as you have stated them.
Assuming that near enough is good enough, you could make a first pass to strip out the content of entire elements which you know won't be visible (such as script, style), and a second pass to remove the remaining tags themselves.
i'd consider writing regex to remove all html tags and you should be left with your desired text. This can be done in Javascript and doesn't require anything special.
I know this is not exactly what you asked for, but it can be done using Regular Expressions:
//javascript code
//should (could) work in C# (needs escaping for quotes) :
h = h.replace(/<(?:"[^"]*"|'[^']*'|[^'">])*>/g,'');
This RegExp will remove HTML tags, notice however that you first need to remove script,link,style,... tags.
If you decide to go this way, I can help you with the regular expressions needed.
HTML 5 includes a detailed description of how to build a parser. It is probably more complicated then you are looking for, but it is the recommended way.
You'll need to parse every DOM element for text, and then detect whether that DOM element is visible (el.style.display == 'block' or 'inline'), and then you'll need to detect whether that element is positioned in such a manner that it isn't outside of the viewable area of the page. Then you'll need to detect the z-index of each element and the background of each element in order to detect if any overlapping is hiding some text.
Basically, this is impossible to do within a month's time.
I just wondered and wanted to gather together in one place all missing features of our beloved html form elements.
One example could be missing of horizontal scrollbar in a listbox. But I am sure there are a lot of features we would like to see in our form elements by default.
One missing feature per answer please.
Thank you.
Date/Time picker controls, rather than always trying to manipulate a textbox, selects, or some other controls to create them.
Hell, they miss so many features, I wouldn't know where to begin! But here goes:
(Missing in HTML 4, don't know about 5)
Full visual customizability (background colours, borders, and text colours) for all elements (including checkboxes, radio buttons, and select elements)
Native input validation (without needing JS) for text inputs: Numeric only, alphabetic characters only, regular expression
An open enumeration, a "SELECT you can type in" would be handy in some situations.
If pretty much everyone, but not quite, answers the question in one of ten or 15 different ways, you have to either force everyone to type in the answer or have an "other" option with a separate text field.
The lack of intrinsic support for multiple windows (or even just modal dialogs) is ridiculous.
Think of the tens of thousands of programmer-hours wasted on acrobatic manipulation of div elements just to implement a UI that would be trivially easy in a desktop app.
It's somewhat pointless to list what is missing in HTML 4 since so much has been fixed in HTML 5. And then, most of us can't list what is missing from HTML 5 because we are not familiar enough with it yet.