Is there a way to embed only a section of a website in another HTML page?
Example: I see an answer I want to blog about, so I grab the HTML content, and splat it in somewhere, and show only that, styled like it is on stackoverflow. Basically, I want to blockquote the section of the page with original styling, if that makes sense. Is that something the site itself has to provide, or can I use an iframe and tell it to show only a certain element or something crazy? Open to all options, but I want it to show up as HTML, not as an image (that's really a last resort).
If this is even possible, are there security concerns I need to aware of?
Don't think image should really be last resort. You have no control over the HTML/CSS of the source page, so even if you craft a solution (probably by using JavaScript to parse out the desired snippet) there is no guarantee that tomorrow the site doesn't decide to change its layout.
Even Jeff, who has control over the layout of stackoverflow.com, still prefers to screen-capture the site, rather than pull in the contents live.
Now if your goal was to have the contents auto-update, that would be a different story. But still, unless you use some agreed-upon method of sharing content, such as RSS, your solution would be very fragile.
The concept you are describing is roughly what is called a "purple include" or "transclusions". There is a library out there for it, but its not exactly actively developed. Here's a couple ajaxian articles on it.
I'd recommend using a server side solution with Python; using urllib2 to request the page, then using BeautifulSoup to parse out the bit that you need. BeautifulSoup has a very flexible selection api with which you can craft heuristics for the section you are interested in.
To illustrate:
soup = BeautifulSoup(html)
text = soup.find(text="Some text on the page that is unlikely to change")
print soup.parent.prettify()
That way if the webmaster later changes the markup on the page, your scraping script should still work.
On client side <iframe> is the only practical option. It is possible to scroll it, but it might not work in the long term, because it's technically close to clickjacking attack.
There's also cross-site XHR, but requires opt-in from destination site, and today works only in few latest browsers.
Getting HTML on server side is easy (every decent web framework has ability to download page and parse HTML and you can use XPath/XSLT or DOM to extract bit you want).
Getting styles however is going to be tricky – CSS rules may not work with HTML fragment taken out of context. You'd have to parse CSS, extract and transform rules or use browser and read currentStyle of every node.
Obviously you have to heavily filter HTML you extract to avoid XSS. It's harder than it seems.
If you don't need to automate this, a good HTML+CSS WYSIWYG editor might be able to extract content fragment with styles.
That sounds like something that IE8's Web Slices would be perfect for. However, it's only available in IE8, and the site of origin would have to implement for you to be able to take advantage of it.
Related
In my web application, a user may make a post with images, embed videos and text with different styles. I want to generate a preview for the post to show it on the front page of the web application. It demands that it doesn't take too much space and as clear as possible.
I know that I need to parse the post html first and may extract image elements first. My consideration for the text is simply to extract all plain texts and show part of them.
Could someone provide other advice, methods or resource about this problem?
As I understand, you want to render a HTML page into an image? You should use some layout engine, such as WebKit or Gecko on server side.
Another option is to use some third-party online tool for these previews. But rendering the page is pretty hard process, because of it's time complexity and memory space requirements for storing images. I have found these services:
http://www.thumboo.com/
http://www.thumbalizr.com/
http://www.zubrag.com/scripts/website-thumbnail-generator.php
I have affiliated with expedia and I am using their API system. One of their requirements for launching the site is adding the terms and agreements to my page and they give us this page: http://travel.ian.com/index.jsp?pageName=userAgreement&locale=en_US&cid=xxx. I do not want to go to a different site, and I can not copy and paste the information because of updates. I also prefer not to use an iframe. Does anyone have any ideas on how to do this? Here is a webpage using this on their site with their domain: http://www.helloweekends.com/terms.htm. Does anyone know how they did this? Any help would be greatly appreciated!
Since it originates from another domain, it wouldn't be possible to use JavaScript, due to the same origin policy. Also, relying on JavaScript for the update would be trouble for users who has JavaScript disabled, as they wouldn't see the terms. Since you don't want to use an iframe, or copy the content, I guess your best shot would be to scrape their page with a server-side language of your choice, and then display it on your page.
Scraping can be a bit tricky though, if you rely on their markup. If they change their markup, there is a chance that your script will break, thus stop updating the terms.
There are various tutorials available on how to scrape sites. Here are a few PHP examples:
Web scrape with PHP
PHP Screen Scraping Tutorial
Note Make sure that they allow you to scrape the page prior to implementing it, so that you don't violate their rules.
Do you know if their API serves something with JSON? A JSONP call can get the values to you, but it will make your page rely on javascript for the users to see the updated page.
Another option is to use PHP of any other server side language to get the contents of the url, process it and return the block you require.
I would suggest the load() function offered by jQuery. It makes a simple AJAX call to retrieve a file, and you could even use a selector to only grab part of the page. For example, load the contents of a HTML page into a div:
$('#div_id').load('my_file.html');
Or just load a part of the page:
$('#div_id').load('my_file.html #main_text_id');
I have a site that wraps some user-generated content, and I want to be able to separate the markup for the layout, and the markup from the user-generated content, so the u-g content can't break the site layout.
The user-generated content is trusted, as it is coming from a known group of users on my network, but nonetheless only a small subset of html tags are allowed (p, ul/ol/li, em, strong, and a couple more). However, the user-generated content is not guaranteed to be well-formed, and we have had some instances of malformed user-generated content breaking the layout of the site.
We are working with our users to keep the content well-formed, but in the meantime I am trying to find a good way to separate the content from the layout. I have been looking into namespaces, but have been unable to find good documentation about CSS support for embedded namespaces.
Anyone have any good ideas?
EDIT
I have seen some really good suggestions here, but I should probably clarify that I have absolutely no control over the input mechanism that the users use. They are entering content into one system, and my page uses that system's API to pull content out of it. That system is using TinyMCE, but like I said, we are still getting some malformed content.
Why not use markdown
If your users are HTML literate or people that can grasp the concept of markdown syntax I suggest you rather go with that. Stackoverflow works great with it. I can't imagine having a usual rich editor on Stackoverflow. Markdown editors are much simpler and faster to use and provide enough formatting capabilities for most situations. If you need some special additional features you can always add those in but for starters oute of the box capabilities will suffice.
Real-time view for self validation
But don't forget to include a real time view of what users are writing. Self validation makes miracles so they correct their own mistakes before posting data.
Instead of parsing the result or forcing the user to use a structured format, just display the content within an iframe:
<iframe id="user_html"></iframe>
<script>
document.getElementById("user_html").src = "data:text/html;charset=utf-8," + escape(content);
</script>
I built custom CMS systems exclusively for several years and always had great luck with a combination of a quality WYSIWYG, strong front-end validation, and relentless back-end validation.
I always gravitate toward CKEditor because it's the only front-end editor that can deal with Microsoft Word output on the front end...that's a must-have in my books. Sure, others have a paste from word solution, but good luck getting users to use it. I've actually had a client overload a db insert thanks to Microsoft Word that didn't get scrubbed in Tiny. HTML tidy is a great solution to clean things up prior to validation on the back end.
CK has built-in templates and classes, so I used those to help my users format without going overboard. On the back-end I checked to ensure they hadn't tried any funny business with CSS, but it was never a concern with that group of users. Give them enough (safe) features and they'll never HAVE to go rogue.
Maybe overkill, but HTML
Tidy
could help if you can use it.
Use a WYSIWYG like
TinyMCE
or CKEditor that has built in cleanup methods.
Robert Koritnik's suggestion to use markdown seems brilliant, especially considering that you only allow a few harmless formatting tags.
I don't think there's anything you can do with CSS to stop layouts from breaking due to open HTML tags, so I would probably forget that idea.
I have to migrate a static HTML website to TYPO3. I know, I could read docus first, but I believe I will need to read some days first to only recognize which direction to run...
Do I have to learn TypoScript like
Default PAG
page = PAGE
page.typeNum = 0
page.20 = TEXT
page.20.value = HELLO UNIVERSE!
page.10 = TEXT
page.10.value = HELLO WORLD!
or is there another way to do it quickly? With markers?
thank you guys!
You will have to learn a little bit of TypoScript to do what you want. Sorry :-( But you won't have to learn that much, and what you do learn you'll be able to reuse when building other TYPO3 sites.
First thing: skip markers. Markers are a remnant of an old, deprecated templating system. The way you should be doing this is with TemplaVoila.
TemplaVoila works by giving you an interface to map TYPO3 content (or instructions to generate content) to blocks of markup in your HTML file. In other words, you take your static HTML file, then go through it and tell TemplaVoila "OK, that DIV is my sidebar, so put a list of all the site pages in there... that P is the footer, put a link to the privacy policy there... that DIV is the main content area, fill it with blocks of content created by the user," and so forth. This is a very powerful approach, because it means that if you work with other Web designers or graphic designers, they don't have to learn any special "magic tags" or markers; they can just give you well-formed HTML and with a few clicks you can turn it into a live template for a site. Pretty nifty.
There's a piece of TYPO3 documentation called "Futuristic Template Building" that explains pretty clearly how to go from a static HTML page to a TYPO3-ized site with TemplaVoila. Here's a direct link to the section of that doc that walks you through the process. (Don't be scared by the word "futuristic" into thinking that TemplaVoila isn't fully baked yet -- that doc was written six years ago, when TemplaVoila was pretty futuristic, but today it's quite mature and in use on TYPO3 sites all over the world.)
This should be enough to get you started, but if you hit roadblocks or can't wrap your head around it feel free to post your questions back to this thread and I'll help you out.
I'm reviving this, since a lot has happened since 2010.
There are multiple ways in TYPO3 to do the templating. All of them involve TypoScript, but in some there is only a minimal amount of TS needed.
Use "the old built-in way", doing all rendering in TypoScript and some HTML templates with markers in them. In this approach, you'd use the content elements provided by the core. Their rendering is defined with TypoScript in the core-extension "CSS Styled Content".
Use "the new built-in way". Here you'd also use the content elements provided by the core, and optionally self-defined ones. The rendering happens using the Fluid templating engine. You would do this using the core-extension "Fluid Styled Content". This is available since version 7.5.
Use a third party extension for content element rendering. I know of these:
TemplavoilĂ - You probably should not use it, since it is not actively developed anymore, although there is a version claiming compatibility to TYPO3 7 LTS, but I don't know much about that.
FluidTYPO3 - This is a whole ecosystem of extensions with which you can define page templates and content element templates completely using the Fluid Templating engine (backend forms, backend preview and frontend rendering). It also provides a mechanism for nesting content elements.
DCE - Dynamic content elements. I don't know anything about them, you would need to read the docs.
Mask - TYPO3-core near wizard for own contentelements and pagetemplates. Uses database fields, not flexforms.
More extensions I don't know of.
It would be a bit much to explain all of these here in detail.
My current personal favorite is the FluidTYPO3 ecosystem, but I'm considering a switch to using Fluid Styled Content, because it is directly integrated into the core. I'm not sure if it supports nested content elements, so maybe one would need a separate solution for that (e.g. the extension gridelements).
How do I repair malformed HTML using C#? A great answer would be an HTML Agility Pack sample!
I'm scraping a site (for legitimate use). The site's HTML is OK but there are some annoying problems.
One way I could go would be through regular expressions. I used Expression Web to analyse the problems and the regular expressions needed to correct them. So one way would be to use a tool such as RegexBuddy to generate C# code for these regular expressions.
However, the recommended tool for processing malformed HTML in C# is the HTML Agility Pack (HAP). Moreover, I've analysed only a handful of pages and I'm afraid that future pages will contain patterns I've not yet solved, and I would hate to enter the "find the errors in the next few pages and correct them" maintenance business. So, if HAP already has a solid, always-working solution, this would be great. The problem is that except for a few mentions here at SO I could not find any how-to-use documentation for this tool, except for the object-by-object API help file.
So - before I spend $ and learning time on RegexBuddy (no free evaluation version), or break my teeth on HAP's API documentation - is there an easy way to do this? An HAP sample would help... :-)
can you tell me what kind of annoying problems are you having?
but you dont need to use regex to clean the html, HAP will let you access the elemtents of a malformed html using Xpath Queries.
and basically you need to learn Xpath to know how to get the html elements you want.
it really depends on the kind of html you are parsing using HAP.
but there is several ways to get the elements.
like by id or class or even you can get the element that follows another element that contain a given text like "name:" for example.
you can goto W3 schools Xpath Tutorial for a nice xpath tutorial
What I took from the answers here:
1) If you're scraping a website you don't control, you'll always enter a maintenance mode where you have to fix your scraper every time the layout of the page you're scraping changes.
2) If you are limited to this known site, why not write your scraper to adjust the problems
So, if I have to go into maintenance mode, it should be as easy as possible. Therefore, my process is as follows:
I use Webius's SWExplorerAutomation to detect scenes in Web pages. The idea is that a Scene is a collection of conditions you define for IE. When a web page is loaded, IE tries to see which set of conditions is met (e.g. - page title is "Account Login", the page contains a "Login" text box a "Password" text box). If a set of conditions corresponding to a scene is detected, IE reports that the scene has been detected. This model provides an abstraction layer - Some changes in the web page can translate to changes in the scene file, saving the code from having to change. Additionally, this shields me from IE's event driven model: I call "scene. I'm evaluating this product but I'm not yet sure I'll use it, mainly because the documentation is terrible. Another alternative is Watin, and one more reason I haven't yet bought SWEA is this article accusing its author of spamming against Watin.
Once the web page has been acquired, I use Expression Web to run compatibility checks and identify errors.
I use RegexMagic to remove and correct errors. I really love this tool. Sure, sometimes it make you murderously angry because it doesn't let you do things that should be really easy, but it's a sweet, sweet tool, and the documentation is amazing.
Finally, after all the errors I know have been corrected, I use HTML Agility Pack to convert to XHTML - cross the ts and dot the is, so to speak: all lower case, quotes across attributes, and so on.
Hope this helps!
Avi
Regex can't be used for HTML Cleaning.
Does http://tidy.sourceforge.net/ helps?
If you're scraping a website you don't control, you'll always enter a maintenance mode where you have to fix your scraper every time the layout of the page you're scraping changes. It doesn't matter if you're using the regex <td color="red">\d+</td> to get the big red number from a page or if you're using a DOM parser to get the 3rd cell in the 2nd row in the table with id numbers to get the same. The regex breaks if the webmaster replaces the color attribute with a class attribute. The DOM parser breaks if the webmaster adds another row to the top of the table.
If you're scraping larger parts of a web page and want to embed them in your own web page, it may be easier to get over your desire for web standards compliance and just let the browser figure out how to display things.
Since you're using Html Agility Pack and know of the problems that occur, if you are limited to this known site, why not write your scraper to adjust the problems when you've loaded the HtmlDocument.
i.e.:
If you know the element always appears after the , insert the element into the first child position of the tag.....