I'm writing a special crawler-like application that needs to retrieve the main content of various pages. Just to clarify : I need the real "meat" of the page (providing there is one , naturally)
I have tried various approaches:
Many pages have rss feeds , so I can read the feed and get this page specific contnent.
Many pages use "content" meta tags
In a lot of cases , the object presented in the middle of screen is the main "content" of the page
However , these methods don't always work , and I've noticed that Facebook do a mighty fine job doing just this (when you want to attach a link , they show you the content they've found on the link page) .
So - do you have any tip for me on an approach I've over looked?
Thanks!
There really is no standard way for web pages to mark "this is the meat". Most pages don't even want this because it makes stealing their core business easier. So you really have to write a framework which can use per-page rules to locate the content you want.
Well, your question is a little bit vague still. In most cases, a "crawler" is going to just find data on the web in a text-format, and process it for storage, parsing, etc. The "Facebook Screenshot" thing is a different beast entirely.
If you're just looking for a web based crawler, there are several libraries that can be used to traverse the DOM of a web page very easily, and can grab content that you're looking for.
If you're using Python, try Beautiful Soup
If you're using Ruby, try hpricot
If you want the entire contents of a webpage for processing at a later date, simply get and store everything underneat the "html" tag.
Here's a BeautifulSoup example to get all the links off a page:
require 'hpricot'
require 'open-uri'
doc = Hpricot(open("http://www.stackoverflow.com"))
(doc/"a").each do |link|
puts link.attributes['href']
end
Edit: If you're going to primarily be grabbing content from the same sites (e.g. the comments section of Reddit, questions from StackOverflow, Digg links, etc) you can hardcode the format of them so your crawler can say, "Ok, I'm on Reddit, get everything with the class of 'thing'. You can also give it a list of default things to look for, such as divs with class/id of "main", "content", "center", etc.
Related
I'm working on a site to help students with ACT prep, and I want to have a page where I can post explanations to questions that people submit. I want to be able to put a few tags on each post so that site visitors can click on or search whatever's relevant for them in the archives ("semicolons", "geometry", etc.) and all the relevant posts will come up, blog style. I'm very new to this, though, and I don't know how to do it or even what to search - when I search for tags I keep getting SEO recommendations, and that doesn't seem like the right thing.
Here's a solution (but it's not great)
It might be the only way to make what you want happen with a static HTML site.
You could, by hand, create pages that you fill with links to all of the posts that fit a certain category or "tag". For example, you could make a page that has links to all of your posts concerning geometry. Lets call this your archive page for geometry.
Then, when you include tags in a post, you would make each tag link to it's corresponding archive page.
Why do I say its not the best solution?
Virtually every blog that you see has a "back end" with a database that stores posts. When someone comes to your website and looks at a post, that posts data is inserted into a template and displayed to the user. You do not have to re-write the entire web page every time. Thing like the header, sidebar, footer, main page background etc are all in a template.
Having a database also lets you search the database and return relevant results. And a blog with a back end will typically let you write rules (or have them already written) that say, when you add a "tag" to a post, a link to that post should be automatically added to an archive page etc.
As far as I can tell you don't have database, so you'll just be linking static HTML pages. That means that every time you make a new post, you'll have to add a link to all of it's relevant archive pages by hand. Maybe you don't mind that now, but eventually it will be a nightmare to maintain.
I would strongly encourage you to look into a blogging platform like Wordpress to make your site. It will be more complicated to learn at first, but technology that's meant to do what you want it to do will ultimately be easier to use and maintain than technology that's simply meant to mark up a page.
Are iframes still widely in use today?
I am coding a site with divs, and I want everything to appear in the container div. Is it possible to do it without coding the header + nav into each page and have the content show at the exact same spot without using iframes?
I did a quick Google search and found a post that said it's not possible, but my site will have quite a bit of links.
As of right now, I am coding it with Tumblr, and the hashtags in the posts would act as links to a section of posts (Ex: #blog would retrieve every post under the "blog" link). What are some widely used ways to target links on a website?
If you are creating a multi-page website, it would be helpful to have the HTML content be generated dynamically or be built statically from template files. You don't want to manually update the same content across multiple HTML files.
Dynamic Pages
There are several options for dynamically generating HTML content depending on the software available to you. For example, PHP is a popular language for web development and is available through many web hosts.
Static Pages
It is possible to build static HTML documents from templates using something like Jekyll.
I'm not sure if I'm interpreting what you mean by "coding it with Tumblr" correctly or not, but I think you mean you're making a Tumblr site with their built-in HTML editing capability.
I think you'll have a very difficult time achieving the behavior you desire there. I think you're trying to create something resembling a single-page application. Tumblr probably just allows basic static HTML with little Javascript. The suggestion Kyle made about using PHP or something like that won't work because that code must be executed on a server, and Tumblr doesn't provide that capability to my knowledge.
If you really want this kind of functionality, you probably should get some paid web hosting and develop your web development skills. It's not a simple task, but it's fun!
Sorry if I underestimated you or anything. Just trying to read between the lines. It seems to me that you may be relatively new to web development given the content of your post, and I'm trying to nudge you in the right direction constructively.
This is a rephrasing of my original question https://stackoverflow.com/questions/14516983/google-sites-trying-to-script-announcements-page-on-steroids:
I've been looking into ways to make subpages of a parent page appear in a grid like "articles" on the home page of my Google Site — like on a Joomla home page and almost like a standard "Announcements" template, except:
The articles should appear in a configurable order, not chronologically (or alphabetically).
The first two articles should be displayed full-width and the ones beneath in two columns.
All articles will contain one or more images, and at least the first one should be displayed.
The timestamp and author of each subpage/article shouldn't be displayed.
At the moment I don't care if everything except the ordering is hardcoded, but ideally there should be a place to input prefs like the number of articles displayed, image size, snippet length, css styling etc.
My progress so far:
I tried using an iframe with an outside-hosted Javascript (using google.feeds.Feed) that pulls the RSS feed from the "Announcements" template, but I can't configure the order of the articles. One possibility would be to have a number at the beginning of every subpage title and parse it, but it's going to mess up with time and the number would also be visible on the standalone article page. Or could the number be hidden with Javascript?
I tried making a spreadsheet with a row for each article with columns "OrderId", "Title", "Content", "Image" and process and format the data with a Google App Script (using createHTML and createImage), but a) there doesn't seem to be a way to get a spreadsheet image to show up inside the webapp and b) these articles are not "real" pages that can be linked to easily on the menus.
This feature would be super-useful for lots of sites, and to me it just seems odd that it isn't a standard gadget (edit: or template). Ideas, anyone?
I don't know if this is helpful, but I wanted something similar and used the RSS XML announcements feed within a Google Gadget embedded into my sites page
Example gadget / site:
http://hosting.gmodules.com/ig/gadgets/file/105840169337292240573/CBC_news_v3_1.xml
http://www.cambridgebridgeclub.org
It is badly written, messy and I'm sure someone could do better than me, but it seems to work fairly reliably. The xml seems to have all the necessary data to be able to chop up articles, and I seem to remember it has image urls as well, so can play with them (although not implemented in my gadget).
Apologies if I am missing the point. I agree with your feature request - it would be great not to have to get so low-level to implement stuff like this in sites....
I have to migrate a static HTML website to TYPO3. I know, I could read docus first, but I believe I will need to read some days first to only recognize which direction to run...
Do I have to learn TypoScript like
Default PAG
page = PAGE
page.typeNum = 0
page.20 = TEXT
page.20.value = HELLO UNIVERSE!
page.10 = TEXT
page.10.value = HELLO WORLD!
or is there another way to do it quickly? With markers?
thank you guys!
You will have to learn a little bit of TypoScript to do what you want. Sorry :-( But you won't have to learn that much, and what you do learn you'll be able to reuse when building other TYPO3 sites.
First thing: skip markers. Markers are a remnant of an old, deprecated templating system. The way you should be doing this is with TemplaVoila.
TemplaVoila works by giving you an interface to map TYPO3 content (or instructions to generate content) to blocks of markup in your HTML file. In other words, you take your static HTML file, then go through it and tell TemplaVoila "OK, that DIV is my sidebar, so put a list of all the site pages in there... that P is the footer, put a link to the privacy policy there... that DIV is the main content area, fill it with blocks of content created by the user," and so forth. This is a very powerful approach, because it means that if you work with other Web designers or graphic designers, they don't have to learn any special "magic tags" or markers; they can just give you well-formed HTML and with a few clicks you can turn it into a live template for a site. Pretty nifty.
There's a piece of TYPO3 documentation called "Futuristic Template Building" that explains pretty clearly how to go from a static HTML page to a TYPO3-ized site with TemplaVoila. Here's a direct link to the section of that doc that walks you through the process. (Don't be scared by the word "futuristic" into thinking that TemplaVoila isn't fully baked yet -- that doc was written six years ago, when TemplaVoila was pretty futuristic, but today it's quite mature and in use on TYPO3 sites all over the world.)
This should be enough to get you started, but if you hit roadblocks or can't wrap your head around it feel free to post your questions back to this thread and I'll help you out.
I'm reviving this, since a lot has happened since 2010.
There are multiple ways in TYPO3 to do the templating. All of them involve TypoScript, but in some there is only a minimal amount of TS needed.
Use "the old built-in way", doing all rendering in TypoScript and some HTML templates with markers in them. In this approach, you'd use the content elements provided by the core. Their rendering is defined with TypoScript in the core-extension "CSS Styled Content".
Use "the new built-in way". Here you'd also use the content elements provided by the core, and optionally self-defined ones. The rendering happens using the Fluid templating engine. You would do this using the core-extension "Fluid Styled Content". This is available since version 7.5.
Use a third party extension for content element rendering. I know of these:
Templavoilà - You probably should not use it, since it is not actively developed anymore, although there is a version claiming compatibility to TYPO3 7 LTS, but I don't know much about that.
FluidTYPO3 - This is a whole ecosystem of extensions with which you can define page templates and content element templates completely using the Fluid Templating engine (backend forms, backend preview and frontend rendering). It also provides a mechanism for nesting content elements.
DCE - Dynamic content elements. I don't know anything about them, you would need to read the docs.
Mask - TYPO3-core near wizard for own contentelements and pagetemplates. Uses database fields, not flexforms.
More extensions I don't know of.
It would be a bit much to explain all of these here in detail.
My current personal favorite is the FluidTYPO3 ecosystem, but I'm considering a switch to using Fluid Styled Content, because it is directly integrated into the core. I'm not sure if it supports nested content elements, so maybe one would need a separate solution for that (e.g. the extension gridelements).
Is there a way to embed only a section of a website in another HTML page?
Example: I see an answer I want to blog about, so I grab the HTML content, and splat it in somewhere, and show only that, styled like it is on stackoverflow. Basically, I want to blockquote the section of the page with original styling, if that makes sense. Is that something the site itself has to provide, or can I use an iframe and tell it to show only a certain element or something crazy? Open to all options, but I want it to show up as HTML, not as an image (that's really a last resort).
If this is even possible, are there security concerns I need to aware of?
Don't think image should really be last resort. You have no control over the HTML/CSS of the source page, so even if you craft a solution (probably by using JavaScript to parse out the desired snippet) there is no guarantee that tomorrow the site doesn't decide to change its layout.
Even Jeff, who has control over the layout of stackoverflow.com, still prefers to screen-capture the site, rather than pull in the contents live.
Now if your goal was to have the contents auto-update, that would be a different story. But still, unless you use some agreed-upon method of sharing content, such as RSS, your solution would be very fragile.
The concept you are describing is roughly what is called a "purple include" or "transclusions". There is a library out there for it, but its not exactly actively developed. Here's a couple ajaxian articles on it.
I'd recommend using a server side solution with Python; using urllib2 to request the page, then using BeautifulSoup to parse out the bit that you need. BeautifulSoup has a very flexible selection api with which you can craft heuristics for the section you are interested in.
To illustrate:
soup = BeautifulSoup(html)
text = soup.find(text="Some text on the page that is unlikely to change")
print soup.parent.prettify()
That way if the webmaster later changes the markup on the page, your scraping script should still work.
On client side <iframe> is the only practical option. It is possible to scroll it, but it might not work in the long term, because it's technically close to clickjacking attack.
There's also cross-site XHR, but requires opt-in from destination site, and today works only in few latest browsers.
Getting HTML on server side is easy (every decent web framework has ability to download page and parse HTML and you can use XPath/XSLT or DOM to extract bit you want).
Getting styles however is going to be tricky – CSS rules may not work with HTML fragment taken out of context. You'd have to parse CSS, extract and transform rules or use browser and read currentStyle of every node.
Obviously you have to heavily filter HTML you extract to avoid XSS. It's harder than it seems.
If you don't need to automate this, a good HTML+CSS WYSIWYG editor might be able to extract content fragment with styles.
That sounds like something that IE8's Web Slices would be perfect for. However, it's only available in IE8, and the site of origin would have to implement for you to be able to take advantage of it.