Markdown or HTML - html

I have a requirement for users to create, modify and delete their own articles. I plan on using the WMD editor that SO uses to create the articles.
From what I can gather SO stores the markdown and the HTML. Why does it do this - what is the benefit?
I can't decide whether to store the markdown, HTML or both. If I store both which one do I retrieve and convert to display to the user.
UPDATE:
Ok, I think from the answers so far, i should be storing both the markdown and HTML. That seems cool. I have also been reading a blog post from Jeff regarding XSS exploits. Because the WMD editor allows you to input any HTML this could cause me some headaches.
The blog post in question is here. I am guessing that I will have to follow the same approach as SO - and sanitize the input on the server side.
Is the sanitize code that SO uses available as Open Source or will I have to start this from scratch?
Any help would be much appreciated.
Thanks

Storing both is extremely useful/helpful in terms of performance and compatiblity (and eventually also social control).
If you store only Markdown (or whatever non-HTML markup), then there's a performance cost by parsing it into HTML flavor everytime. This is not always noticeably cheap.
If you store only HTML, then you'll risk that bugs are silently creeping in the generated HTML. This would lead to lot of maintenance and bugfixing headache. You'll also lose social control because you don't know anymore what the user has actually filled in. You'd for example as being an admin also like to know which users are trying to do XSS using <script> and so on. Also, the enduser won't be able to edit the data in Markdown format. You'd need to convert it back from HTML.
To update the HTML on every change of Markdown version, you just add one extra field representing the Markdown version being used for generating the HTML output. Whenever this has been changed in the server side at the moment you retrieve the row, re-parse the data using the new version and update the row in the DB. This is only an one-time extra cost.

By storing both you only have to process the markdown once (when it is posted). You would then retrieve the HTML so that you can load your pages faster.

If you only stored one, you'd forever have to recreate the other for either the display view or the edit view.

Related

GEDCOM to HTML and RDF

I was wondering if anyone knew of an application that would take a GEDCOM genealogy file and convert it to HTML format for viewing and publishing on the web. I'd like to have separate html files for each individual and perhaps additional files for other content as well. I know there are some tools out there but I was wondering if anyone used any tools and could advise on this. I'm not sure what format to look for such applications. They could be Python or php files that one can edit, or even JavaScript (maybe) or just executable files.
The next issue might be appropriate for a topic in itself. Export of GEDCOM to RDF. My interest here would be to align the information with specific vocabularies, such as BIO or REL which both are extended from FOAF.
Thanks,
Bruce
Like Rob Kam said, Ged2Html was the most popular such program for a long time.
GRAMPS can also create static HTML sites and has the advantage of being free software and having a native XML format which you could easily modify to fit your needs.
Several years ago, I created a simple Java program to turn gedcom into xml. I then used xslt to generate html and rdf. The html I generate is pretty rudimentary, so it would probably be better to look elsewhere for that, but the rdf might be useful to you:
http://jay.askren.net/Projects/SemWeb/
There are a number of these. All listed at http://www.cyndislist.com/gedcom/gedcom-to-web-page-conversion/
Ged2html used to be the most popular and most versatile, but is now no longer being developed. It's an executable, with output customisable through its own scripting syntax.
Family Historian http://www.family-historian.co.uk will create exactly what you are looking for, eg one file per person using the built in Web Site creator. As will a couple of the other Major genealogy packages. I have not seen anything for the RDF part of your question.
I have since tried to produce a Genealogy application using Semantic MediaWiki - MediaWiki, the software behind Wikipedia, and Semantic MediaWiki includes various extensions related to the Semantic Web. I thought it is very easy to use with the forms and the ability to upload a GEDCOM but some feedback from people into genealogy said that it appeared too technical and didn't seem to offer anything new.
So, now the issue is whether to stay with MediaWiki and make it more user friendly or create an entirely new application that allows for adding and updating data in a triple store as well as displaying. I'm not sure how to generate a family tree graphical view of the data, like on sites like ancestry.com, where one can click on a box to see details about the person and update that info or one could click on a right or left arrow around a box to navigate the tree. The data comes from SPARQL queries sent to the data set/triple store both when displaying the initial view and when navigating the tree, where an Ajax call is needed to get more data.
Bruce

html tags in mysql value fields, is that right?

I was looking at status.net source code and mysql tables, and they seem to have html tags in their mysql field values. I was just wondering is that right thing to do or is it going to cause some problems in the future?
It depends on where it will be used. It isn't an issue if the intention is to have arbitrary html there. Especially not if the developers and admins are the only ones who can put it in there.
On the other hand, if for example a user of your system managed to put it there and also used the opportunity to put in a script-tag and a reference to their own scripts you might very well be in big trouble (if you don't escape the strings before you render them on your site).
i would like to take the opportunity to quote the favorite sentence of my old it-teacher:
Oh, it depends.
without knowing where and why the tags are stored in a db, it's hard to say if this is a good ideo...
A database can be used for storing just like the filesystem. So in most cases it's not a problem if you store HTML.
Lets take the articles of an WordPress blog as an example. It's definitely OK to store them in the database.
Short answer: Depends
Long answer: This practice is quite common and often unavoidable.
Think about blog posts: the HTML code that is in it marks up the content cannot be separated from the content itself.
Possible issues:
Javascript injection. If I can inject malicious HTML code into your database, I could create links to malware or javascript commands that help install viruses or trojans.
There's always a trade-off.

Using Semantic MediaWiki for tabular data

Am I completely off-track to think about using Semantic MediaWiki to store (and organise, report on, etc.) 'tabular' data such as financial transactions or weather readings that would usually live in a spreadsheet or database?
It seems that one would need a separate, tiny, page for each tuple; but then, that's by design and perhaps it's perfectly okay.
I ask, simply because SMW seems like such a quick and easy way to get a collaborative data repository up and running.
Semantic MediaWiki is better suited for keeping track of Factual or Encyclopedic data, where you can have pages about everything you need to know about a certain topic.
For tabular or numerical data such as measurements, financial, sensor data, you would indeed need to create little pages about each data point, which is not practical in many cases.
However, there are extensions to Media Wiki that allow you to integrate external data sources (in MySQL databases or CSV files somewhere) with MediaWiki pages. This can allow you to have the best of both worlds - dynamic access and queries of tabular data and semantic annotations of pages around them.
Take a look at :
http://www.mediawiki.org/wiki/Extension:External_Data
No, I don't think it's such a bad idea.
Using SemanticForms you could enter lots of little data pages quickly and easily (for example, an invoice might require additional pages for each line item, but they could all be entered from one form using the 'multiple' feature of the for template form tag). So although I've never tried logging weather data in SMW, I think it would be pretty easy. I don't see what the problem would be with storing data across so many pages; it's easy enough to combine it in whatever formats you require.
Give it a go and let us know how it goes!
You can use either the Semantic Internal Objects extension (SIO), or SMW's built in subobjects (the former works well with the already mentioned External Data extension), to store multiple semantic objects (could be the rows of your spreadsheet) in one page.
However, unless you are really looking for a collaborative tool with semantic capabilities, I doubt SMW is the best suited piece of software for your task.
edit (november 2015): Since SMW version 1.9, there nothing that SIO can do that the built-in subobjects can't, so I would recommend the latter.

Screen scraping gotchas

When screen-scraping, what are the "gotcha"s to look out for?
The inspiration for this is: my spouse's co-worker asked me to scrape all the pages from a Blogger-hosted blog that her friend with cancer kept in her final months and this lady wanted to keep all of the posts in case the blog were ever deleted. I eventually found a free tool that was barely good enough.
One issue with scraping many Blogger pages is that there's often a navigation menu where you can click on the triangles to expand the post lists by year or month. These little buggers created insane amounts of duplicate content because you'd have the same page over and over again with different combinations of the menus being expanded/collapsed. In Blogger's case I'm not sure this is avoidable since the links are all formatted as real http links and not obvious JavaScript calls. Still, it got me thinking:
If you were to scrape a website, what kinds of potentially non-obvious things would you compensate for?
Do not use regex to scrape
While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.
What I recommend you do is use a DOM parser such as BeautifulSoup or equivalent (SimpleHTMLDom in PHP).
Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility.
A regular expression could be devised to achieve the same goal but would be limited. For example, developing a regex to get the src and alt tag would force the alt attribute to be after the src or the opposite, and to overcome this limitation would add more complexity to the regular expression.
Also, consider the following. To properly match an <img> tag using regular expressions and to get only the src attribute (captured in group 2), you need the following regular expression:
<\s*?img\s+?[^>]*?\s*?src\s*?=\s*?(["'])((\\?+.)*?)\1[^>]*?>
And then again, the above can fail if:
The attribute or tag name is in capital and the i modifier is not used.
Quotes are not used around the src attribute.
Another attribute then src uses the > character somewhere in their value.
Some other reason I have not foreseen.
So again, simply don't use regular expressions to parse a dom document.
I screen scrape a lot. Some advice:
Emulate a User-Agent string for some browser you want to use. Different websites frequently return very different results depending on what your user agent is. If they don't recognize the User-Agent they will often revert to lowest common denominator, so it's usually best to start with some recent browser. (For example the World of Warcraft Armory returns beautiful, easy to parse XML if it thinks you're a recent Firefox. If it doesn't know what you are it sends terrible HTML).
Be polite to the site you're scraping; don't hit it too hard. Your scraper will go faster if you multi-thread it, making many requests at once, but that will annoy the site owner.
Be smart about error handling. Do not write code like while (1) { makeRequest(); }. If your code or the server throws an error a loop like this will immediately fetch another request, generating another error. It can get ugly quickly. Handle errors well and consider putting in sleeps or exits if you see a lot of errors.
When developing your parsing code, test against a cached version rather than hitting the server every time. Will make your development go faster and is the basis of a simple test suite.
First, I'd check for an RSS feed. On blogger, you just have to add /rss to the root url, if I remember correctly.
Then I'd check if there isn't already some tool to scrape blogger.
Then if there's no RSS feed, and no existing tool, I'd give up and do it by hand with copy/paste. Unless we're talking 5000 pages, it's much faster and easier that way. Take it from someone who's tried.
If you have access to the actual account, blogger has an export function.
edit: Or of course, you could try mechanical turk.
As far as gotchas are concerned..It's usually a good idea to limit the amount of requests made over a certain period of time. Smashing a site with alot of requests in a short space of time is a good way to have your requests rejected.
Aside from the technical considerations, make sure your not putting yourself at legal risk. Most large sites have specific legal language in their terms of use that disallows programmatic access to their services via an automated computer program, and also, the obvious copyright concerns.
From a technical standpoint, definitely use a DOM parser library and you'll save loads of time. Many provide the ability to read HTML into an XML structure that can be queried using XPath to find exactly what you need.
If you know someone who has access to the account, they can use Blogger's export "Export blog" feature.

PDF Report generation [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
EDIT : I completed this project using ABCpdf. For anyone interested, I love this product and their support is A+. Everything I listed as a 'Con' for the HTML -> PDF solution was easily doable in ABCpdf.
I've been charged with creating a data driven pdf report. After reviewing the plethora of options, I have narrowed it down to 2. I need you all to to help me decide, or offer alternatives I haven't considered. Here are the requirements:
100% Data driven
Eventually PDF (a stop in HTML is fine, so long as it is converted)
Can be run with multiple sets of data (the layout is always the same, the data is variable)
Contains normal analysis-style copy (saved in DB with html markup)
Contains tables (data for tables is generated at run-time)
Header/Page # on each page
Table of Contents
.NET (VB or C#)
Done quickly
Now, because of the fact that the report is going to be generated with multiple sets of data, I don't think a stamped pdf template will work since I won't know how long or how many pages a certain piece of the report could require.
So, I think my best options are:
Programmatic creation using an iText-like solution.
Generate in HTML and convert to PDF using a third-party application (ABCPdf is the tool I have played with so far)
Both solutions have their pro's and con's.
Programmatic solution:
Pros:
Flexible
Easy page numbering/page header/table of contents
Free
Cons:
Time consuming (to write a layer on top of iText to do what I need and keep maintainable)
Since the copy is already stored in the db with html markup, I would have to parse through the data before I place it into the pdf, ensuring I don't have to break the paragraph into chunks so I can apply bold, italic, underline, etc. to specific phrases. This seems like a huge PITA, and I hope I am wrong about that assumption.
HTML -> PDF
Pros:
Easy to generate from db (no parsing necessary)
Many tools for conversion
Uses technology I am already familiar with
Built-in "Print Preview" - not a req, but nice
Cons:
(Edited after project completion. All of my assumptions were incorrect and ABCpdf is awesome)
1. Almost impossible to generate page headers - Not True
2. Very difficult to generate page numbers Not True
3. Nearly impossible to generate table of contents Not True
4. (Cross-browser support isn't a con; Since its internal, I can dictate what browser to use)
5. Conversion tool quirks - may not convert exactly as rendered in browser Not True
6. Overall, I think it would be very hard to format the HTML exactly as I would want it to appear/convert to PDF. Not True
That's it - I need the communitys help in deciding which way I should go. I might be wrong about some of my Pro/Con assumptions. If I am, please tell me. All thoughts and suggestions are welcome and appreciated.
Thanks
Decided on using an approach similar to the one used at
http://alistapart.com/articles/boom
Using ABCPdf instead of Prince for the eventual HTML -> PDF generation.
Anyone who is interested in the same thing, feel free to message me about this approach.
I think that if you have a full version of Adobe Acrobat Pro, it comes with Adobe Live Cycle. You should be able to produce reports generated from a database from it. It will give you everything you need in formatting since you will create the report from scratch.
You can create a database connection to an OLE database that will feed data to your form fields. You select the tables to be used, any stored procedures that will run, any queries, and then the data will appear on one of the pallettes in the designer.
You can also use Web Services (WDSL) to receive and process commands and return the results to the form.
Either way, you would bind fields to your data source and then the data would be displayed in your form.
If you're willing to do a little .NET work there's this:
http://www.dotnetvj.com/2009/05/populating-pdf-from-aspnet-using.html
Depending on which platform you are using and targeting, you might want to consider a reporting solution. These are not perfect but the one thing they do give you is the ability to write a report once and then render it in HTML, PDF, or even Excel.
Usually they also provide an editor that helps you design the report and make it look just right. They provide things like paging, headers, footers, graphs, etc. They also provide an API that you can use to programatically create and run the reports.
I've used Reporting Services in a MS environment and Jasper Reports in a Java environment with good results in both. I'm sure there are other options but these are the ones I've been able to use successfully.
For the HTML→PDF step, I really love Prince. It looks like you can call it from VB.
My recommendation is to use SQL Reporting Services.
Can design every page & table of your report
Include Header and Footer
Include Page Numbers
Table of Contents
Can span through multiple pages
Supports Images & Charts
Can be rendered to PDF without a need for any thrid party PDF Converters