How to programmatically convert HTML to epub? [closed]

How to programmatically convert HTML to epub? [closed] - html

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Can I do this conversion with any programming language or library?

The short answer is yes, it can be done in any programming language.
Basic steps:
Convert your HTML to XHTML (+ CSS). This can be done in your program or through an XSLT file.
Copy your files (XHTML, CSS, any images and fonts) into a directory structure that follows the format.
Zip the directory structure up and name the archive with a ".epub" extension.
Some web sites to help you get started:
A good tutorial for what's in an epub file (and how to create one yourself) can be found here: http://www.jedisaber.com/eBooks/Introduction.shtml. I used this to get started myself.
Specs for the .epub standard are here: http://www.idpf.org/
A validator for .epubs can be downloaded from here: https://github.com/IDPF/epubcheck
June 2015 Note: The epubcheck validator has moved from google code to GitHub; note the new URL.

Calibre supports a wide variety of input formats, including HTML, and a wide variety of output formats, including EPUB, but it's not "a programming language or library". Are there specific reasons you desire a programming-based approach rather than a free-standing tool? If so, maybe Python and ebookmaker.py, for example, could help you.

A late reply, but I found the Python 3-based ebookmaker to be of value, at least after I contributed a pull request to remove a UTF-8 BOM. One problem with it appears to be that it uses brittle regular expressions to parse HTML, but I guess I'll have to report it there.

Here's pdf to epub, I know that's not what you're after, but it's a start.
The calibre package may have what you want

I am using the following library from Aspose - http://www.aspose.com/categories/.net-components/aspose.words-for-.net/default.aspx
In just two lines of code I am able to do html to epub conversions. Using this currently in a production system.
Document doc = new Document(_sourceFilePath);
doc.Save(_destinationFilePath, SaveFormat.Epub);

I just started to implement such a tool in Java (OpenJDK compatible): html2epub. In order to get rid of manually editing the config file, I'll probably start a separate tool to generate the config file from any given directory (however, it would still be necessary to determine the order of the XHTMLs in the EPUB - for non-programmatical use, developing a GUI helper tool could be considered, for a fully flexible programmatical solution, I haven't come up with an idea yet). Before that, I implemented shell script based converters for custom XML input (hag2epub tools) - in case you're interested, I would probably port them to XHTML input (with a config file for the EPUB metadata or obtaining metadata from the topmost index.html of a directory, if existing).

I have the same issue previously, necause I want to read some webpage content offline on my iPad. I have no idea and I am not a computer savvy. There are calibre or stanza blabla....
But for me they are just formats converters and I need a ePub book creator which will allows me to combine many desired documents together to read. Then I found a bookish html to ePub converter, I save the html page from web then convert with it. It's a quite good tool for me now.

Related

Is there a preprocessor for json files? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I have some configuration files that I store the complex object values as serialized json. Currently there is a configuration file for each environment (localhost, dev, prod etc.) and for each installation by client. Most of the values are identically for the configurations between environments but not all. So for three environments and four clients I currently have 12 total files to manage.
If this were a web.config file there would be web.config transforms that would solve the problem. If this was c# I'd have compiler preprocessor directives that could be useed to substitute the different values based on the current build configuration.
Does anyone know of anything that works basically this way or have some good suggestion on tried and true ways to proceed? What I would like is to reduce the number of files down to a single instance for each installation that can suffice for each environment.

Configuration of configuration always seems a bit overdone to me, but you could use a properties file for the parts that change, and apache ant's <replace> task to do the substitutions. Something like this:
<replace
file="configure.json"
propertyFile="config-of-config.properties">
<replacefilter
token="#token1#"
property="property.key"/>
</replace>

Jsonnet from Google is a language that with with a super-set syntax based on JSON, adding high level language features that help to model data in JSON fromat. The compilation step produces JSON. I have used it in a project to describe complex deployment environments that inherit from one another at times and that share domain attributes albeit utilizing them differently from one instance to another.
As an example, an instance contains applications, tenant subscriptions for those applications, contracts, destinations and so forth. The values for all of these attributes are objects the recur throughout environments.
Their docs are very thorough and don't miss the std functions because they make for some very powerful data rendering capabilities.

I wrote a Squirrelistic JSON Preprocessor which uses Golang Text Templates syntax to generate JSON files based on parameters provided.
JSON template can include reference to other templates, use conditional logic, comments, variables and everything else which Golang Text templates package provides.

This really comes down to your full stack.
If you're talking about some application that runs solely client-side, with no server-side processing, whatsoever, then there's really no such thing as pre-processing.
You can process the data further before actually using it, but that won't mean that it will be processed prior to the page being served -- it means that people have to sit around, waiting for that to happen before the apps which need that data can be initialized.
The benefit of using JSON, to begin with is that it's just a data-store, and is quite language-agnostic, and quite widely supported, now. So if it's not 100% client-side, there's nothing stopping you from pre-processing in whatever language you're using on the server, and caching those versions of those files, to serve (and cache) to users, based on their need.
If you really, really need a system to do live processing of config-files, on the client-side, and you've gone through the work of creating app-views which load early, but show the user that they're deferring initialization (ie: "loading..."/spinners), then download a second JSON file, which holds all of the needed implementation-specific data (you'll have 12 of these tiny little files, which should be simple to manage), parse both JSON files into JS objects, and extend the large config object with the additional data in the secondary file.
Please note: Use localhost or some other storage facility to cache this, so that for html5-browsers, this longer load only happens one time.

There is one, https://www.npmjs.com/package/json-variables
Conceptually, it is a function which takes a string, JSON contents, sprinkled with specially marked variables and it produces a string with those variables resolved. Same like Sass or Less does for CSS - it's used to DRY up the source code.
Here's an example.
You'd put something like this in JSON file:
{
"firstName": "customer.firstName",
"message": "Hi %%_firstName_%%",
"preheader": "%%_firstName_%%, look what's inside"
}
Notice how it's DRY — single source of truth for the firstName value.
json-variables would process it into:
{
"firstName": "customer.firstName",
"message": "Hi customer.firstName",
"preheader": "customer.firstName, look what's inside"
}
that is, Hi %%_firstName_%% would look for firstName at the root level (but equally, it could be a deeper path, for example, data1.data2.firstName). Resolving also "bubbles up" to the root level, also you can use custom data structures and more.
Missing pieces of a JSON-processing task puzzle are:
Means to merge multiple JSON files, various ways (object-merge-advanced)
Means to orchestrate actions — Gulp is good if you're preferred programming language is JS
Means to get/set values by path (object-path - its notation uses dots only, no brackets key1.key2.array.2 instead of key1.key2.array[2])
Means to maintain the same set of keys across set of JSON files - you add a key in one, it's added on all others (object-fill-missing-keys)
In described case, we can do at least two approaches: one-to-many, or many-to-many.
Former - Gulp could be "baking" many JSON files from one or more JSON-like source files, json-variables DRY-ing up the references.
Later - alternatively, it could be "managed" set of JSON files rendered into set of distribution files — Gulp watches src folder, runs object-fill-missing-keys to normalise schemas, maybe even sorting objects (yes, it's possible, sorted-object).
It all depends how similar is the desired set of JSON files and how values are customised and is it done manually or programmatically.

Activating HTML with Haskell

I have a large pile of lecture notes in raw HTML format. I would like to add interactive content to these notes, in particular incorporating online exercises. I have some experience implementing online exercises as cgi-bin executables compiled from Haskell code running on the server, interacting with a student record file and sending suitable HTML back to the browser, using Text.Xhtml to generate the content. Now I plan to integrate the notes and the exercises.
The trouble is that I don't want to spend ages manually transforming my raw HTML into Haskell code to generate exactly the raw HTML I started with. Instead, I'd like to put my Haskell code and my HTML in the same source file, with placeholders in the latter for content generated by the former. A suitable tool should then transform this file into Haskell source code for (e.g.) a cgi-bin executable which generates the corresponding page.
Before I go hacking up such a piece of kit, I thought I'd ask if there's better technology out there already. The fixed points are the large legacy lump of HTML, the need to implement the assessment of the exercises in Haskell, and the need to interact with student records on the server. The handicap is that I need to use the departmental web server and I can't reconfigure it (ok, maybe I could ask nicely): that's one of the reasons I currently use cgi-bin executables, which are just fine on our server already, but I'm open to other possibilities.
My current plan is to write a (I mean adapt an existing) preprocessor to support a special syntax for defining functions of type
Html -> ... -> Html -> Html
that looks a lot like raw HTML with splice points. Then what I do with my existing raw HTML is indent it a bit and mark the holes.
But would that be a waste of time? Please, please tell me that this question is a duplicate!

There are Haskell frameworks like Yesod and Happstack which use templating engines like you describe.
Have you looked at the haskell wiki at http://www.haskell.org/haskellwiki/HSP or
http://www.haskell.org/haskellwiki/Web/Libraries/Templating ?
They may do what you need.

You might find someting to do the job here: Templating packages for Haskell.
And you should probably look into Snap, Yesod or Happstack for serving the content.

I have a large pile of lecture notes in raw HTML format. I would like to add interactive content to these notes, in particular incorporating online exercises.
There is already a system (called "ActiveHs"), written in Haskell, that allows to put lecture notes and interactive exercises in one file.
See:
http://pnyf.inf.elte.hu/fp/UsersGuide_en.xml
http://pnyf.inf.elte.hu/fp/Constructive_en.xml
I can really say that it is very well written code and completely open source!

Choosing an open source license for a standard based on XSD [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
got a question about an open source project I'm planning to put in place (to be hosted either at Codeplex or Sourceforge).
In short, the project will consist of an XSD file to define a schema for XML files to adhere to, and some C# code to work with those XML files.
But I'm not sure what license I should give it, especially the XSD file. The project will be mostly a class library, so I'm tempted to go with LGPL, so it can be used by both free and proprietary software.
But the one thing is, I don't want the XSD file to be able to be changed, cause I'm trying to put up a standard for data-sharing of a specific problem domain, and imo there's no point in making a standard open source, or is it ?
Or should I release the XSD as a separate project ? Not sure what's the right way to go...
Thanks for any advise on the matter.
Mathieu

I think you are making a big mistake. In fact, I think that the XSD files should be released under a more liberal license.
If someone wants to create a proprietary version of HTML and fork Firefox to work with it they are only creating needless work for themselves. For the most part, it doesn't cause any problems for Mozilla or the W3C because nobody is going to care or use it. Granted, at one time, both Netscape and Microsoft tried to add proprietary HTML extensions. Microsoft eventually realized the value in browser interoperability. Netscape didn't last long enough for it to matter.
If you put a restrictive license on the schema, you will decrease the likelihood that anyone will adopt your standard. Many developers are constrained by the licenses of components they can use in their projects. What is the point of having a standard, unless it is open to all developers?
Keep in mind, and XSD file is not a standard or a schema. It is only a representation of a standard.
For example, if you have an XHTML XSD, changing the XSD does not change the XHTML schema. The XHTML schema is defined by an English document published by the W3C. The only way to change the XHTML schema is to get the W3C to publish and updated version of the document. If you change an XHMTL XSD, you have created a representation of different schema.
By putting the XSD file under a restrictive license doesn't do anything to protect your schema, it only forces someone to code from scratch a new XSD file for their proprietary extensions.
Have you considered that your standard might have flaws, or not cover certain use cases you haven't considered? If your standard can't meet all the needs of a developer they won't use it. You could promise to incorporate improvements in to the standards, but what happens if you get hit by a bus? If you are the only person who can legally change the standard it will eventually stagnate and become irrelevant.

How to highlight source code in HTML? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I want to highlight C/C++/Java/C# etc source codes in my website.
How can I do this?
Is it a CPU intensive job to highlight the source code?

You can either do this server-side or client-side. It's not very processor intensive, but if you do it client side (using Javascript) there will be a noticeable lag. Most client side solutions revolve around Google Code's syntax highlighting engine. This seems to be the most popular one: SyntaxHighlighter
Server-side solutions tend to be more flexible, especially in the way of defining new languages and configuring how they are highlighted (e.g. colors used). I use GeSHi, which is a PHP solution with a moderately nice plugin for Wordpress. There are also a few libraries built for Java, and even some that are based upon VIM (usually requiring a Perl module to be installed from CPAN).
In short: you have quite a few options, what are your criteria? It's hard to make a solid recommendation without knowing your requirements.

I use GeSHi ("Generic Syntax Highlighter") on pastebin.com
pastebin has high traffic, so I do cache the results of the transformation, which certainly reduces the load.

Personally, I prefer offline tools: I don't see the point of parsing the code (particularly large ones) over and over, for each served page, or even worse, on each browser (for JS libraries), because as pointed above, these libraries often lag (you often see raw source before it is formatted).
There are a number of tools to do this job, some pointed above. I just use the export feature of my favorite editor (SciTE) because it just respects the choices of color I carefully set up... :-) And it can output XML, PDF, RTF and LaTeX too.

Pygment is a good Python library to generate HTML, RTF, ANSI (terminal-style) or LaTeX code. It supports a large range of languages (C, C++, Lua, Erlang, ...) and you can even write your own output formatter.

I use google-code-prettify. It is the simplest to set up and works great with all C-style languages.

If you use jEdit, you might want to use the Code2HTML plugin.

I use SyntaxHighligher on my blog.

Just run it through a tool like: http://www.gnu.org/software/src-highlite/

If you are using PHP, you can use GeSHi to highlight many different languages. I've used it before and it works quite well. A quick googling will also uncover GeSHi plugins for wordpress and drupal.
I wouldn't consider highlighting to be CPU intensive unless you are intending to display megabytes of it all at once. And even then, the CPU load would be minimal and your main problem would be transfer speed for it all.

best library to do web-scraping [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 months ago.
Improve this question
I would like to get data from from different webpages such as addresses of restaurants or dates of different events for a given location and so on. What is the best library I can use for extracting this data from a given set of sites?

If using python, take a good look at Beautiful Soup (http://crummy.com/software/BeautifulSoup).
An extremely capable library, makes scraping a breeze.

The HTML Agility Pack For .net programers is awesome. It turns webpages in XML docs that can be queried with XPath.
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a#href")
{
HtmlAttribute att = link"href";
att.Value = FixLink(att);
}
doc.Save("file.htm");
You can find it here. http://www.codeplex.com/htmlagilitypack

I think the general answer here is to use any language + http library + html/xpath parser. I find that using ruby + hpricot gives a nice clean solution:
require 'rubygems'
require 'hpricot'
require 'open-uri'
sites = %w(http://www.google.com http://www.stackoverflow.com)
sites.each do |site|
doc = Hpricot(open(site))
# iterate over each div in the document (or use xpath to grab whatever you want)
(doc/"div").each do |div|
# do something with divs here
end
end
For more on Hpricot see http://code.whytheluckystiff.net/hpricot/

I personally like the WWW::Mechanize Perl module for these kinds of tasks. It gives you an object that is modeled after a typical web browser, (i.e. you can follow links, fill out forms, or use the "back button" by calling methods on it).
For the extraction of the actual content, you could then hook it up to HTML::TreeBuilder to transform the website you're currently visiting into a tree of HTML::Element objects, and extract the data you want (the look_down() method of HTML::Element is especially useful).

i think watir or selenium are the best choices. Most of the other mentioned libraries are actually HTML parsers, and that is not what you want... You are scraping, if the owner of the website wanted you to get to his data he'd put a dump of his database or site on a torrent and avoid all the http requests and expensive traffic.
basically, you need to parse HTML, but more importantly automate a browser. This to the point of being able to move the mouse and click, basically really mimicking a user. You need to use a screencapture program to get to the captchas and send them off to decaptcha.com (that solve them for a fraction of a cent) to circumvent that. forget about saving that captcha file by parsing the html without rendering it in a browser 'as it is supposed to be seen'. You are screenscraping, not httprequestscraping.
watir did the trick for me in combination with autoitx (for moving the mouse and entering keys in fields -> sometimes this is necessery to set of the right javascript events) and a simple screen capture utility for the captcha's. this way you will be most succesfull, it's quite useless writing a great html parser to find out that the owner of the site has turned some of the text into graphics. (Problematic? no, just get an OCR library and feed the jpeg, text will be returned). Besides i have rarely seen them go that far, although on chinese sites, there is a lot of text in graphics.
Xpath saved my day all the time, it's a great Domain Specific Language (IMHO, i could be wrong) and you can get to any tag in the page, although sometimes you need to tweak it.
What i did miss was 'reverse templates' (the robot framework of selenium has this). Perl had this in CPAN module Template::Extract, very handy.
The html parsing, or the creation of the DOM, i would leave to the browser, yes, it won't be as fast, but it'll work all the time.
Also libraries that pretend to be Useragents are useless, sites are protected against scraping nowadays, and the rendering of the site on a real screen is often necessery to get beyond the captcha's, but also javascript events that need to be triggered for information to appear etc.
Watir if you're into Ruby, Selenium for the rest i'd say. The 'Human Emulator' (or Web Emulator in russia) is really made for this kind of scraping, but then again it's a russian product from a company that makes no secret of its intentions.
i also think that one of these weeks Wiley has a new book out on scraping, that should be interesting. Good luck...

I personally find http://github.com/shuber/curl/tree/master and http://simplehtmldom.sourceforge.net/ awesome for use in my PHP spidering/scraping projects.

The Perl WWW::Mechanize library is excellent for doing the donkey work of interacting with a website to get to the actual page you need.

I would use LWP (Libwww for Perl). Here's a good little guide: http://www.perl.com/pub/a/2002/08/20/perlandlwp.html
WWW::Scraper has docs here: http://cpan.uwinnipeg.ca/htdocs/Scraper/WWW/Scraper.html
It can be useful as a base, you'd probably want to create your own module that fits your restaurant mining needs.
LWP would give you a basic crawler for you to build on.

There have been a number of answers recommending Perl Mechanize, but I think that Ruby Mechanize (very similar to Perl's version) is even better. It handles some things like forms in a much cleaner way syntactically. Also, there are a few frontends which run on top of Ruby Mechanize which make things even easier.

What language do you want to use?
curl with awk might be all you need.

You can use tidy to convert it to XHTML, and then use whatever XML processing facilities your language of choice has available.

I'd recommend BeautifulSoup. It isn't the fastest but performs really well in regards to the not-wellformedness of (X)HTML pages which most parsers choke on.

what someone said.
use ANY LANGUAGE.
as long as you have a good parser library and http library, you are set.
the tree stuff are slower, then just using a good parse library.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008