How can I convert HTML to Textile? - html

I'm scraping a static html site and moving the content into a database-backed CMS. I'd like to use Textile in the CMS.
Is there a tool out there that converts HTML into Textile, so I can scrape the existing site, convert the HTML to Textile, and insert that data into the database?

I know this is an old question, but I found myself trying to do this the other day and not finding anything useful, until I found Pandoc. It can convert loads of other markup formats as well - it's quite brilliant.

Here is a c# lib converting html 2 textile. Though it is textile with their additions. Not pure textile.

Since there was no javascript implementation, I wrote one:
https://github.com/cmroanirgo/to-textile
It's a little primitive at the moment, as it's a blind port of the 'to-markdown' equivalent, but should get the job done.

This is a simple markup replacement, nothing a good regex could not fix.
I recommend Perl, LWP::Simple and some regexes to do the whole thing (spidering, stripping design and menus, converting to textile, and then posting to the database.)

try this simple java code hope it work for you
import java.net.*;
import java.io.*;
class Crawle
{
public static void main(String ar[])throws Exception
{
URL url = new URL("https://www.google.co.in/#q=i+am+happy");
InputStream io = url.openStream();
BufferedReader br = new BufferedReader(new InputStreamReader(io));
FileOutputStream fio = new FileOutputStream("crawler/file.txt");
PrintWriter pr = new PrintWriter(fio,true);
String data = "";
while((data=br.readLine())!=null)
{
pr.println(data);
System.out.println(data);
}
}
}
}

Related

How to parse html from javafx webview and transfer this data to Jsoup Document?

I am trying to parse sidebar TOC(Table of Components) of some documentation site.
Jsoup
I have tried Jsoup. I can not get TOC elements because the HTML content in this tag is not part of initial HTML but is set by JavaScript after the page is loaded.
You can see my previous question here:JSoup cannot parse child elements after depth 2
The suggested solution is to examine what connections are made manually from the Browser Dev Tools menu find the last version of the website. Parsing sidebar TOC of some documentation site is just one component of my java program so I cannot do this manually.
JavaFX Webview(not Android Webview)
I have tried JavaFX Webview because I need a browser that executes javascript code and fills Toc tag components.
WebView browser = new WebView();
WebEngine webEngine = browser.getEngine();
webEngine.load("https://learn.microsoft.com/en-us/ef/ef6/");
But I don't know how can I retrieve HTML code of the loaded website and transfer this data to Jsoup Document?
ANy advice appreciated.
WebView browser = new WebView();
WebEngine webEngine = browser.getEngine();
String url = "https://learn.microsoft.com/en-us/ef/ef6/";
webEngine.load(url);
//get w3c document from webEngine
org.w3c.dom.Document w3cDocument = webEngine.getDocument();
// use jsoup helper methods to convert it to string
String html = new org.jsoup.helper.W3CDom().asString(webEngine.get);
// create jsoup document by parsing html
Document doc = Jsoup.parse(url, html);
I can't promise this is the best way as I've not used Jsoup before and I'm not an expert on the XML API.
The org.jsoup.Jsoup class has a method for parsing HTML in String form: Jsoup.parse(String). This means we need to get the HTML from the WebView as a String. The WebEngine class has a document property that holds a org.w3c.dom.Document. This Document is the HTML content of the currently showing web page. We just need to convert this Document into a String, which we can do with a Transformer.
import java.io.StringWriter;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import org.jsoup.Jsoup;
public class Utils {
private static Transformer transformer;
// not thread safe
public static org.jsoup.nodes.Document convert(org.w3c.dom.Document doc)
throws TransformerException {
if (transformer == null) {
transformer = TransformerFactory.newDefaultInstance().newTransformer();
}
StringWriter writer = new StringWriter();
transformer.transform(new DOMSource(doc), new StreamResult(writer));
return Jsoup.parse(writer.toString());
}
}
You would call this every time the document property changes. I did some "tests" by browsing Google and printing the org.jsoup.nodes.Document to the console and everything seems to be working.
There is a caveat, though; as far as I understand it the document property does not change when there are changes within the same web page (the Document itself may be updated, however). I'm not a web person, so pardon me if I don't make sense here, but I believe that this includes things like a frame changing its content. There may be a way around this by interfacing with the JavaScript using WebEngine.executeStript(String), but I don't know how.

What language to use?

What language should I use if i were to write a script to for example download every video from a specific youtube channel as a mp3 file via http://www.youtube-mp3.org/ ? This has been on my mind for a very long time and it would be nice if anyone could answer this question. I know it should be possible with html to use the tag and do some things with JS but i want a simpler and more reliable way. Thnx. PS. examples appreciated
you can easily use c# with webclient class
using (WebClient Client = new WebClient ())
{
Client.DownloadFile("http://google.com", "a.mpeg");
}

Is there an example on using Razor to generate a static HTML page?

I want to generate a static HTML page by RAZOR, basically by using includes of partial sub pages.
I have tried T4 as well and do look for an alternative: see here and here
This answer says it is possible - but no concrete example
I have installed Razor generator because I thought this is the way to go, but I do not get how to generate static HTML with this.
Best would be a complete extension which behaves like the T4 concept, but allows me to use the RAZOR syntax and HTML formatting (the formatting issue is basically the reasons why I am not using T4).
If you are trying to take a Razor view and compile it and generate the HTML then you can use something like this.
public static string RenderViewToString(string viewPath, object model, ControllerContext context)
{
var viewEngineResult = ViewEngines.Engines.FindView(context, viewPath, null);
var view = viewEngineResult.View;
context.Controller.ViewData.Model = model;
string result = String.Empty;
using (var sw = new StringWriter())
{
var ctx = new ViewContext(context, view,
context.Controller.ViewData,
context.Controller.TempData,
sw);
view.Render(ctx, sw);
result = sw.ToString();
}
return result;
}
Or outside of ControllerContext http://razorengine.codeplex.com/
The current version of Razor Generator has the "Generator" option which when used with the "MvcHelper" generator produces a static method for the helpers too.
For example, add this line at the top of your CSHTML file (with the Custom Tool Visual Studio property set to RazorGenerator of course):
#* Generator: MvcHelper, GeneratePrettyNames : true *#
The pretty names option is not strictly necessary but is something I feel should be default, to avoid those crazy long class names with underscores :-)
As you may know already, the main benefit of this method is you can share your helpers in separate assemblies. That is why I use Razor Generator in the first place.
Even within the same assembly, you could now leave your code outside App_Code folder. However that is not the best practice (at least for security) and the Visual Studio designer gets confused. It thinks the method is still not static, but it isn't and works fine.
I'm prototyping my helpers in the App_Code folder of the same site/assembly for speed then copying them to shared components when they're tested. The reason I needed this solution was to create generic Bootstrap helpers without hand-coding every piece of HTML in a HtmlHelper, i.e. used together with this solution from #chrismilleruk.
I guess later I may have to convert the CSHTML helpers to a hand-coded HtmlHelper for speed. But to start with see a great development speed increase at the beginning, from the ability to copy and paste blocks of HTML code I want to automate, then perfect and debug them quickly in the same format/editor.

Link in swt browser to file inside jar file [duplicate]

I want to implement a help system for my tiny SWT desktop application.
I've thought about a SWT browser widget containing a single html markup page and a set of anchors to navigate (there are only very few things to explain).
Everything works fine, but how do I load the html file from a jar?
I know aboutgetClass().getClassLoader().getResourceAsStream("foo");, but what is the best practice when reading from the input stream? The answer to Load a resource contained in a jar dissuades using a FileInputStream.
Thanks in advance
Well, I found a rather simple solution that obviously just works:
InputStream in = getClass().getClassLoader().getResourceAsStream("html/index.html");
Scanner scanner = new Scanner(in);
StringBuffer buffer = new StringBuffer();
while(scanner.hasNextLine()) {
buffer.append(scanner.nextLine());
}
browser.setText(buffer.toString());
i tend to use commons-io for such a task giving me simple abstraction methods like IOUtils.toString(InputStream in); and leaving the choice of best implementations to the able people at apache ;)
commons-io: http://commons.apache.org/io/
apidocs: http://commons.apache.org/io/api-release/index.html

ASP.NET MVC: render view to generate PDF: use iTextSharp or better solution?

I display receipt in both HTML and printer-friendly version. HTML version does jQuery tabs, etc, while printer-friendly has zero scripts and external dependencies, no master layout, no additional buttons, inline CSS, and can be saved as HTML without problems.
Since I use Spark View Engine, I though maybe it's a good idea to generate PDF using iTextSharp engine. But after few paragraphs I decided it's too cumbersome, because a) I would have to rewrite entire receipt (source Spark view is about 5 pages long) b) I had problems with iTextSharp from beginning - for example, numbered lists kept bulleted, with no indentation, and indentationLeft="20" didn't work - maybe because of lack of documentation, but see (a).
So, my requirements for PDF are very simple: I want to keep the same HTML but insert page breaks between individual receipts (yes I have several ones in a single document).
Is there a simple way to generate PDF from view/HTML without rewriting the view using a strange half-documented engine?
UPDATE: tried community HTMLDoc version; didn't use my inline CSS styles, incorrectly displayed Unicode symbols for currencies. wkhtmltopdf did pick the CSS but failed for currency symbol; I suppose there's problem with encoding solved by setting charset to utf-8. wkhtmltopdf seems to be nice but I'm yet to figure out how to set page breaks...
If you can have the HTML in memory then you can convert it to PDF. I've once did something similar using xhtmlrenderer. It is a JAVA framework that bundles iText and that is capable of converting an HTML stream into PDF. As it is written in JAVA I've used the ikvmc.exe to convert the jar file into a .NET assembly and use it directly from managed code.
public class Pdf : IPdf
{
public FileStreamResult Make(string s)
{
using (var ms = new MemoryStream())
{
using (var document = new Document())
{
PdfWriter.GetInstance(document, ms);
document.Open();
using (var str = new StringReader(s))
{
var htmlWorker = new HTMLWorker(document);
htmlWorker.Parse(str);
}
document.Close();
}
HttpContext.Current.Response.ContentType = "application/pdf";
HttpContext.Current.Response.AddHeader("content-disposition", "attachment;filename=MyPdfName.pdf");
HttpContext.Current.Response.Buffer = true;
HttpContext.Current.Response.Clear();
HttpContext.Current.Response.OutputStream.Write(ms.GetBuffer(), 0, ms.GetBuffer().Length);
HttpContext.Current.Response.OutputStream.Flush();
HttpContext.Current.Response.End();
return new FileStreamResult(HttpContext.Current.Response.OutputStream, "application/pdf");
}
}
}
Finally used wkhtmltopdf which works fine when I set encoding, I found out how to setup page breaks, and it processes my CSS very nice. On issue is that it can't correctly process stdin/out in Windows version (don't remember if it's in or out that doesn't work) - may be fixed in recent versions, but I'm ok with temp files.