How to parse a PDF using AS3? (air) - actionscript-3

is there a way to parse a pdf using AS3 via Air in mobile?.
I don't need the full content of the PDF, only some data, is that possible?.
Edit for clarification:
I got a PDF file that was originally created based on a XML, what I'd need is to be able to retrieve that XML. Or at least, to find a string inside that PDF so I can make a call to a web service.

Original:
There's nothing native in AS3 for this kind of stuff but AlivePDF. It won't let you traverse things like XML so much, as it seems like you're trying to do by taking a small bit of a PDF, but it will let you create pdf's, add pages and change fonts etc.
You weren't entirely clear on what you're attempting to achieve, if you update your question a with a bit more detail we may be able to help a bit more.
Edit:
From the refined question, AlivePDF is not what you're after as it's really only for PDF generation. I'm assuming you're after a method to traverse the document like you would XML, by looking for a tag and extracting the information. I've not found a way to do this other than iterating through the document and searching manually which probably isn't what you're after.
After some searching I found an as3-pdfreader which doesn't seem to be complete at the moment. However on the Project Home the roadmap says parsing pdf files is complete, I've not been able to try it out yet though.

Related

How do I retrieve text and images from websites (in HTML or JSON) on iOS Swift?

I looked around the Internet to figure out how to get data from websites using SWIFT, and have narrowed it down to roughly JSON or HTML, which I am not familiar with.
As far as I know, there are third party libraries for me to use to parse the data. I have been following Dani Arnaout’s Working with JSON in Swift Tutorial as a reference. However, I have not been able to find the way to retrieve the JSON from any random website. Only the iTunes JSON page, which is included in the tutorial, works.
What I want to do: Make an app that downloads images and also some text from many different websites, either by HTML or JSON. The problem right now is that I have no idea how to start doing it. A simple demo would be helpful.
Questions
How do I get the JSON of any random website on the Internet?
How do I make use of the HTML data from websites to turn it into a readable format? I’m retrieving the HTML using Google Chrome, and it seems to be gibberish: I can’t find the text anywhere.
Question 1: Not all websites expose their content formatted as JSON.
Question 2: I think you should look at a couple of resources. First Ray Wenderlich has a tutorial on how to parse HTML. Although it is using Objective-C you should be able to learn quite a lot there.
When you have read that tutorial I would recommend you look at the Swift library Alamofire. There is another tutorial on the Wenderlich site covering this library.
Happy coding!

How to embed or convert PDF to support reading on mobile browser with offline support?

Before you downvote please read the full post. It is a legit question, for witch I have googled and found some answer but all come short therefore I come to the community and ask for advice.
The requirement asks for the ability to read catalogs that are in pdf format inside a mobile browser. There is also the need to read the files offline, so this kills a few options like google pdf viewer.
So faces with this requirement I have not found an easy way to embed a pdf file, therefore conversion to HTML5 or Images is the route that I am thinking on going.
In terms of HTML5 conversion I have found Flexpaper, crocodoc, Prizm, serverPDF and others, but almost all require the user to be online to read the files. Is there a client side only way to read and display PDF files? Or an intermediate browser/js friendly format?
if you optimize this project, maybe it will work
js and html pdf viewer

PDF to Structured Format

I have tons PDFs that I need to convert to some structured format that I can interpret (HTML/XML/etc)
PDFs are in this format:
http://img840.imageshack.us/img840/5407/pdfv.png
I have tried so far a lot of softwares that convert to HTML but all of them have no capabilities to separate the images, they just take like a printscreen of the page without the text and then use this image as a background in the html, using css to position the text
Like this: http://img37.imageshack.us/img37/5015/examplelp.jpg
I have a bunch of PDFs so process each ones images manually is not an option. Does anyone knows any solution for this (even paid softwares)?
I had a similar problem a while back and ended up writing my own solution. It's called PDFX and it's free to use. It converts PDF to a structured-format XML and also renders any bitmap images (not vector graphics) found in the PDF separately.
Example input/output can be found here. You might want to give it a try.

Android: How do I retrieve problematic data from a specific webpage?

I have used .NET and ShDocVw for years to grab data off webpages without any issues I couldn't overcome. This website has me beat though. It seems like such as easy task to grab the titles and other information off a library search page, but I can't see the data to be able to grab it. Usually, I just look in the DOM, but the data wasn't there. I did a view source, but the data wasn't there. I am so confused.
I am learning Android right now and that is how I would like to solve my problem, but if .NET would be easier... Right now I will take any help, in any form.
The URL is http://catalog.kcls.org/opac/en-US/skin/kcls/xml/rresult.xml?if=&it=h&bl=&lf=&a=&la=&cl=&d=1&l=1&s=pubdate&sd=desc&adt=ml&tp=&t=bibcn%3ADVD%20FIC%20ON%20ORDER&av=&rt=multi
For this specific website, if you disable javascript in your browser, you will see they give you a link to a plain html search portal
http://catalog.kcls.org/opac/en-US/extras/slimpac/start.html

How do I convert PDF to HTML programmatically?

Are there any classes, COM objects, command line utilities, or anything else that I can make an API for that can convert a PDF to an HTML document? Obviously the conversion might be a little rough since PDFs can contain a lot more than HTML can describe. I found a utility called pdftohtml on Source Forge, but quite honestly it does a horrible job with the conversion. I don't care if the software is free or commercial, but is there anything out there at all that I can incorporate with my own software to do this sort of conversion at least decently? I know Google's developed their own method of doing this, since you can click "View as HTML" on a PDF attached to an email through Gmail, but I was hoping there was something out available to the public.
Remember, PDF to HTML. I'm NOT worried about HTML to PDF.
well one solution i can think of is to write little program that reads pdf text using library called iText and then generate html files.
well for java based PDF solutions...we dont have a clean way i guess-still.. all solutions are primitive and kind of workarounds... No easy solution for
1. Designing a template of a PDF
2. Then at runtime using java, populate data into this template...either using xml or other datasources...
such a simple requirement and NONE has a good "open-source and free" solution yet !
Eclipse BIRT comes close.. but does not handle Barcode elements ..OOB.
You were looking for pdf2htmlEX (C++), which converts PDF to HTML without losing text or format.
To convert further to semantic HTML, you can process pdf2htmlEX output using my project Transcript (Python). It is however not lossless anymore and works best on documents not deviating too much from conventional visual layout.