I have built out a struct and am correctly parsing JSON data and am getting the data to show up with one exception. The text I am getting also includes a URL string. How do I parse both the text and URL image?
Here is the example of the data:
"text": "<p>How would you rate your knowledge about investing in general and more specifically, the relationship between risk and return as shown in this chart?</p>\r\n\r\n<p>\r\n<img src=\"https://services.website.com/Charts/$versions/2180/2180.png\" />\r\n</P>\r\n"
I have a label called question text that the text is being parsed into. Do I need to add a UIImage to below the text label? The problem I am thinking with this is that on some questions I have text then an image, then more text.
Related
Like this question, extract text from xml tags in an XML file using apach tika parser
I want to extract all text from text based files, including tagged content, the tags themselves, and other text in XML/HTML elements.
I've tried XML (application/xml), and HTML (text/html) and seen that AutoDetectParser returns less than the full text content.
I've also tried YAML (text/plain), and JSON (text/plain) which do return the full text content.
I understand that I can't do XML or HTML using the AutoDetectParser. What I can't find documented is a list of what types of files would need special handling.
To get full text content (even if that means a complete 'raw' copy of the file):
1. What Mimetypes should be parsed using a TXTParser?
2. What Mimetypes should be parsed using other parsers?
Basically, I'm asking what Mimetypes does the AutoDetectParser return less than the full text content?
Thanks
EDIT
My use case is to be able to extract text and metadata from a wide variety of input file formats including txt, xml, html, doc(x), ppt(x), pdf, ...
Essentially, I want to be able to handle any file type Tika can handle.
I am using code like this
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
try (InputStream stream = new FileInputStream(fileToExtract)){
parser.parse(stream, handler, metadata, context);
} catch ... {
}
I see the same results for XML files as the question referenced above.
What I am trying to find out is: where is it documented when the combination of AutoDetectParser and BodyContentHandler will return less than the full text of the input file.
When, or for what Mimetypes, do I need to switch the Parser and/or ContentHandler?
I don't see this information clearly documented, and I am hoping to avoid a trail and error approach.
I have written an API that translates text from english to Hindi language. However, if the text that is passed is an html text, then my code fails.
How do I get this right?
I have tried using py-translate, but this package is not able to convert html texts properly.
I have also tried using googleclient package, it is able to convert html text but can handle only one request at a time.
My API should be handle multiple requests and also be able to deal with html text translation.
Any help is appreciated.
I'm working with Twitter's API to retrieve a list of tweets from a Twitter account.
I'm using this: https://dev.twitter.com/rest/reference/get/statuses/user_timeline
I get the JSON response (http://pastebin.com/raw/zqyUuXcG) but in the text field (at the end of it) I also have the url to the tweet itself.
I'd like to avoid that because I want to keep the text clean and put the url in an hyperlink (like on date or on the container div).
I couldn't find a way to avoid url to be included in the text field. Is there a way to do it?
Thanks
Depends on the method you want to use after you obtain a JSON response, but doesn't look like you have direct control over URL placement straight from the API.
You could use grep to either extract the URL or filter it out from the text field.
For example, in R, this chunk of code would remove everything after and including "https":
gsub( "https.*$", "", textfield)
I've a JSON containing some HTML contents from an External System. We have a Rich Text field for storing this HTML data. But I noticed, while storing I'm getting some HTML tags included in to the field as the HTML contents are coming as JSON string. So my question is how can I store the received JSON string data as a HTML back in Netsuite field. Is it possible ?
Jdata = dataIn.desc; // getting something like : Guest(s) benifit's surcharges <br><p>test benifit desc 25% discountpop "test"</p>
Thanks for your interest !
Maybe try unescape or decodeuri, which are standard old js methods.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/unescape
If not just do a string replacement on the known char sets (" being the same as ")
In a smartphone app I receive HTML text from a server that I have to parse using regexes because I can't display it as HTML (I can't use a webview). The regexes are very intensive (are many) and the results (being the original text inserted by users that cut and paste text from any source, pdf, rtf, etc.) is not always as good as the website counterpart. So, I want to suggest my boss to change the format the mobile client receives, so that I don't have to parse HTML. Question is: what could be this format?