How to display non-ASCII characters from a XML output - html

I get this output in a XML element:
£111.00
It should be £111.00.
How can i sort this out so that all unicode characters are displayed rather than the code. I am using linux tool wget to fetch the xml file from the Internet. Perhaps some sort of convertor?
I am viewing the file in putty , i am parsing the file and i want to clean the input before parsing.
I am using xml_grep2 to get the elements i want and then cat filename | while read .....

Ok i'm going to close this question now.
After parsing the file with xml_grep2 i was able to get a clean output however was seeing this à character in the file. I changed putty settings for character set to UTF-8 from ISO-8859 to resolve that.

You can use HTML::Entities to replace the entities with literal character codes. I don't know how good its coverage is, though. There are bound to be similar tools for other languages if you are not comfortable with Perl. http://metacpan.org/pod/HTML::Entities
sh$ echo '£111.00' | perl -CSD -MHTML::Entities -pe 'decode_entities($_)'
£111.00
This won't work if the HTML::Entities module is not installed. If you need to install it, there are numerous tutorials about the CPAN on the Internet.
Edit: Add usage example. The -CSD option might not be necessary on your system, but on OSX at least, I got garbage output without it.

Related

What is the difference between a .JSON file and .JL file?

I have both JSON file and JL file on my computer but when I open them in Notepad their structure looks like the same. What is the difference between them? where shall I use each one?
Actually, the time that I was asking this question I didn't know that "the file type is no guarantee of what is inside it". in other words I thought that for every file name there is a separate manifesto and if a files name is ".something", there is a unique manifesto for it. But now I know that I can create a file, write anything that I want into it and name it ".peyman" and yes there is nothing special with it!
What was that file? the file was JSON Lines file format.
Where did I find it? in the Scrapy except writing scrapy crawl name -o file.json I saw that somebody wrote scrapy crawl name -o file.jl. I tried that and the file was 99% like JSON file so I wondered and asked this question here.
So:
What is the difference between a .JSON file and .JL file? Now I know that the better question is "What is the difference between a .JSON file and .JL file in the Scrapy?"
The JSON Line is like JSON but without the "[" and "]" at the
beginning and the end. it is used in the Scrapy because of this
There's quite a few things that a jl file extension could be referring to. If I remember correctly, it originally had something to do with the window manager Sawfish.
Sawfish was developed in Lisp, and the jl file was a Lisp source file for Sawfish. However, I'm guessing (because you said that inside was JSON-like sauce) that's not what you're asking about.
In that case, I do recall a few projects on GitHub... JSON lambda and Julia.
Both of those may be the reason why you're seeing JSON in a jl file. Without more information on where you got that file, or what it was part of, though, we won't be able to help you much.
That said, file extensions rarely matter in terms of Linux. In Windows, they're far more important, but in Linux you could literally append anything to a file as an "extension" (ie. thisfile.whatever) and you could still open it up in an editor. The same is true for most editors in Windows.
Likely, the packager of that file decided on jl for their own reasons, rather than following convention of using .json.
I guess JL extension is used for many purposes, but JL is also one of the few extensions used for JSON-lines (also known as NDJSON or JSONL).
This format can contain multiple JSON values, one JSON value (with "compact" formatting) per line and is useful for e.g. streaming or logging.

Messed up encoding using htlatex

I just used tex4ht and htlatex to convert a latex document into html, and now i have some serious trouble to integrate this html document in a web site i'm making (I'm using Laravel though).I think one of the reason I have some troubles is that htlatex output files are Unix encoded, not utf-8. If I just input the file using Laravel views and controllers without any modification, utf-8 characters are not displayed, and if i convert the file to utf-8, all utf-8 characters turn weird within notepad and I have to rewrite them one at a time (the html files contains 2000+ lines, I can't do that).I'm wondering how can I solve the problem.Is "puting the input html in an iframe" tag an any good solution ? Or is there a way to encode this file to utf-8 without messing with his content ? I'm so lost....
tex4ht uses Latin1 as the default encoding, characters unsupported by this encoding are output as XML entities. You can request UTF-8 output using the following command:
htlatex filename.tex "xhtml,charset=utf-8" " -cunihtf -utf8"
As an alternative, you can use Make4ht with -u option:
make4ht -u filename.tex
make4ht is replacement for htlatex with much more features.

notepad++ handling / converting binary command characters

I'm using notepad++. i want to copy my code and then paste it in a simple textarea of little program (which obfuscates variables, removes blank lines & comments) and returns it.
the problem is my code contains binary command characters (like the NUL in white writing with black background) which the program cant handle.
my questions is, is there a simple way to convert these command charachters into something safe, run the program, and then convert them back?
thanks
In SynWrite editor this converting of NULL char can be done. Synwrite has text-converters (Run menu): described in help file topic.
PSPad has similar text-conv feature (Tools menu).
Or you can use a regex to replace [\x00-\x19] with new string.

Convert JSON data to BSON on the command line

I'm on an Ubuntu system, and I'm trying to write a testing framework that has to (among other things) compare the output of a mongodump command. This command generates a bunch of BSON files, which I can compare. However, for human readability, I'd like to convert these to nicely formatted JSON instead, which I can do using the provided bsondump command. The issue is that this appears to be a one-way conversion.
While I can work around this if I absolutely need to, it would be alot easier if there was a way to convert back from JSON to BSON on the command line. Does anyone know of a command line tool to do this? Google seems to have come up dry.
I haven't used them, but bsontools can convert from json, xml, or csv
As #WiredPrarie points out, the conversion from BSON to JSON is lossy, and it makes no sense to want to go back the other way. Workarounds include using mongoimport instead of mongorestore, or just using the original BSON. See the comments for more deails (adding this answer mainly so I can close the question)
You can try beesn, it converts data both ways. For your variant - JSON -> BSON - use the -x switch.
Example:
$ beesn -x -i test-data/01.json -o my.bson
Disclaimer: I am an author of this tool.

HTML downloading and text extraction

What would be a good tool, or set of tools, to download a list of URLs and extract only the text content?
Spidering is not required, but control over the download file names, and threading would be a bonus.
The platform is linux.
wget | html2ascii
Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).
See also: lynx.
Python Beautiful Soup allows you to build a nice extractor.
I know that w3m can be used to render an html document and put the text content in a textfile
w3m www.google.com > file.txt for example.
For the remainder, I'm sure that wget can be used.
Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.
PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.
Use wget to download the required html and then run html2text on the output files.