jtidy fails to parse html - options - html

So I was trying to evaluate a couple of the HTML parsers and gave JTidy a try. Trying to parse this URL:
http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/TagNode.html
Gives these errors:
line 1 column 56,258 - Error: missing '>' for end of tag
line 1 column 56,258 - Error: is not recognized!
It says line one as it reads it in as one line, but this is the line that JTidy pukes/fails on:
<li>//div[last() >= 4]//./div[position() = last()])[position() > 22]//li[2]//a</li>
My code is pretty simple:
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.tidy.Tidy;
Document document = tidy.parseDOM(new ByteArrayInputStream(this.getHtml().getBytes()), null);
NodeList anchorTags = document.getElementsByTagName("A");
Is this just a bug in JTidy or am I doing something wrong? I've evaluated about 6 others so far and none of them have had a problem on this page.

Related

How to parse invalid JSON contianing invalid number

I work with a legacy customer who sends me webhook events. Sometimes their system sends me a value that looks like this
[{"id":"LXKhRA3RHtaVBhnczVRJLdr","ecc":"0X6","cph":"X1X4X77074", "ts":16XX445656000}]
I am using python's json.loads to parse the data sent to me. Here the ts is an invalid number and python gives json.decoder.JSONDecodeError whenever I try to parse this string.
It is okay with me to get None in ts field if I can not parse it.
What would be a smart (& possibly generic) way to solve this problem?
This may not be so generic, but you can try using yaml to load:
import yaml
s = '[{"id":"LXKhRA3RHtaVBhnczVRJLdr","ecc":"0X6","cph":"X1X4X77074","ts":16XX445656000}]'
yaml.safe_load(s)
Output:
[{'id': 'LXKhRA3RHtaVBhnczVRJLdr',
'ecc': '0X6',
'cph': 'X1X4X77074',
'ts': '16XX445656000'}]
If the problem is always in the ts key, and this value is always a string of numbers and letters, you could just remove it before trying to parse:
import re
jstr = """[{"id":"LXKhRA3RHtaVBhnczVRJLdr","ecc":"0X6","cph":"X1X4X77074", "ts":16XX445656000}]"""
jstr_sanitized = re.sub(r',?\s*\"ts\":[A-Z0-9]+', "", jstr)
jobj = json.loads(jstr_sanitized)
# [{'id': 'LXKhRA3RHtaVBhnczVRJLdr', 'ecc': '0X6', 'cph': 'X1X4X77074'}]
Regex explanation (try online):
,?\s*\"ts\":[A-Z0-9]+
,? Zero or one commas
\s* Any number of whitespace characters
\"ts\": Literally "ts":
[A-Z0-9]+ One or more uppercase letters or numbers
Alternatively, you could catch the JSONDecodeError and look at its pos attribute for the offending character. Then, you could either remove just that character and try again, or look for the next space, comma, or bracket and remove characters until that point before you try again.
jstr = """[{"id":"LXKhRA3RHtaVBhnczVRJLdr","ecc":"0X6","cph":"X1X4X77074", "ts":16XX445656000}]"""
while True:
try:
jobj = json.loads(jstr)
break
except json.JSONDecodeError as ex:
jstr = jstr[:ex.pos] + jstr[ex.pos+1:]
This mangles the output so that the ts key is now a valid integer (after removing the Xs) but since you don't care about that anyway, it should be fine:
[{'id': 'LXKhRA3RHtaVBhnczVRJLdr',
'ecc': '0X6',
'cph': 'X1X4X77074',
'ts': 16445656000}]
Since you'd end up repeatedly re-parsing the initial valid part, this is probably not a great idea if you have a huge json string, or there are lots of places that could throw an error, but it should be fine for the kind of example you have shown.

JSONDecodeError: Expecting value: line 1 column 1 (char 0) while getting data from Pokemon API

I am trying to scrape the pokemon API and create a dataset for all pokemon. So I have written a function which looks like this:
import requests
import json
import pandas as pd
def poke_scrape(x, y):
'''
A function that takes in a range of pokemon (based on pokedex ID) and returns
a pandas dataframe with information related to the pokemon using the Poke API
'''
#GATERING THE DATA FROM API
url = 'https://pokeapi.co/api/v2/pokemon/'
ids = range(x, (y+1))
pkmn = []
for id_ in ids:
url = 'https://pokeapi.co/api/v2/pokemon/' + str(id_)
pages = requests.get(url).json()
# content = json.dumps(pages, indent = 4, sort_keys=True)
if 'error' not in pages:
pkmn.append([pages['id'], pages['name'], pages['abilities'], pages['stats'], pages['types']])
#MAKING A DATAFRAME FROM GATHERED API DATA
cols = ['id', 'name', 'abilities', 'stats', 'types']
df = pd.DataFrame(pkmn, columns=cols)
The code works fine for most pokemon. However, when I am trying to run poke_scrape(229, 229) (so trying to load ONLY the 229th pokemon), it gives me the JSONDecodeError. It looks like this:
So far I have tried using json.loads() instead but that has not solved the issue. What is even more perplexing is that specific pokemon has loaded before and the same issue was with another ID - otherwise I could just manually enter the stats for the specific pokemon that is unable to load into my dataframe. Any help is appreciated!
Because of the way the PokeAPI works, some links to the JSON data for each pokemon only load when the links end with a '/' (such as https://pokeapi.co/api/v2/pokemon/229/ vs https://pokeapi.co/api/v2/pokemon/229 - first link will work and the second will return not found). However, others will respond with a response error because of the added '/' so fixed the issue with a few if statements right after the for loop in the beginning of the function

Python3 Replacing special character from .csv file after convert the same from JSON

I am trying to develop a program using Python3.6.4 which convert a JSON file into a CSV file and also we need to clean the data in the csv file. as for example:
My JSON File:
{emp:[{"Name":"Bo#b","email":"bob#gmail.com","Des":"Unknown"},
{"Name":"Martin","email":"mar#tin#gmail.com","Des":"D#eveloper"}]}
Problem 1:
After converting that into csv its creating a blank row between every 2 rows. As
**Name email Des**
[<BLANK ROW>]
Bo#b bob#gmail.com Unknown
[<BLANK ROW>]
Martin mar#tin#gmail.com D#eveloper
Problem 2:
In my code I am using emp but I need to use it dynamically.
fobj = open("D:/Users/shamiks/PycharmProjects/jsonSamle.txt")
jsonCont = fobj.read()
print(jsonCont)
fobj.close()
employee_parsed = json.loads(jsonCont)
emp_data = employee_parsed['employee']
As we will not know the structure or content of up-coming JSON file.
Problem 3:
I also need to remove all # characters from the CSV file.
For solving Problem 3, you can use .replace (https://www.tutorialspoint.com/python/string_replace.htm).
For problem 2, you can use the dictionary keys and then get the zeroth item out of it.
fobj = open("D:/Users/shamiks/PycharmProjects/jsonSamle.txt")
jsonCont = fobj.read().replace("#", "")
print(jsonCont)
fobj.close()
employee_parsed = json.loads(jsonCont)
first_key = employee_parsed.keys()[0]
emp_data = employee_parsed[first_key]
I can't solve problem 1 without more code to see how your are exporting the result. It may be that your data has newlines in it. In which case, you could add .replace("\n","") and/or .replace("\r","") after the previous replace so the line would read fobj.read().replace("#", "").replace("\n", "").replace("\r", "").

How do I pass pandoc_options as output_options to rmarkdown::render()

I have an Rmd file that renders into html correctly almost all of the time. However, it does not render correctly when pandoc (used in the rendering process) finds 4 spaces in the html and at that point, interprets that I want to render a markdown code snippet instead of html.
I have been told that I can turn off the markdown_in_html_blocks feature by doing something like this:
pandoc -f markdown-markdown_in_html_blocks.
I have tried calling pandoc directly rather than it being called implicitly by
rmarkdown::render()
but couldn't get that syntax to work and being able to specify this option (-markdown_in_html_blocks) directly as I call render() is preferred. Here is the latest of I have tried without success:
Base case: works but HTML output file is malformed / has a code block instead of the data that I want to display in the table.
render("reports/Pacing.Rmd")
Attempted fix: not working
rmdFmt <- rmarkdown_format("-markdown_in_html_blocks")
pandocOpts <- pandoc_options(to = "html", from = rmdFmt)
render("reports/Pacing.Rmd",output_format = "html_document",output_file = NULL, output_dir = NULL, output_options = pandocOpts)
Error message: Error in (function (toc = FALSE, toc_depth = 3, toc_float = FALSE, number_sections = FALSE, :
argument 1 matches multiple formal arguments
I have tried other syntax to express that I want to turn off markdown_in_html_blocks but no luck.
Given the following document test.Rmd...
---
title: Test
output: html_document
---
<table>
<tr>
<td>*one*</td>
<td>[a link](https://google.com)</td>
</tr>
</table>
...you can disable the markdown_in_html_blocks extension via
rmarkdown::render("test.Rmd",
output_options = list(md_extensions = "-markdown_in_html_blocks"))
md_extensions is one of the arguments that can be passed to rmarkdown::html_document (see ?rmarkdown::html_document for other arguments).
That seems to be an open issue, but a simpler way to turn off/on such a feature is to directly update the YAML in Rmd file. This should work in your case:
output:
html_document:
pandoc_args: [
"-f", "markdown-markdown_in_html_blocks"
]

Convert from HTML::Template to PDF:FromHtml says invalid XML

I create an HTML file using HTML::Template. The resulting code is a valid XML/HTML (check against a xml validator). But while convert to pdf using PDF::FromHTML a message of "invalid token in xml file" is found.
Trying changing the first declaration line from doctype to xml, or supressing, but nothing works. XML::Simple, PDF:API2, XML::Writer are last version.
Ay idea what is happening?
# create template object and store to verify
shout('s',"create template from $str_filepath") if ($bool_DEBUG);
$str_mytemplate = HTML::Template->new(filename => $str_filepath, case_sensitive => 0, no_includes => 1 );
$str_mytemplate->param(\%strct_toreplace);
$str_filepath = envDir('temp').newID().'.html';
shout('',"template created, storing to : $str_filepath") if ($bool_DEBUG);
if (open(FILE, '>', $str_filepath)) {
print FILE $str_mytemplate->output;
close (FILE);
}
# generate pdf from created file
shout('p',"Creating PDF ") if ($bool_DEBUG);
$pdf_this = PDF::FromHTML->new( encoding => 'utf-8' );
$pdf_this->load_file($str_filepath);
$pdf_this->convert( LineHeight => 10, Landscape => 1, PageSize => 'Letter', );
shout('p',"Display PDF") if ($bool_DEBUG);
print header(-type=>'application/pdf', -charset=>'UTF-8');
print $pdf_this->write_file();
$bool_DEBUG and shout(); are a variable and procedure to set and display messages while debugging mode.
Html code generated via template: http://www.etoxica.com/examplecode.html
Template used: http://www.etoxica.com/exampletemplate.tmpl
Message displayed:
SECTION: Creating PDF
Software error:
not well-formed (invalid token) at line 19, column 13, byte 430 at /usr/local/lib64/perl5/XML/Parser.pm line 187.
at /home/grupo/perl/usr/share/perl5/PDF/FromHTML.pm line 141.
Summary: Found the problem (I guess) ;)
Consider the following lines:
<td>
Some line of data
<br/>
A second line of data
</td>
When try to be read by PDF::FromHTML it will send a message of malformed token in the 5th line, specifically on the slash '/' from </td> tag; BUT, that is not the problem, the problem is created by the <br/> tag inside the <td></td>.
If it is changed to <br> or <br /> no error is found. I don't know if using <br> is a good html practice to xml compability, even is defined as it w3c br semantic.