Generating ixbrl file from json and xlsx - json

please my first question here,
I am working on a project on an accounting site to help generate an ixbrl file from the account details which are in json and xlsx format.
Please has anyone worked with something similar that can put me through on how to go about it.

Welcome #Abiola Aribisala.
An ixbrl file, also known as Inline XBRL or the XHTML syntax of XBRL, requires two things:
The "print friendly" part, in "raw" XHTML, that a human user can look at;
Extra tags within this XHTML (they are in a namespace specific to XBRL), which are the machine-readable part.
Thus, in order to produce Inline XBRL syntax, you first need to have a print friendly version in a format that can be converted to XHTML (like Word, etc), as this cannot be automated just reading from JSON. I imagine that if the Excel file is nicely formatted, it might be possible to convert it to some "raw" XHTML in some way, too.
Second, for the tags, you need a data source with all the contexts, characteristics, etc for each fact value. If your JSON data is in xBRL-JSON format, it should contain this information. Otherwise, it requires extra work.
Finally, a challenge is knowing to put which tag where in XHTML, i.e. "merging" the print version with the data. In a regular setup, this comes from a common source that both generated the print version and the machine-readable data. That way, this common source can directly generate the Inline XBRL file and it is best for quality and correctness.
If the binding between the print version and the data is not available, one could in theory put all the tags in an ix-hidden section in XHTML, however it defeats the purpose of tagging the data exactly where it is on the XHTML page, i.e., it makes it less interactive.

Related

Cutting a config file down to size

Morning!
I've got an app with a config file that's become unwieldy - many switches with no intuition as to which combinations are valid. Right now, all the switches are stored in an XML file. The config file specifies inputs for a large HPC job.
I'm thinking of writing some a formal grammar for a run - that is, the sort of combinations that are acceptable, and from the parsing of it, the switches needed will automatically be inferred. The values would still be read from the XML file, but only when needed.
Is this sort of approach reasonable? How would I go about implementing a grammar without a parser?
If I understand you correctly, you want to implement a Domain Specific Language (DSL), the purpose of which is to specify validation rules for the contents of an XML-based configuration file.
Some people implement a DSL by defining a parser specific to the needs of the DSL. However, some other people shoehorn the semantics of their DSL into the syntax of an existing file format, such as XML or JSON. So if you want to avoid having to write a parser, you could express your DSL in XML syntax.

JSON and HTML trying to understand

According to a post on Stackflow.com called “what’s is JSOn and why would I use it? “web services used XML as their primary data format for transmitting back data, but since JSON appeared, it is preferred method.” Why do must web services use JSON over XML, is because it’s a better method for interchanging?
XML was designed primarily for document formats, e.g. papers in scientific journals. It contains many features that aren't needed for simple data interchange, and these features can get in the way when you are processing XML because they can't be easily represented in Javascript. So the code for processing the XML ends up a lot more complicated than it could be. By contrasts, JSON has an exact match to the data structures Javascript can handle natively. Of course, that problem could in principle be solved by using a language with better XML support than JavaScript - XSLT, for example - but unfortunately XSLT in the browser has never had the same level of investment put into it.
Additionally, for reasons I have never understood, the browser security folks decided that reading JSON from alien web sites (i.e. from a different domain from your HTML page) is safe, but reading XML from alien sites isn't. So if you switch from XML to JSON, you get rid of a lot of cross-site-scripting hassle.
JSON is less verbose and it is sufficient for simple data transmission, i.e. if you do not need any transformations (XSLT).

Should HTML be encoded before being persisted?

Should HTML be encoded before being stored in say, a database? Or is it normal practice to encode on its way out to the browser?
Should all my text based field lengths be quadrupled in the database to allow for extra storage?
Looking for best practice rather than a solid yes or no :-)
Is the data in your database really HTML or is it application data like a name or a comment that you just happen to know will end up as part of an HTML page?
If it's application data, I think its best to:
represent it in a form that native to the environment (e.g. unencoded in the database), and
make sure its properly translated as it crosses representational boundaries (encode when you generate the HTML page).
If you're a fan of MVC, this also helps separates the view/controller from the model (and from the persistent storage format).
Representation
For example, assume someone leaves the comment "I love M&Ms". Its probably easiest to represent it in the code as the plain-text String "I love M&Ms", not as the HTML-encoded String "I love M&Ms". Technically, the data as it exists in the code is not HTML yet and life is easiest if the data is represented as simply as accurately possible. This data may later be used in a different view, e.g. desktop app. This data may be stored in a database, a flat file, or in an XML file, perhaps later be shared with another program. Its simplest for the other program to assume the string is in "native" representation for the format: "I love M&Ms" in a database and flat file and "I love M&Ms" in the XML file. I would cringe to see the HTML-encoded value encoded in an XML file ("I love &Ms").
Translation
Later, when the data is about to cross a representation boundary (e.g. displayed in HTML, stored in a database, plain-text file, or XML file), then its important to make sure it is properly translated so it is represented accurately in a format native to that next environment. In short, when you go to display it on an HTML page, make sure its translated to properly-encoded HTML (manually or through a tool) so the value is accurately displayed on the page. When you go to store it in the database or use it in a query, use escaping and/or prepared statements and bound variable to ensure the same conceptual value is accurately represented to the database. When you go to store it in an XML file, you ensure its XML-encoded.
Failure to translate properly when crossing representation boundaries is the source of injection attacks such SQL-injection attacks. Be conscientious of that whenever you are working with multiple representations/languages (e.g. Java, SQL, HTML, Javascript, XML, etc).
--
On the other hand, if you are really trying to save HTML page fragments to the database, then I am unclear by what you mean by "encoded before being stored". If its is strictly valid HTML, all the necessary values should already be encoded (e.g. &, <, etc).
The practice is to HTML encode before display.
If you are consistent about encoding before displaying, you have done a good bit of XSS prevention.
You should save the original form in your database. This preserved the original and you may want to do other processing on that and not on the encoded version.
Database vendor specific escaping on the input, html escaping on the output.
I disagree with everyone who thinks it should be decoded at display time, the chances of an attack occuring if its encoded before it reaches the database is only possible if a developer purposes decodes it before displaying it. However, if you decode it before presenting it there is always a chance that it could happen by some other newbie developer, like a new hire, or a bad implementation. If its sitting there unencoded its just waiting to pop out on the internet and spread like herpes. Losing the original data shouldnt be a concern. encode + decode should produce the same data every time. Just my two cents.
For security reasons, yes you should first convert the html to their entities and then insert into the database. Attacks such as XSS are initiated when you allow users (or rather bad guys) to use html tags and then you process/insert them in to the databse. XSS is one of the root causes of most security holes. So you definitely need to encode your html before storing it.

Converting large set of word documents automatically into xml, modify them and than convert them into latex, pdf, html

Having a set of about 400 Documents in word which are part of a Quality Management System Word is causing me a lot of grieve because a) it handles images in large doc poorly b) the layout gets sometimes busted c) it is cumbersome to configure the documentation for different clients.
I can convert single documents by saving them as xml/html or text and convert them manually into latex but that is not possible for 400 documents. I know that i can print word documents directly to pdf with tools like PrimoPDF but that is not flexible enough because i need to modify the content.
Is there a way to keep the structure of the document like plain text, headings, tables, images and transform it into XML? Afterwards i would like to transform the XML into html, latex and pdf according the choices of our clients and also modify the content? Is xslt a way to go for transforming the xml to the other formats?
Thanks for any advice.
You could convert your documents to Word 2007. Office 2007 documents are XML documents: just change the file extension to .zip and upzip. Also, Microsoft publishes an API for working with Office 2007 documents that is higher-level than working with the XML tags.
For batch converting MS Word to something else you might have a look at OpenOffice.org.
OpenOffice has a (command line) batch mode for mass conversions. You can also have a look at JodConverter which converts documents using just that mechanism.
That way you could mass convert Micrososoft Word to some other format OpenOffice.org supports. Perhaps text, perhaps RTF, perhaps OpenOffice XML.
You then have a hopefully easier format to convert to Latex.
Have a search for Word and OpenOffice right here at Stack Overflow, you'll find results like this one about Word to Html conversion.
There is advice on Word <--> LaTeX conversions at TUG (TeX User Group):
http://www.tug.org/utilities/texconv/pctotex.html
that may be worth having a look at to see if any of the suggestions and methods meet your requirements.
Not sure how well it works, but there is Word2tex.

What does the url in XML file mean

I find this URL (or a similar one) always on HTML files, XML, XSD...
Like "http://www.w3.org/2001/XMLSchema" or "http://www.w3.org/2001/XMLSchema-instance"
I always wonder what those URLs means.
Even offline the XML or HTML document works without changes. What's the benefits on links to those URLs??
Thanks
Those URLs do not necessarily point to any website/server. They are a convenient naming mechanism. The idea is since every company will have a unique website, using that as their namespace will avoid clashes. Hence better interoperability. Hence the custom.
Namespaces in XML 1.0 Specification
It's the XML Schema.
An XML schema provides a view of the
document type at a relatively high
level of abstraction.