How to convert a huge Elisp data structure to JSON? - json

org-element-parse-buffer returns a huge tree even for a small Org file. I want to transform this tree into JSON. Apparently, json.el uses recursive functions to traverse cons cells, and as Elisp doesn't support tail recursion, invocation of json-encode quickly runs out of stack. If I increase max-lisp-eval-depth and max-specpdl-size, Emacs crashes.
How do I workaround that and transform a huge tree structure into JSON? In general, how do I workaround when I have a huge data structure and a recursive function that may run out of stack?

Yes, json.el functions are recursive, but recursive functions called on Org-Element cause stack overflow not because org-element-parse-buffer returns a huge AST, but because it returns a circular list. A tree-recursive function on a circular list is like a squirrel in a cage.
I guess, the idea behind using self-references in the AST returns is that if you traverse it, at any point you can go back to parent by simply running plist-get on keyword :parent. I imagine this usage for traversing the AST up and down:
(let ((xs '#1=(:text "foo" :child (:text "bar" :parent #1#))))
(plist-get
(plist-get
xs
:child) ; (:text "bar" :parent (:text "foo" :child #0))
:parent)) ; (:text "foo" :child (:text "bar" :parent #0))
But JSON doesn't support circular lists, so you need to remove these self-references from the AST before trying to convert to any data serialization format. I haven't found the way to elegantly remove circular references in the AST, so I resorted to a dirty hack:
Convert the AST to a string
Remove references with regular expressions
Convert the string back to an Elisp data structure
Suppose I have an Org file called test.org with the following content:
* Heading
** Subheading
Text
Then variable tree contains the parsed Org data from this buffer: (setq tree (with-current-buffer "test.org" (org-element-parse-buffer))). Then to prepare this data for JSON export, I just run:
(car (read-from-string (replace-regexp-in-string ":parent #[0-9]+?" "" (prin1-to-string tree)))))
Even with all mentions of :parent removed, the new AST is still valid, so if the new AST is in variable tree2, then the following 3 expressions are equivalent:
(org-element-interpret-data tree2)
(with-current-buffer "test.org" (buffer-substring-no-properties 1 (buffer-end 1)))
"* Heading\n** Subheading\nText\n"
Note that for some reason org-element-interpret-data removes preceding whitespace, so the above is not technically true, when you have lines like text in your Org file.
Now all you need to do is to encode the new non-circular AST into JSON and write it into a file:
(f-write (json-encode tree2) 'utf-8 "test.json")
Notes
Elisp's cons cells are pairs of 2 slots: car and cdr. If cdr of each cell contains a link to another cons cell, we get a linked list. If both car and cdr point at 2 values, we get a dotted pair. Therefore (1 . (2 . (3 . nil))) is equivalent to (1 2 3). But a cdr (or car for that matter) might point at any other cons cell, including the one that already were earlier in the list, giving rise to circular linked list.
Exercise: create a complex tree data structure with several self-references to different subtrees. Then try traversing this tree and jumping by the self-references to get the idea.
With ->> threading macro from dash list manipulation library the expression is equivalent to:
(->> tree prin1-to-string (replace-regexp-in-string ":parent #[0-9]+?" "") read-from-string car)
(buffer-substring-no-properties 1 (buffer-end 1)) is like (buffer-string), but without annoying text properties attached.
f-write is a function that writes text to files from f third-party file manipulation library.

tl;dr
Here's how to remove references to parent and structure in an org tree before encoding it to json :
(let* ((tree (org-element-parse-buffer 'object nil)))
(org-element-map tree (append org-element-all-elements
org-element-all-objects '(plain-text))
(lambda (x)
(if (org-element-property :parent x)
(org-element-put-property x :parent "none"))
(if (org-element-property :structure x)
(org-element-put-property x :structure "none"))))
(json-encode tree))

Related

How exactly does Clojure process function definitions?

I'm studying Clojure, and I've read that in Clojure a function definition is just data, i.e. parameters vector is just an ordinary vector. If that's the case, why can I do this
(def add (fn [a b]
(+ a b)))
but not this
(def vector-of-symbols [a b])
?
I know I normally would have to escape symbols like this:
(def vector-of-symbols [`a `b])
but why don't I have to do it in fn/defn? I assume this is due to fn/defn being macros. I tried examining their source, but they are too advanced for me so far. My attempts to recreate defn also fail, and I'm not sure why (I took example from a tutorial):
(defmacro defn2 [name param & body]
`(def ~name (fn ~param ~#body)))
(defn2 add [a b] (+ a b)) ;;I get "Use of undeclared Var app.core/defn2"
Can someone please explain, how exactly does Clojure turn data structures, especially symbols, into code? And what am I missing about the macro example?
Update Apparently, macro does not work because my project is actually in Clojurescript (in Clojure it does work). I did not think it matters, but as I progress - I discover more and more things that somehow don't work for me in with Clojurescript.
Update 2 This helps: https://www.clojurescript.org/about/differences
A function is a first-class citizen as other data in Clojure.
To define a vector you use (vector ...) or reader has syntaxic sugar [...], for a list it's (list ...) or '(...) the quote not to evaluate the list as a function call, for a set (set ...) or #{...}.
So the factory function for a function is fn (in fact fn*, that comes from Java core of Clojure, fn is a series of macros to manage to destructure and all).
(fn args body)
is a function call that returns a function, where args is a vector of argument(s) event. empty and body is a series of Clojure expressions to be evaluated with args bind to the environment. If nothing is to be evaluated it returns nil. There is also a syntactic sugar #(...) with %x as argument x and % as argument 1.
(fn ...) return a value that is a function. So
(def my-super-function (fn [a b c d] (println "coucou") (+ a b c d)))
binds the symbol my-super-function with the anonymous function returned by (fn [a b c d] (println "coucou") (+ a b c d)).
(def my_vector [1 2 3])
binds the symbol my_vector with the vector [1 2 3]
List of learning resources: https://github.com/io-tupelo/clj-template#documentation
As #jas said, your defn2 macro looks fine.
The main point is that macros are an advanced feature that one almost never needs. A macro is equivalent to a compiler extension, and that is almost never the best solution to a problem. Also keep in mind that functions can do some things macros can't.
Another point: the syntax-quote (aka backquote) ` is very different from a single quote '. In your example you want the single quote for ['a 'b]. Even better would be to quote the entire vector form '[a b].
As to your primary question, it is poorly explained how source-file text is converted into code. This is a 2-step process. The Clojure Reader consumes text string data (from a file or a literal string) and produces data structures like lists, vectors, strings, numbers, symbols. The Clojure compiler takes these data structures as input and produces java byte code that can be executed.
It is confusing because, when printed, one can't tell the difference between the text representation of a vector [1 2 3] and the text string that is input to the reader [1 2 3]. Ideally it would be color-coded or something. This problem doesn't exist in Java, etc since they don't have macros and hence there is no confusion between the source code (text) and the data structures used by a macro (not text).
For a more detailed answer on creating macros in Clojure, please see this answer.

Clojure: generating files containing clojure breaks with persistent lists

I asked a related question here: Clojure: How do I turn clojure code into a string that is evaluatable? It mostly works but lists are translated to raw parens, which fails
The answer was great but I realized that is not exactly what I needed. I simplified the example for stackoverflow, but I am not just writing out datum, I am trying to write out function definitions and other things which contain structures that contain lists. So here is a simple example (co-opted from the last question).
I want to generate a file which contains the function:
(defn aaa []
(fff :update {:bbb "bbb" :xxx [1 2 3] :yyy (3 5 7)}))
Everything after the :update is a structure I have access to when writing the file, so I call str on it and it emerges in that state. This is fine, but the list, when I load-file on this generated function, tries to call 3 as a function (as it is the first element in the list).
So I want a file which contains my function definition that I can then call load-file and call the functions defined in it. How can I write out this function with associated data so that I can load it back in without clojure thinking what used to be lists are now function calls?
You need to quote the structure prior to obtaining the string representation:
(list 'quote foo)
where foo is the structure.
Three additional remarks:
traversing the code to quote all lists / seqs would not do at all, since the top-level (defn ...) form would also get quoted;
lists are not the only potentially problematic type -- symbols are another one (+ vs. #<core$_PLUS_ clojure.core$_PLUS_#451ef443>);
rather than using (str foo) (even with foo already quoted), you'll probably want to print out the quoted foo -- or rather the entire code block with the quoted foo inside -- using pr / prn.
The last point warrants a short discussion. pr explicitly promises to produce a readable representation if *print-readably* is true, whereas str only produces such a representation for Clojure's compound data structures "by accident" (of the implementation) and still only if *print-readably* is true:
(str ["asdf"])
; => "[\"asdf\"]"
(binding [*print-readably* false]
(str ["asdf"]))
; => "[asdf]"
The above behaviour is due to clojure.lang.RT/printString's (that's the method Clojure's data structures ultimately delegate their toString needs to) use of clojure.lang.RT/print, which in turn chooses output format depending on the value of *print-readably*.
Even with *print-readably* bound to true, str may produce output inappropriate for clojure.lang.Reader's consumption: e.g. (str "asdf") is just "asdf", while the readable representation is "\"asdf\"". Use (with-out-str (pr foo)) to obtain a string object containing the representation of foo, guaranteed readable if *print-readably* is true.
Try this instead...
(defn aaa []
(fff :update {:bbb "bbb" :xxx [1 2 3] :yyy (list 3 5 7)}))
Wrap it in a call to quote to read it without evaluating it.

is this swing tablemodel code badly designed?

Context: I have a clojure-based crossword app whose main ui is a JTabbedPane with two tabs, a grid and a clue table. The clue table is a view over a vector of clues, but the vector itself is not the authoritative store of the data, but dynamically generated from a couple of internal data structures via an (active-cluelist) function, triggered by the clue tab being selected.
So this is the implementation of the clue table:
(def cluelist [])
(def update-cluelist)
(def model)
(defn make []
(let [column-names ["Sq" "Word" "Clue"]
column-widths [48 200 600]
table-model (proxy [AbstractTableModel] []
(getColumnCount [] (count column-names))
(getRowCount [] (count cluelist))
(isCellEditable [row col] (= col 2))
(getColumnName [col] (nth column-names col))
(getValueAt [row col] (get-in cluelist [row col]))
(setValueAt [s row col]
(let [word (get-in cluelist [row 1])]
(add-clue word s) ; editing a cell updates the main clue data
(def cluelist (assoc-in cluelist [row 2] s))
(. this fireTableCellUpdated row col))))
table (JTable. table-model)
]
; some pure display stuff elided
(def model table-model)
)
(defn update-cluelist []
(def cluelist (active-cluelist))
(.fireTableDataChanged model))
Someone in another discussion noted that it is a major code smell for (update-cluelist) to be manually calling fireTableDataChanged, because nothing outside the TableModel class should ever be calling that method. However, I feel this is an unavoidable consequence of the table being dynamically generated from an external source. The docs aren't too helpful - they state that
Your custom class simply needs to invoke one the following
AbstractTableModel methods each time table data is changed by an
external source.
which implicitly assumes that the CustomTableModel class is the authoritative source of the data.
Also there is a bit of a clojure/java impedance mismatch here - in java I would have had cluelist and update-cluelist be a private member and method of my TableModel, whereas in clojure cluelist and the table model are dynamically scoped vars that update-cluelist has access to.
My main problem is that there is not a lot of clojure/swing code around that I can look to for best practices. Does anyone have any advice as to the best way to do this?
Suggestion: use an atom for cluelist. Constantly redefining cluelist is not the right way to represent mutable data. Honestly, I would expect it to throw an exception the second time you define cluelist.
If you use an atom for cluelist, you can call the fireTableDataChanged method from a watcher instead of calling it manually. This would mean that anytime (and anywhere) you change the atom, fireTableDataChanged will be called automatically, without an explicit call.
The issue with def is that calling def multiple times doesn't work well in a multi-threaded environment and Clojure tries to make everything default to fairly threadsafe. As I understand it, the "proper" way to use a var is to leave its root binding alone (ie, don't call def again) and use binding if you need to locally change it. def may work the way you are using it, but the language is set up to support atoms, refs, or agents in this sort of situation and these will probably work better most of the time (ie you get watchers). Also, you don't need to worry at all about threads if you add them later.

Haskell: List Comprehension to Combinatory

Inspired by this article. I was playing with translating functions from list comprehension to combinatory style. I found something interesting.
-- Example 1: List Comprehension
*Main> [x|(x:_)<-["hi","hello",""]]
"hh"
-- Example 2: Combinatory
*Main> map head ["hi","hello",""]
"hh*** Exception: Prelude.head: empty list
-- Example 3: List Comprehension (translated from Example 2)
*Main> [head xs|xs<-["hi","hello",""]]
"hh*** Exception: Prelude.head: empty list
It seems strange that example 1 does not throw an exception, because (x:_) pattern matches one of the definitions of head. Is there an implied filter (not . null) when using list comprehensions?
See the section on list comprehensions in the Haskell report. So basically
[x|(x:_)<-["hi","hello",""]]
is translated as
let ok (x:_) = [ x ]
ok _ = [ ]
in concatMap ok ["hi","hello",""]
P.S. Since list comprehensions can be translated into do expressions, a similar thing happens with do expressions, as detailed in the section on do expressions. So the following will also produce the same result:
do (x:_)<-["hi","hello",""]
return x
Pattern match failures are handled specially in list comprehensions. In case the pattern fails to match, the element is dropped. Hence you just get "hh" but nothing for the third list element, since the element doesn't match the pattern.
This is due to the definition of the fail function which is called by the list comprehension in case a pattern fails to match some element:
fail _ = []
The correct parts of this answer are courtesy of kmc of #haskell fame. All the errors are mine, don't blame him.
Yes. When you qualify a list comprehension by pattern matching, the values which don't match are filtered out, getting rid of the empty list in your Example 1. In Example 3, the empty list matches the pattern xs so is not filtered, then head xs fails. The point of pattern matching is the safe combination of constructor discrimination with component selection!
You can achieve the same dubious effect with an irrefutable pattern, lazily performing the selection without the discrimination.
Prelude> [x|(~(x:_))<-["hi","hello",""]]
"hh*** Exception: <interactive>:1:0-30: Irrefutable pattern failed for pattern (x : _)
List comprehensions neatly package uses of map, concat, and thus filter.

Managing updates to nested immutable data structures in functional languages

I've noticed while on my quest to lean functional programming that there are cases when parameter lists start to become excessive when using nested immutable data structures. This is because when making an update to an object state, you need to update all the parent nodes in the data structure as well. Note that here I take "update" to mean "return a new immutable object with the appropriate change".
e.g. the kind of function I have found myself writing (Clojure example) is:
(defn update-object-in-world [world country city building object property value]
(update-country-in-world world
(update-city-in-country country
(update-building-in-city building
(update-object-in-building object property value)))))
All this to update one simple property is pretty ugly, but in addition the caller has to assemble all the parameters!
This must be a fairly common requirement when dealing with immutable data structures in functional languages generally, so is there a good pattern or trick to avoid this that I should be using instead?
Try
(update-in
world
[country city building]
(update-object-in-building object property value))
A classic general-purpose solution to this problem is what's called a "zipper" data structure. There are a number of variations, but the basic idea is simple: Given a nested data structure, take it apart as you traverse it, so that at each step you have a "current" element and a list of fragments representing how to reconstruct the rest of the data structure "above" the current element. A zipper can perhaps be thought of as a "cursor" that can move through an immutable data structure, replacing pieces as it goes, recreating only the parts it has to.
In the trivial case of a list, the fragments are just the previous elements of the list, stored in reverse order, and traversal is just moving the first element of one list to the other.
In the nontrivial but still simple case of a binary tree, the fragments each consist of a value and a subtree, identified as either right or left. Moving the zipper "down-left" involves adding to the fragment list the current element's value and right child, making the left child the new current element. Moving "down-right" works similarly, and moving "up" is done by combining the current element with the first value and subtree on the fragment list.
While the basic idea of the zipper is very general, constructing a zipper for a specific data structure usually requires some specialized bits, such as custom traversal or construction operations, to be used by a generic zipper implementation.
The original paper describing zippers (warning, PDF) gives example code in OCaml for an implementation storing fragments with an explicit path through a tree. Unsurprisingly, plenty of material can also be found on zippers in Haskell. As an alternative to constructing an explicit path and fragment list, zippers can be implemented in Scheme using continuations. And finally, there seems to even be a tree-oriented zipper provided by Clojure.
There are two approaches that I know of:
Collect multiple parameters in some sort of object that is convenient to pass around.
Example:
; world is a nested hash, the rest are keys
(defstruct location :world :country :city :building)
(defstruct attribute :object :property)
(defn do-update[location attribute value]
(let [{:keys [world country city building]} location
{:keys [object property]} attribute ]
(update-in world [country city building object property] value)))
This brings you down to two parameters that the caller needs to care about (location and attribute), which may be fair enough if those parameters do not change very often.
The other alternative is a with-X macro, which sets variables for use by the code body:
(defmacro with-location [location & body] ; run body in location context
(concat
(list 'let ['{:keys [world country city building] :as location} `~location])
`(~#body)))
Example use:
(with-location location (println city))
Then whatever the body does, it does to the world/country/city/building set for it, and it can pass the entire thing off to another function using the "pre-assembled" location parameter.
Update: Now with a with-location macro that actually works.