R: Speed up API calls - json

I am writing code to collect data from the Steam API (documentation: https://developer.valvesoftware.com/wiki/Steam_Web_API).
Particularly, I use the fromJSON() function from the jsonlite package in [R].
However, I have the feeling that the code is slow, and that the bottleneck is the actual calling of the API. I am currently able to do around 7.500-10.000 calls an hour, resulting in around 2-3 calls per second. This feels slow. Is it possible to speed this up, and if so, how?
Two things I found already is that it may be necessary to close the connection after opening it (cf. http://www.firaja.cc/steam-web-api-right-way.html). Also, the API allows json (what I use now) but also XML and CSV outputs, maybe it is better to use one of the latter two? Any other possible solutions?

Related

Dash client-side callback vs dcc.store

I have dash app connected to an AWS RDS. I have a live-updated graph that triggers a callback with a n_interval of 5min to query the database and do some expensive formatting. I store the transformed data (~500 data points) in a dcc.store from which another 6 graphs and a datatable utilize this data (no further processing required. My question is: To further improve the efficiency of the dashboard should I utilize client-side callbacks instead of dcc.store? Since from what I've read the client-side only utilizes the client browser and doesn't need to communicate back to the dash server on callback? Thank you.( I'm secretly hoping it makes little difference as I don't want to learn javascript)
The answer depends a lot on your architecture, how complex are the graphs and how many server side callbacks are currently used.
You have already mentioned the main advantage of clientside callbacks: since the data is on the clientside, it saves some time and network by updating the components directly on the client. The major inconvenience is that those are synchronous calculations and will block the app main thread (dash does not support promises and asynchronous for now), which Can lead to a bad UX if these functions take too much time.
For very simple plots and table, I can guarantee that there is a significant improvement in using clientside callbacks, especially because it avoids the Plotly API that can be much slower than just defining a JSON object with traces and layout.

Need help downloading and reading a zipped CSV file in memory with Clojure

I have an external site from which I want to download a zipped CSV file. Currently, I'm downloading it unzipped, saving it to disk, then unzipping it, saving the unzipped file to disk, then reading the unzipped file with the CSV reader. Lots of useless steps in the process can be trimmed out, and I went on my way to do so.
This amazing answer helped me to get myself going. I tried to use the first option linked there (GZIPInputStream), but I get a "Not GZIP format" error, so I suppose I have to go to the second option.
This is my current code, and it does what I want it to do:
(defn download-zipped-stream!
(:body (clj-http.client/get "www.example.com" {:as :stream})))
(with-open
[stream (ZipInputStream. download-zipped-stream!)]
(.getNextEntry stream)
(doall (clojure.data.csv/read-csv (clojure.java.io/reader stream) :separator \;)))
I literally got to this by trial and error. There are mainly three things I'd like to change / understand about this code.
Ideally, I would want to break my code in two parts: one to download and unzip the content, returning a stream - the reason being that I want to decide later whether I want to read it as a csv directly, or write to disk (I don't want to lose this option, because, during development, it is much easier to read a pre-downloaded csv file than downloading the big content every single time). Turns out that, if I try to access the stream outside of the with-open call, I get a "stream closed" error (which, from what I understand, makes total sense).
On the above code, I have to call this .getNextEntry, or I get an empty list. As someone who is striving to write functional code, this bothers me, because, from what I can understand, I'm dealing with states here - my stream object looks mutable, which is something I really don't want. Isn't there a way to work around this step and straight-up not have it there?
I tried to call the read-csv method directly on the stream object, but the read-csv doesn't really know how to handle ZipInputStreams, apparently. Seeing this, I simply and hopefully throwed an io/reader call in between, and it worked. I don't know if this is the best approach, though. Is it correct?
I'm quite new to Clojure, and I'm completely clueless about Java in general, so, as you can see, my knowledge about those stream objects is pretty limited. I tried to read something about it in Java, but I quitted because I was not sure about how much of it could be useful for someone learning Clojure, so any pointers are also appreciated.
I think you are on the right approach. Suggestions to consider:
Consider using wget to manually download the *.csv.gz file to your local disk. Then, just open that local file instead of using clj-http.client/get.
I haven't played much with ZipInputStream, but if using .getNextEntry() seems to be required, just go with it.
The examples for read-csv show using a Reader to give access to the input file, so this is the expected behavior.
This template project shows how I like to organize a Clojure project & source code. Be sure to peruse the list of documentation provided.
Don't forget to utilize cljdoc.org for looking up Clojure library API docs. For example, see the API docs for data.csv.
Update
You may also want to review this answer.
Use https://github.com/techascent/tech.ml.dataset optionally with https://scicloj.github.io/tablecloth/index.html (a dplyr like api for TMD)
Also has advantage of being extremely fast and able to handle datasets that can't fit in memory, talks SQL, Arrow, et. al. Join conversation about it here:
https://clojurians.zulipchat.com/#narrow/stream/151924-data-science/topic/tech.2Eml.2Edataset

How to covert a large JSON file into XML?

I have a large JSON file, its size is 5.09 GB. I want to convert it to an XML file. I tried online converters but the file is too large for them. Does anyone know how to to do that?
The typical way to process XML as well as JSON files is to load these files completely into memory. Then you have a so called DOM which allows you various kinds of data processing. But neither XML nor JSON are really designed for storing that much data you have here. To my experience you typically will run into memory problems as soon as you exceed a 200 MByte limit. This is because DOMs are created that are composed from individual objects. This approach results in a huge memory overhead that far exceeds the amount of data you want to process.
The only way for you to process files like that is basically to take a stream approach. The basic idea: Instead of parsing the whole file and loading it into memory you parse and process the file "on the fly". As data is read it is parsed and events are triggered on which your software can react and perform some actions as needed. (For details on that have a look at the SAX API in order to understand this concept in more detail.)
As you stated you are processing JSON, not XML. Stream APIs for JSON should be available in the wild as wel. Anyway you could implement one fairly easily yourself: JSON is a pretty simple data format.
Nevertheless such an approach is not optimal: Typically such a concept will result in very slow data processing because of millions of method invocations involved: For every item encountered you typically need to call a method in order to perform some data processing task. This together with additional checks about what kind of information you currently have encountered in the stream will slow down data processing pretty much.
You really should consider to use a different kind of approach. First split your file into many small ones, then perform processing on them. This approach might not seem to be very elegant, but it helps to keep your task much simpler. This way you gain a main advantage: It will be much easier for you to debug your software. Unfortunately you are not very specific about your problem, so I can only guess, but large files typically imply that the data model is pretty complex. Therefor you will probably be much better off by having many small files instead of a single huge one. And later it allows you to dig into individual aspects of your data and the data processing process as needed. You will probably fail getting any detailed insights into that while having a single large file of 5 GByte to process. On errors you will have trouble to identify which part of the huge file is causing the problem.
As I already stated you unfortunately are very unspecific about your problem. Sorry, but because of having no more details about your problem (and your data in particular) I can only give you these general recommendations about data processing. I do not know any details about your data, so I can not give you any recommendation about which approach will work best in your case.

Is d3.text faster than d3.json?

I'm just wondering if d3.text is faster than d3.json?
The reason behind my question is that I'm reading the source code behind cubism.js and I'm just curious to know if it's done with d3.text because it's faster?
Not really.
The reason the graphic metric uses d3.text is because Graphite doesn’t reply with JSON-formatted data; it has its own raw format. Cubism does use d3.json when the server replies with JSON, as for example with cube metrics.
Under the hood, both d3.text and d3.json both use d3.xhr, so they are going to download the file exactly the same way (via asynchronous XMLHttpRequest). Sure, d3.text doesn’t subsequently run the response through JSON.parse, but you still have to parse the reply somehow. And more often than not I would expect the native JSON.parse to be faster, though it would depend on the exact format.

best practices for writing to a file from multiple methods

I have a class that contains a bunch of methods for checking data I scrape every week (for things like well-formedness and other errors in gathering the data). Each of these methods performs a test, and then prints out a summary of the test.
I want to print out the output from these tests to a file, but I'm not sure what the best way to do it is. For example...
Should the class hold an instance variable to the file, and each method open/appends/closes the file? (A problem is that methods sometimes call other methods, so this seems kinda messy?)
Should each method get passed the file as a parameter? (Seems messy as well.)
Should each method return a string, and a"central" method that calls all the other tests outputs all these strings to a file?
I'm not really familiar with using logger libraries -- would that be a solution?
My particular context
I have a scraper that pulls data from various websites and stores them in a database. Websites change all the time, so I'm writing a "scrape checker" program that checks my scrapes for various things, like:
number of empty results
length of results
weird characters in results
and so on
So I have methods like:
check_num_empty_results
check_weird_characters
check_scrape (calls a bunch of other checks)
check_scrape_pair (sometimes I want to check pairs of scrapes together, e.g., to match results against each other, so this is different checking each one in isolation)
etc.
I want my "scrape checker" program to print out a file that summarizes all the checks.
Separation of concerns. Write code the focuses on the scraping activity and return the value(s) scraped. Then use aspect oriented programming for logging, which can simplify the problem greatly as the aspect holds the reference to the file or logging API.
Ultimately, it depends on what language you're using.
The first solution makes the most sense if your language permits it. For each instance of the logging class, have a field for the file object that you're reading from/writing to. This is basically equivalent to passing the file object as a parameter to every method.
That said, most mature languages have modules that will do a lot of this work for you; off the top of my sh/awk, Perl, and Python all come to mind as being suited to this task (though if you want to, you could use Java or something else).
Seems like a logging framework would be a perfect solution for this. If you are using Java or .NET, log4j and log4net are pretty much the de-facto standards for that.