read huge Json file in R stream_in or another option

read huge Json file in R stream_in or another option - json

I have 2 gb Json file which requires to read in R,I have tried to parse using stream_in function following this code ,
h <- function(x){input <<-x}
dat <- llply(as.list("cookie.JSON"),function(x) stream_in(file("cookie.JSON"),pagesize = 5000,handler = h))
each time i stop the execution give a name input to data frame,actually which is not logical and wasting time.
Someone had an experiment breaking down Jsonfile ?

Related

Save decoded JSON values in Lua Variables

Following script describes the decoding of a JSON Object, that is received via MQTT. In this case, we shall take following JSON Object as an example:
{"00-06-77-2f-37-94":{"publish_topic":"/stations/test","sample_rate":5000}}
After being received and decoded in the handleOnReceive function, the local function saveTable is called up with the decoded object which looks like:
["00-06-77-2f-37-94"] = {
publish_topic = "/stations/test",
sample_rate = 5000
}
The goal of the saveTable function is to go through the table above and assign the values "/stations/test" and 5000 respectively to the variables pubtop and rate. When I however print each of both variables, nil is returned in both cases.
How can I extract the values of this table and save them in mentioned variables?
If i can only save the values "publish_topic = "/stations/test"" and "sample_rate = 5000" at first, would I need to parse these to get the values above and save them, or is there another way?
local pubtop
local rate
local function saveTable(t)
local conversionTable = {}
for k,v in pairs(t) do
if type(v) == "table" then
conversionTable [k] = string.format("%q: {", k)
printTable(v)
print("}")
else
print(string.format("%q:", k) .. v .. ",")
end
end
pubtop = conversionTable[0]
rate = conversionTable[1]
end
local lua_value
local function handleOnReceive(topic, data, _, _)
print("handleOnReceive: topic '" .. topic .. "' message '" .. data .. "'")
print(data)
lua_value = JSON:decode(data)
saveTable(lua_value)
print(pubtop)
print(rate)
end
client:register('OnReceive', handleOnReceive)
previous question to thread: Decode and Parse JSON to Lua

The function I gave you was to recursively print table contents. It was not ment to be modified to get some specific values.
Your modifications do not make any sense. Why would you store that string in conversionTable[k]? You obviously have no idea what you're doing here. No offense but you should learn some basics befor you continue.
I gave you that function so you can print whatever is the result of your json decode.
If you know you get what you expect there is no point in recursively iterating through that table.
Just do it like that
for k,v in pairs(lua_value) do
print(k)
print(v.publish_topic)
print(v.sample_rate)
end
Now read the Lua reference manual and do some beginners tutorials please.
You're wasting a lot of time and resources if you're trying to implement things like that if you do not know how to access the elements of a table. This is like the most basic and important operation in Lua.

Inconsistent behaviour when attempting to write Dataframe to CSV in Apache Spark

I'm trying to output the optimal hyperparameters for a decision tree classifier I trained using Spark's MLlib to a csv file using Dataframes and spark-csv. Here's a snippet of my code:
// Split the data into training and test sets (10% held out for testing)
val Array(trainingData, testData) = assembledData.randomSplit(Array(0.9, 0.1))
// Define cross validation with a hyperparameter grid
val crossval = new CrossValidator()
.setEstimator(classifier)
.setEstimatorParamMaps(paramGrid)
.setEvaluator(new BinaryClassificationEvaluator)
.setNumFolds(10)
// Train model
val model = crossval.fit(trainingData)
// Find best hyperparameter combination and create an RDD
val bestModel = model.bestModel
val hyperparamList = new ListBuffer[(String, String)]()
bestModel.extractParamMap().toSeq.foreach(pair => {
val hyperparam: Tuple2[String,String] = (pair.param.name,pair.value.toString)
hyperparamList += hyperparam
})
val hyperparameters = sqlContext.sparkContext.parallelize(hyperparamList.toSeq)
// Print the best hyperparameters
println(bestModel.extractParamMap().toSeq.foreach(pair => {
println(s"${pair.param.parent} ${pair.param.name}")
println(pair.value)
}))
// Define csv path to output results
var csvPath: String = "/root/results/decision-tree"
val hyperparametersPath: String = csvPath+"/hyperparameters"
val hyperparametersFile: File = new File(hyperparametersPath)
val results = (hyperparameters, hyperparametersPath, hyperparametersFile)
// Convert RDD to Dataframe and write it as csv
val dfToSave = spark.createDataFrame(results._1.map(x => Row(x._1, x._2)))
dfToSave.write.format("csv").mode("overwrite").save(results._2)
// Stop spark session
spark.stop()
After finishing a Spark job, I can see the part-00*... and _SUCCESS files inside the path as expected. However, though there are 13 hyperparameters total in this case (confirmed by printing them on screen), cat-ing the csv files shows not every hyperparameter was written to csv:
user#master:~$ cat /root/results/decision-tree/hyperparameters/part*.csv
checkpointInterval,10
featuresCol,features
maxDepth,5
minInstancesPerNode,1
Also, the hyperparameters that do get written change in every execution. This is executed on a HDFS-based Spark cluster with 1 master and 3 workers that have exactly the same hardware. Could it be a race condition? If so, how can I solve it?
Thanks in advance.

I think I figured it out. I expected dfTosave.write.format("csv")save(path) to write everything to the master node, but since the tasks are distributed to all workers, each worker saves its part of the hyperparameters to a local CSV in its filesystem. Because in my case the master node is also a worker, I can see its part of the hyperparameters. The "inconsistent behaviour" (i.e. seeing different parts in each execution) is caused by whatever algorithm Spark uses for distributing partitions among workers.
My solution will be to collect the CSVs from all workers using something like scp or rsync to build the full results.

JMeter - Specify CSV row failure

Within JMeter, I am running a script which uses a .CSV file to enter data as well as verify results. It is working correctly, but I cannot figure out how to tell which row/line of the .CSV caused the individual failures. Is there a way to do this?
Somewhat of an example scenario (not specific to what I'm doing, but similar):
Each row of the .CSV file contains a mathematical equation as well as the expected result.
On page 1, enter the equation (2+2)
Then on Page 2, you get the response: 3.
That test would obviously be a failure.
Say there are 1,000 tests being ran, some that pass and some that do not. How can I tell which .CSV row/line didn't pass?

Do you have any columns in your CSV file which help you to uniquely identify a row?
Let me assume you have a column called 'TestCaseNo' which has values as TC001, TC002,TC003...etc
Add a Beanshell Post Processor to store the result for each iteration.
Add the below code. I assume the you have the PASS or FAIL result stored in the 'Result' variable.
import java.io.File;
import org.apache.jmeter.services.FileServer;
f = new FileOutputStream("someptah/tcstatus.csv", true);
p = new PrintStream(f);
p.println( vars.get("TestCaseNo") + "," + vars.get("Result"));
p.close();
f.close();
The above code creates a CSV file with the results for each testcase.
EDIT:
Do the assertion yourself in the Beanshell post processor.
import java.io.File;
import org.apache.jmeter.services.FileServer;
Result = "FAIL";
Response = prev.getResponseDataAsString();
if (Response.contains("value")) // replace the value with the expected text
Result = "PASS";
f = new FileOutputStream("someptah/tcstatus.csv", true);
p = new PrintStream(f);
p.println( vars.get("TestCaseNo") + "," + Result);
p.close();
f.close();

I would use the following approach
__CSVRead() function - to get data from the .csv file.
__counter function - to track CSV file position. You can include counter variable name into Sampler's label so current .csv file line could be reported. See below image for example
For more information on aforementioned and other useful JMeter functions see How to Use JMeter Functions posts series.

Fast JSON Parser for Matlab

Do you know a very fast JSON Parser for Matlab?
Currently I'm using JSONlab, but with larger JSON files (mine is 12 MB, 500 000 lines) it's really slow. Or do you have any tips' for me to increase the speed?
P.S. The JSON file is max. 3 levels deep.

If you want to be fast, you could use the Java JSON parser.
And before this answer gets out of hand, I am going to post the stuff I put down so far:
clc
% input example
jsonStr = '{"bool1": true, "string1": "some text", "double1": 5, "array1": [1,2,3], "nested": {"val1": 1, "val2": "one"}}'
% use java..
javaaddpath('json.jar');
jsonObj = org.json.JSONObject(jsonStr);
% check out the available methods
jsonObj.methods % see also http://www.json.org/javadoc/org/json/JSONObject.html
% get some stuff
b = jsonObj.getBoolean('bool1')
s = jsonObj.getString('string1')
d = jsonObj.getDouble('double1')
i = jsonObj.getJSONObject('nested').getInt('val1')
% put some stuff
jsonObj = jsonObj.put('sum', 1+1);
% getting an array or matrix is not so easy (you get a JSONArray)
e = jsonObj.get('array1');
% what are the methods to access that JSONArray?
e.methods
for idx = 1:e.length()
e.get(idx-1)
end
% but putting arrays or matrices works fine
jsonObj = jsonObj.put('matrix1', ones(5));
% you can get these also easily ..
m1 = jsonObj.get('matrix1')
% .. as long as you dont convert the obj back to a string
jsonObj = org.json.JSONObject(jsonObj.toString());
m2 = jsonObj.get('matrix1')

If you can afford to call .NET code, you may want to have a look at this lightweight guy (I'm the author):
https://github.com/ysharplanguage/FastJsonParser#PerfDetailed
Coincidentally, my benchmark includes a test ("fathers data") in the 12MB ballpark precisely (and with a couple levels of depth also) that this parser parses into POCOs in under 250 ms on my cheap laptop.
As for the MATLAB + .NET code integration:
http://www.mathworks.com/help/matlab/using-net-libraries-in-matlab.html
'HTH,

If you just want to read JSON files, and have a C++11 compiler, you can use the very fast json_read mex function.

Since Matlab 2016b, you can use jsondecode.
I have not compared its performance to other implementations.
From personal experience I can say that it is not horribly slow.

To read SO's data dump effectively

I use currently Vim to read SO's data dump. However, my Macbook slows down when I roll down just a few rows. This suggests me that there must be more efficient ways to read the data.
I know little MySQL. The files are in .xml -format. It is rather hard to read the data at the moment in .xml. It may be more efficient to convert the xml -files to MySQL and then read the files. I know only MS db -tool for such actions. However, I would like to know another tool too.
Problems
to parse .xml to SQL -queries such that MySQL understand it. We need to know data structures of the data.
to run the data in MySQL
to find some tool similar to MS db -tool by which we can read the data effectively
How do you read SO's data dump effectively?
--
[edit]
How can you run the 523 SQL queries to create the database in your terminal? I have the commands at the moment in a text -file.
How can you "switch to [the recovery mode] to a simple recovery mode in the database?

I made my first ever python program to read them and output SQL insert statements for use with mysql (It's ugly but worked). You'll need to create the tables first though by hand.
import xml.sax.handler
import xml.sax
import sys
class SOHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self.errParse = 0
def startElement(self, name, attributes):
if name != "row":
self.table = name;
self.outFile = open(name+".sql","w")
self.errfile = open(name+".err","w")
else:
skip = 0
currentRow = u"insert into "+self.table+"("
for attr in attributes.keys():
currentRow += str(attr) + ","
currentRow = currentRow[:-1]
currentRow += u") values ("
for attr in attributes.keys():
try:
currentRow += u'"{0}",'.format(attributes[attr].replace('\\','\\\\').replace('"', '\\"').replace("'", "\\'"))
except UnicodeEncodeError:
self.errParse += 1;
skip = 1;
self.errfile.write(currentRow)
if skip != 1:
currentRow = currentRow[:-1]
currentRow += u");"
#print len(attributes.keys())
self.outFile.write(currentRow.encode("utf-8"))
self.outFile.write("\n")
self.outFile.flush()
print currentRow.encode("utf-8");
def characters(self, data):
pass
def endElement(self, name):
pass
if len(sys.argv) < 2:
print "Give me an xml file argument!"
sys.exit(1)
parser = xml.sax.make_parser()
handler = SOHandler()
parser.setContentHandler(handler)
parser.parse(sys.argv[1])
print handler.errParse

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

read huge Json file in R stream_in or another option - json

Related

Save decoded JSON values in Lua Variables

Inconsistent behaviour when attempting to write Dataframe to CSV in Apache Spark

JMeter - Specify CSV row failure

Fast JSON Parser for Matlab

To read SO's data dump effectively

Categories

Resources