Methods to convert CSV to unique JSON - json

I need to convert a .CSV to a specific .JSON format, and I decided to use the csvtojson package from NPM since it seemed to be designed for this sort of thing.
First, a little background on my problem. I have .CSV data that looks similar to this:
scan_type, date/time, source address, source-lat, source-lng, dest address, dest-lat, dest-lng
nexpose,2016-07-18 18:21:44,1008,40.585260,-10.124120,10.111.131.4,10.844880,-10.933360
I have a plain csv-to-json converter here , but there is a problem. It outputs a rather plain file, and I need to split the source/destination, and add some special formatting that I will show later on.
[
{
"scan_type": "nexpose",
"date/time": "2026-07-28 28:22:44",
"source_address": 2008,
"source-lat": 10.58526,
"source-lng": -105.08442,
"dest_address": "11.266.282.0",
"dest-lat": 11.83388,
"dest-lng": -111.82236
}
]
The first thing I need to do is be able to separate the "source" values, from the "destination" values. Here is an example of what I want the "source" values to look like:
(var destination will be exactly the same format)
var source={
id: "99.58926-295.09492",
"source-lat": 49.59926,
"source-lng": -209.98942,
"source_address": 2009,
x: {
valueOf: function() {
var latlng=[
49.58596,
-209.08442
];
var xy = map.FUNCTION_FOR_CONVERTING_LAT_LNG_TO_X_Y(latlng);
return xy[0]; //xy.x
},
y: {
valueOf: function(){
varlatlng=[
49.58596,
-209.08442
];
var xy = map.FUNCTION_FOR_CONVERTING_LAT_LNG_TO_X_Y(latlng);
return xy[1]; //xy.y
}
}
So, my question is, how should I approach converting my data? Should I convert everything with csvtojson? Or should I convert it from the plain .JSON file I generated?
Does anyone have any advice, or similar examples, they could share on how to approach this problem?

I do a lot of work with parsing CSV data and as I am sure you have seen, CSV is very hard to parse and work with correctly as there are a huge number of edge cases that can break even the most rugged of parsers (although it looks like your dataset is fairly plain so that isn't a huge concern). Not to mention you could potentially run into corruption by performing operations while reading from disk, so it is a much better idea to get the data from CSV into a JSON file and then make any manipulations to a JSON object loaded from that "plain" JSON file.
tl;dr: convert your data from the plain .JSON file

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

JSON variable indent for different entries

Background: I want to store a dict object in json format that has say, 2 entries:
(1) Some object that describes the data in (2). This is small data mostly definitions, parameters that control, etc. and things (maybe called metadata) that one would like to read before using the actual data in (2). In short, I want good human readability of this portion of the file.
(2) The data itself is a large chunk- should more like machine readable (no need for human to gaze over it on opening the file).
Problem: How to specify some custom indent, say 4 to the (1) and None to the (2). If I use something like json.dump(data, trig_file, indent=4) where data = {'meta_data': small_description, 'actual_data': big_chunk}, meaning the large data will have a lot of whitespace making the file large.
Assuming you can append json to a file:
Write {"meta_data":\n to the file.
Append the json for small_description formatted appropriately to the file.
Append ,\n"actual_data":\n to the file.
Append the json for big_chunk formatted appropriately to the file.
Append \n} to the file.
The idea is to do the json formatting out the "container" object by hand, and using your json formatter as appropriate to each of the contained objects.
Consider a different file format, interleaving keys and values as distinct documents concatenated together within a single file:
{"next_item": "meta_data"}
{
"description": "human-readable content goes here",
"split over": "several lines"
}
{"next_item": "actual_data"}
["big","machine-readable","unformatted","content","here","....."]
That way you can pass any indent parameters you want to each write, and you aren't doing any serialization by hand.
See How do I use the 'json' module to read in one JSON object at a time? for how one would read a file in this format. One of its answers wisely suggests the ijson library, which accepts a multiple_values=True argument.

Spark - load numbers from a CSV file with non-US number format

I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)

reading a json file and selecting values with dojo

I'm trying to read a json file and select a value in the file, but my googling skills have failed me.
I've come across dojo.xhrGet & ItemFileReadStore, but I'm not sure which is the correct one to use. Or are neither correct?
Any help or wave of a flashlight in the right direction would be greatly appreciated.
Can you be more specific. What do you mean by select a values in file? Using dojo you can perform all range of HTTP request(GET, POST, PUT etc) and specify if returened data is text or json.
xhr.get({
url:"data.json",
handleAs:"json",
load: function(data){
for(var i in data){
console.log("key", i, "value", data[i]);
}
}
});
Here data can be treated as object and based on key data can be retrived using obj.key notation

How to edit a value in existing JSON file without parsing it all?

I want to edit only one value in an existing JSON file.
Is there any way to do that without parsing and re-writing the whole file? (I use Jackson Streaming API to generate and parse the file, but I'm not sure that Streaming API can do that).
my Example.json file contains the following:
{
"id" : "20120421141411",
"name" : "Example",
"time_start" : "2012-04-21T14:14:14"
}
Example given: I want to edit the value of the "name" from "Example" to "other name".
Not that I know of; either at JSON level, or at file level -- unless length of the values happened to be exactly same, underlying file system typically requires rest of the file to be rewritten from point of change.
You can read and write file using Streaming API, replacing value on the go; see JsonGenerator.copyCurrentEvent(jp) to simplify the task -- it just copies the input event exactly as is. For everything except for replacing particular value, you can call that; and for value, can call JsonGenerator.writeString().
If the file is small and the input value you're looking to replace is unique "enough", and you're open to quick-and-dirty, use apache commons-exec or something to shell out:
bash$> echo '{
"id" : "20120421141411",
"name" : "Example",
"time_start" : "2012-04-21T14:14:14"
}' | sed -e 's/Example/othername/'
outputs:
{
"id" : "20120421141411",
"name" : "othername",
"time_start" : "2012-04-21T14:14:14"
}
Use cat file | sed ... if you know the path to the file.
If you really wanted to edit the file in-place, only writing to those bytes you want to change, it's only possible if the data you are writing will not overwrite subsequent data in the file. You are much better off going with one of the solutions above.
Suppose the JSON file were massive (>1GB?), then would this technique make sense? NO, what the heck are you doing with a JSON file that big? Split it up! But for sake of argument...
You really want to do it, so you hook into a JSON parser to keep track of the byte offset within the file and be able to tie that back to the object representing the JsonNode you will be manipulating. You might end up writing your own parser at this point; JSON grammar is intentionally simple. Then you'd just open the file, skip to that offset, and write the JsonNode data... unless it will overwrite something after it (do you pre-populate the file with buffer of space after every value, just in case? hmmm... this is starting to sound like a database problem). In that case, you'll end up rewriting the entire rest of the file as the larger value "pushes" everything else downward. Not a big deal if the edits are always near the end of file. But if they are random, your performance is doomed. You'll bottleneck serializing writes.