How to read a CSV from a string? - csv

I have some data in a csv format. However they are already a string, since i have got them from an HTTP request.
I would like to use Data Frames, in order to view the data.
However i don't know how to parse it, because the CSV package only accepts files, not Strings.
One solution would be to write the content of the String into a file, and then to read it out again. But there has to be a better way!

Use IOBuffer(your_string):
CSV.read(IOBuffer(your_string), DataFrame)

Related

Random JSON file to DataStruct unmarshalling

I want to create a Data Struct (DS) in GOLANG given a random JSON file. That is, take the JSON file's content and unmarshall it into the DS.
Looking around, I have found solutions on how to create such DS which require knowing beforehand the JSON format (Key:value pairs, types of the values, etc.). To do that, it would be also required to 'manually' enter the fields of the struct, and then unmarshall the JSON content into it. Of course, you can always create a small script that does that. However, that seems a bit unpractical, but not entirely impossible or unimplementable.
Do you know a more straightforward way to achieve this?
I think I also found something about porting the JSON's content into an interface, but I am sure (not 100%, though), that we will want to keep these data in a more static format, i.e. a DS. Is there a way to transform this hypothetical interface to a DS?
may be you can try to do this using https://github.com/golang/go/blob/e7f2e5697ac8b9b6ebfb3e0d059a8c318b4709eb/src/encoding/json/stream.go#L371

RavenDB - HTTP request to return data in format other than CSV or JSON

I'm running RavenDB v3.0. According to the RavenDB documentation, you are able to access an HTTP link to export a list of documents in CSV format. I've followed the instructions and can generate the export by connecting to an address similar to their example:
http://my-server/databases/db-name/streams/query/DocumentsForExtract?resultsTransformer=TransformForExtract&format=excel
The above URL will return the extract in CSV format. If I remove the format parameter from the request, or alter it to anything else, it returns it in JSON. I want to know if there are any other formats available? I'd like to get it in XML if possible, but I can't seem to find any documentation about this which is why I'm asking here on SO.
Thanks in advance.
No, that endpoint supports only CSV and JSON

Apache Spark Read One Complex JSON File Per Record RDD or DF

I have an HDFS directory full of the following JSON file format:
https://www.hl7.org/fhir/bundle-transaction.json.html
What I am hoping to do is find an approach to flatten each individual file to become one df record or rdd tuple. I have tried everything I could think of using read.json(), wholeTextFiles(), etc.
If anyone has any best practices advice or pointers, it would be sincerely appreciated.
Load via wholeTextFiles something like this:
sc.wholeTextFiles(...) //RDD[(FileName, JSON)
.map(...processJSON...) //RDD[JsonObject]
Then, you can simply call the .toDF method so that it will infer from your JsonObject.
As far as the processJSON method, you could just use something like the Play json parser
mapPartitions is used when having to deal with data that is structured in a way that different elements can be on different lines. I've worked with both JSON and XML using mapPartitions.
mapPartitions works on an entire block of data at a time, as opposed to a single element. While you should be able to use the DataFrameReader API with JSON, mapPartitions can definitely do as you'd like. I don't have the exact code to flatten a JSON file, but I'm sure you can figure it out. Just remember the output must be an iterable type.

Javascript in place of json input step

I am loading data from a mongodb collection to a mysql table through Kettle transformation.
First I extract them using MongodbInput and then I use json input step.
But since json input step has very low performance, I wanted to replace it with a
javacript script.
I am a beginner in Javascript and even though i tried somethings, the kettle javascript script is not recognizing any keywords.
can anyone give me sample code to convert Json data to different columns using javascript?
To solve your problem you need to see three aspects:
Reading from MongoDB
Reading from JSON
Reading from (probably) String
Reading from MongoDB Except if you changed the interface, MongoDB returns not JSON but BSON files (~binary JSON). You need to see the MongoDB documentation about reading and writing BSON: probably something like BSON.to() and BSON.from() but I don't know it by heart.
Reading from JSON Once you have your BSON in JSON format, you can read it using JSON.stringify() which returns a String.
Reading from (probably) String If you want to use the capabilities of JSON (why else would you use JSON?), you also want to use JSON.parse() which returns a JSON object.
My experience is that to send a JSON object from one step to the other, using a String is not a bad idea, i.e. at the end of a JavaScript step, you write your JSON object to a String and at the beginning of the next JavaScript step (can be further down the stream) you parse it back to JSON to work with it.
I hope this answers your question.
PS: writing JavaScript steps requires you to learn JavaScript. You don't have to be a master, but the basics are required. There is no way around it.
you could use the json input step to get the values of this json and put in common rows

Logging Multiple JSON Objects to a Single File - File Format

I have a solution where I need to be able to log multiple JSON Objects to a file. Essentially doing one log file per day. What is the easiest way to write (and later read) these from a single file?
How does MongoDB handle this with BSON? What does it use as a separator between "records"?
Does Protocol Buffers, BSON, MessagePack, etc... offer compression and the record concept? Compression would be a nice benefit.
With protocol buffers you could define the message as follows:
Message JSONObject {
required string JSON = 1;
}
Message DailyJSONLog {
repeated JSONObject JSON = 1;
}
This way you would just read the file from memory and deserialize it. Its essentially the same way for serializing them as well. Once you have the file (serialized DailyJSONLog) on disk, you can easily just append serialized JSONObjects to the end of that file (since the DailyJSONLog message is very simply a repeated field).
The only issue with this is if you have a LOT of messages each day or if you want to start at a certain location during the day (you're not able to easily get to the middle (or arbitrary) of the repeated list).
I've gotten around this by taking a JSONObject, serializing it and then base64 encoding it. I'd store these to a file separating by a new line. This allows you to very easily see how many records are in each file, gain access to any arbitrary JSON object within the file and to trivially keep expanding the file (you can expand the above 'repeated' message as well pretty trivially but it is a one way easy operation...)
Compression is a different topic. Protocol Buffers will not compress strings. If you were to define a pb message to match your JSON message, then you will get the benefit of having pb possibly 'compress' any integers into their [varint][1] encoded format. You will get 'less' compression if you try above base64 encoding route as well.