How to pass data from a FSharp.Data.CsvProvider to a Deedle.Frame? - csv

I'm trying to pass data from a FSharp.Data.CsvProvider to a Deedle.Frame.
I'm almost new in F#, and I need to convert some CSV files from culture "it-IT" to "en-US", so I can use the data.
I found Deedle, and I want to learn how to use it, but I was not able to directly convert the data from a CSV file in Deedle (at least is what is printed in F# interactive).
I noticed that the CsvProvider makes the conversion, but after some days of attempts I am not able to pass the data.

I believe that Deedle should be able to deal with CSV files that use non-US culture:
let frame = Frame.ReadCsv("C:\\test.csv", culture="it-IT")
That said, if you want to use the CSV type provider for some reason, you can use:
let cs = new CsvProvider<"C:/data/fb.csv">()
cs.Rows
|> Frame.ofRecords
|> Frame.indexColsWith cs.Headers.Value
This uses Frame.ofRecords which creates a data frame from any .NET collection and expands the properties of the objects as columns. The CSV provider represents data as tuples, so this does not name the headers correctly - but the Frame.indexColsWith function lets you name the headers explicity.

Related

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

Reading JSON in Azure Synapse

I'm trying to understand the code for reading JSON file in Synapse Analytics. And here's the code provided by Microsoft documentation:
Query JSON files using serverless SQL pool in Azure Synapse Analytics
select top 10 *
from openrowset(
bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.jsonl',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) with (doc nvarchar(max)) as rows
go
I wonder why the format = 'csv'. Is it trying to convert JSON to CSV to flatten the file?
Why they didn't just read the file as a SINGLE_CLOB I don't know
When you use SINGLE_CLOB then the entire file is important as one value and the content of the file in the doc is not well formatted as a single JSON. Using SINGLE_CLOB will make us do more work after using the openrowset, before we can use the content as JSON (since it is not valid JSON we will need to parse the value). It can be done but will require more work probably.
The format of the file is multiple JSON's like strings, each in separate line. "line-delimited JSON", as the document call it.
By the way, If you will check the history of the document at GitHub, then you will find that originally this was not the case. As much as I remember, originally the file included a single JSON document with an array of objects (was wrapped with [] after loaded). Someone named "Ronen Ariely" in fact found this issue in the document, which is why you can see my name in the list if the Authors of the document :-)
I wonder why the format = 'csv'. Is it trying to convert json to csv to flatten the hierarchy?
(1) JSON is not a data type in SQL Server. There is no data type name JSON. What we have in SQL Server are tools like functions which work on text and provide support for strings which are JSON's like format. Therefore, we do not CONVERT to JSON or from JSON.
(2) The format parameter has nothing to do with JSON. It specifies that the content of the file is a comma separated values file. You can (and should) use it whenever your file is well formatted as comma separated values file (also commonly known as csv file).
In this specific sample in the document, the values in the csv file are strings, which each one of them has a valid JSON format. Only after you read the file using the openrowset, we start to parse the content of the text as JSON.
Notice that only after the title "Parse JSON documents" in the document, the document starts to speak about parsing the text as JSON.

How do I read a Large JSON Array File in PySpark

Issue
I recently encountered a challenge in Azure Data Lake Analytics when I attempted to read in a Large UTF-8 JSON Array file and switched to HDInsight PySpark (v2.x, not 3) to process the file. The file is ~110G and has ~150m JSON Objects.
HDInsight PySpark does not appear to support Array of JSON file format for input, so I'm stuck. Also, I have "many" such files with different schemas in each containing hundred of columns each, so creating the schemas for those is not an option at this point.
Question
How do I use out-of-the-box functionality in PySpark 2 on HDInsight to enable these files to be read as JSON?
Thanks,
J
Things I tried
I used the approach at the bottom of this page:
from Databricks that supplied the below code snippet:
import json
df = sc.wholeTextFiles('/tmp/*.json').flatMap(lambda x: json.loads(x[1])).toDF()
display(df)
I tried the above, not understanding how "wholeTextFiles" works, and of course ran into OutOfMemory errors that killed my executors quickly.
I attempted loading to an RDD and other open methods, but PySpark appears to support only the JSONLines JSON file format, and I have the Array of JSON Objects due to ADLA's requirement for that file format.
I tried reading in as a text file, stripping Array characters, splitting on the JSON object boundaries and converting to JSON like the above, but that kept giving errors about being unable to convert unicode and/or str (ings).
I found a way through the above, and converted to a dataframe containing one column with Rows of strings that were the JSON Objects. However, I did not find a way to output only the JSON Strings from the data frame rows to an output file by themselves. The always came out as
{'dfColumnName':'{...json_string_as_value}'}
I also tried a map function that accepted the above rows, parsed as JSON, extracted the values (JSON I wanted), then parsed the values as JSON. This appeared to work, but when I would try to save, the RDD was type PipelineRDD and had no saveAsTextFile() method. I then tried the toJSON method, but kept getting errors about "found no valid JSON Object", which I did not understand admittedly, and of course other conversion errors.
I finally found a way forward. I learned that I could read json directly from an RDD, including a PipelineRDD. I found a way to remove the unicode byte order header, wrapping array square brackets, split the JSON Objects based on a fortunate delimiter, and have a distributed dataset for more efficient processing. The output dataframe now had columns named after the JSON elements, inferred the schema, and dynamically adapts for other file formats.
Here is the code - hope it helps!:
#...Spark considers arrays of Json objects to be an invalid format
# and unicode files are prefixed with a byteorder marker
#
thanksMoiraRDD = sc.textFile( '/a/valid/file/path', partitions ).map(
lambda x: x.encode('utf-8','ignore').strip(u",\r\n[]\ufeff")
)
df = sqlContext.read.json(thanksMoiraRDD)

convert excel to json file

I'm new in creating jsons and having bais knowledge of java.
I'm trying to convert database table data to json.
Having option to store table data in any format of file than convert that into json.
Here is my table data.
Table: PKGS
Price, pd, Id, Level
1 , 266 , 59098 , 5
2 , 247 , 59098 , 5
I want my table data in this json format. Its just an example...to show level in JSON
"Id":59098
"pd":266
"Level":5
"price":1
"Id":59098
"pd":247
"Level":5
"price":2
In this json there is two loops are going If am not wrong. I was able to do it for one loop in ETL..but couldnt do it for two loops.
Not getting values for reimbursementId and packageId
Have goggled alot but couldn't find any code to understand properly and approach for the same.
Code tried little bit
FileInputStream inp = new FileInputStream("D:/json.xlsx" );
Workbook workbook = WorkbookFactory.create( inp );
Sheet sheet = workbook.getSheetAt( 0 );
JSONObject json = new JSONObject();
JSONArray rows = new JSONArray();
but dont know what to next !!
can anyone tell me how to do this ?
I advice you to use a specific tool. This is not a whole new case.
Try Talend Open Studio. It is not so complicate to use if you want to convert a file (CSV, JSON, Database directly, etc) to another. Please see TalendForge for basics.
In your case, you can connect to your database, and send all data in JSON.
Edit:
Your representation is not following the same logic than JSON. Here how I see it (and this is probably wrong because I can't understand)
If you just want Excel to JSON without any changes:
{
"rows":[
{
"Price":"1",
"pd":"266",
"Id":"59098",
"Level":"5"
},
{
"Price":"1",
"pd":"266",
"Id":"59098",
"Level":"5"
},
//and again and again
{
"Price":"2",
"pd":"247",
"Id":"59098",
"Level":"5"
}
]
}
If you want to reorganize, then define what you want. Try to imagine a sample of your data in a Java Object using ArrayList, int, String and subclass or even better in a JavaScript Object.
For the last example it will give you:
public class myJson{
ArrayList<myObject> rows;
with
public class myObject{
String Price;
String pd;
String Id;
String Level; //Or int, or date, or whatever
If you want to reorganize your data model, please give us this model.
Converting Excel file data to Json format is a a bit complex process, it depends on the structure of data, we do not have exact online tool as such so far...
custom code is required, there are various technologies available to be used, but best suited should be VBA, because VBA fits within Excel and can generate Json file quickly and flexible to edit code compared to any other technology that requires to automate to excel and import data and then process.
We can find there are various websites provide code to generate json from excel data, here is one such site looks expertise in this area. http://www.xlvba.net/tools/excel-automation-to-convert-excel-data-to-json-format.html
Ragavendra

Json-Opening Yelp Data Challenge's data set

I am interested in data mining and I am writing my thesis about it. For my thesis I want to use yelp's data challenge's data set, however i can not open it since it is in json format and almost 2 gb. In its website its been said that the dataset can be opened in phyton using mrjob, but I am also not very good with programming. I searched online and looked some of the codes yelp provided in github however I couldn't seem to find an article or something which explains how to open the dataset, clearly.
Can you please tell me step by step how to open this file and maybe how to convert it to csv?
https://www.yelp.com.tr/dataset_challenge
https://github.com/Yelp/dataset-examples
data is in .tar format when u extract it again it has another file,rename it to .tar and then extract it.you will get all the json files
yes you can use pandas. Take a look:
import pandas as pd
# read the entire file into a python array
with open('yelp_academic_dataset_review.json', 'rb') as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
# now, load it into pandas
data_df = pd.read_json(data_json_str)
Now 'data_df' contains the yelp data ;)
Case, you want convert it directly to csv, you can use this script
https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py
I hope it can help you
To process huge json files, use a streaming parser.
Many of these files aren't a single json, but a stream of jsons (known as "jsons format"). Then a regular json parser will consider everything but the first entry to be junk.
With a streaming parser, you can start reading the file, process parts, and wrote them to the desired output; then continue writing.
There is no single json-to-csv conversion.
Thus, you will not find a general conversion utility, you have to customize the conversion for your needs.
The reason is that a JSON is a tree but a CSV is not. There exists no ultimative and efficient conversion from trees to table rows. I'd stick with JSON unless you are always extracting only the same x attributes from the tree.
Start coding, to become a better programmer. To succeed with such amounts of data, you need to become a better programmer.