Python loading multiple JSON values on one line - json

I have a file (not a valid JSON file) that looks similar to this:
[[0,0,0],[0,0,0],[0,0,0]]["testing", "foo", "bar"]
These are two (or more), non-delimited, valid JSON values that I need to load in from STDIN. I tried just using the following (in Python 3.7):
for line in sys.stdin:
stripped = line.strip()
if not stripped: break
x = loads(stripped)
But that gave the error
json.decoder.JSONDecodeError: Extra data: line 1 column 118 (char 117)
which makes sense, as it can only load one JSON value at a time. How would I go about loading in multiple of these values from STDIN when they are not delimited? Is there a way to check if the JSON loader successfully completed a load and then start another one from the same line?

Related

How can I get ImageDatasetImportDataOp to update labels?

In a Vertex AI pipeline I am updating an image dataset, thus:
ds_op = gcc_aip.ImageDatasetImportDataOp(
project=project,
dataset=get_dataset_id_op.outputs['dataset'],
gcs_source=DATASET_PATH,
import_schema_uri=aiplatform.schema.dataset.ioformat.image.single_label_classification
)
I have tried adding images, updating the csv file with their path and label and uploading this to GCS. Then I run the pipe, the images are uploaded to the dataset but their labels are ignored and they are classed as Unlabeled. What am I doing wrong? TIA!
UPDATE: I am trying to use 'data_item_labels (JsonObject): Labels that will be applied to newly imported DataItems.' but I don't know what format is expected. i have tried JSON, csv, json lines etc but keep getting
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)'
errors.
UPDATE 2: finally figured out I should be passing a JSON object not a file uri, but I have tried everything I can think of and I either get JSON errors or "Invalid data_item_labels.".

Reading a .dat file in Julia, issues with variable delimeter spacing

I am having issues reading a .dat file into a dataframe. I think the issue is with the delimiter. I have included a screen shot of what the data in the file looks like below. My best guess is that it is tab delimited between columns and then new-line delimited between rows. I have tried reading in the data with the following commands:
df = CSV.File("FORCECHAIN00046.dat"; header=false) |> DataFrame!
df = CSV.File("FORCECHAIN00046.dat"; header=false, delim = ' ') |> DataFrame!
My result either way is just a DataFrame with only one column including all the data frome each column concatenated into one string. I tried to even specify the types with the following code:
df = CSV.File("FORCECHAIN00046.dat"; types=[Float64,Float64,Float64,Float64,
Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64]) |> DataFrame!
And I received an the following error:
┌ Warning: 2; something went wrong trying to determine row positions for multithreading; it'd be very helpful if you could open an issue at https://github.com/JuliaData/CS
V.jl/issues so package authors can investigate
I can work around this by uploading it into google sheets and then downloading a csv, but I would like to find a way to make the original .dat file work.
Part of the issue here is that .dat is not a proper file format—it's just something that seems to be written out in a somewhat human-readable format with columns of numbers separated by variable numbers of spaces so that the numbers line up when you look at them in an editor. Google Sheets has a lot of clever tricks built in to "do what you want" for all kinds of ill-defined data files, so I'm not too surprised that it manages to parse this. The CSV package on the other hand supports using a single character as a delimiter or even a multi-character string, but not a variable number of spaces like this.
Possible solutions:
if the files aren't too big, you could easily roll your own parser that splits each line and then builds a matrix
you can also pre-process the file turning multiple spaces into single spaces
That's probably the easiest way to do this and here's some Julia code (untested since you didn't provide test data) that will open your file and convert it to a more reasonable format:
function dat2csv(dat_path::AbstractString, csv_path::AbstractString)
open(csv_path, write=true) do io
for line in eachline(dat_path)
join(io, split(line), ',')
println(io)
end
end
return csv_path
end
function dat2csv(dat_path::AbstractString)
base, ext = splitext(dat_path)
ext == ".dat" ||
throw(ArgumentError("file name doesn't end with `.dat`"))
return dat2csv(dat_path, "$base.csv")
end
You would call this function as dat2csv("FORCECHAIN00046.dat") and it would create the file FORCECHAIN00046.csv, which would be a proper CSV file using commas as delimiters. That won't work well if the files contain any values with commas in them, but it looks like they are just numbers, in which case it should be fine. So you can use this function to convert the files to proper CSV and then load that file with the CSV package.
A little explanation of the code:
the two-argument dat2csv method opens csv_path for writing and then calls eachline on dat_path to read one line form it at a time
eachline strips any trailing newline from each line, so each line will be bunch of numbers separated by whitespace with some leading and/or trailing whitespace
split(line) does the default splitting of line which splits it on whitespace, dropping any empty values—this leaves just the non-whitespace entries as strings in an array
join(io, split(line), ',') joins the strings in the array together, separated by the , character and writes that to the io write handle for csv_path
println(io) writes a newline after that—otherwise everything would just end up on a single very long line
the one-argument dat2csv method calls splitext to split the file name into a base name and an extension, checking that the extension is the expected .dat and calling the two-argument version with the .dat replaced by .csv
Try using the readdlm function in DelimitedFiles library, and convert to DataFrame afterwards:
using DelimitedFiles, DataFrames
df = DataFrame(readdlm("FORCECHAIN00046.dat"), :auto)

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column

I have a json file:
{
"a": {
"b": 1
}
}
I am trying to read it:
val path = "D:/playground/input.json"
val df = spark.read.json(path)
df.show()
But getting an error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed
when the referenced columns only include the internal corrupt record
column (named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and
spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the
same query. For example, val df =
spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
So I tried to cache it as they suggest:
val path = "D:/playground/input.json"
val df = spark.read.json(path).cache()
df.show()
But I keep getting the same error.
You may try either of these two ways.
Option-1: JSON in single line as answered above by #Avishek Bhattacharya.
Option-2: Add option to read multi line JSON in the code as follows. You could read the nested attribute also as shown below.
val df = spark.read.option("multiline","true").json("C:\\data\\nested-data.json")
df.select("a.b").show()
Here is the output for Option-2.
20/07/29 23:14:35 INFO DAGScheduler: Job 1 finished: show at NestedJsonReader.scala:23, took 0.181579 s
+---+
| b|
+---+
| 1|
+---+
The problem is with the JSON file. The file : "D:/playground/input.json" looks like as you descibed as
{
"a": {
"b": 1
}
}
This is not right. Spark while processing json data considers each new line as a complete json. Thus it is failing.
You should keep your complete json in a single line in a compact form by removing all white spaces and newlines.
Like
{"a":{"b":1}}
If you want multiple jsons in a single file keep them like this
{"a":{"b":1}}
{"a":{"b":2}}
{"a":{"b":3}} ...
For more infos see
This error means 2 things:
1- either your file format isn't what you think (and you are using the wrong method for it, like its text but you mistakenly used json method)
2- you file doesn't follow the standards for the format you are using (while you used correct method for correct format), this usually happens with json.

Python - Tweet data(Json): How to filter out broken tweet? (the line is not in json format)

I am on the assignment to load the tweet data(JSON) into Python.
One of the tasks is to filter out the broken tweet(the line is not in JSON format)
My first code:
with open('./hw2-files-10mb.txt') as json_file:
data =json.load(json_file)
Output: JSONDecodeError: Extra data: line 2 column 1 (char 3979)
I add in try/except and what I want the program to do is, when the line is not in JSON format, skip.filter the line and move on to load the next line.
with open('./hw2-files-10mb.txt') as json_file:
lines = json_file.readlines()
for line in lines:
try:
data =json.load(json_file)
except ValueError:
pass
However, my output is only one tweet. It seems like the code stops running when an error occurs. Please advise. Thanks and have a nice weekend.

How open and read JSON file?

I have json file but this file have weight 186 mb. I try read via python .
import json
f = open('file.json','r')
r = json.loads(f.read())
ValueError: Extra data: line 88 column 2 -...
FILE
How to open it? Help me
Your JSON file isn't a JSON file, it's several JSON files mashed together.
The first instance of this occurs in the 1630070th character:
'шова"}]}]}{"response":[{"count'
^ here
That said, jq appears to be able to handle it, so the individual parts are fine.
You'll need to split the file at the boundaries of the individual JSON objects. Try catching the JSONDecodeError and use its .colno to slice the text into correct chunks.
It should be:
r = json.loads(f)