How I can read json files inside of tar.gz format? - json

I have a huge file with format tar.gz. inside of this file there are 1000 json files. how I can read them? I did this:
import tarfile
file = tarfile.open('normalizer_output.tar.gz')
The if I run file.list() I see the names of these json files like the following:
?rw-rw-r-- ubuntu/ubuntu 6471545 2022-06-02 09:25:53 output/normalized-1054.json
?rw-rw-r-- ubuntu/ubuntu 6535150 2022-06-02 09:26:06 output/normalized-1055.json
?rw-rw-r-- ubuntu/ubuntu 6690476 2022-06-02 09:26:15 output/normalized-1056.json

Related

Spark can't get delimiter for CSV file

I have a CSV file like this CSV read by pandas like this
But when I read it with PySpark, it turned out like this
CSV read by PySpark
What's wrong with the delimiter in Spark and how can I fix it?
From the posted images, %2C, which is URL encode equivalent of ,, seems to be your delimiter.
Set delimiter to %2C and also use header option:
df = spark.read.option("header",True).option("delimiter", "%2C").csv(path)
Input CSV File:
date%2Copening%2Chigh%2Clow%2Cclose%2Cadjclose%2Cvolume
2022-12-09%2C100%2C101%2C99%2C99.5%2C99.5%2C10000000
2022-12-09%2C200%2C202%2C199%2C199%2C199.1%2C20000000
2022-12-09%2C300%2C303%2C299%2C299%2C299.2%2C30000000
Output dataframe:
+----------+-------+----+---+-----+--------+--------+
|date |opening|high|low|close|adjclose|volume |
+----------+-------+----+---+-----+--------+--------+
|2022-12-09|100 |101 |99 |99.5 |99.5 |10000000|
|2022-12-09|200 |202 |199|199 |199.1 |20000000|
|2022-12-09|300 |303 |299|299 |299.2 |30000000|
+----------+-------+----+---+-----+--------+--------+

make a list of a list out of the header from a csv file

I want to put the header of the csv file in a nested list.
It should have ann output like this:
[[name], [age], [""], [""]]
how can I do this without reading the line again (I am not allowed to and I also am not allowed to use csv module and pandas (all imports except os are forbidden))
Just map the item of the list to list. Check below
def value_to_list(tlist):
l=len(tlist)
for i in range(l):
tlist[i]=[tlist[i]]
return tlist
headers=[]
with open(r"D:\my_projects\DemoProject\test.csv","r") as file :
headers = value_to_list(file.readline().split(","))
test.csv file is "col1,col2,col3"
output :
> python -u "run.py"
[['col1'], ['col2'], ['col3']]
>

Dump a list into a JSON file acceptable by Athena

I am creating a JSON file in an s3 bucket using the following code -
def myconverter(o):
if isinstance(o, datetime.datetime):
return o.__str__()
s3.put_object(
Bucket='sample-bucket',
Key="sample.json",
Body = json.dumps(whole_file, default=myconverter)
)
Here, the whole_file variable is a list.
Sample of the "whole_file" variable -
[{"sample_column1": "abcd","sample_column2": "efgh"},{"sample_column1": "ijkl","sample_column2": "mnop"}]
The output "sample.json" file that I get should be in the following format -
{"sample_column1": "abcd","sample_column2": "efgh"}
{"sample_column1": "ijkl","sample_column2": "mnop"}
The output "sample.json" that I am getting is -
[{"sample_column1": "abcd","sample_column2": "efgh"},{"sample_column1": "ijkl","sample_column2": "mnop"}]
What changes should be made to get each JSON object in a single line?
You can write each entry to the file, then upload the file object to s3
import json
whole_file = [{"sample_column1": "abcd","sample_column2": "efgh"},
{"sample_column1": "ijkl","sample_column2": "mnop"}
]
with open("temp.json", "w") as temp:
for record in whole_file:
temp.write(json.dumps(record, default=str))
temp.write("\n")
The lookput should look like this
~ cat temp.json
{"sample_column1": "abcd", "sample_column2": "efgh"}
{"sample_column1": "ijkl", "sample_column2": "mnop"}
upload the file
import boto3
s3 = boto3.client("s3")
s3.upload_file("temp.json", bucket, object_name="whole_file.json")

how do you parse json files that has incomplete lines in the file?

I have bunch of files in one directory that has many entries in the file as this:
{"DateTimeStamp":"2017-07-20T21:52:00.767-0400","Host":"Server","Code":"test101","use":"stats"}
I need to be able read each file and form a data frame from the json etries. Sometimes, the lines in the file may not be complete and my script is failing. How can I modify this script to account for not complete lines in the files:
path<-c("C:/JsonFiles")
filenames <- list.files(path, pattern="*Data*", full.names=TRUE)
dflist <- lapply(filenames, function(i) {
jsonlite::fromJSON(
paste0("[",
paste0(readLines(i),collapse=","),
"]"),flatten=TRUE
)
})
mq<-rbindlist(dflist, use.names=TRUE, fill=TRUE)

Weka: file not recognized as csv data files

my line 1 is:
column0,column1,column2,column3,column4,column5,column6,column7,column8,column9,column10,column11,column12,column13,column14,column15,column16,column17,column18,column19,column20
line 2 is:
225,1,9d36efa8d56c724ceb5b8834873d5457,38.69.182.103,,,,,,3,62930,0,,,,,6f4b457b6044ccd205dcf5531582af54,Apache-HttpClient%2fUNAVAILABLE%20%28java%201.4%29,1646,,160807,1