Dump a list into a JSON file acceptable by Athena - json

I am creating a JSON file in an s3 bucket using the following code -
def myconverter(o):
if isinstance(o, datetime.datetime):
return o.__str__()
s3.put_object(
Bucket='sample-bucket',
Key="sample.json",
Body = json.dumps(whole_file, default=myconverter)
)
Here, the whole_file variable is a list.
Sample of the "whole_file" variable -
[{"sample_column1": "abcd","sample_column2": "efgh"},{"sample_column1": "ijkl","sample_column2": "mnop"}]
The output "sample.json" file that I get should be in the following format -
{"sample_column1": "abcd","sample_column2": "efgh"}
{"sample_column1": "ijkl","sample_column2": "mnop"}
The output "sample.json" that I am getting is -
[{"sample_column1": "abcd","sample_column2": "efgh"},{"sample_column1": "ijkl","sample_column2": "mnop"}]
What changes should be made to get each JSON object in a single line?

You can write each entry to the file, then upload the file object to s3
import json
whole_file = [{"sample_column1": "abcd","sample_column2": "efgh"},
{"sample_column1": "ijkl","sample_column2": "mnop"}
]
with open("temp.json", "w") as temp:
for record in whole_file:
temp.write(json.dumps(record, default=str))
temp.write("\n")
The lookput should look like this
~ cat temp.json
{"sample_column1": "abcd", "sample_column2": "efgh"}
{"sample_column1": "ijkl", "sample_column2": "mnop"}
upload the file
import boto3
s3 = boto3.client("s3")
s3.upload_file("temp.json", bucket, object_name="whole_file.json")

Related

make a list of a list out of the header from a csv file

I want to put the header of the csv file in a nested list.
It should have ann output like this:
[[name], [age], [""], [""]]
how can I do this without reading the line again (I am not allowed to and I also am not allowed to use csv module and pandas (all imports except os are forbidden))
Just map the item of the list to list. Check below
def value_to_list(tlist):
l=len(tlist)
for i in range(l):
tlist[i]=[tlist[i]]
return tlist
headers=[]
with open(r"D:\my_projects\DemoProject\test.csv","r") as file :
headers = value_to_list(file.readline().split(","))
test.csv file is "col1,col2,col3"
output :
> python -u "run.py"
[['col1'], ['col2'], ['col3']]
>

Loading Multiple CSV files across all subfolder levels with Wildcard file name

I want to Load Multiple CSV files matching certain names into a dataframe. Currently i am looping through the whole folder and creating a list of filenames and then loading those csv's into the dataframe list and then concatenating that dataframe.
The approach i want to use (if possible) is to bypass all the code and read all files in a one liner kind of approach.
I know this can be done easily for single level of subfolders, but my subfolder structure is as follows
Root Folder
|
Subfolder1
|
Subfolder 2
|
X01.csv
Y01.csv
Z01.csv
|
Subfolder3
|
Subfolder4
|
X01.csv
Y01.csv
|
Subfolder5
|
X01.csv
Y01.csv
I want to read all "X01.csv" files while reading from Root Folder.
Is there a way i can read all the required files in code something like the below
filepath = "rootpath" + "/**/X*.csv"
df = spark.read.format("com.databricks.spark.csv").option("recursiveFilelookup","true").option("header","true").load(filepath)
This code works fine for single level of subfolders, is there any equivalent of this for multi level folders ? i thought the "recursiveFilelookup" option would look across all levels of subfolders, but apparently this is not the way it works.
Currently i am getting a
Path not found ... filepath
exception
any help please
Have you tried using the glob.glob function?
You can use it to search for files that match certain criteria inside a root path, and pass the list of files it finds to spark.read.csv function.
For example, I've recreated the folder structure from your example inside a Google Colab environment:
To get a list of all CSV files matching the criteria you've specified, you can use the following code:
import glob
rootpath = './Root Folder/'
# The following line of code looks through all files
# inside the rootpath recursively, trying to match the
# pattern specified. In this case, it tries to find any
# CSV file that starts with the letters X, Y, or Z,
# and ends with 2 numbers (ranging from 0 to 9).
glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True)
# Returns:
# ['./Root Folder/Subfolder5/Y01.csv',
# './Root Folder/Subfolder5/X01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Y01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Z01.csv',
# './Root Folder/Subfolder1/Subfolder 2/X01.csv']
Now you can combine this with spark.read.csv capability of reading a list of files to get the answer you're looking for:
import glob
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
rootpath = './Root Folder/'
spark.read.csv(glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True), inferSchema=True, header=True)
Note
You can specify more general patterns like:
glob.glob(rootpath + "**/*.csv", recursive=True)
To return a list of all csv files inside any subdirectory of rootpath.
Additionally, to consider only the immediate subdirectories files, you could use something like:
glob.glob(rootpath + "*.csv", recursive=True)
Edit
Based on your comments to this answer, does something like this works on Databricks?
from notebookutils import mssparkutils as ms
# databricks has a module called dbutils.fs.ls
# that works similarly to mssparkutils.fs, based on
# the following page of its documentation:
# https://docs.databricks.com/dev-tools/databricks-utils.html#ls-command-dbutilsfsls
def scan_dir(
initial_path: str,
search_str: str,
account_name: str,
):
"""Scan a directory and subdirectories for a string.
Parameters
----------
initial_path : str
The path to start the search. Accepts either a valid container name,
or the entire connection string.
search_str : str
The string to search.
account_name : str
The name of the account to access the container folders.
This value is only used, when the `initial_path`, doesn't
conform with the format: "abfss://<initial_path>#<account_name>.dfs.core.windows.net/"
Raises
------
FileNotFoundError
If the `initial_path` informed doesn't exist.
ValueError
If `initial_path` is not a string.
"""
if not isinstance(initial_path, str):
raise ValueError(
f'`initial_path` needs to be of type string, not {type(initial_path)}'
)
elif not initial_path.startswith('abfss'):
initial_path = f'abfss://{initial_path}#{account_name}.dfs.core.windows.net/'
try:
fdirs = ms.fs.ls(initial_path)
except Py4JJavaError as exc:
raise FileNotFoundError(
f'The path you informed \"{initial_path}\" doesn\'t exist'
) from exc
found = []
for path in fdirs:
p = path.path
if path.isDir:
found = [*found, *scan_dir(p, search_str)]
if search_str.lower() in path.name.lower():
# print(p.split('.net')[-1])
found = [*found, p.replace(path.name, "")]
return list(set(found))
Example:
# Change .parquet to .csv
spark.read.parquet(*scan_dir("abfss://CONTAINER_NAME#ACCOUNTNAME.dfs.core.windows.net/ROOT/FOLDER/", ".parquet"))
This method above worked for on Azure Synapse:

How to write all the returned data into JSON file using Python?

How to write the returned dict into a json file.
until now I am able to returned the correct data and print it, but when I tried to write it into a JSON file it just write the last record.
I will appreciate any help to fix this
example :
printed data:
[{"file Name": "test1.txt", "searched Word": "th", "number of occurence": 1}]
[{"file Name": "test2.txt", "searched Word": "th", "number of occurence": 1}]
json file
[{"file Name": "test2.txt", "searched Word": "th", "number of occurence": 1}]
code:
import re
import json
import os.path
import datetime
for counter, myLine in enumerate(textList):
thematch=re.sub(searchedSTR,RepX,myLine)
matches = re.findall(searchedSTR, myLine, re.MULTILINE | re.IGNORECASE)
if len(matches) > 0:
# add one record for the match (add one because line numbers start with 1)
d[matches[0]].append(counter + 1)
self.textEdit_PDFpreview.insertHtml(str(thematch))
'''
loop over the selected file and extract 3 values:
==> name of file
==> searched expression
==> number of occurence
'''
for match, positions in d.items():
listMetaData = [{"file Name":fileName,"searched Word":match,"number of occurence":len(positions)}]
jsondata = json.dumps(listMetaData)
print("in the for loop ==>jsondata: \n{0}".format(jsondata))
'''
create a folder called 'search_result' that includes all the result of the searching as JSON file
where the system check if the folder exist will continue if not the system create the folder
insode the folder the file will be created as ==> today_searchResult.js
'''
if not(os.path.exists("./search_result")):
try:
#print(os.mkdir("./search_result"))
#print(searchResultFoder)
today = datetime.date.today()
fileName = "{}_searchResult.json".format(today)
#fpJ = os.path.join(os.mkdir("./search_result"),fileName)
#print(fpJ)
with open(fileName,"w") as jsf:
jsf.write(jsondata)
print("finish writing")
except Exception as e:
print(e)

how do you parse json files that has incomplete lines in the file?

I have bunch of files in one directory that has many entries in the file as this:
{"DateTimeStamp":"2017-07-20T21:52:00.767-0400","Host":"Server","Code":"test101","use":"stats"}
I need to be able read each file and form a data frame from the json etries. Sometimes, the lines in the file may not be complete and my script is failing. How can I modify this script to account for not complete lines in the files:
path<-c("C:/JsonFiles")
filenames <- list.files(path, pattern="*Data*", full.names=TRUE)
dflist <- lapply(filenames, function(i) {
jsonlite::fromJSON(
paste0("[",
paste0(readLines(i),collapse=","),
"]"),flatten=TRUE
)
})
mq<-rbindlist(dflist, use.names=TRUE, fill=TRUE)

Json Parsing in Apache Pig

I am Having a json :
{"Name":"sampling","elementInfo":{"fraction":"3"},"destination":"/user/sree/OUT","source":"/user/sree/foo.txt"}
I found that we are able to load json into PigScript.
A = LOAD ‘data.json’
USING PigJsonLoader();
But how to parse json in Apache Pig
--Sampling.pig
--pig -x mapreduce -f Sampling.pig -param input=foo.csv -param output=OUT/pig -param delimiter="," -param fraction='0.05'
--Load data
inputdata = LOAD '$input' using PigStorage('$delimiter');
--Group data
groupedByAll = group inputdata all;
--output into hdfs
sampled = SAMPLE inputdata $fraction;
store sampled into '$output' using PigStorage('$delimiter');
Above is my pig script.
How to parse json (each element) in Apache pig?
I need to take above json as input and parse its source,delimiter,fraction,output and pass in $input,$delimiter,$fraction,$output respectively.
How to parse the same .
Please suggest
Try this :
--Load data
inputdata = LOAD '/input.txt' using JsonLoader('Name:chararray,elementinfo:(fraction:chararray),destionation:chararray,source:chararray');
--Group data
groupedByAll = group inputdata all;
store groupedByAll into '/OUT/pig' using PigStorage(',');
Now your output looks :
all,{(sampling1,(4),/user/sree/OUT1,/user/sree/foo1.txt),(sampling,(3),/user/sree/OUT,/user/sree/foo.txt)}
In input file fraction data {"fraction":"3"} in double quotes. so i used fraction as chararray so can't able to run sample command so i used the above script to get the result.
if you want to perform sample operation cast the fraction data to int and then you will get the result.