I have made a connection to my HDFS using the following command
import pyarrow as pa
import pyarrow.parquet as pq
fs = pa.hdfs.connect(self.namenode, self.port, user=self.username, kerb_ticket = self.cert)
I'm using the following command to read a parquet file
fs.read_parquet()
but there is not read method for regular text files (e.g. a csv file). How can I read a csv file using pyarrow.
You need to create a file-like object and use the CSV module directly. See pyarrow.csv.read_csv
You can set up a spark session to connect to hdfs, then read it from there.
ss = SparkSession.builder.appName(...)
csv_file = ss.read.csv('/user/file.csv')
Another way is to open the file first, then read it using csv.csv_read
Here is what I used at the end.
from pyarrow import csv
file = 'hdfs://user/file.csv'
with fs.open(file, 'rb') as f:
csv_file = csv.read_csv(f)
I am parsing 350 txt files having json data using python. I am able to retrieve 62 of those object and store them on mysql database, but after that I am getting an error saying JSONDecodeError: ExtraData
Python:
import os
import ast
import json
import mysql.connector as mariadb
from mysql.connector.constants import ClientFlag
mariadb_connection = mariadb.connect(user='root', password='137800000', database='shaproject',client_flags=[ClientFlag.LOCAL_FILES])
cursor = mariadb_connection.cursor()
sql3 = """INSERT INTO shaproject.alttwo (alttwo_id,responses) VALUES """
os.chdir('F:/Code Blocks/SEM 2/DM/Project/350/For Merge Disqus')
current_list_dir=os.listdir()
print(current_list_dir)
cur_cwd=os.getcwd()
cur_cwd=cur_cwd.replace('\\','/')
twoid=1
for every_file in current_list_dir:
file=open(cur_cwd + "/" + every_file)
utffile=file.read()
data=json.loads(utffile)
for i in range(0,len(data['response'])):
data123 = json.dumps(data['response'][i])
tup=(twoid,data123)
print(sql3+str(tup))
twoid+=1
cursor.execute(sql3+str(tup)+";")
print(tup)
mariadb_connection.commit()
I have searched online and found that multiple dump statements are resulting in this error. But I am unable to resolve it.
You want to use glob.
Rather than os.listdir(), which is too permissive,
use glob to focus on just the *.json files.
Print out the name of the file before asking .loads() to parse it.
Rename any badly formatted files to .txt rather than .json, in order to skip them.
Note that you can pass the open file directly to .load(), if you wish.
Closing open files would be a good thing.
Rather than a direct assignment (with no close()!)
you would be better off with with:
with open(cur_cwd + "/" + every_file) as file:
data = json.load(file)
Talking about current current working directory seems
both repetitive and redundant.
It would suffice to call it cwd.
In jmeter using http requests i'm posting some json bundles and from the responses i'm using jsr223 post processor to extract data and store it inside csv files, each entry in each line. now for 10 post requests i'm getting duplicate data into the csv file. Is there a way to read back csv files and remove duplicate lines using jmeter. The number of lines in csv files can be almost 200,000.
eg:csv file be like
csvFile1.csv:
line1
line2
duplicateline
...........so on
You can read the file into an ArrayList as
new File('/path/to/file').readLines()
You can remove the duplicate entries using unique() function as
def lines = file.readLines().unique()
You can write the unique lines back using Writer
Putting everything together:
def file = new File('/path/to/file')
def lines = file.readLines().unique()
file.withWriter { writer ->
lines.each {line ->
writer.writeLine(line)
}
}
Demo:
Just in case: The Groovy Templates Cheat Sheet for JMeter
Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is
df.coalesce(1).write.option("header", "true").csv("name.csv")
This will write the dataframe into a CSV file contained in a folder called name.csv but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv.
I would like to know if it is possible to avoid the folder name.csv and to have the actual CSV file called name.csv and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).
Any help is appreciated.
A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:
df.toPandas().to_csv("<path>/<filename>")
EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.
If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir.
import csv
def spark_to_csv(df, file_path):
""" Converts spark dataframe to CSV file """
with open(file_path, "w") as f:
writer = csv.DictWriter(f, fieldnames=df.columns)
writer.writerow(dict(zip(fieldnames, fieldnames)))
for row in df.toLocalIterator():
writer.writerow(row.asDict())
If the result size is comparable to spark driver node's free memory, you may have problems with converting the dataframe to pandas.
I would tell spark to save to some temporary location, and then copy the individual csv files into desired folder. Something like this:
import os
import shutil
TEMPORARY_TARGET="big/storage/name"
DESIRED_TARGET="/export/report.csv"
df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
part_filename = next(entry for entry in os.listdir(TEMPORARY_TARGET) if entry.startswith('part-'))
temporary_csv = os.path.join(TEMPORARY_TARGET, part_filename)
shutil.copyfile(temporary_csv, DESIRED_TARGET)
If you work with databricks, spark operates with files like dbfs:/mnt/..., and to use python's file operations on them, you need to change the path into /dbfs/mnt/... or (more native to databricks) replace shutil.copyfile with dbutils.fs.cp.
A more databricks'y' solution is here:
TEMPORARY_TARGET="dbfs:/my_folder/filename"
DESIRED_TARGET="dbfs:/my_folder/filename.csv"
spark_df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
temporary_csv = os.path.join(TEMPORARY_TARGET, dbutils.fs.ls(TEMPORARY_TARGET)[3][1])
dbutils.fs.cp(temporary_csv, DESIRED_TARGET)
Note if you are working from Koalas data frame you can replace spark_df with koalas_df.to_spark()
For pyspark, you can convert to pandas dataframe and then save it.
df.toPandas().to_csv("<path>/<filename.csv>", header=True, index=False)
There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.
Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).
1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.
You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.
I had the same problem and used python's NamedTemporaryFile library to solve this.
from tempfile import NamedTemporaryFile
s3 = boto3.resource('s3')
with NamedTemporaryFile() as tmp:
df.coalesce(1).write.format('csv').options(header=True).save(tmp.name)
s3.meta.client.upload_file(tmp.name, S3_BUCKET, S3_FOLDER + 'name.csv')
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html for more info on upload_file()
Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same in Databricks.
fpath=output+'/'+'temp'
def file_exists(path):
try:
dbutils.fs.ls(path)
return True
except Exception as e:
if 'java.io.FileNotFoundException' in str(e):
return False
else:
raise
if file_exists(fpath):
dbutils.fs.rm(fpath)
df.coalesce(1).write.option("header", "true").csv(fpath)
else:
df.coalesce(1).write.option("header", "true").csv(fpath)
fname=([x.name for x in dbutils.fs.ls(fpath) if x.name.startswith('part-00000')])
dbutils.fs.cp(fpath+"/"+fname[0], output+"/"+"name.csv")
dbutils.fs.rm(fpath, True)
You can go with pyarrow, as it provides file pointer for hdfs file system. You can write your content to file pointer as a usual file writing. Code example:
import pyarrow.fs as fs
HDFS_HOST: str = 'hdfs://<your_hdfs_name_service>'
FILENAME_PATH: str = '/user/your/hdfs/file/path/<file_name>'
hadoop_file_system = fs.HadoopFileSystem(host=HDFS_HOST)
with hadoop_file_system.open_output_stream(path=FILENAME_PATH) as f:
f.write("Hello from pyarrow!".encode())
This will create a single file with the specified name.
To initiate pyarrow you should define environment CLASSPATH properly, set the output of hadoop classpath --glob to it
df.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("PATH/FOLDER_NAME/x.csv")
you can use this and if you don't want to give the name of CSV everytime you can write UDF or create an array of the CSV file name and give it to this it will work
I've run a Spark job via databricks on AWS, and by calling
big_old_rdd.saveAsTextFile("path/to/my_file.json")
have saved the results of my job into an S3 bucket on AWS. The result of that spark command is a directory path/to/my_file.json containing portions of the result:
_SUCCESS
part-00000
part-00001
part-00002
and so on. I can copy those part files to my local machine using the AWS CLI with a relatively simple command:
aws s3 cp s3://my_bucket/path/to/my_file.json local_dir --recursive
and now I've got all those part-* files locally. Then I can get a single file with
cat $(ls part-*) > result.json
The problem is that this two-stage process is cumbersome and leaves file parts all over the place. I'd like to find a single command that will download and merge the files (ideally in order). When dealing with HDFS directly this is something like hadoop fs -cat "path/to/my_file.json/*" > result.json.
I've looked around through the AWS CLI documentation but haven't found an option to merge the file parts automatically, or to cat the files. I'd be interested in either some fancy tool in the AWS API or some bash magic that will combine the above commands.
Note: Saving the result into a single file via spark is not a viable option as this requires coalescing the data to a single partition during the job. Having multiple part files on AWS is fine, if not desirable. But when I download a local copy, I'd like to merge.
This can be done with a relatively simple function using boto3, the AWS python SDK.
The solution involves listing the part-* objects in a given key, and then downloading each of them and appending to a file object. First, to list the part files in path/to/my_file.json in the bucket my_bucket:
import boto3
bucket = boto3.resource('s3').Bucket('my_bucket')
keys = [obj.key for obj in bucket.objects.filter(Prefix='path/to/my_file.json/part-')]
Then, use Bucket.download_fileobj() with a file opened in append mode to write each of the parts. The function I'm now using, with a few other bells and whistles, is:
from os.path import basename
import boto3
def download_parts(base_object, bucket_name, output_name=None, limit_parts=0):
"""Download all file parts into a single local file"""
base_object = base_object.rstrip('/')
bucket = boto3.resource('s3').Bucket(bucket_name)
prefix = '{}/part-'.format(base_object)
output_name = output_name or basename(base_object)
with open(output_name, 'ab') as outfile:
for i, obj in enumerate(bucket.objects.filter(Prefix=prefix)):
bucket.download_fileobj(obj.key, outfile)
if limit_parts and i >= limit_parts:
print('Terminating download after {} parts.'.format(i))
break
else:
print('Download completed after {} parts.'.format(i))
The downloading part may be an extra line of code.
As far as cat'ing in order, you can do it according to time created, or alphabetically.
Combined in order of time created: cat $(ls -t) > outputfile
Combined & Sorted alphabetically: cat $(ls part-* | sort) > outputfile
Combined & Sorted reverse-alphabetically: cat $(ls part-* | sort -r) > outputfile