I have an app that receives a CSV formatted data by 256 bytes chunks.
My goal is to have the file imported to MySQL. As for now, I am saving the entire CSV to a temp file and then import it to MySQL using IMPORT DATA INFILE ....
It works as expected my only doubt is that maybe there is a better solution to import the data in a more efficient way.
Related
I have got a parquet file / folder (about 1GB) that I would like to load into my local Cassandra DB. Unfortunately I could not find any way (except via SPARK (in Scala)) to directly load this file into CDB. If I blow out the parquet file into CSV it'll just get way too huge for my laptop.
I am setting up a Cassandra DB for a big data analytics case (I've got about 25TB in raw data that we need to get searchable fast). Right now I am running some local tests on how to optimally design the keyspaces, indices and tables before move to Cassandra as a Service on a Hyperscaler. Converting the data to CSV is not an option as this blows up too much.
COPY firmographics.company (col1,col2,col3.....) FROM 'C:\Users\Public\Downloads\companies.csv' WITH DELIMITER='\t' AND HEADER=TRUE;
Turns out, like Alex Ott said, it's easy enough to just write this up in SPARK. Below my code:
import findspark
from pyspark.sql import SparkSession
findspark.init()
spark = SparkSession\
.builder\
.appName("Spark Exploration App")\
.config('spark.jars.packages', 'com.datastax.spark:spark-cassandra-connector_2.11:2.3.2')\
.getOrCreate()
import pandas as pd
df = spark.read.parquet("/PATH/TO/FILE/")
import time
start = time.time()
df2.drop('filename').write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table="few_com", keyspace="bmbr")\
.save()
end = time.time()
print(end - start)
suppose I have multiple CSV files in the same directory, these files all share the same schema.
/tmp/data/myfile1.csv, /tmp/data/myfile2.csv, /tmp/data.myfile3.csv, /tmp/datamyfile4.csv
I would like to read these files into a Spark DataFrame or RDD, and I would like each file to be a parition of the DataFrame. How can I do this?
You have two options I can think of:
1) Use the Input File name
Instead of trying to control the partitioning directly, add the name of the input file to your DataFrame and use that for any grouping/aggregation operations you need to do. This is probably your best option as it is more aligned with the parallel processing intent of spark where you tell it what to do and let it figure out the how. You do this with code like this:
SQL:
SELECT input_file_name() as fname FROM dataframe
Or Python:
from pyspark.sql.functions import input_file_name
newDf = df.withColumn("filename", input_file_name())
2) Gzip your CSV files
Gzip is not a splittable compression format. This means when loading gzipped files, each file will be it's own partition.
I've recently started working on hbase and know not much about it. I have multiple csv files (around 20000) and I want to import them into a HBase table in a way that each file would be a row in hbase, and the name of the file would be the rowkey. It means each row of the csv file is a cell in hbase, which I need to put them in a struct datatype(of 25 fields). Unfortunately, I have no clue for the problem. If anyone would be kind to give me some tip to start I appreciate it.
Here is a sample of the csv files:
time, a, b, c, d, ..., x
0.000,98.600,115.700,54.200,72.900,...,0.000
60.000,80.100,113.200,54.500,72.100,...,0.000
120.000,80.000,114.200,55.200,72.900,...,0.000
180.000,80.000,118.400,56.800,75.500,...,0.000
240.000,80.000,123.100,59.600,79.200,...,0.000
300.000,80.000,130.100,61.600,82.500,...,0.000
Thanks,
Importtsv is a utility that will load data in TSV or CSV format into HBase.
Importtsv has two distinct usages:
Loading data from TSV or CSV format in HDFS into HBase via Puts.
Preparing StoreFiles to be loaded via the completebulkload.
Load Data from TSV or CSV format in HDFS to Hbase
Below is the example that allows you to load data from hdfs file to HBase table. You must copy the local file to the hdfs folder then you can load that to HBase table.
$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' -Dimporttsv.columns=HBASE_ROW_KEY, personal_data:name, personal_data:city, personal_data:age personal /test
The above command will generate the MapReduce job to load data from CSV file to HBase table.
Suppose that df is a dataframe in Spark. The way to write df into a single CSV file is
df.coalesce(1).write.option("header", "true").csv("name.csv")
This will write the dataframe into a CSV file contained in a folder called name.csv but the actual CSV file will be called something like part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv.
I would like to know if it is possible to avoid the folder name.csv and to have the actual CSV file called name.csv and not part-00000-af091215-57c0-45c4-a521-cd7d9afb5e54.csv. The reason is that I need to write several CSV files which later on I will read together in Python, but my Python code makes use of the actual CSV names and also needs to have all the single CSV files in a folder (and not a folder of folders).
Any help is appreciated.
A possible solution could be convert the Spark dataframe to a pandas dataframe and save it as csv:
df.toPandas().to_csv("<path>/<filename>")
EDIT: As caujka or snark suggest, this works for small dataframes that fits into driver. It works for real cases that you want to save aggregated data or a sample of the dataframe. Don't use this method for big datasets.
If you want to use only the python standard library this is an easy function that will write to a single file. You don't have to mess with tempfiles or going through another dir.
import csv
def spark_to_csv(df, file_path):
""" Converts spark dataframe to CSV file """
with open(file_path, "w") as f:
writer = csv.DictWriter(f, fieldnames=df.columns)
writer.writerow(dict(zip(fieldnames, fieldnames)))
for row in df.toLocalIterator():
writer.writerow(row.asDict())
If the result size is comparable to spark driver node's free memory, you may have problems with converting the dataframe to pandas.
I would tell spark to save to some temporary location, and then copy the individual csv files into desired folder. Something like this:
import os
import shutil
TEMPORARY_TARGET="big/storage/name"
DESIRED_TARGET="/export/report.csv"
df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
part_filename = next(entry for entry in os.listdir(TEMPORARY_TARGET) if entry.startswith('part-'))
temporary_csv = os.path.join(TEMPORARY_TARGET, part_filename)
shutil.copyfile(temporary_csv, DESIRED_TARGET)
If you work with databricks, spark operates with files like dbfs:/mnt/..., and to use python's file operations on them, you need to change the path into /dbfs/mnt/... or (more native to databricks) replace shutil.copyfile with dbutils.fs.cp.
A more databricks'y' solution is here:
TEMPORARY_TARGET="dbfs:/my_folder/filename"
DESIRED_TARGET="dbfs:/my_folder/filename.csv"
spark_df.coalesce(1).write.option("header", "true").csv(TEMPORARY_TARGET)
temporary_csv = os.path.join(TEMPORARY_TARGET, dbutils.fs.ls(TEMPORARY_TARGET)[3][1])
dbutils.fs.cp(temporary_csv, DESIRED_TARGET)
Note if you are working from Koalas data frame you can replace spark_df with koalas_df.to_spark()
For pyspark, you can convert to pandas dataframe and then save it.
df.toPandas().to_csv("<path>/<filename.csv>", header=True, index=False)
There is no dataframe spark API which writes/creates a single file instead of directory as a result of write operation.
Below both options will create one single file inside directory along with standard files (_SUCCESS , _committed , _started).
1. df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
2. df.repartition(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header",
"true").csv("PATH/FOLDER_NAME/x.csv")
If you don't use coalesce(1) or repartition(1) and take advantage of sparks parallelism for writing files then it will create multiple data files inside directory.
You need to write function in driver which will combine all data file parts to single file(cat part-00000* singlefilename ) once write operation is done.
I had the same problem and used python's NamedTemporaryFile library to solve this.
from tempfile import NamedTemporaryFile
s3 = boto3.resource('s3')
with NamedTemporaryFile() as tmp:
df.coalesce(1).write.format('csv').options(header=True).save(tmp.name)
s3.meta.client.upload_file(tmp.name, S3_BUCKET, S3_FOLDER + 'name.csv')
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html for more info on upload_file()
Create temp folder inside output folder. Copy file part-00000* with the file name to output folder. Delete the temp folder. Python code snippet to do the same in Databricks.
fpath=output+'/'+'temp'
def file_exists(path):
try:
dbutils.fs.ls(path)
return True
except Exception as e:
if 'java.io.FileNotFoundException' in str(e):
return False
else:
raise
if file_exists(fpath):
dbutils.fs.rm(fpath)
df.coalesce(1).write.option("header", "true").csv(fpath)
else:
df.coalesce(1).write.option("header", "true").csv(fpath)
fname=([x.name for x in dbutils.fs.ls(fpath) if x.name.startswith('part-00000')])
dbutils.fs.cp(fpath+"/"+fname[0], output+"/"+"name.csv")
dbutils.fs.rm(fpath, True)
You can go with pyarrow, as it provides file pointer for hdfs file system. You can write your content to file pointer as a usual file writing. Code example:
import pyarrow.fs as fs
HDFS_HOST: str = 'hdfs://<your_hdfs_name_service>'
FILENAME_PATH: str = '/user/your/hdfs/file/path/<file_name>'
hadoop_file_system = fs.HadoopFileSystem(host=HDFS_HOST)
with hadoop_file_system.open_output_stream(path=FILENAME_PATH) as f:
f.write("Hello from pyarrow!".encode())
This will create a single file with the specified name.
To initiate pyarrow you should define environment CLASSPATH properly, set the output of hadoop classpath --glob to it
df.write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("PATH/FOLDER_NAME/x.csv")
you can use this and if you don't want to give the name of CSV everytime you can write UDF or create an array of the CSV file name and give it to this it will work
We can load a large csv file as row-chunks through (e.g.) below:
from pandas import *
tp = read_csv('large_dataset.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows.
One might argue that using 'usecols' is the solution; however, in my experience, 'usecols' is, qualitatively, not as fast as using 'chunksize'. Because, I presume that the entire file is still read into the memory when 'usecols' is used, and yet 'chunksize' iterates through the file, instead; hence, faster.
How could we load a large csv file as column-chunks?