How to convert a PyArrow table to a in-memory csv - pyarrow

I'm searching for a way to convert a PyArrow table to a csv in memory so that I can dump the csv object directly into a database. With pyarrow.csv.write_csv() it is possible to create a csv file on disk, but is it somehow possible to create a csv object in memory? I have difficulties to understand the documentation. Thanks a lot in advance for the help!

Yes, it is possible. You can use Python io module to write to memory:
>>> import pyarrow as pa
>>> from pyarrow import csv
>>> import io
# Create a Table
>>> t = pa.Table.from_arrays([[1, 2, 3], ["a", "b", "c"]], ["c1", "c2"])
# Write to memory
>>> buf = io.BytesIO()
>>> csv.write_csv(t, buf, csv.WriteOptions(include_header=True))
>>> buf.seek(0)
0
# Read from memory for demo purposes
>>> csv.read_csv(buf)
pyarrow.Table
c1: int64
c2: string
----
c1: [[1,2,3]]
c2: [["a","b","c"]]

Related

faster way to save multi-dim numpy array in self-describing format such as json

I am looking for fast way to save multi dimensional array in self-describing format such as json. As the article said, the saving format could be self-describing (e.g. json, yaml, csv) or not self-describing (e.g. pickle, protobuf, hdf5).
https://medium.com/#shmulikamar/python-serialization-benchmarks-8e5bb700530b
First come up to my mind was json. So I compared dump/load time of json format versus pickle and npy. The comparison result is as follows, which shows using json at least by standard library is too inefficient.
Does anybody know other alternative self-describing format or fast way to serialize/de-serialize ?
pickle dump: 0.011399507522583008
pickle load: 0.01577591896057129
npy dump: 0.006514072418212891
npy load: 0.004297971725463867
json dump: 4.027008533477783
json load: 0.6741242408752441
import json
import time
import pickle
import uuid
import numpy as np
# my data is like 100 image sequence
image_seq = np.random.randint(0, high=255, size=(100, 3, 224, 224), dtype=np.uint8)
# benchmark for pickle
ts = time.time()
pickle_file = "/tmp/{}.pickle".format(str(uuid.uuid4()))
with open(pickle_file, "wb") as f:
pickle.dump(image_seq, f)
print("pickle dump:", time.time() - ts)
with open(pickle_file, "rb") as f:
pickle.load(f)
print("pickle load:", time.time() - ts)
# benchmark for npy
npy_file = "/tmp/{}.npy".format(str(uuid.uuid4()))
ts = time.time()
np.save(npy_file, image_seq)
print("npy dump:", time.time() - ts)
ts = time.time()
np.load(npy_file)
print("npy load:", time.time() - ts)
# benchmark for json
ts = time.time()
json_file = "/tmp/{}.json".format(str(uuid.uuid4()))
with open(json_file, "w") as f:
json.dump(image_seq.flatten().tolist(), f)
print("json dump:", time.time() - ts)
ts = time.time()
with open(json_file, "r") as f:
json.load(f)
print("json load:", time.time() - ts)
np.save() already gives you a self describing file, and as you have shown it is very fast. If you want speed and self description, it is a great choice.
Text based formats like JSON or CSV will always be slower than binary ones like what np.save() gives you.

How do I split / chunk Large JSON Files with AWS glueContext before converting them to JSON?

I'm trying to convert a 20GB JSON gzip file to parquet using AWS Glue.
I've setup a job using Pyspark with the code below.
I got this log WARN message:
LOG.WARN: Loading one large unsplittable file s3://aws-glue-data.json.gz with only one partition, because the file is compressed by unsplittable compression codec.
I was wondering if there was a way to split / chunk the file? I know I can do it with pandas, but unfortunately that takes far too long (12+ hours).
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
import pyspark.sql.functions
from pyspark.sql.functions import col, concat, reverse, translate
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
test = glueContext.create_dynamic_frame_from_catalog(
database="test_db",
table_name="aws-glue-test_table")
# Create Spark DataFrame, remove timestamp field and re-name other fields
reconfigure = test.drop_fields(['timestamp']).rename_field('name', 'FirstName').rename_field('LName', 'LastName').rename_field('type', 'record_type')
# Create pyspark DF
spark_df = reconfigure.toDF()
# Filter and only return 'a' record types
spark_df = spark_df.where("record_type == 'a'")
# Once filtered, remove the record_type column
spark_df = spark_df.drop('record_type')
spark_df = spark_df.withColumn("LastName", translate("LastName", "LName:", ""))
spark_df = spark_df.withColumn("FirstName", reverse("FirstName"))
spark_df.write.parquet("s3a://aws-glue-bucket/parquet/test.parquet")
Spark does not parallelize reading a single gzip file. However, you can do split it in chunks.
Also, Spark is really slow at reading gzip files(since its not paralleized). You can do this to speed it up:
file_names_rdd = sc.parallelize(list_of_files, 100)
lines_rdd = file_names_rdd.flatMap(lambda _: gzip.open(_).readlines())

pandas reading JSON with chunk size:

I have JSON file which I need to load into memory via chunks.
consider this file.json example:
[{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1},
{"weekly_report_day":"2019-12-30T00:00:00","project_id":"2","actions":1,"users":1,"event_id":1}
]
THen I want to use pandas read_json function combined with chunksizes.:
import pandas as pd
file = "file.json"
dtype = {"weekly_report_day": str, "project_id": str, "actions": int, "event_id": int}
chunked = pd.read_json(file, orient = 'records', dtype=dtype, chunksize = 1, lines = True)
for df in chunked:
print(df)
This however returns an error.
I would like to ask for suggestion (I need to use chunks, as the original data are very huge).

Reading a big JSON file with multiple objects in Python

I have a big GZ compressed JSON file where each line is a JSON object (i.e. a python dictionary).
Here is an example of the first two lines:
{"ID_CLIENTE":"o+AKj6GUgHxcFuaRk6/GSvzEWRYPXDLjtJDI79c7ccE=","ORIGEN":"oaDdZDrQCwqvi1YhNkjIJulA8C0a4mMZ7ESVlEWGwAs=","DESTINO":"OOcb8QTlctDfYOwjBI02hUJ1o3Bro/ir6IsmZRigja0=","PRECIO":0.0023907284768211919,"RESERVA":"2015-05-20","SALIDA":"2015-07-26","LLEGADA":"2015-07-27","DISTANCIA":0.48962542317352847,"EDAD":"19","sexo":"F"}{"ID_CLIENTE":"WHDhaR12zCTCVnNC/sLYmN3PPR3+f3ViaqkCt6NC3mI=","ORIGEN":"gwhY9rjoMzkD3wObU5Ito98WDN/9AN5Xd5DZDFeTgZw=","DESTINO":"OOcb8QTlctDfYOwjBI02hUJ1o3Bro/ir6IsmZRigja0=","PRECIO":0.001103046357615894,"RESERVA":"2015-04-08","SALIDA":"2015-07-24","LLEGADA":"2015-07-24","DISTANCIA":0.21382548869717155,"EDAD":"13","sexo":"M"}
So, I'm using the following code to read each line into a Pandas DataFrame:
import json
import gzip
import pandas as pd
import random
with gzip.GzipFile('data/000000000000.json.gz', 'r',) as fin:
data_lan = pd.DataFrame()
for line in fin:
data_lan = pd.DataFrame([json.loads(line.decode('utf-8'))]).append(data_lan)
But it's taking years.
Any suggestion to read the data quicker?
EDIT:
Finally what solved the problem:
import json
import gzip
import pandas as pd
with gzip.GzipFile('data/000000000000.json.gz', 'r',) as fin:
data_lan = []
for line in fin:
data_lan.append(json.loads(line.decode('utf-8')))
data = pd.DataFrame(data_lan)
I've worked on a similar problem myself, The append() is kinda slow. I generally use a list of dicts to load the json file and then create a Dataframe at once. This ways, you can have the flexibility the lists give you and only when you're sure about the Data in the list you convert it into a Dataframe. Below is an implementation of the concept:
import pandas as pd
import gzip
def get_contents_from_json(file_path)-> dict:
"""
Reads the contents of the json file into a dict
:param file_path:
:return: A dictionary of all contents in the file.
"""
try:
with gzip.open(file_path) as file:
contents = file.read()
return json.loads(contents.decode('UTF-8'))
except json.JSONDecodeError:
print('Error while reading json file')
except FileNotFoundError:
print(f'The JSON file was not found at the given path: \n{file_path}')
def main(file_path: str):
file_contents = get_contents_from_json(file_path)
if not isinstance(file_contents,list):
# I've considered you have a JSON Array in your file
# if not let me know in the comments
raise TypeError("The file doesn't have a JSON Array!!!")
all_columns = file_contents[0].keys()
data_frame = pd.DataFrame(columns=all_columns, data=file_contents)
print(f'Loaded {int(data_frame.size / len(all_columns))} Rows', 'Done!', sep='\n')
if __name__ == '__main__':
main(r'C:\Users\carrot\Desktop\dummyData.json.gz')
A pandas DataFrame fits into a contiguous block of memory which means that pandas needs to know the size of the data set when the frame is created. Since append changes the size, new memory must be allocated and the original plus new data sets are copied in. As your set grows, the copy gets bigger and bigger.
You can use from_records to avoid this problem. First, you need to know the row count and that means scanning the file. You could potentially cache that number if you do it often, but its a relatively fast operation. Now you have the size and pandas can allocate the memory efficiently.
# count rows
with gzip.GzipFile(file_to_test, 'r',) as fin:
row_count = sum(1 for _ in fin)
# build dataframe from records
with gzip.GzipFile(file_to_test, 'r',) as fin:
data_lan = pd.DataFrame.from_records(fin, nrows=row_count)

How to use binary data in SQLAlchemy?

How does one use binary data (BLOB type column) in SQLAlchemy.
I just created a table with fields key, val, where val is BLOB and when I query the table, SQLAlchemy returns:
<read-only buffer for 0x83c3040, size -1, offset 0 at 0x83c3120>
How do I use this read-only buffer?
You can iterate over it (e.g. for streaming it) or convert it to a string/binary if you want to have the whole binary in memory (which shouldn't be a problem as long as you are not dealing with movies in the database...)
>>> from sqlalchemy.util import buffer
>>> var = buffer('foo')
>>> var
<read-only buffer for 0xb727fb00, size -1, offset 0 at 0xb727fa80>
>>> str(var)
'foo'
>>> for i in var:
... print i
...
f
o
o
Regards,
Christoph
In SQLAlchemy 1.3, using Python2, binary data is returned as str, in Python 3 it is bytes:
# -*- coding: utf-8 -*-
import zlib
import sqlalchemy as sa
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import orm
Base = declarative_base()
class Blobby(Base):
__tablename__ = 'blobby'
id = sa.Column(sa.Integer, primary_key=True)
blob = sa.Column(sa.BLOB)
engine = sa.create_engine('mysql+pymysql:///test', echo=True)
Base.metadata.drop_all(bind=engine, checkfirst=True)
Base.metadata.create_all(bind=engine)
Session = orm.sessionmaker(bind=engine)
session = Session()
data = zlib.compress('Hello world'.encode('ascii'))
session.add(Blobby(blob=data))
session.commit()
blob, = session.query(Blobby.blob).first()
print(type(blob), blob)
session.close()
Python2 output
(<type 'str'>, 'x\x9c\xf3H\xcd\xc9\xc9W(\xcf/\xcaI\x01\x00\x18\xab\x04=')
Python3 output:
<class 'bytes'> b'x\x9c\xf3H\xcd\xc9\xc9W(\xcf/\xcaI\x01\x00\x18\xab\x04='