pandas find constant variables in a huge csv file - csv

I have a large csv file that I can not load into memory. I need to find which variables are constant. How can I do that?
I am reading the csv as
d = pd.read_csv(load_path, header=None, chunksize=10)
Is there an elegant way to solve the problem?
The data contains string and numerical variables

This is my current slow solution that does not use pandas
constant_variables = [True for i in range(number_of_columns)]
with open(load_path) as f:
line0 = next(f).split(',')
for num, line in enumerate(f):
line = line.split(',')
for i in range(n_col):
if line[i] != line0[i]:
constant_variables[i] = False
if num % 10000 == 0:
print(num)

You have 2 methods I can think of iterate over each column and check for uniqueness:
col_list = pd.read_csv(path, nrows=1).columns
for col in range(len(col_list)):
df = pd.read_csv(path, usecols=col)
if len(df.drop_duplicates()) == len(df):
print("all values are constant for: ", df.column[0])
or iterate over the csv in chunks and check again the lengths:
for df in pd.read_csv(path, chunksize=1000):
t = dict(zip(df, [len(df[col].value_counts()) for col in df]))
print(t)
The latter will read in chunks and tell you how unique each columns data is, this is just rough code which you can modify for your needs

Related

AWS Athena export array of structs to JSON

I've got an Athena table where some fields have a fairly complex nested format. The backing records in S3 are JSON. Along these lines (but we have several more levels of nesting):
CREATE EXTERNAL TABLE IF NOT EXISTS test (
timestamp double,
stats array<struct<time:double, mean:double, var:double>>,
dets array<struct<coords: array<double>, header:struct<frame:int,
seq:int, name:string>>>,
pos struct<x:double, y:double, theta:double>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json'='true')
LOCATION 's3://test-bucket/test-folder/'
Now we need to be able to query the data and import the results into Python for analysis. Because of security restrictions I can't connect directly to Athena; I need to be able to give someone the query and then they will give me the CSV results.
If we just do a straight select * we get back the struct/array columns in a format that isn't quite JSON.
Here's a sample input file entry:
{"timestamp":1520640777.666096,"stats":[{"time":15,"mean":45.23,"var":0.31},{"time":19,"mean":17.315,"var":2.612}],"dets":[{"coords":[2.4,1.7,0.3], "header":{"frame":1,"seq":1,"name":"hello"}}],"pos": {"x":5,"y":1.4,"theta":0.04}}
And example output:
select * from test
"timestamp","stats","dets","pos"
"1.520640777666096E9","[{time=15.0, mean=45.23, var=0.31}, {time=19.0, mean=17.315, var=2.612}]","[{coords=[2.4, 1.7, 0.3], header={frame=1, seq=1, name=hello}}]","{x=5.0, y=1.4, theta=0.04}"
I was hoping to get those nested fields exported in a more convenient format - getting them in JSON would be great.
Unfortunately it seems that cast to JSON only works for maps, not structs, because it just flattens everything into arrays:
SELECT timestamp, cast(stats as JSON) as stats, cast(dets as JSON) as dets, cast(pos as JSON) as pos FROM "sampledb"."test"
"timestamp","stats","dets","pos"
"1.520640777666096E9","[[15.0,45.23,0.31],[19.0,17.315,2.612]]","[[[2.4,1.7,0.3],[1,1,""hello""]]]","[5.0,1.4,0.04]"
Is there a good way to convert to JSON (or another easy-to-import format) or should I just go ahead and do a custom parsing function?
I have skimmed through all the documentation and unfortunately there seems to be no way to do this as of now. The only possible workaround is
converting a struct to a json when querying athena
SELECT
my_field,
my_field.a,
my_field.b,
my_field.c.d,
my_field.c.e
FROM
my_table
Or I would convert the data to json using post processing. Below script shows how
#!/usr/bin/env python
import io
import re
pattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)
pattern2 = re.compile(r':([a-z][^,{}. [\]]+)', re.I)
pattern3 = re.compile(r'\\"', re.I)
with io.open("test.csv") as f:
headers = list(map(lambda f: f.strip(), f.readline().split(",")))
for line in f.readlines():
orig_line = line
data = []
for i, l in enumerate(line.split('","')):
data.append(headers[i] + ":" + re.sub('^"|"$', "", l))
line = "{" + ','.join(data) + "}"
line = pattern1.sub(r'"\1":', line)
line = pattern2.sub(r':"\1"', line)
print(line)
The output on your input data is
{"timestamp":1.520640777666096E9,"stats":[{"time":15.0, "mean":45.23, "var":0.31}, {"time":19.0, "mean":17.315, "var":2.612}],"dets":[{"coords":[2.4, 1.7, 0.3], "header":{"frame":1, "seq":1, "name":"hello"}}],"pos":{"x":5.0, "y":1.4, "theta":0.04}
}
Which is a valid JSON
The python code from #tarun almost got me there, but I had to modify it in several ways due to my data. In particular, I have:
json structures saved in Athena as strings
Strings that contain multiple words, and therefore need to be in between double quotes. Some of them contain "[]" and "{}" symbols.
Here is the code that worked for me, hopefully will be useful for others:
#!/usr/bin/env python
import io
import re, sys
pattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)
pattern2 = re.compile(r':([a-z][^,{}. [\]]+)', re.I)
pattern3 = re.compile(r'\\"', re.I)
with io.open(sys.argv[1]) as f:
headers = list(map(lambda f: f.strip(), f.readline().split(",")))
print(headers)
for line in f.readlines():
orig_line = line
#save the double quote cases, which mean there is a string with quotes inside
line = re.sub('""', "#", orig_line)
data = []
for i, l in enumerate(line.split('","')):
item = re.sub('^"|"$', "", l.rstrip())
if (item[0] == "{" and item[-1] == "}") or (item[0] == "[" and item[-1] == "]"):
data.append(headers[i] + ":" + item)
else: #we have a string
data.append(headers[i] + ": \"" + item + "\"")
line = "{" + ','.join(data) + "}"
line = pattern1.sub(r'"\1":', line)
line = pattern2.sub(r':"\1"', line)
#restate the double quotes to single ones, once inside the json
line = re.sub("#", '"', line)
print(line)
This method is not by modifying the Query.
Its by Post Processing For Javascript/Nodejs we can use the npm package athena-struct-parser.
Detailed Answer with Example
https://stackoverflow.com/a/67899845/6662952
Reference - https://www.npmjs.com/package/athena-struct-parser
I used a simple approach to get around the struct -> json Athena limitation. I created a second table where the json columns were saved as raw strings. Using presto json and array functions I was able to query the data and return the valid json string to my program:
--Array transform functions too
select
json_extract_scalar(dd, '$.timestamp') as timestamp,
transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.time')) as arr_stats_time,
transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.mean')) as arr_stats_mean,
transform(cast(json_extract(json_parse(dd), '$.stats') as ARRAY<JSON>), x -> json_extract_scalar(x, '$.var')) as arr_stats_var
from
(select '{"timestamp":1520640777.666096,"stats":[{"time":15,"mean":45.23,"var":0.31},{"time":19,"mean":17.315,"var":2.612}],"dets":[{"coords":[2.4,1.7,0.3], "header":{"frame":1,"seq":1,"name":"hello"}}],"pos": {"x":5,"y":1.4,"theta":0.04}}' as dd);
I know the query will take longer to execute but there are ways to optimize.
I worked around this by creating a second table using the same S3 location, but changed the field's data type to string. The resulting CSV then had the string that Athena pulled from the object in the JSON file and I was able to parse the result.
I also had to adjust the #tarun code, because I had more complex data and nested structures. Here is the solution I've got, I hope it helps:
import re
import json
import numpy as np
pattern1 = re.compile(r'(?<=[{,\[])\s*([^{}\[\],"=]+)=')
pattern2 = re.compile(r':([^{}\[\],"]+|()(?![{\[]))')
pattern3 = re.compile(r'"null"')
def convert_metadata_to_json(value):
if type(value) is str:
value = pattern1.sub('"\\1":', value)
value = pattern2.sub(': "\\1"', value)
value = pattern3.sub('null', value)
elif np.isnan(value):
return None
return json.loads(value)
df = pd.read_csv('test.csv')
df['metadata_json'] = df.metadata.apply(convert_metadata_to_json)

Python Spark- How to output empty DataFrame to csv file (Only output header)?

I want to output empty dataframe to csv file. I use these codes:
df.repartition(1).write.csv(path, sep='\t', header=True)
But due to there is no data in dataframe, spark won't output header to csv file.
Then I modify the codes to:
if df.count() == 0:
empty_data = [f.name for f in df.schema.fields]
df = ss.createDataFrame([empty_data], df.schema)
df.repartition(1).write.csv(path, sep='\t')
else:
df.repartition(1).write.csv(path, sep='\t', header=True)
It works, but I want to ask whether there are a better way without count function.
df.count() == 0 will make your driver program retrieve the count of all your dataframe partitions across the executors.
In your case I would use df.take(1).isEmpty (Spark >= 2.1). Still slow, but preferable to a raw count().
Only header:
cols = '\t'.join(df.columns)
with open('./cols.csv', 'w') as f:
f.write(cols)

Reading NaN values from .csv files with decode_csv()

I have .csv file with integer values, that can have NA value which represents missing data.
Example file:
-9882,-9585,-9179
-9883,-9587,NA
-9882,-9585,-9179
When trying to read it with
import tensorflow as tf
reader = tf.TextLineReader(skip_header_lines=1)
key, value = reader.read_up_to(filename_queue, 1)
record_defaults = [[0], [0], [0]]
data, ABL_E, ABL_N = tf.decode_csv(value, record_defaults=record_defaults)
It throws following error later on sess.run(_) on the 2nd iteration
InvalidArgumentError (see above for traceback): Field 5 in record 32400 is not a valid int32: NA
Is there a way to interpret string "NA" while reading csv as NaN or similar value in TensorFlow?
I recently ran into the same problem. I solved it by reading the CSV as strings, replacing every occurrence of "NA" with some valid value, then converting it to float
# Set up reading from CSV files
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
NUM_COLUMNS = XX # Specify number of expected columns
# Read values as string, set "NA" for missing values.
record_defaults = [[tf.cast("NA", tf.string)]] * NUM_COLUMNS
decoded = tf.decode_csv(value, record_defaults=record_defaults, field_delim="\t")
# Replace every occurrence of "NA" with "-1"
no_nan = tf.where(tf.equal(decoded, "NA"), ["-1"]*NUM_COLUMNS, decoded)
# Convert to float, combine to a single tensor with stack.
float_row = tf.stack(tf.string_to_number(no_nan, tf.float32))
But long term I plan switching to tfrecords because reading csv is too slow for my needs

import chosen columns of json file in r

Can't find any solution how to load huge JSON. I try with well known Yelp dataset. It's 3.2 GB and I want to analyse 9 out of 10 columns. I need to skip import $text column, which will give me much lighter file to load. Probably about -70%. I don't want to manipulate the file.
I tried many libraries and stuck. I've found a solution for data.frame to apply pipe function:
df <- read.table(pipe("cut -f1,5,28 myFile.txt"))
from this thread: Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?)
How to do it for JSON? I'd like to do:
json <- read.table(pipe("cut -text yelp_academic_dataset_review.json"))
but this of course throws an error due to wrong format. Is there any possibilities without parsing whole file with regex?
EDIT
Structure of one row: (even can't count them all)
{"review_id":"vT2PALXWX794iUOoSnNXSA","user_id":"uKGWRd4fONB1cXXpU73urg","business_id":"D7FK-xpG4LFIxpMauvUStQ","stars":1,"date":"2016-10-31","text":"some long text here","useful":0,"funny":0,"cool":0,"type":"review"}
SECOND EDIT
Finally, I've created a loop to convert all data into csv file, omitted unwanted column. It's slow but I've got 150 mb (zipped) from 3.2 gb.
# files to process
filepath <- jfile1
fcsv <- "c:/Users/cp/Documents/R DATA/Yelp/tests/reviews.csv"
write.table(x = paste(colns, collapse=","), file = fcsv, quote = F, row.names = F, col.names = F)
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
# regex process
d <- NULL
for (i in rcols) {
pst <- paste(".*\"",colns[i],"\":*(.*?) *,\"",colns[i+1],"\":.*", sep="")
w <- sub(pst, "\\1", line)
d <- cbind(d, noquote(w))
}
# save on the fly
write.table(x = paste(d, collapse = ","), file = fcsv, append = T, quote = F, row.names = F, col.names = F)
}
close(con)
It can be save to json also. I wonder if it's the most efficient way, but other scripts I tested was slow and often had some encoding issues.
Try this:
library(jsonlite)
df <- as.data.frame(fromJSON('yelp_academic_dataset_review.json', flatten=TRUE))
Then once it is a dataframe delete the column(s) you don't need.
If you don't want to manipulate the file in advance of importing it, I'm not sure what options you have in R. Alternatively, you could make a copy of the file, then delete the text column with this script, then import the copy to R, then delete the copy.

Memory error when using read_csv

I'd like to convert the csv files into hdf5 format,which are used for caffe training.Because the csv files is 80G,it will report memory error.The machine memory is 128G.So can it possbile to improve my code?handle it one by one?Below is my code,it reported memory error when run in np.array
if '__main__' == __name__:
print 'Loading...'
day = sys.argv[1]
file = day+".xls"
data = pd.read_csv(file, header=None)
print data.iloc[0,1:5]
y = np.array(data.iloc[:,0], np.float32)
x = np.array(data.iloc[:,1:], np.float32)
patch = 100000
dirname = "hdf5_" + day
os.mkdir(dirname)
filename = dirname+"/hdf5.txt"
modelname = dirname+"/data"
file_w = open(filename, 'w')
for idx in range(int(math.ceil(y.shape[0]*1.0/patch))):
with h5py.File(modelname + str(idx) + '.h5', 'w') as f:
d_begin = idx*patch
d_end = min(y.shape[0], (idx+1)*patch)
f['data'] = x[d_begin:d_end,:]
f['label'] = y[d_begin:d_end]
file_w.write(modelname + str(idx) + '.h5\n')
file_w.close()
The best approach would be to read n lines and then write these to the HDF5 file, extending it be n elements each time. This way the amount of memory needed is not dependent on the size of the csv file. You could read a line at a time as well, but that would be slightly less efficient.
Here's code that applies this process for reading weather station data:
https://github.com/HDFGroup/datacontainer/blob/master/util/ghcn/convert_ghcn.py.
Actually, since you treat chunks of size 100000 separately, there is no need to load the whole CSV at one. The chunksize option in the read_csv is exactly for this case.
When specifying chunksize, read_csv will become an iterator, returning DataFrames's of size chunksize. You can iterate over instead of slicing the arrays each time.
Minus all the lines setting the different variables, your code should look more like this:
chuncks = pd.read_csv(file, header=None, chunksize=100000)
for chunk_number, data in enumerate(chunks):
y = np.array(data.iloc[:,0], np.float32)
x = np.array(data.iloc[:,1:], np.float32)
file_w = open(filename, 'w')
with h5py.File(modelname + str(idx) + '.h5', 'w') as f:
f['data'] = x
f['label'] = y
file_w.write(modelname + str(chunk_number) + '.h5\n')
file_w.close()