how do I compute the number of unique values in a pyarrow array? - pyarrow

I have a pyarrow int32 ChunkedArray
containing 18 chunks that I got from an ORC file:
import pyarrow.dataset
import pyarrow.compute
t = pyarrow.dataset.dataset("my/orc/file", format="orc").to_table()
a = t.column("a_column")
print(type(a))
print(a.num_chunks)
print(a.type)
<class 'pyarrow.lib.ChunkedArray'>
18
int32
I want to compute the number of unique values in the column. Seems like a straightforward job for count_distinct:
>>> print(pyarrow.compute.count_distinct(a))
36
However, unique() indicates that there are only two non-null values:
>>> print(pyarrow.compute.unique(a))
[
null,
100,
250
]
Suggesting that that count_distinct() is summed over the chunks. Is this the expected behavior? What is the correct way to get the number of non-null values in array (either chunked or non-chunked)?

Related

Parse nested json data in dataframe

I have delimited file that have JSON also keyvalues matching in the column. I need to parse this data into dataframe.
Below is the record format
**trx_id|name|service_context|status**
abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success
i need to convert all information from this record to have this format
trx_id|name |type|payload.trx_id|payload.name|payload.counter.counter_type|payload.counter.counter_info|.....|payload.renewal.flag|status
abc123|order|cdr |abc123 |abs |product |transfer |.....|0 |success
abc456|order|cdr |abc456 |abs |product | |.....|1 |success
Currently i've done manual parsing the data for key_value with sep=';|[|] and remove behind '=' and update the column name.
for Json, i do the below command, however the result is replacing the existing table and only contain parsing json result.
test_parse = pd.concat([pd.json_normalize(json.loads(js)) for js in test_parse['payload']])
Is there any way to do avoid any manual process to process this type of data?
The below hint will be sufficient to solve the problem.
Do it partwise for each column and then merge them together (you will need to remove the columns once you are able to split into multiple columns):
import ast
from pandas.io.json import json_normalize
x = json_normalize(df3['service_context'].apply(lambda x: (ast.literal_eval(x.split('=')[1])))).add_prefix('payload.')
y = pd.DataFrame(x['payload.counter'].apply(lambda x:[i['counter_type'] for i in x]).to_list())
y = y.rename(columns={0: 'counter_type', 1:'counter_info'})
for row in x['payload.product']:
z1 = json_normalize(row)
z2 = json_normalize(z1['customer_spec.resource_pecification'][0])
### Write your own code.
x:
y:
It's realy a 3-step approach
use primary pipe | delimiter
extract key / value pairs
normlize JSON
import pandas as pd
import io, json
# overall data structure is pipe delimited
df = pd.read_csv(io.StringIO("""abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success"""),
sep="|", header=None, names=["trx_id","name","data","status"])
df2 = pd.concat([
df,
# split out sub-columns ; delimted columns in 3rd column
pd.DataFrame(
[[c.split("=")[1] for c in r] for r in df.data.str.split(";")],
columns=[c.split("=")[0] for c in df.data.str.split(";")[0]],
)
], axis=1)
# extract json payload into columns. This will leave embedded lists as these are many-many
# that needs to be worked out by data owner
df3 = pd.concat([df2,
pd.concat([pd.json_normalize(json.loads(p)).add_prefix("payload.") for p in df2.payload]).reset_index()], axis=1)
output
trx_id name data status type payload index payload.trx_id payload.name payload.counter payload.language payload.type payload.can_replace payload.product payload.renewal_flag payload.price.transaction payload.price.discount
0 abc123 order type=cdr;payload={"trx_id":"abc123","name":"ab... success cdr {"trx_id":"abc123","name":"abs","counter":[{"c... 0 abc123 abs [{'counter_type': 'product'}, {'counter_type':... id AD yes [{'flag': '0', 'identifier_flag': '0', 'custom... 0 1800 0
use with caution - explode() embedded lists
df3p = df3["payload.product"].explode().apply(pd.Series)
df3.join(df3.explode("payload.counter")["payload.counter"].apply(pd.Series)).join(
pd.json_normalize(df3p.join(df3p["customer_spec"].apply(pd.Series)).explode("resource_pecification").to_dict(orient="records"))
)

Pyarrow table memory compared to raw csv size

I have a 2GB CSV file that I read into a pyarrow table with the following:
from pyarrow import csv
tbl = csv.read_csv(path)
When I call tbl.nbytes I get 3.4GB. I was surprised at how much larger the csv was in arrow memory than as a csv. Maybe I have a fundamental misunderstanding of what pyarrow is doing under the hood but I thought if anything it would be smaller due to its columnar nature (i also probably could have squeezed out more gains using ConvertOptions but i wanted a baseline). I definitely wasnt expecting an increase of almost 75%. Also when I convert it from arrow table to pandas df the df took up roughly the same amount of memory as the csv - which was expected.
Can anyone help explain the difference in memory for arrow tables compared to a csv / pandas df.
Thx.
UPDATE
Full code and output below.
In [2]: csv.read_csv(r"C:\Users\matth\OneDrive\Data\Kaggle\sf-bay-area-bike-shar
...: e\status.csv")
Out[2]:
pyarrow.Table
station_id: int64
bikes_available: int64
docks_available: int64
time: string
In [3]: tbl = csv.read_csv(r"C:\Users\generic\OneDrive\Data\Kaggle\sf-bay-area-bik
...: e-share\status.csv")
In [4]: tbl.schema
Out[4]:
station_id: int64
bikes_available: int64
docks_available: int64
time: string
In [5]: tbl.nbytes
Out[5]: 3419272022
In [6]: tbl.to_pandas().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71984434 entries, 0 to 71984433
Data columns (total 4 columns):
# Column Dtype
--- ------ -----
0 station_id int64
1 bikes_available int64
2 docks_available int64
3 time object
dtypes: int64(3), object(1)
memory usage: 2.1+ GB
There are two problems:
The integers columns are using int64, but int32 would be more appropriate (unless the values are big)
The time column is interpreted as a string. It doesn't help that the input format isn't following any standard (%Y/%m/%d %H:%M:%S)
The first problem is easy to solve, using ConvertionOptions:
tbl = csv.read_csv(
<path>,
convert_options=csv.ConvertOptions(
column_types={
'station_id': pa.int32(),
'bikes_available': pa.int32(),
'docks_available': pa.int32(),
'time': pa.string()
}))
The second one is a bit more complicated because as far as I can tell the read_csv API doesn't let you provide a format for the time column, and there's no easy way to convert string columns to datetime in pyarrow. So you have to use pandas instead:
series = tbl.column('time').to_pandas()
series_as_datetime = pd.to_datetime(series, format='%Y/%m/%d %H:%M:%S')
tbl2 = pa.table(
{
'station_id':tbl.column('station_id'),
'bikes_available':tbl.column('bikes_available'),
'docks_available':tbl.column('docks_available'),
'time': pa.chunked_array([series_as_datetime])
})
tbl2.nbytes
>>> 1475683759
1475683759 is the number you expect, you can't get any better. Each row is 20 bytes (4 + 4 + 4 + 8).

Count number of users per window using PySpark

I'm using Kafka to stream a JSON file, sending each line as a message. One of the keys is the user's email.
Then I use PySpark to count the number of unique users per window, using their email to identify them. The command
def print_users_count(count):
print 'The number of unique users is:', count
print_users_count((lambda message: message['email']).distinct().count())
Gives me the error below. How can I fix this?
AttributeError Traceback (most recent call last)
<ipython-input-19-311ba744b41f> in <module>()
2 print 'The number of unique users is:', count
3
----> 4 print_users_count((lambda message: message['email']).distinct().count())
AttributeError: 'function' object has no attribute 'distinct'
Here is my PySpark code:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
try:
sc.stop()
except:
pass
sc = SparkContext(appName="KafkaStreaming")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
# Define the PySpark consumer.
kafkaStream = KafkaUtils.createStream(ssc, bootstrap_servers, 'spark-streaming2', {topicName:1})
# Parse the incoming data as JSON.
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
# Count the number of messages per batch.
parsed.count().map(lambda x:'Messages in this batch: %s' % x).pprint()
Your not applying the lambda function to anything. What is message referencing? Right not the lambda function is just that, a function. That si why your getting AttributeError: 'function' object has no attribute 'distinct'. It is not being applied to any data, so it is not returning any data. You need to reference the dataframe which the key email is in.
See the pyspark docs for pyspark.sql.functions.countDistinct(col, *cols) and pyspark.sql.functions.approx_count_distinct pyspark docs. This should be a simpler solution to getting a unique count.

Reading NaN values from .csv files with decode_csv()

I have .csv file with integer values, that can have NA value which represents missing data.
Example file:
-9882,-9585,-9179
-9883,-9587,NA
-9882,-9585,-9179
When trying to read it with
import tensorflow as tf
reader = tf.TextLineReader(skip_header_lines=1)
key, value = reader.read_up_to(filename_queue, 1)
record_defaults = [[0], [0], [0]]
data, ABL_E, ABL_N = tf.decode_csv(value, record_defaults=record_defaults)
It throws following error later on sess.run(_) on the 2nd iteration
InvalidArgumentError (see above for traceback): Field 5 in record 32400 is not a valid int32: NA
Is there a way to interpret string "NA" while reading csv as NaN or similar value in TensorFlow?
I recently ran into the same problem. I solved it by reading the CSV as strings, replacing every occurrence of "NA" with some valid value, then converting it to float
# Set up reading from CSV files
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
NUM_COLUMNS = XX # Specify number of expected columns
# Read values as string, set "NA" for missing values.
record_defaults = [[tf.cast("NA", tf.string)]] * NUM_COLUMNS
decoded = tf.decode_csv(value, record_defaults=record_defaults, field_delim="\t")
# Replace every occurrence of "NA" with "-1"
no_nan = tf.where(tf.equal(decoded, "NA"), ["-1"]*NUM_COLUMNS, decoded)
# Convert to float, combine to a single tensor with stack.
float_row = tf.stack(tf.string_to_number(no_nan, tf.float32))
But long term I plan switching to tfrecords because reading csv is too slow for my needs

Is there a in built method from python csv module to enumerate all possible value for a specific column?

I have a csv file which has many columns. Now my requirement is to find all possible value that are present for that specific column.
Is there any built in function in python that helps me to get these values.
You can us pandas.
Example file many_cols.csv:
col1,col2,col3
1,10,100
1,20,100
2,10,100
3,30,100
Find unique values per column:
>>> import pandas as pd
>>> df = pd.read_csv('many_cols.csv')
>>> df.col1.drop_duplicates().tolist()
[1, 2, 3]
>>> df['col2'].drop_duplicates().tolist()
[10, 20, 30]
>>> df['col3'].drop_duplicates().tolist()
[100]
For all columns:
import pandas as pd
df = pd.read_csv('many_cols.csv')
for col in df.columns:
print(col, df[col].drop_duplicates().tolist())
Output:
col1 [1, 2, 3]
col2 [10, 20, 30]
col3 [100]
I would use a set() for this.
Lets say the csv file is this and we want only unique values from second column.
foo,1,bar
baz,2,foo
red,3,blue
git,3,foo
Here is the code that would accomplish this. I am simply printing out the unique values to test that it worked.
import csv
def parse_csv_file(rawCSVFile):
fileLineList = []
with open(rawCSVFile, newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
return fileLineList
def main():
uniqueColumnValues = set()
fileLineList = parse_csv_file('sample.csv')
for row in fileLineList:
uniqueColumnValues.add(row[1]) # Selecting 2nd column here.
print(uniqueColumnValues)
if __name__ == '__main__':
main()
Overly "clever" approach to figuring out unique values for all the rows at once (assumes all columns are the same size, though it ignores empty lines seamlessly):
# Assumes somefile was opened properly earlier
csvin = filter(None, csv.reader(somefile))
for i, vals in enumerate(map(sorted, map(set, zip(*csvin)))):
print("Unique values for column", i)
print(vals)
It uses zip(*csvin) to do a table rotation (converting the normal one row at a time output to one column at a time), then uniquifies each column with set, and (for nice output) sorts it.