I have a CSV file formatted as follows:
somefeature,anotherfeature,f3,f4,f5,f6,f7,lastfeature
0,0,0,1,1,2,4,5
And I try to read it as a pandas Series (using pandas daily snapshot for Python 2.7).
I tried the following:
import pandas as pd
types = pd.Series.from_csv('csvfile.txt', index_col=False, header=0)
and:
types = pd.read_csv('csvfile.txt', index_col=False, header=0, squeeze=True)
But both just won't work: the first one gives a random result, and the second just imports a DataFrame without squeezing.
It seems like pandas can only recognize as a Series a CSV formatted as follows:
f1, value
f2, value2
f3, value3
But when the features keys are in the first row instead of column, pandas does not want to squeeze it.
Is there something else I can try? Is this behaviour intended?
Here is the way I've found:
df = pandas.read_csv('csvfile.txt', index_col=False, header=0);
serie = df.ix[0,:]
Seems like a bit stupid to me as Squeeze should already do this. Is this a bug or am I missing something?
/EDIT: Best way to do it:
df = pandas.read_csv('csvfile.txt', index_col=False, header=0);
serie = df.transpose()[0] # here we convert the DataFrame into a Serie
This is the most stable way to get a row-oriented CSV line into a pandas Series.
BTW, the squeeze=True argument is useless for now, because as of today (April 2013) it only works with row-oriented CSV files, see the official doc:
http://pandas.pydata.org/pandas-docs/dev/io.html#returning-series
This works. Squeeze still works, but it just won't work alone. The index_col needs to be set to zero as below
series = pd.read_csv('csvfile.csv', header = None, index_col = 0, squeeze = True)
In [28]: df = pd.read_csv('csvfile.csv')
In [29]: df.ix[0]
Out[29]:
somefeature 0
anotherfeature 0
f3 0
f4 1
f5 1
f6 2
f7 4
lastfeature 5
Name: 0, dtype: int64
ds = pandas.read_csv('csvfile.csv', index_col=False, header=0);
X = ds.iloc[:, :10] #ix deprecated
As Pandas value selection logic is :
DataFrame -> Series=DataFrame[Column] -> Values=Series[Index]
So I suggest :
df=pandas.read_csv("csvfile.csv")
s=df[df.columns[0]]
from pandas import read_csv
series = read_csv('csvfile.csv', header=0, parse_dates=[0], index_col=0, squeeze=True
Since none of the answers above worked for me, here is another one, recreating the Series manually from the DataFrame.
# create example series
series = pd.Series([0, 1, 2], index=["a", "b", "c"])
series.index.name = "idx"
print(series)
print()
# create csv
series_csv = series.to_csv()
print(series_csv)
# read csv
df = pd.read_csv(io.StringIO(series_csv), index_col=0)
indx = df.index
vals = [df.iloc[i, 0] for i in range(len(indx))]
series_again = pd.Series(vals, index=indx)
print(series_again)
Output:
idx
a 0
b 1
c 2
dtype: int64
idx,0
a,0
b,1
c,2
idx
a 0
b 1
c 2
dtype: int64
Related
I have delimited file that have JSON also keyvalues matching in the column. I need to parse this data into dataframe.
Below is the record format
**trx_id|name|service_context|status**
abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success
i need to convert all information from this record to have this format
trx_id|name |type|payload.trx_id|payload.name|payload.counter.counter_type|payload.counter.counter_info|.....|payload.renewal.flag|status
abc123|order|cdr |abc123 |abs |product |transfer |.....|0 |success
abc456|order|cdr |abc456 |abs |product | |.....|1 |success
Currently i've done manual parsing the data for key_value with sep=';|[|] and remove behind '=' and update the column name.
for Json, i do the below command, however the result is replacing the existing table and only contain parsing json result.
test_parse = pd.concat([pd.json_normalize(json.loads(js)) for js in test_parse['payload']])
Is there any way to do avoid any manual process to process this type of data?
The below hint will be sufficient to solve the problem.
Do it partwise for each column and then merge them together (you will need to remove the columns once you are able to split into multiple columns):
import ast
from pandas.io.json import json_normalize
x = json_normalize(df3['service_context'].apply(lambda x: (ast.literal_eval(x.split('=')[1])))).add_prefix('payload.')
y = pd.DataFrame(x['payload.counter'].apply(lambda x:[i['counter_type'] for i in x]).to_list())
y = y.rename(columns={0: 'counter_type', 1:'counter_info'})
for row in x['payload.product']:
z1 = json_normalize(row)
z2 = json_normalize(z1['customer_spec.resource_pecification'][0])
### Write your own code.
x:
y:
It's realy a 3-step approach
use primary pipe | delimiter
extract key / value pairs
normlize JSON
import pandas as pd
import io, json
# overall data structure is pipe delimited
df = pd.read_csv(io.StringIO("""abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success"""),
sep="|", header=None, names=["trx_id","name","data","status"])
df2 = pd.concat([
df,
# split out sub-columns ; delimted columns in 3rd column
pd.DataFrame(
[[c.split("=")[1] for c in r] for r in df.data.str.split(";")],
columns=[c.split("=")[0] for c in df.data.str.split(";")[0]],
)
], axis=1)
# extract json payload into columns. This will leave embedded lists as these are many-many
# that needs to be worked out by data owner
df3 = pd.concat([df2,
pd.concat([pd.json_normalize(json.loads(p)).add_prefix("payload.") for p in df2.payload]).reset_index()], axis=1)
output
trx_id name data status type payload index payload.trx_id payload.name payload.counter payload.language payload.type payload.can_replace payload.product payload.renewal_flag payload.price.transaction payload.price.discount
0 abc123 order type=cdr;payload={"trx_id":"abc123","name":"ab... success cdr {"trx_id":"abc123","name":"abs","counter":[{"c... 0 abc123 abs [{'counter_type': 'product'}, {'counter_type':... id AD yes [{'flag': '0', 'identifier_flag': '0', 'custom... 0 1800 0
use with caution - explode() embedded lists
df3p = df3["payload.product"].explode().apply(pd.Series)
df3.join(df3.explode("payload.counter")["payload.counter"].apply(pd.Series)).join(
pd.json_normalize(df3p.join(df3p["customer_spec"].apply(pd.Series)).explode("resource_pecification").to_dict(orient="records"))
)
I have .csv file with integer values, that can have NA value which represents missing data.
Example file:
-9882,-9585,-9179
-9883,-9587,NA
-9882,-9585,-9179
When trying to read it with
import tensorflow as tf
reader = tf.TextLineReader(skip_header_lines=1)
key, value = reader.read_up_to(filename_queue, 1)
record_defaults = [[0], [0], [0]]
data, ABL_E, ABL_N = tf.decode_csv(value, record_defaults=record_defaults)
It throws following error later on sess.run(_) on the 2nd iteration
InvalidArgumentError (see above for traceback): Field 5 in record 32400 is not a valid int32: NA
Is there a way to interpret string "NA" while reading csv as NaN or similar value in TensorFlow?
I recently ran into the same problem. I solved it by reading the CSV as strings, replacing every occurrence of "NA" with some valid value, then converting it to float
# Set up reading from CSV files
filename_queue = tf.train.string_input_producer([filename])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
NUM_COLUMNS = XX # Specify number of expected columns
# Read values as string, set "NA" for missing values.
record_defaults = [[tf.cast("NA", tf.string)]] * NUM_COLUMNS
decoded = tf.decode_csv(value, record_defaults=record_defaults, field_delim="\t")
# Replace every occurrence of "NA" with "-1"
no_nan = tf.where(tf.equal(decoded, "NA"), ["-1"]*NUM_COLUMNS, decoded)
# Convert to float, combine to a single tensor with stack.
float_row = tf.stack(tf.string_to_number(no_nan, tf.float32))
But long term I plan switching to tfrecords because reading csv is too slow for my needs
I have seen similar questions being asked and responded to. However, no answer seems to address my specific needs.
The following code, which I took and adapted to suit my needs, successfully imports the files and relevant columns. However it appends rows onto the df and does not merge those columns based on keys.
import glob
import pandas as pd
import os
path = r'./csv_weather_data'
all_files = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat(pd.read_csv(f, skiprows=47, skipinitialspace=True, usecols=['Year','Month','Day','Hour','DBT'],) for f in all_files)
Typical data structure is the following:
Year Month Day Hour DBT
1989 1 1 0 7.8
1989 1 1 100 8.6
1989 1 1 200 9.2
I would like to achieve the following:
import all csv files contained in a folder into a pandas df
merge first 4 columns into 1 column of datetime values
merge all imported csv, using newly created datetime value as an index, and adding DBT columns to that, with each DBT column taking the name of the imported csv (it is the Dry Bulb Temperature DBT of that weather file).
Any advice?
You should divide the problem in two steps:
First, define your import function. Here you need to define datetime and set is as index.
def my_import(f):
df = pd.read_csv(f, skiprows=47, skipinitialspace=True, usecols=['Year','Month','Day','Hour','DBT'],)
df.loc[:, 'Date'] = pd.to_datetime(df.apply(lambda x : str(int(x['Year']))+str(int(x['Month']))+str(int(x['Day']))+str(int(x['Hour'])), axis = 1), format = '%Y%m%d%H')
df.drop(['Year', 'Month', 'Day', 'Hour'], axis = 1, inplace = True)
df.set_index('Date')
return df
Then you concatenate by columns (axis = 1)
df = pd.concat({f : my_import(f) for f in all_files}, axis = 1)
I have a csv file which has many columns. Now my requirement is to find all possible value that are present for that specific column.
Is there any built in function in python that helps me to get these values.
You can us pandas.
Example file many_cols.csv:
col1,col2,col3
1,10,100
1,20,100
2,10,100
3,30,100
Find unique values per column:
>>> import pandas as pd
>>> df = pd.read_csv('many_cols.csv')
>>> df.col1.drop_duplicates().tolist()
[1, 2, 3]
>>> df['col2'].drop_duplicates().tolist()
[10, 20, 30]
>>> df['col3'].drop_duplicates().tolist()
[100]
For all columns:
import pandas as pd
df = pd.read_csv('many_cols.csv')
for col in df.columns:
print(col, df[col].drop_duplicates().tolist())
Output:
col1 [1, 2, 3]
col2 [10, 20, 30]
col3 [100]
I would use a set() for this.
Lets say the csv file is this and we want only unique values from second column.
foo,1,bar
baz,2,foo
red,3,blue
git,3,foo
Here is the code that would accomplish this. I am simply printing out the unique values to test that it worked.
import csv
def parse_csv_file(rawCSVFile):
fileLineList = []
with open(rawCSVFile, newline='') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
fileLineList.append(row)
return fileLineList
def main():
uniqueColumnValues = set()
fileLineList = parse_csv_file('sample.csv')
for row in fileLineList:
uniqueColumnValues.add(row[1]) # Selecting 2nd column here.
print(uniqueColumnValues)
if __name__ == '__main__':
main()
Overly "clever" approach to figuring out unique values for all the rows at once (assumes all columns are the same size, though it ignores empty lines seamlessly):
# Assumes somefile was opened properly earlier
csvin = filter(None, csv.reader(somefile))
for i, vals in enumerate(map(sorted, map(set, zip(*csvin)))):
print("Unique values for column", i)
print(vals)
It uses zip(*csvin) to do a table rotation (converting the normal one row at a time output to one column at a time), then uniquifies each column with set, and (for nice output) sorts it.
I have a large csv file that I can not load into memory. I need to find which variables are constant. How can I do that?
I am reading the csv as
d = pd.read_csv(load_path, header=None, chunksize=10)
Is there an elegant way to solve the problem?
The data contains string and numerical variables
This is my current slow solution that does not use pandas
constant_variables = [True for i in range(number_of_columns)]
with open(load_path) as f:
line0 = next(f).split(',')
for num, line in enumerate(f):
line = line.split(',')
for i in range(n_col):
if line[i] != line0[i]:
constant_variables[i] = False
if num % 10000 == 0:
print(num)
You have 2 methods I can think of iterate over each column and check for uniqueness:
col_list = pd.read_csv(path, nrows=1).columns
for col in range(len(col_list)):
df = pd.read_csv(path, usecols=col)
if len(df.drop_duplicates()) == len(df):
print("all values are constant for: ", df.column[0])
or iterate over the csv in chunks and check again the lengths:
for df in pd.read_csv(path, chunksize=1000):
t = dict(zip(df, [len(df[col].value_counts()) for col in df]))
print(t)
The latter will read in chunks and tell you how unique each columns data is, this is just rough code which you can modify for your needs