Related
I have a .txt file with 683,500 rows, every 7 rows its a different person that contain:
ID
Name
Work position
Date 1 (year - month)
Date 2 (year - month)
Gross payment
Service time
I would like to read that .txt and output (could be a json, csv, txt, or even in a database) every person in a 7 column, for example:
ID Name Work position Date 1 Date 2 Gross payment Service time
ID Name Work position Date 1 Date 2 Gross payment Service time
ID Name Work position Date 1 Date 2 Gross payment Service time
ID Name Work position Date 1 Date 2 Gross payment Service time
Example in the txt:
00000000886
MANUEL DE JESUS SUBERVI PEÑA
MAESTRO MEDIA GENERAL
2006-08
2021-09
30,556.04
15.7
00000000086
MANUEL DE JESUS SUBERVI PEÑA
MAESTRO MEDIA GENERAL
2006-01
2021-09
30,556.04
15.7
00100000086
MANUEL DE JESUS SUBERVI PEÑA
MAESTRO MEDIA GENERAL
2006-01
2021-09
30,556.04
15.7
import csv
#opening file
file = open (r"C:\Users\Redford\Documents\Proyecto automatizacion\data1.txt") #open file
counter = 0
total_lines = len(file.readlines()) #count lines
#print('Total lines:', x)
#reading from file
content = file.read()
colist = content.split ()
print(colist)
#read data from data1.txt and write in data2.txt
lines = open (r"C:\Users\Redford\Documents\Proyecto automatizacion\data1.txt")
arr = []
with open('data2.txt', 'w') as f:
for line in lines:
#arr.append(line)
f.write (line)
I'm new to programing and I don't know how to translate my logic to code.
Your code does not collect multiple lines to write them into one.
Use this approach:
read your file line by line
collect each line without a \n into a list
if list reaches 7 length, write into csv and clear list
repeat until done
Create data file:
with open ("t.txt","w") as f:
f.write("""00000000886\nMANUEL DE JESUS SUBERVI PEÑA\nMAESTRO MEDIA GENERAL\n2006-08\n2021-09\n30,556.04\n15.7
00000000086\nMANUEL DE JESUS SUBERVI PEÑA\nMAESTRO MEDIA GENERAL\n2006-01\n2021-09\n30,556.04\n15.7
00100000086\nMANUEL DE JESUS SUBERVI PEÑA\nMAESTRO MEDIA GENERAL\n2006-01\n2021-09\n30,556.04\n15.7""")
Program:
import csv
with open("t.csv","w",newline="") as wr, open("t.txt") as r:
# create a csv writer
writer = csv.writer(wr)
# uncomment if you want a header over your data
# h = ["ID","Name","Work position","Date 1","Date 2",
# "Gross payment","Service time"]
# writer.writerow(h)
person = []
for line in r: # could use enumerate as well, this works ok
# collect line data minus the \n into list
person.append(line.strip())
# this person is finished, write, clear list
if len(person) == 7:
# leveraged the csv module writer, look it up if you need
# to customize it further regarding quoting etc
writer.writerow(person)
person = [] # reset list for next person
# something went wrong, your file is inconsistent, write remainder
if person:
writer.writerow(person)
print(open("t.csv").read())
Output:
00000000886,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-08,2021-09,"30,556.04",15.7
00000000086,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-01,2021-09,"30,556.04",15.7
00100000086,MANUEL DE JESUS SUBERVI PEÑA,MAESTRO MEDIA GENERAL,2006-01,2021-09,"30,556.04",15.7
Readup: csv module - writer
The "Gross payment" needs to be quoted because it contain s a ',' wich is the delimiter for csv - the module does this automagically.
On top of the excellent answer from #PatrickArtner, I would like to propose an itertools-based solution:
import csv
import itertools
def file_grouper_itertools(
in_filepath="t.txt",
out_filepath="t.csv",
size=7):
with open(in_filepath, 'r') as in_file,\
open(out_filepath, 'w') as out_file:
writer = csv.writer(out_file)
args = [iter(in_file)] * size
for block in itertools.zip_longest(*args, fillvalue=' '):
# equivalent, for the given input, to:
# block = [x.rstrip('\n') for x in block]
block = ''.join(block).rstrip('\n').split('\n')
writer.writerow(block)
The idea there is to loop in blocks of the required size.
For larger group sizes this gets faster simply because of the fewer cycles the main loop is being executed.
Running some micro-benchmarking shows that your use case should benefit from this approach compared to the manual looping (adapted into a function):
import csv
def file_grouper_manual(
in_filepath="t.txt",
out_filepath="t.csv",
size=7):
with open(in_filepath, 'r') as in_file,\
open(out_filepath, 'w') as out_file:
writer = csv.writer(out_file)
block = []
for line in in_file:
block.append(line.rstrip('\n'))
if len(block) == size:
writer.writerow(block)
block = []
if block:
writer.writerow(block)
Benchmarking:
n = 100_000
k = 7
with open ("t.txt", "w") as f:
for i in range(n):
f.write("\n".join(["0123456"] * k))
%timeit file_grouper_manual()
# 1 loop, best of 5: 325 ms per loop
%timeit file_grouper_itertools()
# 1 loop, best of 5: 230 ms per loop
Alternatively, you could use Pandas, which is very convenient, but requires that all the input fit into available memory (which should not be a problem in your case, but can be for larger inputs):
import numpy as np
import pandas as pd
def file_grouper_pandas(in_filepath="t.txt", out_filepath="t.csv", size=7):
with open(in_filepath) as in_filepath:
data = [x.rstrip('\n') for x in in_filepath.readlines()]
df = pd.DataFrame(np.array(data).reshape((-1, size)), columns=list(range(size)))
# consistent with the other solutions
df.to_csv(out_filepath, header=False, index=False)
%timeit file_grouper_pandas()
# 1 loop, best of 5: 666 ms per loop
If you do a lot of work with tables and data, NumPy and Pandas are really useful libraries to get comfortable with.
import numpy as np
import pandas as pd
columns = ['ID', 'Name' , 'Work position', 'Date 1 (year - month)', 'Date 2 (year - month)',
'Gross payment', 'Service time']
with open('oldfile.txt', 'r') as stream:
# read file into a list of lines
lines = stream.readlines()
# remove newline character from each element of the list.
lines = [line.strip('\n') for line in lines]
# Figure out how many rows there will be in the table
number_of_people = len(lines)/7
# Split data into rows
data = np.array_split(lines, number_of_people)
# Convert data to pandas dataframe
df = pd.DataFrame(data, columns = columns)
Once you have converted the data to a Pandas Dataframe, you can easily output it to any of the formats you listed. For example to output to csv you can do:
df.to_csv('newfile.csv')
Or for json it would be:
df.to_json('newfile.csv')
I have delimited file that have JSON also keyvalues matching in the column. I need to parse this data into dataframe.
Below is the record format
**trx_id|name|service_context|status**
abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success
i need to convert all information from this record to have this format
trx_id|name |type|payload.trx_id|payload.name|payload.counter.counter_type|payload.counter.counter_info|.....|payload.renewal.flag|status
abc123|order|cdr |abc123 |abs |product |transfer |.....|0 |success
abc456|order|cdr |abc456 |abs |product | |.....|1 |success
Currently i've done manual parsing the data for key_value with sep=';|[|] and remove behind '=' and update the column name.
for Json, i do the below command, however the result is replacing the existing table and only contain parsing json result.
test_parse = pd.concat([pd.json_normalize(json.loads(js)) for js in test_parse['payload']])
Is there any way to do avoid any manual process to process this type of data?
The below hint will be sufficient to solve the problem.
Do it partwise for each column and then merge them together (you will need to remove the columns once you are able to split into multiple columns):
import ast
from pandas.io.json import json_normalize
x = json_normalize(df3['service_context'].apply(lambda x: (ast.literal_eval(x.split('=')[1])))).add_prefix('payload.')
y = pd.DataFrame(x['payload.counter'].apply(lambda x:[i['counter_type'] for i in x]).to_list())
y = y.rename(columns={0: 'counter_type', 1:'counter_info'})
for row in x['payload.product']:
z1 = json_normalize(row)
z2 = json_normalize(z1['customer_spec.resource_pecification'][0])
### Write your own code.
x:
y:
It's realy a 3-step approach
use primary pipe | delimiter
extract key / value pairs
normlize JSON
import pandas as pd
import io, json
# overall data structure is pipe delimited
df = pd.read_csv(io.StringIO("""abc123|order|type=cdr;payload={"trx_id":"abc123","name":"abs","counter":[{"counter_type":"product"},{"counter_type":"transfer"}],"language":"id","type":"AD","can_replace":"yes","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period":"0","period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0}}]}}],"renewal_flag":"0"}|success
abc456|order|type=cdr;payload={"trx_id":"abc456","name":"abs","counter":[{"counter_type":"product"}],"language":"id","price":{"transaction":1800,"discount":0},"product":[{"flag":"0","identifier_flag":"0","customer_spec":{"period_unit":"month","resource_pecification":[{"amount":{"ssp":0.0,"discount":0.0},"bt":{"service_id":"500_USSD","amount":"65000"}}]}}],"renewal_flag":"1"}|success"""),
sep="|", header=None, names=["trx_id","name","data","status"])
df2 = pd.concat([
df,
# split out sub-columns ; delimted columns in 3rd column
pd.DataFrame(
[[c.split("=")[1] for c in r] for r in df.data.str.split(";")],
columns=[c.split("=")[0] for c in df.data.str.split(";")[0]],
)
], axis=1)
# extract json payload into columns. This will leave embedded lists as these are many-many
# that needs to be worked out by data owner
df3 = pd.concat([df2,
pd.concat([pd.json_normalize(json.loads(p)).add_prefix("payload.") for p in df2.payload]).reset_index()], axis=1)
output
trx_id name data status type payload index payload.trx_id payload.name payload.counter payload.language payload.type payload.can_replace payload.product payload.renewal_flag payload.price.transaction payload.price.discount
0 abc123 order type=cdr;payload={"trx_id":"abc123","name":"ab... success cdr {"trx_id":"abc123","name":"abs","counter":[{"c... 0 abc123 abs [{'counter_type': 'product'}, {'counter_type':... id AD yes [{'flag': '0', 'identifier_flag': '0', 'custom... 0 1800 0
use with caution - explode() embedded lists
df3p = df3["payload.product"].explode().apply(pd.Series)
df3.join(df3.explode("payload.counter")["payload.counter"].apply(pd.Series)).join(
pd.json_normalize(df3p.join(df3p["customer_spec"].apply(pd.Series)).explode("resource_pecification").to_dict(orient="records"))
)
I am trying to import a list of stock-tickers (the line that is #symbols_list...read_csv..), and fetch stock-info on that date into a pandas.
import datetime
import pandas as pd
from pandas import DataFrame
from pandas.io.data import DataReader
#symbols_list = [pd.read_csv('Stock List.csv', index_col=0)]
symbols_list = ['AAPL', 'TSLA', 'YHOO','GOOG', 'MSFT','ALTR','WDC','KLAC']
symbols=[]
start = datetime.datetime(2014, 2, 9)
#end = datetime.datetime(2014, 12, 30)
for ticker in symbols_list:
r = DataReader(ticker, "yahoo",
start = start)
#start=start, end)
# add a symbol column
r['Symbol'] = ticker
symbols.append(r)
# concatenate all the dfs
df = pd.concat(symbols)
#define cell with the columns that i need
cell= df[['Symbol','Open','High','Low','Adj Close','Volume']]
#changing sort of Symbol (ascending) and Date(descending) setting Symbol as first column and changing date format
cell.reset_index().sort(['Symbol', 'Date'], ascending=[1,0]).set_index('Symbol').to_csv('stock.csv', date_format='%d/%m/%Y')
The input file Stock list.csv
has the following content with these entries on each their separate row:
Index
MMM
ABT
ABBV
ACE
ACN
ACT
ADBE
ADT
AES
AET
AFL
AMG
and many more tickers of interest.
When run with the manually coded list
symbols_list = ['AAPL', 'TSLA', 'YHOO','GOOG', 'MSFT','ALTR','WDC','KLAC']
It all works fine and processes the input and stores it to a file,
But whenever I run the code with the read_csv from file, I get the following error:
runfile('Z:/python/CrystallBall/SpyderProject/getstocks3.py', wdir='Z:/python/CrystallBall/SpyderProject') Reloaded modules: pandas.io.data, pandas.tseries.common Traceback (most recent call last):
File "<ipython-input-32-67cbdd367f48>", line 1, in <module>
runfile('Z:/python/CrystallBall/SpyderProject/getstocks3.py', wdir='Z:/python/CrystallBall/SpyderProject')
File "C:\Program Files (x86)\WinPython-32bit-3.4.2.4\python-3.4.2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 601, in runfile
execfile(filename, namespace)
File "C:\Program Files (x86)\WinPython-32bit-3.4.2.4\python-3.4.2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 80, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "Z:/python/CrystallBall/SpyderProject/getstocks3.py", line 35, in <module>
cell.reset_index().sort(['Symbol', 'Date'], ascending=[1,0]).set_index('Symbol').to_csv('stock.csv', date_format='%d/%m/%Y')
File "C:\Users\Morten\AppData\Roaming\Python\Python34\site-packages\pandas\core\generic.py", line 1947, in __getattr__
(type(self).__name__, name))
AttributeError: 'Panel' object has no attribute 'reset_index'
Why can I only process the symbol_list manually laid out, and not the imported tickers from file?
Any takers? Any help greatly appreciated!
Your code has numerous issues which the following code has fixed and works:
In [4]:
import datetime
import pandas as pd
from pandas import DataFrame
from pandas.io.data import DataReader
temp='''Index
MMM
ABT
ABBV
ACE
ACN
ACT
ADBE
ADT
AES
AET
AFL
AMG'''
df = pd.read_csv(io.StringIO(temp), index_col=[0])
symbols=[]
start = datetime.datetime(2014, 2, 9)
for ticker in df.index:
r = DataReader(ticker, "yahoo",
start = start)
#start=start, end)
# add a symbol column
r['Symbol'] = ticker
symbols.append(r)
# concatenate all the dfs
df = pd.concat(symbols)
#define cell with the columns that i need
cell= df[['Symbol','Open','High','Low','Adj Close','Volume']]
#changing sort of Symbol (ascending) and Date(descending) setting Symbol as first column and changing date format
cell.reset_index().sort(['Symbol', 'Date'], ascending=[1,0]).set_index('Symbol').to_csv('stock.csv', date_format='%d/%m/%Y')
cell
Out[4]:
Symbol Open High Low Adj Close Volume
Date
2014-02-10 MMM 129.65 130.41 129.02 126.63 3317400
2014-02-11 MMM 129.70 131.49 129.65 127.88 2604000
... ... ... ... ... ... ...
2015-02-06 AMG 214.35 215.82 212.64 214.45 424400
[3012 rows x 6 columns]
So firstly this: symbols_list = [pd.read_csv('Stock List.csv', index_col=0)]
This will create a list with a single entry which will be a df with no columns and just an index of your ticker values.
This: for ticker in symbols_list:
won't work because the iterable object that is returned from the df is the column and not each entry, in your case you need to iterate over the index which is what my code does.
I'm not sure what you wanted to achieve, it isn't necessary to specify that index_col=0 if there is only one column, you can either create a df with just a single column, or if you pass squeeze=True this will create a Series which just has a single column.
I've been trying for weeks to plot 3 sets of (x, y) data on the same plot from a .CSV file, and I'm getting nowhere. My data was originally an Excel file which I have converted to a .CSV file and have used pandas to read it into IPython as per the following code:
from pandas import DataFrame, read_csv
import pandas as pd
# define data location
df = read_csv(Location)
df[['LimMag1.3', 'ExpTime1.3', 'LimMag2.0', 'ExpTime2.0', 'LimMag2.5','ExpTime2.5']][:7]
My data is in the following format:
Type mag1 time1 mag2 time2 mag3 time3
M0 8.87 41.11 8.41 41.11 8.16 65.78;
...
M6 13.95 4392.03 14.41 10395.13 14.66 25988.32
I'm trying to plot time1 vs mag1, time2 vs mag2 and time3 vs mag3, all on the same plot, but instead I get plots of time.. vs Type, eg. for the code:
df['ExpTime1.3'].plot()
I get 'ExpTime1.3' (y-axis) plotted against M0 to M6 (x-axis), when what I want is 'ExpTime1.3' vs 'LimMag1.3', with x-labels M0 - M6.
How do I get 'ExpTime..' vs 'LimMag..' plots, with all 3 sets of data on the same plot?
How do I get the M0 - M6 labels on the x-axis for the 'LimMag..' values (also on the x-axis)?
Since trying askewchan's solutions, which did not return any plots for reasons unknown, I've found that I can get a plot of ExpTimevs LimMagusing df['ExpTime1.3'].plot(),if I change the dataframe index (df.index) to the values of the x axis (LimMag1.3). However, this appears to mean that I have to convert each desired x-axis to the dataframe index by manually inputing all the values of the desired x-axis to make it the data index. I have an awful lot of data, and this method is just too slow, and I can only plot one set of data at a time, when I need to plot all 3 series for each dataset on the one graph. Is there a way around this problem? Or can someone offer a reason, and a solution, as to why I I got no plots whatsoever with the solutions offered by askewchan?\
In response to nordev, I have tried the first version again, bu no plots are produced, not even an empty figure. Each time I put in one of the ax.plotcommands, I do get an output of the type:
[<matplotlib.lines.Line2D at 0xb5187b8>], but when I enter the command plt.show()nothing happens.
When I enter plt.show()after the loop in askewchan's second solution, I get an error back saying AttributeError: 'function' object has no attribute 'show'
I have done a bit more fiddling with my original code and can now get a plot of ExpTime1.3vs LimMag1.3 with the code df['ExpTime1.3'][:7].plot(),by making the index the same as the x axis (LimMag1.3), but I can't get the other two sets of data on the same plot. I would appreciate any further suggestions you may have. I'm using ipython 0.11.0 via Anaconda 1.5.0 (64bit) and spyder on Windows 7 (64bit), python version is 2.7.4.
If I have understood you correctly, both from this question as well as your previous one on the same subject, the following should be basic solutions you could customize to your needs.
Several subplots:
Note that this solution will output as many subplots as there are Spectral classes (M0, M1, ...) vertically on the same figure. If you wish to save the plot of each Spectral class in a separate figure, the code needs some modifications.
import pandas as pd
from pandas import DataFrame, read_csv
import numpy as np
import matplotlib.pyplot as plt
# Here you put your code to read the CSV-file into a DataFrame df
plt.figure(figsize=(7,5)) # Set the size of your figure, customize for more subplots
for i in range(len(df)):
xs = np.array(df[df.columns[0::2]])[i] # Use values from odd numbered columns as x-values
ys = np.array(df[df.columns[1::2]])[i] # Use values from even numbered columns as y-values
plt.subplot(len(df), 1, i+1)
plt.plot(xs, ys, marker='o') # Plot circle markers with a line connecting the points
for j in range(len(xs)):
plt.annotate(df.columns[0::2][j][-3:] + '"', # Annotate every plotted point with last three characters of the column-label
xy = (xs[j],ys[j]),
xytext = (0, 5),
textcoords = 'offset points',
va = 'bottom',
ha = 'center',
clip_on = True)
plt.title('Spectral class ' + df.index[i])
plt.xlabel('Limiting Magnitude')
plt.ylabel('Exposure Time')
plt.grid(alpha=0.4)
plt.tight_layout()
plt.show()
All in same Axes, grouped by rows (M0, M1, ...)
Here is another solution to get all the different Spectral classes plotted in the same Axes with a legend identifying the different classes. The plt.yscale('log') is optional, but seeing as how the values span such a great range, it is recommended.
import pandas as pd
from pandas import DataFrame, read_csv
import numpy as np
import matplotlib.pyplot as plt
# Here you put your code to read the CSV-file into a DataFrame df
for i in range(len(df)):
xs = np.array(df[df.columns[0::2]])[i] # Use values from odd numbered columns as x-values
ys = np.array(df[df.columns[1::2]])[i] # Use values from even numbered columns as y-values
plt.plot(xs, ys, marker='o', label=df.index[i])
for j in range(len(xs)):
plt.annotate(df.columns[0::2][j][-3:] + '"', # Annotate every plotted point with last three characters of the column-label
xy = (xs[j],ys[j]),
xytext = (0, 6),
textcoords = 'offset points',
va = 'bottom',
ha = 'center',
rotation = 90,
clip_on = True)
plt.title('Spectral classes')
plt.xlabel('Limiting Magnitude')
plt.ylabel('Exposure Time')
plt.grid(alpha=0.4)
plt.yscale('log')
plt.legend(loc='best', title='Spectral classes')
plt.show()
All in same Axes, grouped by columns (1.3", 2.0", 2.5")
A third solution is as shown below, where the data are grouped by the series (columns 1.3", 2.0", 2.5") rather than by the Spectral class (M0, M1, ...). This example is very similar to
#askewchan's solution. One difference is that the y-axis here is a logarithmic axis, making the lines pretty much parallel.
import pandas as pd
from pandas import DataFrame, read_csv
import numpy as np
import matplotlib.pyplot as plt
# Here you put your code to read the CSV-file into a DataFrame df
xs = np.array(df[df.columns[0::2]]) # Use values from odd numbered columns as x-values
ys = np.array(df[df.columns[1::2]]) # Use values from even numbered columns as y-values
for i in range(df.shape[1]/2):
plt.plot(xs[:,i], ys[:,i], marker='o', label=df.columns[0::2][i][-3:]+'"')
for j in range(len(xs[:,i])):
plt.annotate(df.index[j], # Annotate every plotted point with its Spectral class
xy = (xs[:,i][j],ys[:,i][j]),
xytext = (0, -6),
textcoords = 'offset points',
va = 'top',
ha = 'center',
clip_on = True)
plt.title('Spectral classes')
plt.xlabel('Limiting Magnitude')
plt.ylabel('Exposure Time')
plt.grid(alpha=0.4)
plt.yscale('log')
plt.legend(loc='best', title='Series')
plt.show()
You can call pyplot.plot(time, mag) three different times in the same figure. It would be wise to give a label to them. Something like this:
import matplotlib.pyplot as plt
...
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(df['LimMag1.3'], df['ExpTime1.3'], label="1.3")
ax.plot(df['LimMag2.0'], df['ExpTime2.0'], label="2.0")
ax.plot(df['LimMag2.5'], df['ExpTime2.5'], label="2.5")
plt.show()
If you want to loop it, this would work:
fig = plt.figure()
ax = fig.add_subplot(111)
for x,y in [['LimMag1.3', 'ExpTime1.3'],['LimMag2.0', 'ExpTime2.0'], ['LimMag2.5','ExpTime2.5']]:
ax.plot(df[x], df[y], label=y)
plt.show()
I have a CSV file formatted as follows:
somefeature,anotherfeature,f3,f4,f5,f6,f7,lastfeature
0,0,0,1,1,2,4,5
And I try to read it as a pandas Series (using pandas daily snapshot for Python 2.7).
I tried the following:
import pandas as pd
types = pd.Series.from_csv('csvfile.txt', index_col=False, header=0)
and:
types = pd.read_csv('csvfile.txt', index_col=False, header=0, squeeze=True)
But both just won't work: the first one gives a random result, and the second just imports a DataFrame without squeezing.
It seems like pandas can only recognize as a Series a CSV formatted as follows:
f1, value
f2, value2
f3, value3
But when the features keys are in the first row instead of column, pandas does not want to squeeze it.
Is there something else I can try? Is this behaviour intended?
Here is the way I've found:
df = pandas.read_csv('csvfile.txt', index_col=False, header=0);
serie = df.ix[0,:]
Seems like a bit stupid to me as Squeeze should already do this. Is this a bug or am I missing something?
/EDIT: Best way to do it:
df = pandas.read_csv('csvfile.txt', index_col=False, header=0);
serie = df.transpose()[0] # here we convert the DataFrame into a Serie
This is the most stable way to get a row-oriented CSV line into a pandas Series.
BTW, the squeeze=True argument is useless for now, because as of today (April 2013) it only works with row-oriented CSV files, see the official doc:
http://pandas.pydata.org/pandas-docs/dev/io.html#returning-series
This works. Squeeze still works, but it just won't work alone. The index_col needs to be set to zero as below
series = pd.read_csv('csvfile.csv', header = None, index_col = 0, squeeze = True)
In [28]: df = pd.read_csv('csvfile.csv')
In [29]: df.ix[0]
Out[29]:
somefeature 0
anotherfeature 0
f3 0
f4 1
f5 1
f6 2
f7 4
lastfeature 5
Name: 0, dtype: int64
ds = pandas.read_csv('csvfile.csv', index_col=False, header=0);
X = ds.iloc[:, :10] #ix deprecated
As Pandas value selection logic is :
DataFrame -> Series=DataFrame[Column] -> Values=Series[Index]
So I suggest :
df=pandas.read_csv("csvfile.csv")
s=df[df.columns[0]]
from pandas import read_csv
series = read_csv('csvfile.csv', header=0, parse_dates=[0], index_col=0, squeeze=True
Since none of the answers above worked for me, here is another one, recreating the Series manually from the DataFrame.
# create example series
series = pd.Series([0, 1, 2], index=["a", "b", "c"])
series.index.name = "idx"
print(series)
print()
# create csv
series_csv = series.to_csv()
print(series_csv)
# read csv
df = pd.read_csv(io.StringIO(series_csv), index_col=0)
indx = df.index
vals = [df.iloc[i, 0] for i in range(len(indx))]
series_again = pd.Series(vals, index=indx)
print(series_again)
Output:
idx
a 0
b 1
c 2
dtype: int64
idx,0
a,0
b,1
c,2
idx
a 0
b 1
c 2
dtype: int64