I have a csv file with 1461 attributes. I want to load it into a pandas data frame. The problem is, many rows do not have values for trailing consecutive columns. So pandas gives parsing error due to irregular length. How to put missing values for the leading columns for once and load the csv file into the data frame?
Edit1
We can see that the data set csv file is as follows
a,b,c,d,e,f,g,h,i"""
1,2,4,5
1,0,9,8,7,6,5,4,7
1,3,5,6,7
6,7,8,8,9,4,5,3,5"""
I want a pandas data frame like below
"""a b c d e f g h i
1 2 4 5 ? ? ? ? ?
1 0 9 8 7 6 5 4 7
1 3 5 6 7 ? ? ? ?
6 7 8 8 9 4 5 3 5"""
NaN in place of ? can be ok
we don't have enough commas unequal length problem.
It seems you can use parameter names in read_csv for column names by range (if attributes are columns):
import pandas as pd
from pandas.compat import StringIO
temp=u"""
a,v
c,v,f,r
b,g
y"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
#in real data change 4 to 1461
names = range(4)
df = pd.read_csv(StringIO(temp), names=names)
print (df)
0 1 2 3
0 a v NaN NaN
1 c v f r
2 b g NaN NaN
3 y NaN NaN NaN
EDIT:
temp=u"""a,b,c,d,e,f,g,h,i
1,2,4,5
1,0,9,8,7,6,5,4,7
1,3,5,6,7
6,7,8,8,9,4,5,3,5"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp))
print (df)
a b c d e f g h i
0 1 2 4 5 NaN NaN NaN NaN NaN
1 1 0 9 8 7.0 6.0 5.0 4.0 7.0
2 1 3 5 6 7.0 NaN NaN NaN NaN
3 6 7 8 8 9.0 4.0 5.0 3.0 5.0
Related
Here we have a link where there is a table:
http://pitzavod.ru/products/upakovka/
When I read it with pd.read_html I do get a list, but it 1.) is not nested, thus when converted to a dataframe it is not readable, 2.) contains integers 0 to number of rows in the table on the website.
The list I get looks like:
[ 0 1 \
0 Показатели Марка целлюлозы
1 ОСН NaN
2 Механическая прочность при размоле в мельнице ... 10 000 740 520
3 Степень делигнификации, п.е. 28 - 45
4 Сорность - число соринок в условной массе 500г... 6500
5 Влажность, % не более 20
2
0 Методы испытаний
1 NaN
2 ГОСТ13525.1 ГОСТ 13525.3 ГОСТ 13525.8
3 ГОСТ 10070
4 ГОСТ 14363.3
5 ГОСТ 16932 ]
Is there a way to easily clean this pandas outpute, or do I properly need to parse the website? Thank you.
That's because read_html returns always a list (even if the number of tables is 1).
pandas.read_html :Read HTML tables into a list of DataFrame objects.
You need to slice it with [0] :
df = pd.read_html("http://pitzavod.ru/products/upakovka/")[0]
Output (showing the last two columns) :
1 2
0 Марка целлюлозы Методы испытаний
1 ОСН Методы испытаний
2 10 000 740 520 ГОСТ13525.1 ГОСТ 13525.3 ГОСТ 13525.8
3 28 - 45 ГОСТ 10070
4 6500 ГОСТ 14363.3
5 20 ГОСТ 16932
Category/Date jul aug sep oct nov dec ...jan
AA on 2020jun 5 6 3 8 2 7 ... 4
AA on 2020May 7 3 2 6 5 5 ....7
Difference -2 1 1 2 -3 2 ...-3
I am using wcf query to fetch the data.The dataset looks like
category -AA
Date - 2020Jun
Month1 4
Month12 7
Month11 2
Month10 8
i am using a switch statement like below to display each column value
switch(Reportdate.Month(jun 2020) = 1, Fields.Month1.Value,
Reportdate.Month = 12, Fields.Month12.Value)
The problem i am facing is with finding the difference in each column
I'm really new to working with JSON data, so I had a question about formatting.
Here's the link to the data I was trying to work with
I was using JSONlite and did this:
shot<-"http://stats.nba.com/stats/playerdashptshotlog?DateFrom=&DateTo=&
GameSegment=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&
Outcome=&Period=0&PlayerID=202322&Season=2014-15&SeasonSegment=&
SeasonType=Regular+Season&TeamID=0&VsConference=&VsDivision="
I then did fromJSON:
json_data <- fromJSON(paste(readLines(shot), collapse=""))
This gives me the data in a list. My issue (although for all I know I messed up working towards this) is trying to create a data frame out of this info. I was able to make a data frame with code I read under similar questions on the site, but it is all of the data in just one column. Any recommendations would be appreciated!
Thanks
Normally, first thing to do when you get a JSON, you look at the structure.
str(json_data)
Doing so will reveal that your data has a very simple structure: is is a dataframe with rows, a line of headers, wrapped in some metadata about what each column means. Using the $ will allow you to address those specific components. In other words, your specific json is already a data frame structure, all you gotta to is take it out of json
library(jsonlite)
json_data <- fromJSON(paste(readLines(shot), collapse=""))
str(json_data)
mydf <- data.frame(json_data$resultSets$rowSet)
colnames(mydf) <- unlist(json_data$resultSets$headers)
You ought to get something like this:
head(mydf)
GAME_ID MATCHUP LOCATION W FINAL_MARGIN SHOT_NUMBER PERIOD
1 0021401215 APR 14, 2015 - WAS # IND A L -4 1 1
2 0021401215 APR 14, 2015 - WAS # IND A L -4 2 1
3 0021401215 APR 14, 2015 - WAS # IND A L -4 3 1
4 0021401215 APR 14, 2015 - WAS # IND A L -4 4 1
5 0021401215 APR 14, 2015 - WAS # IND A L -4 5 1
6 0021401215 APR 14, 2015 - WAS # IND A L -4 6 1
GAME_CLOCK SHOT_CLOCK DRIBBLES TOUCH_TIME SHOT_DIST PTS_TYPE SHOT_RESULT
1 10:33 7.7 0 1 25 3 missed
2 8:41 14 10 9.6 10.7 2 made
3 6:42 14.9 11 9.7 18.2 2 missed
4 5:16 19 3 3.5 4.2 2 made
5 4:45 19.8 3 3.7 3.3 2 missed
6 3:08 13.5 10 9.7 18 2 missed
CLOSEST_DEFENDER CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST FGM PTS
1 Hill, George 201588 4.3 0 0
2 Hill, George 201588 5.7 1 2
3 Hill, George 201588 3 0 0
4 Miles, CJ 101139 4 1 2
5 Hill, Solomon 203524 3 0 0
6 Hill, George 201588 4.5 0 0
I have a file with data in 2 columns X and Y. There are some blocks and they are separated by a blank line. I want to join the points (given by their coordenates x and y in the file) in each block using vectors. I'm trying to use these functions:
prev_x = NaN
prev_y = NaN
dx(x) = (x_delta = x-prev_x, prev_x = ($0 > 0 ? x : 1/0), x_delta)
dy(y) = (y_delta = y-prev_y, prev_y = ($0 > 0 ? y : 1/0), y_delta)
which I've taken from Plot lines and vector in graphical gnuplot (first answer). The command to plot would be plot for[i=0:5] 'Field_lines.txt' every :::i::i u (prev_x):(prev_y):(dx($1)):(dy($2)) with vectors. The output is
and the problem is that the point (0,0) is being included even though it's not in the file. I don't think I understand what the functions dx and dy do exactly and how they are being used with the option using (prev_x):(prev_y):(dx($1)):(dy($2)) so an explanation of this would help me a lot to try to fix this.
This is the file:
#1
0 5
0 4
0 3
0.4 2
0.8 1
0.8 1
#2
2 5
2 4
2 3
2 2
2 1
2 0
#3
4 5
4.2 4
4.5 3
4.6 2
4.7 1
4.7 0
#4
7 5
7.2 4
7.5 3
7.9 2
7.9 1
7.9 0
#5
9 5
9 4
9.2 3
9.5 2
9.5 1
9.5 0
#6
11 7
12 6
13 5
13.3 4
13.5 3
13.5 2
13.6 1
14 0
Thanks!
I'm not completely sure, what the real problem is, but I think you cannot rely on the columns in the using statement to be evaluated from left to right, and your check $0 > 0 in the dx and dy some too late in my opinion.
I usually put all the assignments and conditionals in the first column, and that works fine also in your case:
set offsets 1,1,1,1
unset key
prev_x = prev_y = 1
plot for [i=0:5] 'Field_lines.txt' every :::i::i \
u (x_delta = prev_x-$1, prev_x=$1, y_delta=prev_y-$2, prev_y=$2, ($0 == 0 ? 1/0 : prev_x)):(prev_y):(x_delta):(y_delta) with vectors backhead
Also, to draw a vector from j-th row to the point in the following row you must invert the definition of x_delta and use backhead to draw the vectors in the correct direction
It often happens that data will be given to you with wrapped columns. Consider, for example:
CCY Decimals CCY Decimals CCY Decimals
AUD/CAD 5 EUR/CZK 4 GBP/NOK 5
AUD/CHF 5 EUR/DKK 5 GBP/NZD 5
AUD/DKK 5 EUR/GBP 5 GBP/PLN 5
AUD/JPY 3 EUR/HKD 5 GBP/SEK 5
AUD/NOK 5 EUR/HUF 3 GBP/SGD 5
...
Which should be parsed as a dataframe of two columns (CCY and Decimals), not six. My question is, what is the most idiomatic way of achieving this?
I would have wanted something like the following:
data = pd.read_csv("file.csv")
data.groupby(axis=1,by=data.columns.map(lambda s: s.replace("\..",""))).\
apply(lambda df : df.values.flatten())
When reading the csv file we end up with columns CCY,Decimals,CCY.1,Decimals.1 .. etc. The groupby operation returns a collection of data frames:
<pandas.core.groupby.DataFrameGroupBy object at 0x3a52b10>
Which we would then flatten using numpy functionality. So we would are converting DataFrames with repeating columns into Series, and then merging these into a result DF.
However, this doesn't work. I've tried passing the different keys arguments to groupBy, but it always complains about being unable to reindex non-unique columns.
There are a number of existing questions that deal with flattening groups of columns (e.g. "Flattening" output of group.nth in Pandas), but I can't find any that do this for repeating columns.
To use groupby, I'd do:
>>> groups = df.groupby(axis=1,by=lambda x: x.rsplit(".",1)[0])
>>> pd.DataFrame({k: v.values.flat for k,v in groups})
CCY Decimals
0 AUD/CAD 5
1 EUR/CZK 4
2 GBP/NOK 5
3 AUD/CHF 5
4 EUR/DKK 5
5 GBP/NZD 5
6 AUD/DKK 5
7 EUR/GBP 5
8 GBP/PLN 5
9 AUD/JPY 3
10 EUR/HKD 5
11 GBP/SEK 5
12 AUD/NOK 5
13 EUR/HUF 3
14 GBP/SGD 5
[15 rows x 2 columns]
and then sort.