I have a data frame with over 800,000 rows and 8 columns that looks like this:
ID Index Var1 Var2 Var3 Var4 Var5 Var6
1 0 106 114 24 25 1 0
1 1 705 79 19 21 1 0
1 2 661 361 30 37 1 0
1 3 212 332 30 37 1 0
I am trying to get this dataframe into a JSON format for a small piece of a larger machine learning project.
I need the json formatted object to look like this:
{'source-ref':'s3://sagemaker...jpg',
'name':{'annotations':[{'ID':1,
'var1': 106,
'var2': 114,
'var3': 24,
'var4': 25}]}
I've tried using the to_json function with a combination of the reset_index function on the ID and index. Can someone please point me in the right direction?
800k records will generate a very large JSON object...
df = pd.read_csv(io.StringIO("""ID Index Var1 Var2 Var3 Var4 Var5 Var6
1 0 106 114 24 25 1 0
1 1 705 79 19 21 1 0
1 2 661 361 30 37 1 0
1 3 212 332 30 37 1 0"""),sep="\s+")
{'source-ref':'s3://sagemaker...jpg',
'name':{'annotations':df.to_dict(orient="records")}}
Related
I have a table of wind directions and strengths over a 24 hour period, sample data at the bottom of this question.
only directions that have strengths are stored in the database, I'm currently using the following SQL:
SELECT winddirection, avespeed
FROM wp_weather_data
WHERE ID%10 = 0
what I would like to return is an entry for every degree (0 value for any degree not in the db) and where there are multiple entries for a given degree to only return the highest value. Oh, and they need to be in ascending order of degrees.
Is this possible?
This is so I can plot a wind distribution chart on a polar chat plugin in WordPress.
sample data returned from the above sql:
294 2
271 3
269 2
285 3
289 2
123 1
130 1
144 1
160 0
168 0
161 0
135 0
138 0
331 0
115 0
136 0
161 0
267 0
114 0
265 0
204 0
248 1
206 0
199 1
250 2
244 3
257 3
272 5
267 5
208 3
221 3
223 4
253 6
233 5
First run the function b(n):
? b(n) = lcm(vector(n, i, i))/n
After function c(n):
? c(n)=sum(j=1,n,sum(i=1,n,(-1)^(i+j)/(i+j-1)))
Last run d(n):
? d(n)=factor(denominator(c(n))/b(n))~
and test with 202
? d(202)
The result is:
%8 =
[3 7 17 19 31 211 223 227 229 233 239 241 251 257 263 269 271 277 281 283 293
307 311 313 317 331 337 347 349 353 359 367 373 379 383 389 397 401]
[1 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
What indicates the -1 in factoring results?
You try to factor the rational number. Note, type(denominator(c(202))/b(202)) is t_FRAC instead of t_INT because of denominator(c(202))/b(202) = <some big number>/31. So -1 means just -1 power for divisor 31, and no bug is here.
I'm trying to scrape the data from every table at the hockey-reference awards page. I can scrape the first table for the Hart Memorial Trophy, but when I try the rest of them, I end up with empty vectors. I used Selector Gadget and the rvest package to produce the following code.
library(rvest)
url="https://www.hockey-reference.com/awards/voting-2017.html"
byng<-read_html(url)
byng_node<-html_nodes(byng, "#byng_stats .right , #byng_stats a")
byng_text<-html_text(byng_node)
However, once I run this code, I get no data in the byng variables:
> byng_node
{xml_nodeset (0)}
> byng_text
character(0)
What's happening here? Does selector gadget not work for pages with multiple tables? Does it have nothing to do with that and there's something HTMLy I don't understand? Any help is greatly appreciated!
#neilfws was right: if you look at the source code of the HTML page, you see that all but the first table are commented so rvest thinks they are comments, not part of source code itself. Let's do a dirty hack and remove these characters that are used to comment our precious tables:
library(rvest)
url="https://www.hockey-reference.com/awards/voting-2017.html"
byng<-read_html(url)
# Remove commenting sequences
byng <- gsub("<!--", "", byng)
byng <- gsub("-->", "", byng)
byng<-read_html(byng)
#Get tables as a list of dataframes
tables <- html_table(byng)
# Last table
tables[7]
[[1]]
Scoring Scoring Scoring Scoring Goalie Stats Goalie Stats
1 Place Player Age Tm Pos Votes Vote% 1st 2nd 3rd 4th 5th G A PTS +/- W L
2 1 Connor McDavid 20 EDM C 762 94.07 141 18 3 0 0 30 70 100 27
3 2 Sidney Crosby 29 PIT C 526 64.94 20 142 0 0 0 44 45 89 17
4 3 Nicklas Backstrom 29 WSH C 127 15.68 1 2 116 0 0 23 63 86 17
5 4 Mark Scheifele 23 WPG C 21 2.59 0 0 21 0 0 32 50 82 18
6 5 Auston Matthews 19 TOR C 10 1.23 0 0 10 0 0 40 29 69 2
7 6 Evgeni Malkin 30 PIT C 4 0.49 0 0 4 0 0 33 39 72 18
8 7 John Tavares 26 NYI C 2 0.25 0 0 2 0 0 28 38 66 4
9 8 Jonathan Toews 28 CHI C 1 0.12 0 0 1 0 0 21 37 58 7
10 8 Brad Marchand 28 BOS C 1 0.12 0 0 1 0 0 39 46 85 18
11 8 Ryan Kesler 32 ANA C 1 0.12 0 0 1 0 0 22 36 58 8
12 8 Ryan Getzlaf 31 ANA C 1 0.12 0 0 1 0 0 15 58 73 7
I have a long (csv) file with "column-name x value" pairs which I would like to read into a pandas.DataFrame
user_id col val
00008901 1 55
00008901 2 66
00011501 1 77
00011501 3 88
00011501 4 99
The result should look like this:
1 2 3 4
00008901 55 66 0 0
00011501 77 0 88 99
I tried to read it into a list and create a DataFrame from it, but pandas crashed as I have 4.5 million elements.
What the best way to do that? Ideally directly with read_csv.
First use read_csv for create DataFrame:
df = pd.to_csv('file.csv')
Then need set_index with unstack:
df1 = df.set_index(['user_id','col'])['val'].unstack(fill_value=0)
print (df1)
col 1 2 3 4
user_id
8901 55 66 0 0
11501 77 0 88 99
Another solution with pivot, replacing NaN to 0 by fillna and last cast to int:
df1 = df.pivot(index='user_id', columns='col', values='val').fillna(0).astype(int)
print (df1)
col 1 2 3 4
user_id
8901 55 66 0 0
11501 77 0 88 99
If get error:
"ValueError: Index contains duplicate entries, cannot reshape"
It means you have some duplicates, so fastest solution is groupby with unstack and some aggreagte function like mean or sum:
print (df.groupby(['user_id','col'])['val'].mean().unstack(fill_value=0))
col 1 2 3 4
user_id
8901 55 66 0 0
11501 77 0 88 99
Better it see in a bit changed csv:
print (df)
user_id col val
0 8901 1 55
1 8901 2 66
2 11501 1 77 > duplicates -> 11501 and 1
3 11501 1 151 > duplicates -> 11501 and 1
4 11501 3 88
5 11501 4 99
print (df.groupby(['user_id','col'])['val'].mean().unstack(fill_value=0))
col 1 2 3 4
user_id
8901 55 66 0 0
11501 114 0 88 99
Actually I thought I had no duplicates, but found out that I really have some ...
I could not use ".mean" as it is categorial value, but solved the problem by first looking at the sorted table and then just keeping the last entry ... then applying the (great !) solution .. which I still have to fully understand ;-)
df.sort(columns=(['user_id','col']) ) # optional for debugging
df.drop_duplicates(subset=['user_id','col'], keep='last', inplace=True)
df_table = df.set_index(['user_id','col'])['val'].unstack(fill_value=0)
You can't directly read in the required structure using read_csv. But you can use pivot_table function to convert to the desired structure.
df = pd.read_csv('filepath/your.csv')
df = pd.pivot_table(df, index='user_id', columns='col', values='val, aggfunc='mean').reset_index()
The output will be like
1 2 3 4
00008901 55 66 0 0
00011501 77 0 88 99
I don't think it is possible to use read_csv to parse the csv file.
You can create a data structure such as dictionary and use it to create a dataframe:
import pandas as pd
from collections import defaultdict
import csv
data_dict = defaultdict(lambda: [0] * columns)
columns = 4
delimiter = ','
with open("my_csv.csv") as csv_file:
reader = csv.DictReader(csv_file,delimiter=delimiter)
for row in reader:
row_id = row["user_id"]
col = int(row["col"])-1
val = int(row["val"])
data_dict[row_id][col] = val
df = pd.DataFrame(data_dict.values(), index=data_dict.keys(), columns=range(1, columns+1))
For a csv file that contains:
user_id,col,val
00008901,1,55
00008901,2,66
00011501,1,77
00011501,3,88
00011501,4,99
The output is:
1 2 3 4
00008901 55 66 0 0
00011501 77 0 88 99
I am using SQL Server 2008 R2.
I am having a database table that contains some user data as given below :
Id UserId Sys Dia ReadingType DataId IsDeleted
1 10 98 65 last 1390556024216 0
2 10 99 69 average 1390556024216 0
3 10 102 96 last 1390562788540 0
4 10 102 96 average 1390562788540 0
5 11 130 98 last 1390631241547 0
6 11 130 98 average 1390631241547 0
7 2 285 199 first 1390770562374 0
8 2 250 180 last 1390770562374 0
9 2 267 189 average 1390770562374 0
10 1 258 180 first 1391191009457 0
11 1 258 180 last 1391191009457 0
12 1 258 180 average 1391191009457 0
13 1 285 199 additional 1391191009457 0
14 22 110 78 last 1391549208338 0
15 22 123 83 last 1391549208349 0
In this table, there are records that are having the same DataId but different ReadingType.
I want to set IsDeleted=1 for the records having ReadingType='last' and having a record with ReadingType='average' with the same DataId, Sys, Dia and UserId.
So the Desired result shoul be :
Id UserId Sys Dia Reading DataId IsDeleted
1 10 98 65 last 1390556024216 0
2 10 99 69 average 1390556024216 0
3 10 102 96 last 1390562788540 1
4 10 102 96 average 1390562788540 0
5 11 130 98 last 1390631241547 1
6 11 130 98 average 1390631241547 0
7 2 285 199 first 1390770562374 0
8 2 250 180 last 1390770562374 0
9 2 267 189 average 1390770562374 0
10 1 258 180 first 1391191009457 0
11 1 258 180 last 1391191009457 1
12 1 258 180 average 1391191009457 0
13 1 285 199 additional 1391191009457 0
14 22 110 78 last 1391549208338 0
15 22 123 83 last 1391549208349 0
Here the records with Id 3, 5 and 11 should be marked as deleted as they are having same UserId, Sys, Dia, DataId and ReadingType="last" with another record having ReadingType="average" with same other fields.
Can anyone help me how to find out such a records and update them?
Just use UPDATE with EXISTS subquery:
UPDATE T
SET IsDeleted=1
WHERE
ReadingType='last'
AND
EXISTS(SELECT * FROM T as T1
WHERE T1.ReadingType='average'
AND T1.DataId=T.DataId
AND T1.Sys=T.Sys
AND T1.Dia=T.Dia
AND T1.UserId=T.UserId
)
SQLFiddle demo
You Can solve many way but here i am using the sub-query to solve your problem
UPDATE TABLE SET IsDeleted=1
WHERE DataId=(SELECT DataId FROM TABEL WHERE Reading='last')