I have a databasedump with appr. 6.0000 lines.
They all look like this:
{"student”:”12345”,”achieved_date":1576018800,"expiration_date":1648677600,"course_code”:”SOMECODE,”certificate”:”STRING WITH A LOT OF CHARACTERS”,”certificate_code”:”ABCDE,”certificate_date":1546297200}
"STRING WITH A LOT OF CHARACTERS" is a string with around 600.000 characters (!)
I need those characters on each line removed... I tried with:
sed 's/certificate\":\"*","certificate_code//'
But it seems it did not do the trick.
I also couldn't find an answer to work with here, so reaching out to you, hopefully you can help me.. is this best done with SED? or any other method?
For now I don't care if the all the characters on "STRING WITH A LOT OF CHARACTERS" are removed or replaced by I.E. a 0, even that would make it workable for me ;)
The output for od -xc filename | head is:
0000000 2d2d 4d20 5379 4c51 6420 6d75 2070 3031
- - M y S Q L d u m p 1 0
0000020 312e 2033 4420 7369 7274 6269 3520 372e
. 1 3 D i s t r i b 5 . 7
0000040 322e 2c39 6620 726f 4c20 6e69 7875 2820
. 2 9 , f o r L i n u x (
0000060 3878 5f36 3436 0a29 2d2d 2d0a 202d 6f48
x 8 6 _ 6 4 ) \n - - \n - - H o
0000100 7473 203a 3231 2e37 2e30 2e30 2031 2020
s t : 1 2 7 . 0 . 0 . 1
hope you can help me!
When I do the od command on the sample text you've supplied, the output includes :
0000520 454d 4f43 4544 e22c 9d80 6563 7472 6669
M E C O D E , ” ** ** c e r t i f
0000540 6369 7461 e265 9d80 e23a 9d80 5453 4952
i c a t e ” ** ** : ” ** ** S T R I
0000560 474e 5720 5449 2048 2041 4f4c 2054 464f
N G W I T H A L O T O F
0000600 4320 4148 4152 5443 5245 e253 9d80 e22c
C H A R A C T E R S ” ** ** , ”
0000620 9d80 6563 7472 6669 6369 7461 5f65 6f63
** ** c e r t i f i c a t e _ c o
0000640 6564 80e2 3a9d 80e2 419d 4342 4544 e22c
d e ” ** ** : ” ** ** A B C D E , ”
So you can see the "quotes" are the byte sequences e2 80 9d, which is unicode U+201d (see https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128 )
Probably the simplest would be to simply skip these unicode characters with the single-character wildcard .
sed "s/certificate.:.*.certificate_code/certificate_code/"
Unfortunately, sed doesn't appear to take the unicode \u201d syntax, so some other answers suggest using the hex sequence (\xe2\x80\x9d) - eg : Escaping double quotation marks in sed (but unfortunately I haven't got that to work just yet, and I have to sign off now)
This answer explains why it could have happened, with some remedial action if that's possible in your situation : Unknown UTF-8 code units closing double quotes
If you are working with bash, would you please try the following:
q=$'\xe2\x80\x9d'
sed "s/certificate${q}:${q}.*${q},${q}certificate_code//" file
Result:
{"student”:”12345”,”achieved_date":1576018800,"expiration_date":1648677600,"course_code”:”SOMECODE,””:”ABCDE,”certificate_date":1546297200}
I want to get the current day using dart.
Code I want:-
//I want something like this
var day = DateTime.getCurrentDay();
Output:-
Tuesday
use this pattern
DateFormat('EEEE').format(yourDate);
or
DateFormat('EEEE').format(DateTime.now());
See all available patterns here
ICU Name Skeleton
-------- --------
DAY d
ABBR_WEEKDAY E
WEEKDAY EEEE
ABBR_STANDALONE_MONTH LLL
STANDALONE_MONTH LLLL
NUM_MONTH M
NUM_MONTH_DAY Md
NUM_MONTH_WEEKDAY_DAY MEd
ABBR_MONTH MMM
ABBR_MONTH_DAY MMMd
ABBR_MONTH_WEEKDAY_DAY MMMEd
MONTH MMMM
MONTH_DAY MMMMd
MONTH_WEEKDAY_DAY MMMMEEEEd
ABBR_QUARTER QQQ
QUARTER QQQQ
YEAR y
YEAR_NUM_MONTH yM
YEAR_NUM_MONTH_DAY yMd
YEAR_NUM_MONTH_WEEKDAY_DAY yMEd
YEAR_ABBR_MONTH yMMM
YEAR_ABBR_MONTH_DAY yMMMd
YEAR_ABBR_MONTH_WEEKDAY_DAY yMMMEd
YEAR_MONTH yMMMM
YEAR_MONTH_DAY yMMMMd
YEAR_MONTH_WEEKDAY_DAY yMMMMEEEEd
YEAR_ABBR_QUARTER yQQQ
YEAR_QUARTER yQQQQ
HOUR24 H
HOUR24_MINUTE Hm
HOUR24_MINUTE_SECOND Hms
HOUR j
HOUR_MINUTE jm
HOUR_MINUTE_SECOND jms
HOUR_MINUTE_GENERIC_TZ jmv (not yet implemented)
HOUR_MINUTE_TZ jmz (not yet implemented)
HOUR_GENERIC_TZ jv (not yet implemented)
HOUR_TZ jz (not yet implemented)
MINUTE m
MINUTE_SECOND ms
SECOND s
// DateTime.now() or any DateTime format can be used.
DateFormat('EEEE').format(DateTime.now());
To use other Date formats/patterns like "EEEE" you can visit the Official Documentation
I have data in csv format as shown below.
The data has the below format
"first_name","last_name","company_name","address","city","county","postal","phone1","phone2","email","web"
The sample data named under User.csv. The file contains below data.
"Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz#hotmail.com","http://www.alandrosenburgcpapc.co.uk"
"Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas#gmail.com","http://www.capgeminiamerica.co.uk"
"France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade#hotmail.com","http://www.elliottjohnwesq.co.uk"
When I try the same to load using PigStorage
user = LOAD '/home/abhijit/Downloads/User.csv' USING PigStorage(',');
DUMP user;
The output of it is like :
("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz#hotmail.com","http://www.alandrosenburgcpapc.co.uk")
("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas#gmail.com","http://www.capgeminiamerica.co.uk")
("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade#hotmail.com","http://www.elliottjohnwesq.co.uk")
I want to do a group by on city. So I have written
grp = group user by $4;
dump grp;
I get the output as :
( Binney St",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas#gmail.com","http://www.capgeminiamerica.co.uk")})
("8 Moor Place",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade#hotmail.com","http://www.elliottjohnwesq.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14 Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz#hotmail.com","http://www.alandrosenburgcpapc.co.uk")})
The company_name and address is creating a problem as it contains ',' as part of it. for example "14, Taylor St" in address or "Elliott, John W Esq" in company_name.
so my $4 is treated for "Taylor St" and not the "St. Stephens Ward"
So because of the extra delimiter in the address data or the company_name data is not loaded properly or seperated properly and the group by fuction is not giving correct result.
How can I achieve the group by output as below
("Abbey Ward",{("Evan","Zigomalas","Cap Gemini America","5, Binney St","Abbey Ward","Buckinghamshire","HP11 2AX","01937-864715","01714-737668","evan.zigomalas#gmail.com","http://www.capgeminiamerica.co.uk")})
("St. Stephens Ward",{("Aleshia","Tomkiewicz","Alan D Rosenburg Cpa Pc","14, Taylor St","St. Stephens Ward","Kent","CT2 7PP","01835-703597","01944-369967","atomkiewicz#hotmail.com","http://www.alandrosenburgcpapc.co.uk")})
("East Southbourne and Tuckton W",{("France","Andrade","Elliott, John W Esq","8 Moor Place","East Southbourne and Tuckton W","Bournemouth","BH6 3BE","01347-368222","01935-821636","france.andrade#hotmail.com","http://www.elliottjohnwesq.co.uk")})
grp = group a by $5 ;
It won't be the solution for me. I already thought of it.
The problem is that PigStorage does not take escaping into account, so creates columns for fields that should not be columns (each time an entry contains a comma).
Using CSVExcelStorage will solve this as this storage can deal with escaping, thus creating the right amount and sequence of columns.
I am fairly new to scraping/parsing HTML in R. I am trying to get data from the Career Receiving Statistics and Career Rushing Statistics' tables from http://totalfootballstats.com/PlayerWR.asp?id=1218565.
I know about the read readHTMLtable function but both these tables are embedded in so much junk and I can't seem to get past the children nodes of the root.
EDIT: the above problem has been solved. However for the website http://www.sports-reference.com/cfb/players/a-index.html I am trying to loop through all players and access their data. I'm running into trouble in accessing their respective url links. I have tried:
fb=htmlParse("http://www.sports-reference.com/cfb/players/a-index.html")
p1=getNodeSet(fb,'//pre')
con = textConnection(xmlValue(p1[[100]]))
players100 = read.table(con)
But this results in the error "Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 3 did not have 5 elements"
The other thing I tried is:
links <- xpathSApply(fb, "//a/#href")
But I feel like there should be a better way to do this?
Well here's the same player from a different website, much much cleaner. The data doesn't match though, so someone got it wrong. My money's on totalfootballstats.com. Choose your resources wisely!
readHTMLTable(
"http://www.sports-reference.com/cfb/players/doyle-aaron-1.html"
)
# $receiving
# Year School Conf Class Pos G Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD
# 1 1988 Miami (FL) Ind WR 11 1 12 12.0 0 1 34 34.0 0 2 46 23.0 0
# 2 1989 Miami (FL) Ind WR 11 8 93 11.6 1 8 93 11.6 1
# $kick_ret
# Year School Conf Class Pos G Ret Yds Avg TD Ret Yds Avg TD
# 1 1988 Miami (FL) Ind WR 11 1 8 8.0 0
# 2 1989 Miami (FL) Ind WR 11
For specific requests, it looks like you can a construct a valid URL like this, which will also construct the path for multiple players at once.
## base URI
u <- "http://www.sports-reference.com"
## player first and last names
first <- "bill"
last <- "adams"
## use sprintf() to make all the paths at once
fullPath <- sprintf("%s/cfb/players/%s-%s-1.html", u, first, last)
## read the table - I think you'll need to loop readHTMLTable() though
readHTMLTable(fullPath)
# $receiving
# Year School Conf Class Pos G Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD
# 1 1969 Dayton Ind WR 10 1 3 3.0 1 1 3 3.0 1
# 2 1970 Dayton Ind WR 10 4 42 10.5 1 4 42 10.5 1
I am doing an analysis in Stata and I have a lot of different panel regressions (within, first-difference and random trend) and to see the results properly, I am using eststo and esttab.
My problem now is that to get the difference for first difference and the double difference for the random trend, I use d.varname and d.d.varname.
Stata then thinks that the differences are new variables and puts them in their own rows which becomes very difficult to read.
Has anyone an idea how I can get a regression table in which Stata sees varname, d.varname and d.d.varname as the same variable?
My regression looks like this:
foreach v in a aa aaa aaaa{
qui eststo: xtreg `v' b b1 b2 b3 b4 b5 i.year, fe cluster(xy)
qui eststo: xtreg `v' b b1 b2 b3 b4 b5 i.year if c>d, fe cluster(xy)
qui eststo: reg d.`v' d.b d.b1 d.b2 d.b3 d.b4 d.b5 i.year, cluster(xy)
qui eststo: reg d.`v' d.b d.b1 d.b2 d.b3 d.b4 d.b5 i.year if c>d, cluster(xy)
qui eststo: reg d.d.`v' d.d.b d.d.b1 d.d.b2 d.d.b3 d.d.b4 d.d.b5 i.year, cluster(xy)
qui eststo: reg d.d.`v' d.d.b d.d.b1 d.d.b2 d.d.b3 d.d.b4 d.d.b5 i.year if c>d, cluster(xy)
esttab using output.tex, wide
}
In my table I then get my estimates for
b
b1
b2
b3
b4
b5
d.b1
d.d.b1
d.b2
d.d.b2
and so on..
This is a bit hacked together--ultimately it doesn't do anything fancy but just automates the changing of variable names across specifications. It seems a bit too much code for such a simple question, but I don't know of an easier way to do this.
***create dummy data
set seed 99
webuse xtsetxmpl, clear
foreach i in "" 1 2 3 4 5 {
g b`i' = uniform()
}
foreach i in a aa aaa aaaa c d {
g `i'= y*uniform()
}
*next two lines just so the differencing works
g xy = pid
replace tod = (tod-1609570800000)/(36*100000)
xtset pid tod
***end of data creation
cap program drop diff
program define diff
syntax anything
cap drop *_adj *adjDV
if "`anything'" == "orig" {
foreach i in "" 1 2 3 4 5 {
g b`i'_adj = b`i'
}
foreach i in a aa aaa aaaa {
g `i'_adjDV = `i'
}
}
else {
foreach i in "" 1 2 3 4 5 {
g b`i'_adj = `anything'b`i'
}
foreach i in a aa aaa aaaa {
g `i'_adjDV = `anything'`i'
}
}
end
**************************************
*run original regression (excluding year term not necessary to example)
**************************************
eststo clear
foreach v in a aa aaa aaaa {
diff orig
eststo: xtreg `v'_adjDV *adj , fe cluster(xy)
eststo: xtreg `v'_adjDV *adj if c>d, fe cluster(xy)
diff d.
eststo: reg `v'_adjDV *adj , cluster(xy)
eststo: reg `v'_adjDV *adj if c>d, cluster(xy)
diff d.d.
eststo: reg `v'_adjDV *adj , cluster(xy)
eststo: reg `v'_adjDV *adj if c>d, cluster(xy)
esttab _all, wide
}
You are new here, so just for the future, try to post a MWE (minimal working example)---it makes things a bit quicker on this end. You can see that I have given an example of how to do this in the first section of the code.
It actually works just using the esttab command. One example:
esttab ,compress nogaps booktabs drop( _cons) ///
indicate("State specific effects = var_dummy_*") ///
rename( D.b b D2.b b D.b2 b2 D2.b b2 ) ///
The rename gets the job done. The rest is included to show more possibilities.