Webscraping Pokemon Data - html
I am trying to find out the number of moves each Pokemon (first generation) could learn.
I found the following website that contains this information: https://pokemondb.net/pokedex/game/red-blue-yellow
There are 151 Pokemon listed here - and for each of them, their move set is listed on a template page like this: https://pokemondb.net/pokedex/bulbasaur/moves/1
Since I am using R, I tried to get the website addresses for each of these 150 Pokemon (https://docs.google.com/document/d/1fH_n_BPbIk1bZCrK1hLAJrYPH2d5RTy9IgdR5Ck_lNw/edit#):
names = c("Bulbasaur","Ivysaur","Venusaur","Charmander","Charmeleon","Charizard","Squirtle","Wartortle","Blastoise","Caterpie","Metapod","Butterfree","Weedle","Kakuna","Beedrill",
"Pidgey","Pidgeotto","Pidgeot","Rattata","Raticate","Spearow","Fearow","Ekans","Arbok","Pikachu","Raichu","Sandshrew","Sandslash","Nidoran","Nidorina","Nidoqueen","Nidorino","Nidoking",
"Clefairy","Clefable","Vulpix","Ninetales","Jigglypuff","Wigglytuff","Zubat","Golbat","Oddish","Gloom","Vileplume","Paras","Parasect","Venonat","Venomoth","Diglett","Dugtrio","Meowth","Persian",
"Psyduck","Golduck","Mankey","Primeape","Growlithe","Arcanine","Poliwag","Poliwhirl","Poliwrath","Abra","Kadabra","Alakazam","Machop","Machoke","Machamp","Bellsprout","Weepinbell","Victreebel","Tentacool",
"Tentacruel","Geodude","Graveler","Golem","Ponyta","Rapidash","Slowpoke","Slowbro","Magnemite","Magneton","Farfetch’d","Doduo","Dodrio","Seel","Dewgong","Grimer","Muk","Shellder","Cloyster","Gastly","Haunter",
"Gengar","Onix","Drowzee","Hypno","Krabby","Kingler","Voltorb","Electrode","Exeggcute","Exeggutor","Cubone","Marowak","Hitmonlee","Hitmonchan","Lickitung","Koffing","Weezing","Rhyhorn","Rhydon","Chansey","Tangela",
"Kangaskhan","Horsea","Seadra","Goldeen","Seaking","Staryu","Starmie","Mr.Mime","Scyther","Jynx","Electabuzz","Magmar","Pinsir","Tauros","Magikarp","Gyarados","Lapras","Ditto"
,"Eevee","Vaporeon","Jolteon","Flareon","Porygon","Omanyte","Omastar","Kabuto","Kabutops","Aerodactyl","Snorlax","Articuno","Zapdos","Moltres","Dratini","Dragonair","Dragonite","Mewtwo","Mew")
template_1 = rep("https://pokemondb.net/pokedex/",150)
template_2 = rep("/moves/1",150)
pokemon_websites = data.frame(template_1, names, template_2)
pokemon_websites$full_website = paste(pokemon_websites$template_1, pokemon_websites$names, pokemon_websites$template_2)
Next, I remove all spaces:
library(stringr)
pokemon_websites$full_website = str_remove_all( pokemon_websites$full_website," ")
Now, I have a column with all the website names:
head(pokemon_websites)
template_1 names template_2 full_website
1 https://pokemondb.net/pokedex/ Bulbasaur /moves/1 https://pokemondb.net/pokedex/Bulbasaur/moves/1
2 https://pokemondb.net/pokedex/ Ivysaur /moves/1 https://pokemondb.net/pokedex/Ivysaur/moves/1
3 https://pokemondb.net/pokedex/ Venusaur /moves/1 https://pokemondb.net/pokedex/Venusaur/moves/1
4 https://pokemondb.net/pokedex/ Charmander /moves/1 https://pokemondb.net/pokedex/Charmander/moves/1
5 https://pokemondb.net/pokedex/ Charmeleon /moves/1 https://pokemondb.net/pokedex/Charmeleon/moves/1
6 https://pokemondb.net/pokedex/ Charizard /moves/1 https://pokemondb.net/pokedex/Charizard/moves/1
I would like to count the number of moves each of these 150 Pokemon can learn. For example, the first Pokemon "Bulbasaur" can learn 24 moves:
In the end, I would like to add a column to the earlier data frame that contains the number of moves each Pokemon can learn. For example, something that looks like this:
> head(pokemon_websites)
template_1 names template_2 full_website number_of_moves
1 https://pokemondb.net/pokedex/ Bulbasaur /moves/1 https://pokemondb.net/pokedex/Bulbasaur/moves/1 24
2 https://pokemondb.net/pokedex/ Ivysaur /moves/1 https://pokemondb.net/pokedex/Ivysaur/moves/1 ???
3 https://pokemondb.net/pokedex/ Venusaur /moves/1 https://pokemondb.net/pokedex/Venusaur/moves/1 ???
4 https://pokemondb.net/pokedex/ Charmander /moves/1 https://pokemondb.net/pokedex/Charmander/moves/1 ???
5 https://pokemondb.net/pokedex/ Charmeleon /moves/1 https://pokemondb.net/pokedex/Charmeleon/moves/1 ???
6 https://pokemondb.net/pokedex/ Charizard /moves/1 https://pokemondb.net/pokedex/Charizard/moves/1 ???
Is there a way to webscrape this data in R, count the number of moves for each of the 150 Pokemon, and then place this move count into a column?
Right now I am doing this by hand and it is taking a long time! Also, I have heard some websites do not allow for automated webscraping - if this website (https://pokemondb.net/pokedex/game/red-blue-yellow) does not allow webscraping, I can try to find another website that might allow it.
Thank you!
You can scrape all the tables for each of the pokemen using something like this:
tables =lapply(pokemon_websites$full_website,function(link) {
tryCatch(
read_html(link) %>% html_nodes("table") %>% html_table(),
error = function(e) {}, warning=function(w) {}
)
})
However, note that the number of tables returned differs for each of the pokemon. For example the first has 6 tables - the first three of those are for Red/Blue, the second three of those are for Yellow.
lengths(tables)
[1] 6 6 6 6 6 6 6 6 6 2 4 7 2 4 8 6 6 6 4 4 6 6 6 6 6 8 6 6 0 4 8 4 8 6 8 4 6 6 8 4 4 6 6 8 6 6 5 5 5 5 4 4 6 6 6
[56] 6 4 6 6 6 8 6 6 6 6 6 6 6 6 8 6 6 6 6 6 4 4 6 6 6 6 0 6 6 6 6 4 4 6 8 4 4 6 6 6 6 6 6 6 6 4 8 6 7 6 6 6 4 4 6
[111] 6 6 6 6 6 6 6 6 6 8 0 6 4 6 6 6 6 2 8 6 2 4 8 8 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
Since the OP wants to count only the moves in Red/Blue tab we can do the following, (If you need moves from both the tabs follow #langtang answer)
tables1 =lapply(pokemon_websites$full_website, function(x){
tryCatch( x %>% read_html() %>% html_nodes('.active') %>% html_nodes('.resp-scroll') %>% html_table(),
error = function(e) NULL
)
})
moves= lapply(tables1, function(x) lapply(x, function(x) dim(x)[1]))
moves = lapply(moves, unlist, use.names=FALSE)
moves = lapply(moves, sum) %>% unlist()
[1] 24 25 27 32 33 37 32 33 37 2 3 30 2 3 26 22 23 25 24 27 21 23 24 26 29 30 27 29 0 28 43 28 44 41 42 22 23 40 41 19 22 21 23 25 23 26 22 29 20 23 24
[52] 27 31 34 31 34 23 25 25 36 37 25 34 35 29 31 32 23 24 26 28 31 30 31 33 22 26 37 46 23 26 0 23 26 25 28 22 24 27 29 20 20 32 25 33 36 25 27 24 27 24 29
[103] 32 35 26 26 37 19 21 27 42 44 24 36 22 24 25 27 33 34 0 21 34 35 30 23 27 2 34 34 1 19 32 32 28 30 22 28 22 30 24 43 25 25 22 30 32 36 45 60
Related
Can you explain this function and how does this work with examples?
I can't understand this function and I have checked the tutorial website for those unknown function here is the code: def print_formatted(number): # your code goes here for i in range(1,n + 1): pad = n.bit_length() dec = str(i).rjust(pad) octs = str(oct(i)[2:]).rjust(pad) hexx = str(hex(i)[2:]).rjust(pad).upper() bina = str(bin(i)[2:]).rjust(pad) print(f'{dec} {octs} {hexx} {bina}') Thanking you in advance! This is the output it gave when called upon with n = 17 1 1 1 1 2 2 2 10 3 3 3 11 4 4 4 100 5 5 5 101 6 6 6 110 7 7 7 111 8 10 8 1000 9 11 9 1001 10 12 A 1010 11 13 B 1011 12 14 C 1100 13 15 D 1101 14 16 E 1110 15 17 F 1111 16 20 10 10000 17 21 11 10001 so how does this above code makes sure that the output gives not the output like this: 1 1 1 1 2 2 2 10 3 3 3 11 4 4 4 100 5 5 5 101 6 6 6 110 7 7 7 111 8 10 8 1000 9 11 9 1001 10 12 A 1010 11 13 B 1011 12 14 C 1100 13 15 D 1101 14 16 E 1110 15 17 F 1111 16 20 10 10000 17 21 11 10001
Find if a value exists in a Google sheet on a certain column, in all the rows above the current row based on 2 criterias
I have the following scenario: columns from A-Z and 100 rows in each row for the Z column I want to find if the value in A column from the current row exists in the rows above in A column then if exists, I would like to find if the B column for the matching rows have the cell completed with a value for all the rows that are matching I would to receive the matching rows in an array list, not as rows or at least to be able to put a value like "mathing"/"not matching" this should be an array formula I've tried something like this, only for the first criteria, but somehow it checks only the current row. =ARRAYFORMULA( IF(ROW(Z2:Z)>2, IF(MATCH(A2:A,$A$2:A&ROW(A2:A)-1),"matching","not matching"),"not matching")) I check to see if it's the first row (as it has headers), and if it's the first row, then surely it can't have any data matching above It will be great to have it as a google sheet formula but if it's not possible it could also be a google app script
Try this: function myfunk() { const ss = SpreadsheetApp.getActive(); const sh = ss.getSheetByName("Sheet0"); const osh = ss.getSheetByName("Sheet1"); osh.clearContents(); const dsr = 2; const vs = sh.getRange(dsr, 1, sh.getLastRow() - dsr + 1, sh.getLastColumn()).getDisplayValues(); let o = []; vs.forEach((r, i) => { if (i > 0) { let as = vs.map(r => r[0]).slice(0, i);// suggested by DoubleUnary let bs = vs.map(r => r[1]).slice(0, i);//suggested by DoubleUnary let idx = as.indexOf(r[25]); if (~idx && bs[idx]) { o.push(['yes', dsr + i, dsr + idx, r[25], bs[idx]]) } else { o.push(['no', dsr + i, ~idx ? as[idx] : '', r[25], ~idx ? bs[idx] : '']); } } }); o.unshift(['Value', 'Test Row', 'Result Row', 'Z value', 'B value']) Logger.log(JSON.stringify(o)); osh.getRange(1, 1, o.length, o[0].length).setValues(o); } My Data: COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 COL13 COL14 COL15 COL16 COL17 COL18 COL19 COL20 COL21 COL22 COL23 COL24 COL25 COL26 4 4 8 18 3 15 15 6 6 18 2 10 19 14 5 16 3 6 0 13 15 14 10 13 19 7 14 5 18 12 12 3 5 5 12 0 0 4 19 17 13 14 2 6 2 0 18 15 16 1 1 15 14 8 18 19 18 19 14 11 9 2 12 4 19 8 7 17 2 5 17 12 3 18 6 15 12 17 12 15 1 11 2 14 4 12 15 4 2 7 13 12 4 10 0 2 9 2 15 12 18 7 10 6 15 8 3 11 3 11 8 2 0 12 18 12 17 3 3 10 5 18 0 6 19 12 11 2 3 5 16 16 7 14 12 3 1 9 0 1 9 4 17 11 18 2 4 16 13 4 1 3 4 13 9 8 11 18 9 9 10 17 6 16 8 10 15 10 18 1 2 9 10 18 13 0 11 4 7 2 0 18 3 5 1 5 18 17 4 8 2 4 10 13 7 10 9 6 3 7 5 7 12 12 6 0 3 7 3 3 19 4 2 5 0 9 5 14 0 2 15 9 18 6 1 15 5 5 1 12 4 7 9 3 19 19 15 16 12 18 13 0 12 4 12 4 1 8 19 2 1 1 8 14 6 10 0 16 14 14 10 8 3 15 5 13 9 13 10 6 16 2 15 3 2 16 19 2 14 1 10 1 1 5 5 10 8 3 8 17 13 15 8 9 6 4 2 14 6 4 1 6 14 8 9 11 12 3 18 5 14 9 18 2 12 17 2 17 10 0 11 7 11 2 0 11 15 6 7 13 10 18 17 6 19 12 14 15 7 12 5 0 17 15 2 2 18 6 7 13 1 10 19 9 7 13 15 13 7 18 11 13 10 8 1 10 5 17 9 9 5 14 3 3 1 19 7 13 0 5 10 2 12 17 3 12 9 0 10 9 15 6 14 18 1 3 6 4 9 19 4 9 15 11 0 3 10 19 5 18 16 10 4 4 4 1 1 6 8 10 9 8 19 4 11 18 12 14 8 4 5 11 8 17 5 7 13 13 16 14 8 7 14 7 18 9 3 11 0 1 7 19 8 6 3 4 4 2 4 11 3 7 5 5 9 16 15 7 6 4 6 7 17 8 13 10 2 9 18 0 13 12 4 13 9 4 19 4 7 10 17 1 5 5 3 7 12 3 19 19 7 1 11 9 9 9 7 5 6 8 7 0 11 19 6 17 12 1 18 Results: Value Test Row Result Row Z value B value no 3 15 no 4 17 no 5 6 no 6 5 no 7 8 no 8 18 no 9 7 yes 10 9 3 5 yes 11 3 14 5 no 12 10 yes 13 3 14 5 yes 14 3 14 5 yes 15 12 10 8 yes 16 12 10 8 yes 17 2 4 4 no 18 8 8 yes 19 6 15 8 no 20 5 no 21 18
Google script to append multiple rows - select columns
How can I append multiple rows from source sheet to destination sheet; select columns like A,C,H ? source sheet columns A-Z, with multiple rows. destination sheet - append rows from source sheet - select columns A, C and H only. Thank you
Transfer Data from one sheet to another selecting only columns A,C and H function xferdata() { const ss = SpreadsheetApp.getActive(); const ssh = ss.getSheetByName("Sheet0"); const dsh = ss.getSheetByName("Sheet1"); const vs = ssh.getRange(2,1,ssh.getLastRow() - 1, 8).getValues(); let o = vs.map(([a,,c,,,,,h]) => [a,c,h]); //Logger.log(JSON.stringify(o)); dsh.getRange(dsh.getLastRow() + 1,1,o.length,o[0].length).setValues(o); } Sheet0: COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 9 3 4 5 6 7 8 9 10 4 5 6 7 8 9 10 11 5 6 7 8 9 10 11 12 6 7 8 9 10 11 12 13 7 8 9 10 11 12 13 14 8 9 10 11 12 13 14 15 9 10 11 12 13 14 15 16 10 11 12 13 14 15 16 17 11 12 13 14 15 16 17 18 12 13 14 15 16 17 18 19 13 14 15 16 17 18 19 20 14 15 16 17 18 19 20 21 15 16 17 18 19 20 21 22 16 17 18 19 20 21 22 23 17 18 19 20 21 22 23 24 18 19 20 21 22 23 24 25 19 20 21 22 23 24 25 26 20 21 22 23 24 25 26 27 Sheet1: A B C 1 3 8 2 4 9 3 5 10 4 6 11 5 7 12 6 8 13 7 9 14 8 10 15 9 11 16 10 12 17 11 13 18 12 14 19 13 15 20 14 16 21 15 17 22 16 18 23 17 19 24 18 20 25 19 21 26 20 22 27
Timed out script
I'm using the following code in order to avoid using an IMPORTRANGE() formula. The data I'm getting is over 50k rows so that's why I'm using App Script. But now I'm getting the following error: Exception: Service Spreadsheets timed out while accessing document with id function Live_Data_importing() { var raw_live = SpreadsheetApp.openByUrl("someURL").getSheetByName("Sheet1"); var selected_columns = raw_live.getRange("A:Q").getValues().map(([a,,,,e,f,g,,i,j,,l,,n,o,]) => [a,e,f,g,i,j,l,n,o] ); var FF_filtered = selected_columns.filter(row=>row[8]=="somedata"); var FF_Hourly_live_data = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("somesheet"); FF_Hourly_live_data.getRange(2, 1, FF_filtered.length, FF_filtered[0].length).setValues(FF_filtered); } How can I fix this error? I also made a copy of the file but script is still running the same error.
Tested this on smaller set of data It works okay function Live_Data_importing() { const ss = SpreadsheetApp.getActive(); var raw_live = ss.getSheetByName("Sheet3"); var selected_columns = raw_live.getRange("A1:Q" + raw_live.getLastRow()).getValues().map(([a, , , , e, f, g, , i, j, , l, , n, o,]) => [a, e, f, g, i, j, l, n, o]);//fix this range var FF_filtered = selected_columns.filter(row => row[8] == 0);//My data is all integers var FF_Hourly_live_data = ss.getSheetByName("Sheet4");//output sheet FF_Hourly_live_data.getRange(2, 1, FF_filtered.length, FF_filtered[0].length).setValues(FF_filtered); } Data: COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 COL13 COL14 COL15 COL16 COL17 COL18 COL19 COL20 0 1 18 13 0 7 5 11 15 2 1 17 0 8 11 18 0 16 1 8 0 16 16 9 5 0 16 8 14 0 7 4 3 15 8 19 18 4 18 10 9 14 7 16 17 13 17 4 8 6 19 18 9 18 12 1 8 19 12 10 13 17 4 2 10 3 7 16 14 13 1 5 16 12 2 3 14 3 0 9 4 17 14 5 9 13 18 0 4 1 14 1 12 7 10 1 15 4 14 5 1 17 13 17 3 1 17 8 19 7 14 18 2 17 1 16 2 19 13 15 7 4 9 0 11 14 11 7 12 14 3 19 14 13 18 12 16 19 19 6 7 6 10 5 15 8 0 8 2 16 4 14 18 14 2 16 16 5 15 16 4 5 9 6 6 2 1 15 14 8 19 8 4 10 12 12 4 6 10 12 11 15 3 4 1 3 17 19 10 4 11 2 10 16 12 1 6 3 0 3 11 0 14 8 13 13 4 2 16 18 10 14 5 3 7 4 7 9 5 15 0 6 5 1 2 18 1 1 11 13 7 13 5 15 5 13 17 18 14 19 0 17 10 6 10 16 16 0 18 19 12 8 15 1 11 4 19 4 17 14 4 14 14 6 16 8 15 4 5 2 4 5 14 14 16 9 0 16 0 4 8 3 8 5 12 15 5 19 14 0 1 6 12 2 19 10 10 19 13 19 0 5 14 7 2 17 3 10 2 14 2 5 0 18 5 1 13 13 3 13 10 18 18 4 18 11 0 7 13 9 18 13 9 2 0 16 15 3 9 14 12 15 15 7 16 13 14 1 4 12 18 9 6 14 18 11 16 2 18 15 15 0 2 4 6 16 5 18 2 6 14 18 11 1 9 15 13 8 8 19 11 19 14 0 7 15 1 10 8 9 14 14 15 3 11 13 4 10 5 16
Selecting or Updating with 2 group criteria for a max value on mysql [duplicate]
This question already has answers here: SQL select only rows with max value on a column [duplicate] (27 answers) Closed 3 years ago. I'm trying to figure out how to use group and max() on joins correctly, I'm doing some parsing of a moodle(Open source school software) mysql database. Students are allowed to retake the quizes indefinatly for this particular program, but I need to be able to update the course completion date to reflect the last time they took the test because a lot of other things depend on the completion fields. The mdl_quiz_attempts table stores all attempts for all quizes, the userid will have many of the same entries, but the attempt number is not unique to the table, but instead unique to both the student AND the key for the row. Meaning students have multiple entries. On the mdl_course_modules table, The instance field is the key for the mdl_quiz table, and the mdl_course_modules_completion coursemoduleid field is the key for mdl_course_modules. So what I want to do is this: given a student id UPDATE mdl_course_completion.timemodified to mdl_quize_attempts.timemodified WHERE the row on mdl_quiz_attempts is the max attempt by userid for each quiz.(the quiz field on the quiz_attempts has to be looked up through in course module instance table to get the, instance id for course completion module id) Here are example partial tables. mdl_quiz_attempts id quiz userid attempt timemodified 2 1 3 6 1365408901 6 1 4 1 1369873688 7 2 4 1 1369877532 8 7 4 1 1369881431 9 7 4 2 1369882897 12 5 4 1 1505165504 13 6 4 1 1369887643 17 8 4 1 1369958105 18 1 4 2 1374557701 22 7 4 3 1374639901 23 6 4 7 1374640202 24 5 4 2 1374639901 25 8 4 2 1374639901 26 2 4 2 1374639301 27 2 6 1 1376620469 29 2 12 1 1389915486 30 1 23 1 1390978667 31 1 23 2 1391030924 32 2 23 1 1392113103 33 2 23 2 1392696602 34 2 23 3 1392767435 35 7 12 1 1398914256 36 8 43 1 1405281193 37 1 50 1 1405522411 38 5 43 1 1505165504 mdl_course_modules id course module instance section 3 2 9 2 3 5 2 17 2 4 7 2 17 3 5 8 2 17 4 6 9 2 17 5 7 10 2 17 6 8 11 2 17 7 9 12 2 17 8 10 13 2 17 9 11 14 2 17 10 12 15 2 17 11 13 25 2 16 1 14 26 2 23 1 4 28 2 7 1 14 30 4 9 4 26 42 4 23 3 33 45 4 23 6 38 46 4 23 7 37 47 4 23 8 36 48 4 23 9 35 49 4 23 10 32 50 4 23 11 34 51 5 9 5 27 53 5 23 12 43 55 5 23 13 44 mdl_quiz id name 10 Unit 10 Quiz 11 Unit 2 Quiz 12 Unit 3 Quiz 13 Unit 5 Quiz 14 Unit 1 Quiz 15 Unit 8 Quiz 16 Unit 9 Quiz 17 Unit 7 Quiz 18 Unit 4 Quiz mdl_course_modules_completion id coursemoduleid userid completionstate viewed timemodified 14 25 2 0 1 0 15 25 6 0 1 0 67 25 4 1 1 1369873688 68 28 4 1 0 1369874483 69 192 4 1 0 1369875233 70 184 4 1 1 1369877532
Something like this ? update mdl_course_modules_completion c join mdl_quiz_attempts a on a.userid = c.userid join (select max(attempt) max_attempts from mdl_quiz_attempts group by userid) max on max.max_attempts = a.attempt set c.timemodified = a.timemodified where c.userid = :<USER_ID>