How to extract the same column (change it's name to the name of the file) from multiple excel file and put them in a new data table - multiple-columns

General Idea: to extract the same column from different excel files and put them in a new data table.
I have multiple excel files with different names (for example cell1, cell2, cell3, cell4 ... etc. or it could be the don't have sequential names just different ones).
Each excel file has the same columns. (example: speed, displacement, mean, median, std....etc.)
I want to extract a specific column, for example, speed and change the name of the column to the respective name of the excel file from which I extracted this column. (example, data excel name: cell1 extract "speed" column and change its name to "cell1")
put the extracted column with the already change into a new data table. Purpose: to get a new data table that will contain all the "speed" columns extracted from each excel file with the new name.
excel name "cell1"
Speed
Displace
0.2
0.2
0.23
0.30
0.23
0.25
0.30
0.28
excel name "cell2"
Speed
Displace
0.1
0.2
0.13
0.30
0.33
0.25
0.30
0.28
out put new data table (could be named speed)
Cell1
Cell2
0.2
0.1
0.23
0.13
0.23
0.33
0.30
0.30
This is my starting code
(I'm new in Rstudio. )
library(tidyverse)
library(data.table)
csv.files <- list.files(pattern="*.csv", recursive=TRUE)
bg.df <- NULL
for (csv.file in csv.files) {
if (is.null(bg.df)) {
bg.df <- readr::read_csv(csv.file) %>%
dplyr::mutate(
file = csv.file
)
} else {
bg.df <- bg.df %>%
dplyr::add_row(
readr::read_csv(csv.file) %>%
dplyr::mutate(
file = csv.file
)
)
}
}
out put new data table (could be named speed)
Cell1
Cell2
0.2
0.1
0.23
0.13
0.23
0.33
0.30
0.30

Related

Create a new column in a dataframe using a function

So I think I am having a simple problem.
I have a dataframe (my_data), that looks like this
Treatment Amount Duration
a 5 3000
b 8 2000
c 6 1000
d 2 5000
Now I want to create a new dataframe (my_data_1) which adds a new column based on a simple function Duration/Amount.
My_data_1 should look like this:
Treatment Amount Duration Mean duration
a 5 3000 600
b 8 2000 250
c 6 1000 167
d 2 5000 2500
I tried to write a function and implement it into my dataframe
mean_duration <- function(md){my_data$Duration / my_data$Amount}
my_data_1$md <- with(my_data, ave(Duration, Amount, FUN = mean_duration))
Where did I go wrong?

Import CSV File from Dynamic Path using SQL query

File Path :
'D:\TEST\Sales Data-20Nov2017.csv
Table1
EMP NAME SALES Till Date SALES 2015-16
A Sam 50 30
B Bob 40 60
C Cat 30 20
D Doll 20 50
E Eric 10 25
F Fed 15 10
How to Import csv file from Above path with dynamic date using SQL query?
Csv File contents above data
Kindly suggest SQL query using above path with dynamic Date

web scraping - No records found

I'm trying to rbind series of HTML Tables (from different pages with same col names) but some pages have "no records" , I want to skip such pages or assign NULL to the dataframe.
Example Dataframe 1
url="http://stats.espncricinfo.com/ci/engine/player/28081.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=1;template=results;type=batting;view=match"
Batting=readHTMLTable(url)
Batting$"Match by match list"
Batting<-Batting$"Match by match list"
Dataframe 2
url="http://stats.espncricinfo.com/ci/engine/player/625383.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=2;template=results;type=batting;view=match"
Batting=readHTMLTable(url)
Batting$"Match by match list"
Batting<-Batting$"Match by match list"
There are several such Dataframes which have records in tabular form and some that don't have records
When I rbind the one with no records is causing error for final dataframe
final_DF<-rbind(Dataframe1,Dataframe2)
How do I resolve this!?
PS: And for each url query I'm adding certain set of columns(say 5 additional columns using cbind) based on my requirement to the dataframe.
You can do the following:
require(rvest)
require(tidyverse)
urls <- c(
"http://stats.espncricinfo.com/ci/engine/player/28081.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=1;template=results;type=batting;view=match",
"http://stats.espncricinfo.com/ci/engine/player/625383.html?class=2;filter=advanced;floodlit=1;innings_number=1;orderby=start;result=2;template=results;type=batting;view=match"
)
extra_cols <- list(
tibble("Team"="IND","Player"="MS.Dhoni","won"=1,"lost"=0,"D"=1,"D/N"=0,"innings"=1,"Format"="ODI"),
tibble("Team"="IND","Player"="MS.Dhoni","won"=1,"lost"=0,"D"=1,"D/N"=0,"innings"=1,"Format"="ODI")
)
doc <- map(urls, read_html) %>%
map(html_node, ".engineTable:nth-child(5)")
keep <- map_lgl(doc, ~class(.) != "xml_missing")
map(doc[keep], html_table, fill = TRUE) %>%
map2_df(extra_cols[keep], cbind)
The critical part is the discard which removes all list-elements of class "xml_missing" e.g. the empty ones.
I comparison to your code i use CSS selector to specify the html_node that should inherit the table. See http://selectorgadget.com/
Also your rbind is done internally by map2_df (the last row)
This results in: (using %>% {head(.[,c("Bat1", "Runs", "Team")])})
Bat1 Runs Team
1 0 0 IND
2 3 3 IND
3 148 148 IND
4 56 56 IND
5 38 38 IND
6 20 20 IND

Count number of rows when using dplyr to access sql table/query

What would be the efficient way to count the number of rows which using dplyr to access sql table. MWE is below using SQLite, but I use PostgreSQL and have the same issue. Basically dim() is not very consistent. I used
dim()
This works for a schema in the database (First case), but is not very consistent when I create a tbl from an SQL query for the same schema (Second case). My number of rows is in the millions or I see this even with a small 1000 of rows. I get NA or ??. Is there anything that is missing?
#MWE
test_db <- src_sqlite("test_db.sqlite3", create = T)
library(nycflights13)
flights_sqlite <- copy_to(test_db, flights, temporary = FALSE, indexes = list(
c("year", "month", "day"), "carrier", "tailnum"))
flights_postgres <- tbl(test_db, "flights")
First case (table from direct schema)
flights_postgres
> flights_postgres
Source: postgres 9.3.5 []
From: flights [336,776 x 16]
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
#using dim()
> dim(flights_postgres)
[1] 336776 16
The above works and get the count of the number of rows.
Second case (table from SQL query)
## use the flights schema above but can also be used to create other variables (like lag, lead) in run time
flight_postgres_2 <- tbl(test_db, sql("SELECT * FROM flights"))
>flight_postgres_2
Source: postgres 9.3.5 []
From: <derived table> [?? x 16]
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
>
> dim(flight_postgres_2)
[1] NA 16
As you see it either prints as ?? or NA. So not very helpful.
I got around this by either using collect() or then convert the output to a dataframe using as.data.frame() to check the dimension. But these two methods may not be the ideal solution, given the time it may take for larger number of rows.
I think the answer is what #alistaire suggests: Do it in the database.
> flight_postgres_2 %>% summarize(n())
Source: sqlite 3.8.6 [test_db.sqlite3]
From: <derived table> [?? x 1]
n()
(int)
1 336776
.. ...
Asking dim to do this would be having your cake (lazy evaluation of SQL with dplyr, keeping data in the database) and eating it too (having full access to the data in R).
Note that this is doing #alistaire's approach underneath:
> flight_postgres_2 %>% summarize(n()) %>% explain()
<SQL>
SELECT "n()"
FROM (SELECT COUNT() AS "n()"
FROM (SELECT * FROM flights) AS "zzz11") AS "zzz13"
<PLAN>
selectid order from detail
1 0 0 0 SCAN TABLE flights USING COVERING INDEX flights_year_month_day

How to keep variable amount of numbers after decimals

I am using Excel 2007. I have multiple data sets with random number of characters in each row. For example:
A1 1.60
A2 0.008
A3 0.900
A4 1.0
A5 0.56
A6 1.703
I need to make it into a different order on a different page such as
A1 1.60
A2 0.900
A3 1.0
A4 0.56
A5 1.703
A6 0.008
Unfortunately, whenever I move it to a new page (example, I can move it with =Page1!A1) the numbers revert to
A1 1.6
A2 0.9
A3 1
A4 0.56
A5 1.703
A6 0.008
So I lose the zeros.
To complicate things, the number of characters of each entry/row/column varies between data sets. This means that using =TEXT(A1,"#.#0") can't work -- sometimes my A1 could be 1498 or other.
I am (potentially) looking for code that will 'count' the number of decimals shown, and then produce that automatically. Or any other ways to get the number of decimals (or lack there of) correct for my variable data will do. VBA/Macros/Functions?
You can adapt this code to your situation I think...
Public Sub MaintainTrailingZeroFormat()
Dim s As String
Dim lngLength As Long, lngDecimal As Long, lngTrailingZero As Long
Dim r As Range
''Grab cell value as text
s = ActiveWorkbook.ActiveSheet.Range("A2").Text
''Find decimal from beginning of string
lngDecimal = InStr(s, ".")
''Find total length of string
lngLength = Len(s)
''Subtract to find number of trailing zeros
lngTrailingZero = i - st
''Set destination cell format to 'TEXT'
ActiveWorkbook.Sheets(2).Range("A1").NumberFormat = "#"
''Populate cell with text value
ActiveWorkbook.Sheets(2).Range("A1") = FormatNumber(s, EX)
End Sub