Filling a Matrix with "For Loop" Taking Too Long - json

I'm trying to create a data frame that is about 1,000,000 x 5 by using a for-loop, but it's been 5+ hours and I don't think it will finish very soon. I'm using the rjson library to read in the data from a large json file. Can someone help me with filling up this data frame in a faster way?
library(rjson)
# read in data from json file
file <- "/filename"
c <- file(file, "r")
l <- readLines(c, -1L)
data <- lapply(X=l, fromJSON)
# specify variables that i want from this data set
myvars <- c("url", "time", "userid", "hostid", "title")
newdata <- matrix(data[[1]][myvars], 1, 5, byrow=TRUE)
# here's where it goes wrong
for (i in 2:length(l)) {
newdata <- rbind(newdata, data[[i]][myvars])
}
newestdata <- data.frame(newdata)

This is taking forever because each iteration of your loop is creating a new, bigger object. Try this:
slice <- function(field, data) unlist(lapply(data, `[[`, field))
data.frame(Map(slice, myvars, list(data)))
This will create a data.frame and preserve your original data types: character, numeric, etc., if it matters. While forcing everything into a matrix will coerce everything into character class.

Without the data, it's hard to be sure, but there are a couple of things you are doing that are relatively slow. This should be faster, but again, without the data, I can't test:
newdata <- vapply(data, `[`, character(5L), myvars)
I'm also assuming that your data is character, which I think it has to be based on title.
Also, as others have noted, the reason yours is slow is because you are growing an object, which requires R to keep re-allocating memory. vapply will allocate the memory ahead of time because it knows the size of each iterations result, and how many items there are.

Related

Read every nth batch in pyarrow.dataset.Dataset

In Pyarrow now you can do:
a = ds.dataset("blah.parquet")
b = a.to_batches()
first_batch = next(b)
What if I want the iterator to return me every Nth batch instead of every other? Seems like this could be something in FragmentScanOptions but that's not documented at all.
No, there is no way to do that today. I'm not sure what you're after but if you are trying to sample your data there are a few choices but none that achieve quite this effect.
To load only a fraction of your data from disk you can use pyarrow.dataset.head
There is a request in place for randomly sampling a dataset although the proposed implementation would still load all of the data into memory (and just drop rows according to some random probability).
Update: If your dataset is only parquet files then there are some rather custom parts and pieces that you can cobble together to achieve what you want.
a = ds.dataset("blah.parquet")
all_fragments = []
for fragment in a.get_fragments():
for row_group_fragment in fragment.split_by_row_group():
all_fragments.append(row_group_fragment)
sampled_fragments = all_fragments[::2]
# Have to construct the sample dataset manually
sampled_dataset = ds.FileSystemDataset(sampled_fragments, schema=a.schema, format=a.format)
# Iterator which will only return some of the batches
# of the source dataset
sampled_dataset.to_batches()

R: jsonlite's stream_out function producing incomplete/truncated JSON file

I'm trying to load a really big JSON file into R. Since the file is too big to fit into memory on my machine, I found that using the jsonlite package's stream_in/stream_out functions is really helpful. With these functions, I can subset the data first in chunks without loading it, write the subset data to a new, smaller JSON file, and then load that file as a data.frame. However, this intermediary JSON file is getting truncated (if that's the right term) while being written with stream_out. I will now attempt to explain with further detail.
What I'm attempting:
I have written my code like this (following an example from documentation):
con_out <- file(tmp <- tempfile(), open = "wb")
stream_in(file("C:/User/myFile.json"), handler = function(df){
df <- df[which(df$Var > 0), ]
stream_out(df, con_out, pagesize = 1000)
}, pagesize = 5000)
myData <- stream_in(file(tmp))
As you can see, I open a connection to a temporary file, read my original JSON file with stream_in and have the handler function subset each chunk of data and write it to the connection.
The problem
This procedure runs without any problems, until I try to read it in myData <- stream_in(file(tmp)), upon which I receive an error. Manually opening the new, temporary JSON file reveals that the bottom-most line is always incomplete. Something like the following:
{"Var1":"some data","Var2":3,"Var3":"some othe
I then have to manually remove that last line after which the file loads without issue.
Solutions I've tried
I've tried reading the documentation thoroughly and looking at the stream_out function, and I can't figure out what may be causing this issue. The only slight clue I have is that the stream_out function automatically closes the connection upon completion, so maybe it's closing the connection while some other component is still writing?
I inserted a print function to print the tail() end of the data.frame at every chunk inside the handler function to rule out problems with the intermediary data.frame. The data.frame is produced flawlessly at every interval, and I can see that the final two or three rows of the data.frame are getting truncated while being written to file (i.e., they're not being written). Notice that it's the very end of the entire data.frame (after stream_out has rbinded everything) that is getting chopped.
I've tried playing around with the pagesize arguments, including trying very large numbers, no number, and Inf. Nothing has worked.
I can't use jsonlite's other functions like fromJSON because the original JSON file is too large to read without streaming and it is actually in minified(?)/ndjson format.
System info
I'm running R 3.3.3 x64 on Windows 7 x64. 6 GB of RAM, AMD Athlon II 4-Core 2.6 Ghz.
Treatment
I can still deal with this issue by manually opening the JSON files and correcting them, but it's leading to some data loss and it's not allowing my script to be automated, which is an inconvenience as I have to run it repeatedly throughout my project.
I really appreciate any help with this; thank you.
I believe this does what you want, it is not necessary to do the extra stream_out/stream_in.
myData <- new.env()
stream_in(file("MOCK_DATA.json"), handler = function(df){
idx <- as.character(length(myData) + 1)
myData[[idx]] <- df[which(df$id %% 2 == 0), ] ## change back to your filter
}, pagesize = 200) ## change back to 1000
myData <- myData %>% as.list() %>% bind_rows()
(I created some mock data in Mockaroo: generated 1000 lines, hence the small pagesize, to check if everything worked with more than one chunk. The filter I used was even IDs because I was lazy to create a Var column.)

Importing Data into Elastic Search from R

Hej dear community,
I am right now trying to import data from an API call (and processed the JSON Output in R) into an index at elastic search.
"stored" is a dataframe containing 20 obs. along 113 variables. However, elastic search copies only 7 out of 20 obs. into the index. Those are correctly transfered in terms of values.
Still, I cannot explain where and why I am missing the other 13 observations. The code I am using, see below
stored <- fromJSON(API_URL)
stored <- stored[['results']]
connect(es_base = "xxx.xxx.x.xx", es_port = xxxx)
connection()
docs_bulk(stored, index="data", raw = FALSE, chunk_size = 100000)
Thanks in advanced :-)
Thanks to Sckott, we were able to solve the problem.
The Json file from the API Call were not 100% - UTF8 encoded. By using fromJSON for the URL-Call it entered additional characters to the data. However, adding a readLines avoids the problem. The final code i used were:
Output_FT <- fromJSON(readLines(BWURL_x), flatten = TRUE)
stored <- Output_FT[['results']]
connect(es_base = "xxx.xxx.x.xx", es_port = xxxx)
connection()
docs_bulk(stored, index="data")
Best,

R jsonlite filter records before loading

I have many large json files (3G each) which I want to load efficiently to a strong RServer machine, however loading all record from all files will be redundant and exhausting (50M records multiply by 40). So I thought using jsonlite package because I heard it's efficient. The thing is that I do not need all records but only a subset of records where an embedded element ("source") have an existing field by the name "duration".
This is currently my code:
library(jsonlite)
library(curl)
url <- "https://s3-eu-west-1.amazonaws.com/es-export-data/logstash-2016.02.15.json"
test <- stream_in(url(url))
it's only 1 extract of many. now, jsonlite package have a 'flatten' function to flatten embedded elements to create 1 wide flatten data frame. Then I could filter it. However, it seems not efficient. I think that pre-filter it when the data is loaded is much more efficient.
here a dput of one record:
> dput(test_data)
"{\"_index\":\"logstash-2016.02.15\",\"_type\":\"productLogs\",\"_id\":\"AVLitaOtp4oNFTVKv9tZ\",\"_score\":0,\"_source\":{\"EntryType\":\"Event\",\"queryType\":\"clientQuery\",\"status\":\"success\",\"cubeName\":\"Hourly Targets Operations by Model\",\"cubeID\":\"aHourlyIAAaTargetsIAAaOperationsIAAabyIAAaModel\",\"startQueryTimeStamp\":\"2016-02-15T02:14:23+00:00\",\"endQueryTimeStamp\":\"2016-02-15T02:14:23+00:00\",\"queryResponeLengthBytes\":0,\"duration\":0,\"concurrentQuery\":14,\"action\":\"finishQueryJaql\",\"#timestamp\":\"2016-02-15T02:14:23.253Z\",\"appTypeName\":\"dataserver\",\"#version\":\"1\",\"host\":\"VDED12270\",\"type\":\"productLogs\",\"tags\":[],\"send_type\":\"PullGen1\",\"sisenseuid\":\"janos.kopecek#regenersis.com\",\"sisenseOwnerid\":\"janos.kopecek#regenersis.com\",\"sisenseVersion\":\" 5.8.1.29\",\"sisenseMonitoringVersion\":\"3.0.0.6\",\"inputType\":\"sqs\",\"token\":\"fTdyoSwaFZTalBlnFIlTsqvvzfKZVGle\",\"logstash_host\":\"vpc_cluster_1\"}}"
>
any help appreciated
You have to add an handler function and specify which elements you need:
stream_in(url(url) , handler = function(x) x$"_source$duration")

R: parse JSON/XML exported compound properties from Pubchem

I would like to parse all chemical properties of a given compound as given in Pubchem in R, using the JSON (or XML) export facility.
Example: ALPHA-IONONE, pubchem compound ID 5282108
https://pubchem.ncbi.nlm.nih.gov/compound/5282108
library("rjson")
data <- rjson::fromJSON(file="https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/JSON/?response_type=display")
or
library("RJSONIO")
data <- RJSONIO::fromJSON("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/JSON/?response_type=display")
will get me a tree of nested lists, but how do I go from this rather complicated list of nested lists to a nice dataframe or list of dataframes?
In this case, what I am after is everything under
3.1 Computed Descriptors
3.2 Other Identifiers
3.3 Synonyms
4.1 Computed Properties
in a single row of a dataframe and each element in a separate named column with multiple items per element (e.g. multiple synonyms) pasted together with a "|" as a delimiter. E.g. in this case something like
pubchemid IUPAC_Name InChI InChI_Key Canonical SMILES Isomeric SMILES CAS EC Number Wikipedia MeSH Synonyms Depositor-Supplied Synonyms Molecular_Weight Molecular_Formula XLogP3 Hydrogen_Bond_Donor_Count ...
5282108 (E)-4-(2,6,6-trimethylcyclohex-2-en-1-yl)but-3-en-2-one InChI=1S/C13H20O/c1-10-6-5-9-13(3,4)12(10)8-7-11(2)14/h6-8,12H,5,9H2,1-4H3/b8-7+ ....
Fields with multiple items, such as Depositor-Supplied Synonyms could be pasted together with a "|", e.g. value could be ALPHA-IONONE|Iraldeine|...
Second, I would also like to import section
4.2.2 Kovats Retention Index
as a dataframe
pubchemid column_class kovats_ri
5282108 Standard non-polar 1413
5282108 Standard non-polar 1417
...
5282108 Semi-standard non-polar 1427
...
(section 4.3.1 GC-MS would have been nice too, but since it only displays the 3 top peaks this is a little useless right now, so I'll skip that)
Anybody any idea how to achieve this in an elegant way?
PS Note that not all these fields will necessarily exist for any given query.
2D structure and some properties can also be obtained from
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5282108/record/SDF/?record_type=2d&response_type=display
and 3D structure from
https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5282108/record/SDF/?record_type=3d&response_type=display
Data can also be exported as XML, using
https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/XML/?response_type=display
if that would be any easier
Note: also tried with R package rpubchem, but that one only seems to import a small amount of the available info:
library("rpubchem")
get.cid(5282108)
CID IUPACName CanonicalSmile MolecularFormula MolecularWeight TotalFormalCharge XLogP HydrogenBondDonorCount HydrogenBondAcceptorCount HeavyAtomCount TPSA
2 5282108 (E)-4-(2,6,6-trimethylcyclohex-2-en-1-yl)but-3-en-2-one C13H20O 192.297300 0 3 0 1 14 17 5282108
My proposal works on XML files, because (thanks to XPath) I find them more convenient to traverse and select nodes.
Please note that this is neither fast (took few seconds while testing) nor optimal (I parse each file twice - once for names and the like and once for Kovats Retention Index). But I guess that you will want to parse some set of files once and go ahead with your real business, and premature optimization is the root of all evil.
I have put main tasks into separate functions. If you want to get data for one specific pubchem record, they are ready to use. But if you want to get data from few pubchem records at once, you can define vector of pointers to data and use examples at the bottom to merge results together. In my case, vector contains paths to files on my local disk. URLs are supported as well, although I would discourage them (remember that each site will be requested twice, and if there is greater number of records, you probably want to handle faulty network somehow).
Compound you have linked to has multiple entries on "EC Number". They do differ by ReferenceNumber, but not by Name. I wasn't sure why it is that way and what should I do with it (your sample output contains only one entry for EC Number), so I left this to R. R added suffixes to duplicated values and created EC.Number.1, EC.Number.2 etc. These suffixes do not match with ReferenceNumber in file and probably the same column in master data frame will refer to different ReferenceNumbers for different compounds.
It seems that pubchem uses following format for tags <type>Value[List]. In few places I have hardcoded StringValue, but maybe some compound has different types in the same fields. I usually haven't considered lists, except where it was requested. So further modifications might be needed as more data is thrown at this code.
If you have any questions, please post them in comments. I am not sure whether I should explain that code or what.
library("xml2")
library("data.table")
compound.attributes <- function(file=NULL) {
compound <- read_xml(file)
ns <- xml_ns(compound)
information <- xml_find_all(compound, paste0(
"//d1:TOCHeading[text()='Computed Descriptors'",
" or text()='Other Identifiers'",
" or text()='Synonyms'",
" or text()='Computed Properties']",
"/following-sibling::d1:Section/d1:Information"
), ns)
properties <- sapply(information, function(x) {
name <- xml_text(xml_find_one(x, "./d1:Name", ns))
value <- ifelse(length(xml_find_all(x, "./d1:StringValueList", ns)) > 0,
paste(sapply(
xml_find_all(x, "./d1:StringValueList", ns),
xml_text, trim=TRUE), sep="", collapse="|"),
xml_text(
xml_find_one(x, "./*[contains(name(),'Value')]", ns),
trim=TRUE)
)
names(value) <- name
return(value)
})
rm(compound, information)
properties <- as.list(properties)
properties$pubchemid <- sub(".*/([0-9]+)/?.*", "\\1", file)
return(data.frame(properties))
}
compound.retention.index <- function(file=NULL) {
pubchemid <- sub(".*/([0-9]+)/?.*", "\\1", file)
compound <- read_xml(file)
ns <- xml_ns(compound)
information <- xml_find_all(compound, paste0(
"//d1:TOCHeading[text()='Kovats Retention Index']",
"/following-sibling::d1:Information"
), ns)
indexes <- lapply(information, function(x) {
name <- xml_text(xml_find_one(x, "./d1:Name", ns))
values <- as.numeric(sapply(
xml_find_all(x, "./*[contains(name(), 'NumValue')]", ns),
xml_text))
data.frame(pubchemid=pubchemid,
column_class=name,
kovats_ri=values)
})
return( do.call("rbind", indexes) )
}
compounds <- c("./5282108.xml", "./5282148.xml", "./91754124.xml")
cd <- rbindlist(
lapply(compounds, compound.attributes),
fill=TRUE
)
rti <- do.call("rbind",
lapply(compounds, compound.retention.index))