I'm trying to load a really big JSON file into R. Since the file is too big to fit into memory on my machine, I found that using the jsonlite package's stream_in/stream_out functions is really helpful. With these functions, I can subset the data first in chunks without loading it, write the subset data to a new, smaller JSON file, and then load that file as a data.frame. However, this intermediary JSON file is getting truncated (if that's the right term) while being written with stream_out. I will now attempt to explain with further detail.
What I'm attempting:
I have written my code like this (following an example from documentation):
con_out <- file(tmp <- tempfile(), open = "wb")
stream_in(file("C:/User/myFile.json"), handler = function(df){
df <- df[which(df$Var > 0), ]
stream_out(df, con_out, pagesize = 1000)
}, pagesize = 5000)
myData <- stream_in(file(tmp))
As you can see, I open a connection to a temporary file, read my original JSON file with stream_in and have the handler function subset each chunk of data and write it to the connection.
The problem
This procedure runs without any problems, until I try to read it in myData <- stream_in(file(tmp)), upon which I receive an error. Manually opening the new, temporary JSON file reveals that the bottom-most line is always incomplete. Something like the following:
{"Var1":"some data","Var2":3,"Var3":"some othe
I then have to manually remove that last line after which the file loads without issue.
Solutions I've tried
I've tried reading the documentation thoroughly and looking at the stream_out function, and I can't figure out what may be causing this issue. The only slight clue I have is that the stream_out function automatically closes the connection upon completion, so maybe it's closing the connection while some other component is still writing?
I inserted a print function to print the tail() end of the data.frame at every chunk inside the handler function to rule out problems with the intermediary data.frame. The data.frame is produced flawlessly at every interval, and I can see that the final two or three rows of the data.frame are getting truncated while being written to file (i.e., they're not being written). Notice that it's the very end of the entire data.frame (after stream_out has rbinded everything) that is getting chopped.
I've tried playing around with the pagesize arguments, including trying very large numbers, no number, and Inf. Nothing has worked.
I can't use jsonlite's other functions like fromJSON because the original JSON file is too large to read without streaming and it is actually in minified(?)/ndjson format.
System info
I'm running R 3.3.3 x64 on Windows 7 x64. 6 GB of RAM, AMD Athlon II 4-Core 2.6 Ghz.
Treatment
I can still deal with this issue by manually opening the JSON files and correcting them, but it's leading to some data loss and it's not allowing my script to be automated, which is an inconvenience as I have to run it repeatedly throughout my project.
I really appreciate any help with this; thank you.
I believe this does what you want, it is not necessary to do the extra stream_out/stream_in.
myData <- new.env()
stream_in(file("MOCK_DATA.json"), handler = function(df){
idx <- as.character(length(myData) + 1)
myData[[idx]] <- df[which(df$id %% 2 == 0), ] ## change back to your filter
}, pagesize = 200) ## change back to 1000
myData <- myData %>% as.list() %>% bind_rows()
(I created some mock data in Mockaroo: generated 1000 lines, hence the small pagesize, to check if everything worked with more than one chunk. The filter I used was even IDs because I was lazy to create a Var column.)
Related
I have a data feed that gives a large .txt file (50-75GB) every day. The file contains several different schemas within it, where each row corresponds to one schema. I would like to split this into partitioned datasets for each schema, how can I do this efficiently?
The largest problem you need to solve is the iteration speed to recover your schemas, which can be challenging for a file at this scale.
Your best tactic here will be to get an example 'notional' file with each of the schemas you want to recover as a line within it, and to add this as a file within your repository. When you add this file into your repo (alongside your transformation logic), you will then be able to push it into a dataframe, much as you would with the raw files in your dataset, for quick testing iteration.
First, make sure you specify txt files as a part of your package contents, this way your tests will discover them (this is covered in documentation under Read a file from a Python repository):
You can read other files from your repository into the transform context. This might be useful in setting parameters for your transform code to reference.
To start, In your python repository edit setup.py:
setup(
name=os.environ['PKG_NAME'],
# ...
package_data={
'': ['*.txt']
}
)
I am using a txt file with the following contents:
my_column, my_other_column
some_string,some_other_string
some_thing,some_other_thing,some_final_thing
This text file is at the following path in my repository: transforms-python/src/myproject/datasets/raw.txt
Once you have configured the text file to be shipped with your logic, and after you have included the file itself in your repository, you can then include the following code. This code has a couple of important functions:
It keeps raw file parsing logic completely separate from the stage of reading the file into a Spark DataFrame. This is so that the way this DataFrame is constructed can be left to the test infrastructure, or to the run time, depending on where you are running.
This keeping of the logic separate lets you ensure the actual row-by-row parsing you want to do is its own testable function, instead of having it live purely within your my_compute_function
This code uses the Spark-native spark_session.read.text method, which will be orders of magnitude faster than row-by-row parsing of a raw txt file. This will ensure the parallelized DataFrame is what you operate on, not a single file, line by line, inside your executors (or worse, your driver).
from transforms.api import transform, Input, Output
from pkg_resources import resource_filename
def raw_parsing_logic(raw_df):
return raw_df
#transform(
my_output=Output("/txt_tests/parsed_files"),
my_input=Input("/txt_tests/dataset_of_files"),
)
def my_compute_function(my_input, my_output, ctx):
all_files_df = None
for file_status in my_input.filesystem().ls('**/**'):
raw_df = ctx.spark_session.read.text(my_input.filesystem().hadoop_path + "/" + file_status.path)
parsed_df = raw_parsing_logic(raw_df)
all_files_df = parsed_df if all_files_df is None else all_files_df.unionByName(parsed_df)
my_output.write_dataframe(all_files_df)
def test_my_compute_function(spark_session):
file_path = resource_filename(__name__, "raw.txt")
raw_df = raw_parsing_logic(
spark_session.read.text(file_path)
)
assert raw_df.count() > 0
raw_columns_set = set(raw_df.columns)
expected_columns_set = {"value"}
assert len(raw_columns_set.intersection(expected_columns_set)) == 1
Once you have this code up and running, your test_my_compute_function method will be very fast to iterate on, so that you can perfect your schema recovery logic. This will make it substantially easier to get your dataset building at the very end, but without any of the overhead of a full build.
I got a successful knitted document from my R Mark down. Then I modified two cells that are completely unrelated to the error. Suddenly, I am getting an issue with my second call to jsonlite tripping up the knitting process as shown in this error:
line 165: Error in open.connection(con, "rb"): Couldn't connect to server
Calls: <anonymous ... fromJSON_String -> parseJSON -> parse_con -> open ->
open.connection
Checking the source code, an earlier cell has a function that makes the json call. Cell one that tests the function knits fine. Cell two that re-uses the function in a loop to make a bunch of json calls (with 3.7 second delay between them) is not knitting. The code works. And even knitted previously. Now it throws an error.
My gut feeling is this is a random performance issue but it now fails repeatedly and the behavior defies logic to me. I can run the code and it works fine in every cell. I have checked my cell declaration syntax looking for glitches that can trip up knitting and I can't find anything wrong. The code is too big to present in its entirety here, so below is just the relevant cells from the markdown doc, the one that works when knitting (that has jsonlite in it) and the cell immediately after it that triggered the knitting error even though the code works fine.
The document now won't knit to get my latest changes. Any ideas or helpful suggestions on how to shake out the kinks to reknit this document would be appreciated:
This cell runs and knits fine:
```{r p5jLiteAns, message=FALSE, warning=FALSE}
# sample ids: tt0120737, tt0468569
library(curl)
library(jsonlite)
get_json_movieRecord <- function(movieID, showRequest=FALSE) {
movieURL_start <- "http://www.omdbapi.com/?i="
movieURL_end <- "&plot=short&r=json&tomatoes=true"
moovURL <- paste0(movieURL_start, movieID, movieURL_end)
if (showRequest == TRUE) {
print(paste("Sending request: ", moovURL))
}
as.data.frame(jsonlite::fromJSON(moovURL), stringsAsFactors=FALSE)
}
# build initial data frame from first record:
movE_data <- get_json_movieRecord(IMDB_mvIDs[1,], TRUE)
movE_data[ ,1:6]
print(paste0("Num Cols: ", NCOL(movE_data)))
```
This cell triggers the error. Line 165 correlates to the line that begins the for loop, so it is probably choking on the function call that makes the json call within it:
```{r p5getDatawDelay}
for (i in 2:250) {
# declared ahead of the loop and then added to within the loop
# in theory, this should yield better performance:
movE_data <- rbind(movE_data, get_json_movieRecord(IMDB_mvIDs[i,]))
if (i %% 2 == 0) { # Add delay to every other request ...
cDelayRtn <- causeDelay(3.7)
}
}
write.csv(movE_data, file='IMDB_Top250_OMDB_Detailed.csv')
movE_data[1:6, 1:6]
paste0("Number Rows: ", NROW(movE_data))
paste0("Number Cols: ", NCOL(movE_data))
```
For completeness... the above code makes a function call to an earlier cell that has this delay function within it:
```{r p5addTimeDelayFunc, echo=T, eval=T}
causeDelay <- function(x, showDelay = FALSE)
{
p1 <- proc.time()
Sys.sleep(x) # nothing happens for x seconds
if (showDelay == TRUE) {
proc.time() - p1
}
}
print("Text")
causeDelay(3.7, TRUE)
print("Text after delay.")
print("")
print("Text2")
causeDelay(3.7)
print("Text after delay.")
```
I think I may see the answer here. I got the same error on 2 or 3 knitting attempts and as each one takes more then 10 minutes to hit the point of error, this is not a process you want to re do many times. But after making a superficial edit (deletion of comments), hitting the knit button again resulted in a new error, something different than what is shown in this post.
The full code of this markdown document would take a good hour or more to click through and re-run every cell + time to re-knit. Even though every cells was run from a previous successful attempt (and the output is still visible in the markdown document), I now suspect that something is no longer properly cached. The solution is probably do just that: re-run every cell and then re-try the RStudio knit button again.
If anyone knows something else to try in situations like these, please do post. Otherwise, I am looking at a lengthy process I will initiate later after I catch up on some other work.
I have many large json files (3G each) which I want to load efficiently to a strong RServer machine, however loading all record from all files will be redundant and exhausting (50M records multiply by 40). So I thought using jsonlite package because I heard it's efficient. The thing is that I do not need all records but only a subset of records where an embedded element ("source") have an existing field by the name "duration".
This is currently my code:
library(jsonlite)
library(curl)
url <- "https://s3-eu-west-1.amazonaws.com/es-export-data/logstash-2016.02.15.json"
test <- stream_in(url(url))
it's only 1 extract of many. now, jsonlite package have a 'flatten' function to flatten embedded elements to create 1 wide flatten data frame. Then I could filter it. However, it seems not efficient. I think that pre-filter it when the data is loaded is much more efficient.
here a dput of one record:
> dput(test_data)
"{\"_index\":\"logstash-2016.02.15\",\"_type\":\"productLogs\",\"_id\":\"AVLitaOtp4oNFTVKv9tZ\",\"_score\":0,\"_source\":{\"EntryType\":\"Event\",\"queryType\":\"clientQuery\",\"status\":\"success\",\"cubeName\":\"Hourly Targets Operations by Model\",\"cubeID\":\"aHourlyIAAaTargetsIAAaOperationsIAAabyIAAaModel\",\"startQueryTimeStamp\":\"2016-02-15T02:14:23+00:00\",\"endQueryTimeStamp\":\"2016-02-15T02:14:23+00:00\",\"queryResponeLengthBytes\":0,\"duration\":0,\"concurrentQuery\":14,\"action\":\"finishQueryJaql\",\"#timestamp\":\"2016-02-15T02:14:23.253Z\",\"appTypeName\":\"dataserver\",\"#version\":\"1\",\"host\":\"VDED12270\",\"type\":\"productLogs\",\"tags\":[],\"send_type\":\"PullGen1\",\"sisenseuid\":\"janos.kopecek#regenersis.com\",\"sisenseOwnerid\":\"janos.kopecek#regenersis.com\",\"sisenseVersion\":\" 5.8.1.29\",\"sisenseMonitoringVersion\":\"3.0.0.6\",\"inputType\":\"sqs\",\"token\":\"fTdyoSwaFZTalBlnFIlTsqvvzfKZVGle\",\"logstash_host\":\"vpc_cluster_1\"}}"
>
any help appreciated
You have to add an handler function and specify which elements you need:
stream_in(url(url) , handler = function(x) x$"_source$duration")
Im trying to write a small app that retrieves a JSON file (it contains a list of items, which all have some properties), saves its contents to the DB and then displays some of it later on. I have Zotonic up and running, and generating some HTML is no problem.
ATM i'm stuck trying to figure out how to define a custom resource and how to get the data from the JSON in the DB. When the data is there I should be fine, that part seems covered ok by the documentation.
I wrote some standalone erlang scripts that fetch the data and I noticed that Zotonic has a library for decoding JSON so that part should be fine. Any tips on where to put which code or where to look further?
The z_db module allows for creating custom tables by using:
z_db:create_table(Table, Cols, Context).
The Table variable is your table name which can be either an atom or a list containing a single atom.
The Cols is a list of column definitions, which are defined by records. Currently the record definition (you can find this in include/zotonic.hrl) is:
-record(column_def, {name, type, length, is_nullable=true, default, primary_key}).
See Erlang docs on records for more info on records
Example code which I put in users/sites/[sitename]/models/m_[sitename].erl:
init(Context) ->
case z_db:table_exists(?table,Context) of
false ->
z_db:create_table(tablename,
[
#column_def{name=id, type="serial"},
#column_def{name=gid, type="integer", is_nullable=false},
#column_def{name=magnitude, type="real"},
#column_def{name=depth, type="real"},
#column_def{name=location, type="character varying"},
#column_def{name=time, type="integer"},
#column_def{name=date, type="integer"}
], Context);
true -> ok
end,
ok.
Pay attention to what options of the record you specify. Most of the errors I got were e.g. from specifying a length on the integer fields.
The models/m_sitename:init/1 does not get called on site start. The sitename:init/1 does get called so I call the init function there to ensure the table exists. Example:
init(Context) ->
m_sitename:init(Context).
It is called by Zotonic with the Context variable of the site automatically. You can get this variable manually as well with z:c(sitename).. So if you call the m_sitename:init(Context). from somewhere else you would do:
m_sitename:init(z:c(sitename)).
Next, insertion in the DB can be done with:
z_db:insert(Table, PropList, Context).
Where Table is again an atom or a list containing a single atom representing the table name. Context is the same as above.
PropList is a property list which is a list containing tuples consisting of two elements where the first is an atom and the second is its associated value/property. Example:
PropList = [
{row, Value},
{anotherrow, AnotherValue}
].
Table = tablename.
Context = z:c(sitename).
z_db:insert(Table, PropList, Context).
See Erlang docs on Property Lists for more info on property lists.
=== The dependencies have been updated so if you build from source the step directly below is no longer needed ===
The JSON part is bit more tricky. Included with Zotonic are mochijson2 and as a secondary dependency also jiffy. The latest version of jiffy contains jiffy:decode/2 which allows you to specify maps as a return type. Much more readable than the standard {struct, {struct, <<"">>}} monster. To update to the latest version edit the line in deps/twerl/rebar.config that says
{jiffy, ".*", {git, "https://github.com/davisp/jiffy.git", {tag, "0.8.3"}}},
to
{jiffy, ".*", {git, "https://github.com/davisp/jiffy.git", {tag, "0.14.3"}}},
Now run z:m(). in the Zotonic shell. (you must do this after every change in your code).
Now check in the Zotonic shell if there is a jiffy:decode/2 available by typing jiffy: <tab>, it will show a list of available functions and their arity.
To retrieve a JSON file from the internet run:
{ok, {{_, 200, _}, _, Body}} = httpc:request(get, {"url-to-JSON-here", []}, [], [])
Which will yield the variable Body with the contents. See Erlang docs on http client for more info on this call.
Next convert the contents of Body to Erlang terms with:
JsonData = jiffy:decode(Body, [return_maps]).
What you have to do next depends a lot on the structure of your JSON resource. Keep in mind that everything is now in binary UTF-8 encoded strings! If you print JsonData to screen (just enter JsonData. in your Zotonic/Erlang shell) you will see a lot of #map{<<"key"", <<"Value">>} this.
My data was nested so I had to extract the needed data like this:
[{_,ItemList}|_] = ListData.
This gave me a list of maps, and in order to deal with them as individual items I used the following function:
get_maps([]) ->
done;
get_maps([First|Rest]) ->
Map = maps:get(<<"properties">>, First),
case is_map(Map) of
true ->
map_to_proplist(Map),
get_maps(Rest);
false -> done
end,
done;
get_maps(_) ->
done.
As you might remember, the z_db:insert/3 function needs a property list to populate rows, so that what the call to map_to_proplist/1 is for. How this function looks is completely dependent on how your data looks but as an example here is what worked for me:
map_to_proplist(Map) ->
case is_map(Map) of
true ->
{Value1,_} = string:to_integer(binary_to_list(maps:get(<<"key1">>, Map))),
{Value2,_} = string:to_float(binary_to_list(maps:get(<<"key2">>, Map))),
{Value3,_} = string:to_float(binary_to_list(maps:get(<<"key3">>, Map))),
Value4 = binary_to_list(maps:get(<<"key4">>, Map)),
{Value5,_} = string:to_integer(binary_to_list(maps:get(<<"key5">>, Map))),
{Value6,_} = string:to_integer(binary_to_list(maps:get(<<"key6">>, Map))),
PropList = [{rowname1, Value1}, {rowname2, Value2}, {rowname3, Value3}, {rowname4, Value4}, {rowname5, Value5}, {rowname6, Value6}],
m_sitename:insert_items(PropList,z:c(sitename)),
ok;
false ->
ok
end.
See the documentation on string:to_list/1 as to why the tuples are needed when casting. The call to m_sitename:insert_items(PropList,z:c(sitename)) calls the z_db:insert/3 in models/m_sitename.erl but wrapped in a catch:
insert_items(PropList,Context) ->
(catch z_db:insert(?table, PropList, Context)).
Ok, quite a long post but this should get you up and running if you were looking for this answer.
The above was done with Zotonic 0.13.2 on Erlang/OTP 18.
A repost (except the JSON part) of my post in the Zotonic Developers group.
I'm trying to create a data frame that is about 1,000,000 x 5 by using a for-loop, but it's been 5+ hours and I don't think it will finish very soon. I'm using the rjson library to read in the data from a large json file. Can someone help me with filling up this data frame in a faster way?
library(rjson)
# read in data from json file
file <- "/filename"
c <- file(file, "r")
l <- readLines(c, -1L)
data <- lapply(X=l, fromJSON)
# specify variables that i want from this data set
myvars <- c("url", "time", "userid", "hostid", "title")
newdata <- matrix(data[[1]][myvars], 1, 5, byrow=TRUE)
# here's where it goes wrong
for (i in 2:length(l)) {
newdata <- rbind(newdata, data[[i]][myvars])
}
newestdata <- data.frame(newdata)
This is taking forever because each iteration of your loop is creating a new, bigger object. Try this:
slice <- function(field, data) unlist(lapply(data, `[[`, field))
data.frame(Map(slice, myvars, list(data)))
This will create a data.frame and preserve your original data types: character, numeric, etc., if it matters. While forcing everything into a matrix will coerce everything into character class.
Without the data, it's hard to be sure, but there are a couple of things you are doing that are relatively slow. This should be faster, but again, without the data, I can't test:
newdata <- vapply(data, `[`, character(5L), myvars)
I'm also assuming that your data is character, which I think it has to be based on title.
Also, as others have noted, the reason yours is slow is because you are growing an object, which requires R to keep re-allocating memory. vapply will allocate the memory ahead of time because it knows the size of each iterations result, and how many items there are.