Importing Data into Elastic Search from R - json

Hej dear community,
I am right now trying to import data from an API call (and processed the JSON Output in R) into an index at elastic search.
"stored" is a dataframe containing 20 obs. along 113 variables. However, elastic search copies only 7 out of 20 obs. into the index. Those are correctly transfered in terms of values.
Still, I cannot explain where and why I am missing the other 13 observations. The code I am using, see below
stored <- fromJSON(API_URL)
stored <- stored[['results']]
connect(es_base = "xxx.xxx.x.xx", es_port = xxxx)
connection()
docs_bulk(stored, index="data", raw = FALSE, chunk_size = 100000)
Thanks in advanced :-)

Thanks to Sckott, we were able to solve the problem.
The Json file from the API Call were not 100% - UTF8 encoded. By using fromJSON for the URL-Call it entered additional characters to the data. However, adding a readLines avoids the problem. The final code i used were:
Output_FT <- fromJSON(readLines(BWURL_x), flatten = TRUE)
stored <- Output_FT[['results']]
connect(es_base = "xxx.xxx.x.xx", es_port = xxxx)
connection()
docs_bulk(stored, index="data")
Best,

Related

R: jsonlite's stream_out function producing incomplete/truncated JSON file

I'm trying to load a really big JSON file into R. Since the file is too big to fit into memory on my machine, I found that using the jsonlite package's stream_in/stream_out functions is really helpful. With these functions, I can subset the data first in chunks without loading it, write the subset data to a new, smaller JSON file, and then load that file as a data.frame. However, this intermediary JSON file is getting truncated (if that's the right term) while being written with stream_out. I will now attempt to explain with further detail.
What I'm attempting:
I have written my code like this (following an example from documentation):
con_out <- file(tmp <- tempfile(), open = "wb")
stream_in(file("C:/User/myFile.json"), handler = function(df){
df <- df[which(df$Var > 0), ]
stream_out(df, con_out, pagesize = 1000)
}, pagesize = 5000)
myData <- stream_in(file(tmp))
As you can see, I open a connection to a temporary file, read my original JSON file with stream_in and have the handler function subset each chunk of data and write it to the connection.
The problem
This procedure runs without any problems, until I try to read it in myData <- stream_in(file(tmp)), upon which I receive an error. Manually opening the new, temporary JSON file reveals that the bottom-most line is always incomplete. Something like the following:
{"Var1":"some data","Var2":3,"Var3":"some othe
I then have to manually remove that last line after which the file loads without issue.
Solutions I've tried
I've tried reading the documentation thoroughly and looking at the stream_out function, and I can't figure out what may be causing this issue. The only slight clue I have is that the stream_out function automatically closes the connection upon completion, so maybe it's closing the connection while some other component is still writing?
I inserted a print function to print the tail() end of the data.frame at every chunk inside the handler function to rule out problems with the intermediary data.frame. The data.frame is produced flawlessly at every interval, and I can see that the final two or three rows of the data.frame are getting truncated while being written to file (i.e., they're not being written). Notice that it's the very end of the entire data.frame (after stream_out has rbinded everything) that is getting chopped.
I've tried playing around with the pagesize arguments, including trying very large numbers, no number, and Inf. Nothing has worked.
I can't use jsonlite's other functions like fromJSON because the original JSON file is too large to read without streaming and it is actually in minified(?)/ndjson format.
System info
I'm running R 3.3.3 x64 on Windows 7 x64. 6 GB of RAM, AMD Athlon II 4-Core 2.6 Ghz.
Treatment
I can still deal with this issue by manually opening the JSON files and correcting them, but it's leading to some data loss and it's not allowing my script to be automated, which is an inconvenience as I have to run it repeatedly throughout my project.
I really appreciate any help with this; thank you.
I believe this does what you want, it is not necessary to do the extra stream_out/stream_in.
myData <- new.env()
stream_in(file("MOCK_DATA.json"), handler = function(df){
idx <- as.character(length(myData) + 1)
myData[[idx]] <- df[which(df$id %% 2 == 0), ] ## change back to your filter
}, pagesize = 200) ## change back to 1000
myData <- myData %>% as.list() %>% bind_rows()
(I created some mock data in Mockaroo: generated 1000 lines, hence the small pagesize, to check if everything worked with more than one chunk. The filter I used was even IDs because I was lazy to create a Var column.)

How to make a complex list into a dataframe in R?

I have a complex list which is get from a json file.
The json file was get from a map service api in China.
I searched the website to solve the problem but I can't find a proper solution to my question, so I put it in this question and hope it can be solved.
If I missing something that I didn't find in the website, I apologize for that.
The code to get the list are as follows:`
library(rjson)
library(RCurl)
key<-"fd5a14632c36aecd2e759a0cc91a3b4a"
origin<-"大润发东环店"
urlorigin <- paste("http://restapi.amap.com/v3/geocode/geo?key=",key,"&address=",origin,"&city=苏州",sep = "")
dataorigin<-readLines(urlorigin,encoding="UTF-8")
origininfo<-fromJSON(dataorigin)
originpoi<-origininfo$geocodes[[1]]$location
destination<-"苏州大学本部北门"
urldest <- paste("http://restapi.amap.com/v3/geocode/geo?key=",key,"&address=",destination,"&city=苏州",sep = "")
datadest<-readLines(urldest,encoding="UTF-8")
destinfo<-fromJSON(datadest)
destpoi<-destinfo$geocodes[[1]]$location
urlpath <- paste("http://restapi.amap.com/v3/direction/driving?key=",key,"&origin=",originpoi,"&destination=",destpoi, "&originid=&destinationid=&extensions=all&strategy=0&waypoints=&avoidpolygons=&avoidroad=",sep = "")
pathjson<-paste(readLines(urlpath,encoding = "UTF-8"),collapse = "")
pathinfo<-fromJSON(pathjson)
The pathinfo was the list I get at last and I want to convert it into a dataframe that I can work with.
Thank you for your time.
I'm from China and my English is not that good, I apologize for that.
My Chinese is very limited as well. But your code to get the data is working (with some warnings).
pathinfo_df <- as.data.frame(lapply(pathinfo,rbind))
pathinfo_df is now a data_frame.
summary(pathinfo_df)
status info infocode count
1:1 OK:1 10000:1 1:1
route.origin.Length route.origin.Class route.origin.Mode
1 -none- character
route.destination.Length route.destination.Class route.destination.Mode
1 -none- character
route.taxi_cost.Length route.taxi_cost.Class route.taxi_cost.Mode
1 -none- character
route.paths.Length route.paths.Class route.paths.Mode
1 -none- list
So, there's plenty to select and play with. Read up on selecting from lists. see also:
str(pathinfo_df)
Then map it on Google Earth. Looks like the taxi might be costly. Have a good trip!

How to get all resource record sets from Route 53 with boto?

I'm writing a script that makes sure that all of our EC2 instances have DNS entries, and all of the DNS entries point to valid EC2 instances.
My approach is to try and get all of the resource records for our domain so that I can iterate through the list when checking for an instance's name.
However, getting all of the resource records doesn't seem to be a very straightforward thing to do! The documentation for GET ListResourceRecordSets seems to suggest it might do what I want and the boto equivalent seems to be get_all_rrsets ... but it doesn't seem to work as I would expect.
For example, if I go:
r53 = boto.connect_route53()
zones = r53.get_zones()
fooA = r53.get_all_rrsets(zones[0][id], name="a")
then I get 100 results. If I then go:
fooB = r53.get_all_rrsets(zones[0][id], name="b")
I get the same 100 results. Have I misunderstood and get_all_rrsets does not map onto ListResourceRecordSets?
Any suggestions on how I can get all of the records out of Route 53?
Update: cli53 (https://github.com/barnybug/cli53/blob/master/cli53/client.py) is able to do this through its feature to export a Route 53 zone in BIND format (cmd_export). However, my Python skills aren't strong enough to allow me to understand how that code works!
Thanks.
get_all_rrsets returns a ResourceRecordSets which derives from Python's list class. By default, 100 records are returned. So if you use the result as a list it will have 100 records. What you instead want to do is something like this:
r53records = r53.get_all_rrsets(zones[0][id])
for record in r53records:
# do something with each record here
Alternatively if you want all of the records in a list:
records = [r for r in r53.get_all_rrsets(zones[0][id]))]
When iterating either with a for loop or list comprehension Boto will fetch the additional records (up to) 100 records at a time as needed.
this blog post from 2018 has a script which allows exporting in bind format:
#!/usr/bin/env python3
# https://blog.jverkamp.com/2018/03/12/generating-zone-files-from-route53/
import boto3
import sys
route53 = boto3.client('route53')
paginate_hosted_zones = route53.get_paginator('list_hosted_zones')
paginate_resource_record_sets = route53.get_paginator('list_resource_record_sets')
domains = [domain.lower().rstrip('.') for domain in sys.argv[1:]]
for zone_page in paginate_hosted_zones.paginate():
for zone in zone_page['HostedZones']:
if domains and not zone['Name'].lower().rstrip('.') in domains:
continue
for record_page in paginate_resource_record_sets.paginate(HostedZoneId = zone['Id']):
for record in record_page['ResourceRecordSets']:
if record.get('ResourceRecords'):
for target in record['ResourceRecords']:
print(record['Name'], record['TTL'], 'IN', record['Type'], target['Value'], sep = '\t')
elif record.get('AliasTarget'):
print(record['Name'], 300, 'IN', record['Type'], record['AliasTarget']['DNSName'], '; ALIAS', sep = '\t')
else:
raise Exception('Unknown record type: {}'.format(record))
usage example:
./export-zone.py mydnszone.aws
mydnszone.aws. 300 IN A server.mydnszone.aws. ; ALIAS
mydnszone.aws. 86400 IN CAA 0 iodef "mailto:hostmaster#mydnszone.aws"
mydnszone.aws. 86400 IN CAA 128 issue "letsencrypt.org"
mydnszone.aws. 86400 IN MX 10 server.mydnszone.aws.
the output can be saved as a file and/or copied to the clipboard:
the Import zone file page allows to paste the data:
at the time of this writing the script was working fine using python 3.9.

How to upload multiple JSON files into CouchDB

I am new to CouchDB. I need to get 60 or more JSON files in a minute from a server.
I have to upload these JSON files to CouchDB individually as soon as I receive them.
I installed CouchDB on my Linux machine.
I hope some one can help me with my requirement.
If possible can someone help me with pseudo code.
My Idea:
Is to write a python script for uploading all JSON files to CouchDB.
Each and every JSON file must be each document and the data present in
JSON must be inserted same into CouchDB
(the specified format with values in a file).
Note:
These JSON files are Transactional, every second 1 file is generated
so I need to read the file upload as same format into CouchDB on
successful uploading archive the file into local system of different folder.
python program to parse the json and insert into CouchDb:
import sys
import glob
import errno,time,os
import couchdb,simplejson
import json
from pprint import pprint
couch = couchdb.Server() # Assuming localhost:5984
#couch.resource.credentials = (USERNAME, PASSWORD)
# If your CouchDB server is running elsewhere, set it up like this:
couch = couchdb.Server('http://localhost:5984/')
db = couch['mydb']
path = 'C:/Users/Desktop/CouchDB_Python/Json_files/*.json'
#dirPath = 'C:/Users/VijayKumar/Desktop/CouchDB_Python'
files = glob.glob(path)
for file1 in files:
#dirs = os.listdir( dirPath )
file2 = glob.glob(file1)
for name in file2: # 'file' is a builtin type, 'name' is a less-ambiguous variable name.
try:
with open(name) as f: # No need to specify 'r': this is the default.
#sys.stdout.write(f.read())
json_data=f
data = json.load(json_data)
db.save(data)
pprint(data)
json_data.close()
#time.sleep(2)
except IOError as exc:
if exc.errno != errno.EISDIR: # Do not fail if a directory is found, just ignore it.
raise # Propagate other kinds of IOError.
I would use CouchDB bulk API, even though you have specified that you need to send them to db one by one. For example, by implementing a simple queue that gets sent out every say 5 - 10 seconds via a bulk doc call will greatly increase performance of your application.
There is obviously a quirk in that and that is you need to know the IDs of the docs that you want to get from the DB. But for the PUTs it is perfect. (it is not entirely true, you can get ranges of docs using bulk operation if the IDs you are using for your docs can be sorted nicely).
From my experience working with CouchDB, I have a hunch that you are dealing with Transactional documents in order to compile them into some sort of sum result and act on that data accordingly (maybe creating next transactional doc in series). For that you can rely on CouchDB by using 'reduce' functions on the views you create. It takes a little practice to get reduce function working properly and is highly dependent on what it is you actually what to achieve and what data you are prepared to emit by the view so I can't really provide you with more detail on that.
So in the end the app logic would go something like that:
get _design/someDesign/_view/yourReducedView
calculate new transaction
add transaction to queue
onTimeout
send all in transaction queue
If I got that first part of why you are using transactional docs wrong all that would really change is the part where you getting those transactional docs in my app logic.
Also, before writing your own 'reduce' function, have a look at buil-in ones (they are alot faster then anything outside of db engine can do)
http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
EDIT:
Since you are starting, I strongly recommend to have a look at CouchDB Definitive Guide.
NOTE FOR LATER:
Here is one hidden stone (well maybe not so much a hidden stone but not an obvious thing to look out for for the new-comer in any case). When you write reduce function make sure that it does not produce too much output for the query without boundaries. This will extremely slow down the entire view even when you provide reduce=false when getting stuff from it.
So you need to get JSON documents from a server and send them to CouchDB as you receive them. A Python script would work fine. Here is some pseudo-code:
loop (until no more docs)
get new JSON doc from server
send JSON doc to CouchDB
end loop
In Python, you could use requests to send the documents to CouchDB and probably to get the documents from the server as well (if it is using an HTTP API).
You might want to checkout the pycouchdb module for python3. I've used it myself to upload lots of JSON objects into couchdb instance. My project does pretty much the same as you describe so you can take a look at my project Pyro at Github for details.
My class looks like that:
class MyCouch:
""" COMMUNICATES WITH COUCHDB SERVER """
def __init__(self, server, port, user, password, database):
# ESTABLISHING CONNECTION
self.server = pycouchdb.Server("http://" + user + ":" + password + "#" + server + ":" + port + "/")
self.db = self.server.database(database)
def check_doc_rev(self, doc_id):
# CHECKS REVISION OF SUPPLIED DOCUMENT
try:
rev = self.db.get(doc_id)
return rev["_rev"]
except Exception as inst:
return -1
def update(self, all_computers):
# UPDATES DATABASE WITH JSON STRING
try:
result = self.db.save_bulk( all_computers, transaction=False )
sys.stdout.write( " Updating database" )
sys.stdout.flush()
return result
except Exception as ex:
sys.stdout.write( "Updating database" )
sys.stdout.write( "Exception: " )
print( ex )
sys.stdout.flush()
return None
Let me know in case of any questions - I will be more than glad to help if you will find some of my code usable.

Filling a Matrix with "For Loop" Taking Too Long

I'm trying to create a data frame that is about 1,000,000 x 5 by using a for-loop, but it's been 5+ hours and I don't think it will finish very soon. I'm using the rjson library to read in the data from a large json file. Can someone help me with filling up this data frame in a faster way?
library(rjson)
# read in data from json file
file <- "/filename"
c <- file(file, "r")
l <- readLines(c, -1L)
data <- lapply(X=l, fromJSON)
# specify variables that i want from this data set
myvars <- c("url", "time", "userid", "hostid", "title")
newdata <- matrix(data[[1]][myvars], 1, 5, byrow=TRUE)
# here's where it goes wrong
for (i in 2:length(l)) {
newdata <- rbind(newdata, data[[i]][myvars])
}
newestdata <- data.frame(newdata)
This is taking forever because each iteration of your loop is creating a new, bigger object. Try this:
slice <- function(field, data) unlist(lapply(data, `[[`, field))
data.frame(Map(slice, myvars, list(data)))
This will create a data.frame and preserve your original data types: character, numeric, etc., if it matters. While forcing everything into a matrix will coerce everything into character class.
Without the data, it's hard to be sure, but there are a couple of things you are doing that are relatively slow. This should be faster, but again, without the data, I can't test:
newdata <- vapply(data, `[`, character(5L), myvars)
I'm also assuming that your data is character, which I think it has to be based on title.
Also, as others have noted, the reason yours is slow is because you are growing an object, which requires R to keep re-allocating memory. vapply will allocate the memory ahead of time because it knows the size of each iterations result, and how many items there are.