I have a CSV file with 2 columns and 20,000 rows I would like to import into Google Cloud Datastore. I'm new to the Google Cloud and NoSQL databases. I have tried using dataflow but need to provide a Javascript UDF function name. Does anyone have an example of this? I will be querying this data once it's in the datastore.
Any advice or guidance on how to create this would be appreciated.
Using Apache Beam, you can read a CSV file using the TextIO class. See the TextIO documentation.
Pipeline p = Pipeline.create();
p.apply(TextIO.read().from("gs://path/to/file.csv"));
Next, apply a transform that will parse each row in the CSV file and return an Entity object. Depending on how you want to store each row, construct the appropriate Entity object. This page has an example of how to create an Entity object.
.apply(ParDo.of(new DoFn<String, Entity>() {
#ProcessElement
public void processElement(ProcessContext c) {
String row = c.element();
// TODO: parse row (split) and construct Entity object
Entity entity = ...
c.output(entity);
}
}));
Lastly, write the Entity objects to Cloud Datastore. See the DatastoreIO documentation.
.apply(DatastoreIO.v1().write().withProjectId(projectId));
Simple in python, but can easily adapt to other langauges. Use the split() method to loop through the lines and comma-separated values:
from google.appengine.api import urlfetch
from my.models import MyModel
csv_string = 'http://someplace.com/myFile.csv'
csv_response = urlfetch.fetch(csv_string, allow_truncated=True)
if csv_response.status_code == 200:
for row in csv_response.content.split('\n'):
row_values = row.split(',')
# csv values are strings. Cast them if they need to be something else
new_entry = MyModel(
property1 = row_values[0],
property2 = row_values[1]
)
new_entry.put()
else:
print 'cannot load file: {}'.format(csv_string)
Related
import com.jayway.jsonpath.JsonPath
def idCSV = new File('id.csv')
def index = [fileOne.json, fileTwo.json]
def jsonString
index.each { file ->
jsonString = ________
def ids = JsonPath.read(jsonString, '$..id')
ids.each { id ->
idCSV << id << newLine
}
}
How to fill the jsonString = ____, so that I can json file into string and parse the string to extract ids and some information from the json string.
And I don't to do it in http request-> GET-> file format.
Previously i have extraced jsonString from http response and it worked well now I want to do it this way.
Use JsonSlurper:
def jsonString = new groovy.json.JsonSlurper().parseText(new File("json.txt").text)
My expectation is that you're looking for File.getText() function
jsonString = file.text
I have no full vision why do you need to store the values from JSON in a CSV file, however there is an alternative way of achieving this which doesn't require scripting as your approach will work with 1 concurrent thread only, if you will add more users attempting writing into the same file - you'll run into a race condition :
You can read the files from the folder into JMeter Variables via Directory Listing Config
The file can be read using HTTP Request sampler
The values cane be fetched using JSON Extractor, they will be automatically stored into JMeter Variables so you will able to use them later on
If you need the values to be present in the file (although I wouldn't recommend this approach cause it will cause massive disk IO and potentially can run your test) you can go for the Flexible File Writer
I store a blob of Json in the datastore using JsonProperty.
I don't know the structure of the json data.
I am using endpoints proto datastore in order to retrieve my data.
The probleme is the json property is encoded in base64 and I want a plain json object.
For the example, the json data will be:
{
first: 1,
second: 2
}
My code looks something like:
import endpoints
from google.appengine.ext import ndb
from protorpc import remote
from endpoints_proto_datastore.ndb import EndpointsModel
class Model(EndpointsModel):
data = ndb.JsonProperty()
#endpoints.api(name='myapi', version='v1', description='My Sample API')
class DataEndpoint(remote.Service):
#Model.method(path='mymodel2', http_method='POST',
name='mymodel.insert')
def MyModelInsert(self, my_model):
my_model.data = {"first": 1, "second": 2}
my_model.put()
return my_model
#Model.method(path='mymodel/{entityKey}',
http_method='GET',
name='mymodel.get')
def getMyModel(self, model):
print(model.data)
return model
API = endpoints.api_server([DataEndpoint])
When I call the api for getting a model, I get:
POST /_ah/api/myapi/v1/mymodel2
{
"data": "eyJzZWNvbmQiOiAyLCAiZmlyc3QiOiAxfQ=="
}
where eyJzZWNvbmQiOiAyLCAiZmlyc3QiOiAxfQ== is the base64 encoded of {"second": 2, "first": 1}
And the print statement give me: {u'second': 2, u'first': 1}
So, in the method, I can explore the json blob data as a python dict.
But, in the api call, the data is encoded in base64.
I expeted the api call to give me:
{
'data': {
'second': 2,
'first': 1
}
}
How can I get this result?
After the discussion in the comments of your question, let me share with you a sample code that you can use in order to store a JSON object in Datastore (it will be stored as a string), and later retrieve it in such a way that:
It will show as plain JSON after the API call.
You will be able to parse it again to a Python dict using eval.
I hope I understood correctly your issue, and this helps you with it.
import endpoints
from google.appengine.ext import ndb
from protorpc import remote
from endpoints_proto_datastore.ndb import EndpointsModel
class Sample(EndpointsModel):
column1 = ndb.StringProperty()
column2 = ndb.IntegerProperty()
column3 = ndb.StringProperty()
#endpoints.api(name='myapi', version='v1', description='My Sample API')
class MyApi(remote.Service):
# URL: .../_ah/api/myapi/v1/mymodel - POSTS A NEW ENTITY
#Sample.method(path='mymodel', http_method='GET', name='Sample.insert')
def MyModelInsert(self, my_model):
dict={'first':1, 'second':2}
dict_str=str(dict)
my_model.column1="Year"
my_model.column2=2018
my_model.column3=dict_str
my_model.put()
return my_model
# URL: .../_ah/api/myapi/v1/mymodel/{ID} - RETRIEVES AN ENTITY BY ITS ID
#Sample.method(request_fields=('id',), path='mymodel/{id}', http_method='GET', name='Sample.get')
def MyModelGet(self, my_model):
if not my_model.from_datastore:
raise endpoints.NotFoundException('MyModel not found.')
dict=eval(my_model.column3)
print("This is the Python dict recovered from a string: {}".format(dict))
return my_model
application = endpoints.api_server([MyApi], restricted=False)
I have tested this code using the development server, but it should work the same in production using App Engine with Endpoints and Datastore.
After querying the first endpoint, it will create a new Entity which you will be able to find in Datastore, and which contains a property column3 with your JSON data in string format:
Then, if you use the ID of that entity to retrieve it, in your browser it will show the string without any strange encoding, just plain JSON:
And in the console, you will be able to see that this string can be converted to a Python dict (or also a JSON, using the json module if you prefer):
I hope I have not missed any point of what you want to achieve, but I think all the most important points are covered with this code: a property being a JSON object, store it in Datastore, retrieve it in a readable format, and being able to use it again as JSON/dict.
Update:
I think you should have a look at the list of available Property Types yourself, in order to find which one fits your requirements better. However, as an additional note, I have done a quick test working with a StructuredProperty (a property inside another property), by adding these modifications to the code:
#Define the nested model (your JSON object)
class Structured(EndpointsModel):
first = ndb.IntegerProperty()
second = ndb.IntegerProperty()
#Here I added a new property for simplicity; remember, StackOverflow does not write code for you :)
class Sample(EndpointsModel):
column1 = ndb.StringProperty()
column2 = ndb.IntegerProperty()
column3 = ndb.StringProperty()
column4 = ndb.StructuredProperty(Structured)
#Modify this endpoint definition to add a new property
#Sample.method(request_fields=('id',), path='mymodel/{id}', http_method='GET', name='Sample.get')
def MyModelGet(self, my_model):
if not my_model.from_datastore:
raise endpoints.NotFoundException('MyModel not found.')
#Add the new nested property here
dict=eval(my_model.column3)
my_model.column4=dict
print(json.dumps(my_model.column3))
print("This is the Python dict recovered from a string: {}".format(dict))
return my_model
With these changes, the response of the call to the endpoint looks like:
Now column4 is a JSON object itself (although it is not printed in-line, I do not think that should be a problem.
I hope this helps too. If this is not the exact behavior you want, maybe should play around with the Property Types available, but I do not think there is one type to which you can print a Python dict (or JSON object) without previously converting it to a String.
I use MergeContent 1.3.0 in order to merge FlowFiles from 2 sources: 1) from ListenHTTP and 2) from QueryElasticsearchHTTP.
The problem is that the merging result is a list of JSON strings. How can I convert them into a single JSON string?
{"event-date":"2017-08-08T00:00:00"}{"event-date":"2017-02-23T00:00:00"}{"eid":1,"zid":1,"latitude":38.3,"longitude":2.4}
I would to get this result:
{"event-date":["2017-08-08T00:00:00","2017-02-23T00:00:00"],"eid":1,"zid":1,"latitude":38.3,"longitude":2.4}
Is it possible?
UPDATE:
After changing data structure in Elastic, I was able to come up with the following output result of MergeContent. Now I have a common field eid in both JSON strings. I would like to merge these strings by eid in order to get a single JSON file. Which operator should I use?
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4}{"eid":"1","dates":{"event-date":["2017-08-08","2017-02-23"]}}
I need to get the following output:
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4,"dates":{"event-date":["2017-08-08","2017-02-23"]}}
It was suggested to use ExecuteScript to merge files. However I cannot figure out how to do this. This is what I tried:
import json
import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
class ModJSON(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
obj = json.loads(text)
newObj = {
"eid": obj['eid'],
"zid": obj['zid'],
...
}
outputStream.write(bytearray(json.dumps(newObj, indent=4).encode('utf-8')))
flowFile1 = session.get()
flowFile2 = session.get()
if (flowFile1 != None && flowFile2 != None):
# WHAT SHOULD I PUT HERE??
flowFile = session.write(flowFile, ModJSON())
flowFile = session.putAttribute(flowFile, "filename", flowFile.getAttribute('filename').split('.')[0]+'_translated.json')
session.transfer(flowFile, REL_SUCCESS)
session.commit()
The example how to read multiple files from incoming queue using filtering
Assume you have multiple pairs of flow files with following content:
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4}
and
{"eid":"1","dates":{"event-date":["2017-08-08","2017-02-23"]}}
The same value of eid field provides a link between pairs.
Before merging we have to extract the value of eid field and put it into na attribute of the flow file for fast filtering.
Use the EvaluateJsonPath processor with properties:
Destination : flowfile-attribute
eid : $.eid
After this you'll have new eid attribute of the flow file.
Then use ExecuteScript processor with groovy language and with following code:
import org.apache.nifi.processor.FlowFileFilter;
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
//get first flow file
def ff0 = session.get()
if(!ff0)return
def eid = ff0.getAttribute('eid')
//try to find files with same attribute in the incoming queue
def ffList = session.get(new FlowFileFilter(){
public FlowFileFilterResult filter(FlowFile ff) {
if( eid == ff.getAttribute('eid') )return FlowFileFilterResult.ACCEPT_AND_CONTINUE
return FlowFileFilterResult.REJECT_AND_CONTINUE
}
})
//let's assume you require two additional files in queue with the same attribute
if( !ffList || ffList.size()<1 ){
//if less than required
//rollback current session with penalize retrieved files so they will go to the end of the incoming queue
//with pre-configured penalty delay (default 30sec)
session.rollback(true)
return
}
//let's put all in one list to simplify later iterations
ffList.add(ff0)
if( ffList.size()>2 ){
//for example unexpected situation. you have more files then expected
//redirect all of them to failure
session.transfer(ffList, REL_FAILURE)
return
}
//create empty map (aka json object)
def json = [:]
//iterate through files parse and merge attributes
ffList.each{ff->
session.read(ff).withStream{rawIn->
def fjson = new JsonSlurper().parse(rawIn)
json.putAll(fjson)
}
}
//create new flow file and write merged json as a content
def ffOut = session.create()
ffOut = session.write(ffOut,{rawOut->
rawOut.withWriter("UTF-8"){writer->
new JsonBuilder(json).writeTo(writer)
}
} as OutputStreamCallback )
//set mime-type
ffOut = session.putAttribute(ffOut, "mime.type", "application/json")
session.remove(ffList)
session.transfer(ffOut, REL_SUCCESS)
Joining together two different types of data is not really what MergeContent was made to do.
You would need to write a custom processor, or custom script, that understood your incoming data formats and created the new output.
If you have ListenHttp connected to QueryElasticSearchHttp, meaning that you are triggering the query based on the flow file coming out of ListenHttp, then you may want to make a custom version of QueryElasticSearchHttp that takes the content of the incoming flow file and joins it together with any of the outgoing results.
Here is where the query result is currently written to a flow file:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/QueryElasticsearchHttp.java#L360
Another option is to use ExecuteScript and write a script that could take multiple flow files and merge them together in the way you described.
I'm using SOAPUI to mock out a web-api service, I'm reading the contents of a static json response file, but changing the contents of a couple of the nodes based on what the user has passed through in the request.
I can't create multiple responses as surcharge is calculated from the amount passed.
The toString() method of the object that gets return by the slurper is replacing { with [ with invalidates my JsonResponse. I've included the important bits of the code below, has anyone got a way around this or is JsonSlurper not the right thing to use here?
def json=slurper.parseText(new File(path).text)
// set the surcharge to the two credit card nodes
// these are being set fine
json.AvailableCardTypeResponse.PaymentCards[0].Surcharge="${sur_charge}"
json.AvailableCardTypeResponse.PaymentCards[1].Surcharge="${sur_charge}"
response.setContentType("application/json;charset=utf-8" );
response.setContentLength(length);
Tools.readAndWrite( new ByteArrayInputStream(json.toString().getBytes("UTF-8")), length,response.getOutputStream() )
return new com.eviware.soapui.impl.wsdl.mock.WsdlMockResult(mockRequest)
You're slurping the json into a List/Map structure, then writing this List/Map structure out.
You need to convert your lists and maps back to json.
Change the line:
Tools.readAndWrite( new ByteArrayInputStream(json.toString().getBytes("UTF-8")), length,response.getOutputStream() )
to
Tools.readAndWrite( new ByteArrayInputStream( new JsonBuilder( json ).toString().getBytes("UTF-8")), length,response.getOutputStream() )
I want to parse a json file into objects, and save it to database. I just create a groovy script that runs in grails console(typing grails console in cmd line). I did not create grails app or domain class. Inside this small script, When I call save, I have
groovy.lang.MissingMethodException: No signature of method: Blog.save()
is applicable for argument types: () values: []
Possible solutions: wait(), any(), wait(long), isCase(java.lang.Object),
sleep(long), any(groovy.lang.Closure)
Am I missing something?
I'm also confused that if I do save, is it going to save data to a table called Blog? Should I build any database connection here? (Because I grails domain class, we don't need to. But is it different using pure groovy?)
Many Thanks!
import grails.converters.*
import org.codehaus.groovy.grails.web.json.*;
class Blog {
String title
String body
static mapping = {
body type:"text"
attachment type:"text"
}
Blog(title,body,slug){
this.title = title
this.body=body
}
}
here parse the json
// parse json
List parsedList =JSON.parse(new FileInputStream("c:/ning-blogs.json"), "UTF-8")
def blogs = parsedList.collect {JSONObject jsonObject ->
new Blog(jsonObject.get("title"),jsonObject.get("description"),"N/A");
}
loop blogs and save each object
for (i in blogs){
// println i.title; I'll get the information needed.
i.save();
}
I don't have large experience with grails, but from a quick googling seems like that for a class be treated like a model class, it will need to be either on the correct convention-package/dir or a legacy jar with hibernate mapping/JPA annotation. Thus your example can't work. Why not define that model in your model package?