I can't see to find it, hence i turn to slack to ask: Is there a way to write a csv file with its heards using akka stream alpakka ?
The only thing i see is https://doc.akka.io/docs/alpakka/current/data-transformations/csv.html#csv-formatting
But no reverse operation to csv to map somehow.
My use case is that i need to read few csv files, filter their content, and write the clean content in a corresponding file orginalcsvfilename-cleanded.csv.
If it is not directly supported, any recommendation ?
You can do something like that
def csv_header(elem:T):List[String] = ???
def csv_line(elem:T):List[String] = ???
def firstTrueIterator(): Iterator[Boolean] = (Iterator single true) ++ (Iterator continually false)
def firstTrueSource: Source[Boolean, _] = Source fromIterator firstTrueIterator
def processData(elem: T, firstRun: Boolean): List[List[String]] = {
if (firstRun) {
List(
csv_header(elem),
csv_line(elem)
)
} else {
List(csv_line(elem))
}
}
val finalSource = source
.zipWith(firstTrueSource)(processData)
.mapConcat(identity)
.via(CsvFormatting.format())
I use MergeContent 1.3.0 in order to merge FlowFiles from 2 sources: 1) from ListenHTTP and 2) from QueryElasticsearchHTTP.
The problem is that the merging result is a list of JSON strings. How can I convert them into a single JSON string?
{"event-date":"2017-08-08T00:00:00"}{"event-date":"2017-02-23T00:00:00"}{"eid":1,"zid":1,"latitude":38.3,"longitude":2.4}
I would to get this result:
{"event-date":["2017-08-08T00:00:00","2017-02-23T00:00:00"],"eid":1,"zid":1,"latitude":38.3,"longitude":2.4}
Is it possible?
UPDATE:
After changing data structure in Elastic, I was able to come up with the following output result of MergeContent. Now I have a common field eid in both JSON strings. I would like to merge these strings by eid in order to get a single JSON file. Which operator should I use?
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4}{"eid":"1","dates":{"event-date":["2017-08-08","2017-02-23"]}}
I need to get the following output:
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4,"dates":{"event-date":["2017-08-08","2017-02-23"]}}
It was suggested to use ExecuteScript to merge files. However I cannot figure out how to do this. This is what I tried:
import json
import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
class ModJSON(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
obj = json.loads(text)
newObj = {
"eid": obj['eid'],
"zid": obj['zid'],
...
}
outputStream.write(bytearray(json.dumps(newObj, indent=4).encode('utf-8')))
flowFile1 = session.get()
flowFile2 = session.get()
if (flowFile1 != None && flowFile2 != None):
# WHAT SHOULD I PUT HERE??
flowFile = session.write(flowFile, ModJSON())
flowFile = session.putAttribute(flowFile, "filename", flowFile.getAttribute('filename').split('.')[0]+'_translated.json')
session.transfer(flowFile, REL_SUCCESS)
session.commit()
The example how to read multiple files from incoming queue using filtering
Assume you have multiple pairs of flow files with following content:
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4}
and
{"eid":"1","dates":{"event-date":["2017-08-08","2017-02-23"]}}
The same value of eid field provides a link between pairs.
Before merging we have to extract the value of eid field and put it into na attribute of the flow file for fast filtering.
Use the EvaluateJsonPath processor with properties:
Destination : flowfile-attribute
eid : $.eid
After this you'll have new eid attribute of the flow file.
Then use ExecuteScript processor with groovy language and with following code:
import org.apache.nifi.processor.FlowFileFilter;
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
//get first flow file
def ff0 = session.get()
if(!ff0)return
def eid = ff0.getAttribute('eid')
//try to find files with same attribute in the incoming queue
def ffList = session.get(new FlowFileFilter(){
public FlowFileFilterResult filter(FlowFile ff) {
if( eid == ff.getAttribute('eid') )return FlowFileFilterResult.ACCEPT_AND_CONTINUE
return FlowFileFilterResult.REJECT_AND_CONTINUE
}
})
//let's assume you require two additional files in queue with the same attribute
if( !ffList || ffList.size()<1 ){
//if less than required
//rollback current session with penalize retrieved files so they will go to the end of the incoming queue
//with pre-configured penalty delay (default 30sec)
session.rollback(true)
return
}
//let's put all in one list to simplify later iterations
ffList.add(ff0)
if( ffList.size()>2 ){
//for example unexpected situation. you have more files then expected
//redirect all of them to failure
session.transfer(ffList, REL_FAILURE)
return
}
//create empty map (aka json object)
def json = [:]
//iterate through files parse and merge attributes
ffList.each{ff->
session.read(ff).withStream{rawIn->
def fjson = new JsonSlurper().parse(rawIn)
json.putAll(fjson)
}
}
//create new flow file and write merged json as a content
def ffOut = session.create()
ffOut = session.write(ffOut,{rawOut->
rawOut.withWriter("UTF-8"){writer->
new JsonBuilder(json).writeTo(writer)
}
} as OutputStreamCallback )
//set mime-type
ffOut = session.putAttribute(ffOut, "mime.type", "application/json")
session.remove(ffList)
session.transfer(ffOut, REL_SUCCESS)
Joining together two different types of data is not really what MergeContent was made to do.
You would need to write a custom processor, or custom script, that understood your incoming data formats and created the new output.
If you have ListenHttp connected to QueryElasticSearchHttp, meaning that you are triggering the query based on the flow file coming out of ListenHttp, then you may want to make a custom version of QueryElasticSearchHttp that takes the content of the incoming flow file and joins it together with any of the outgoing results.
Here is where the query result is currently written to a flow file:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/QueryElasticsearchHttp.java#L360
Another option is to use ExecuteScript and write a script that could take multiple flow files and merge them together in the way you described.
I have a webpage were I take the RSS links from. The links are XML and I would like to use the XMLFeedSpider functionality to simplify the parsing.
Is that possible?
This would be the flow:
GET example.com/rss (return HTML)
Parse html and get RSS links
foreach link parse XML
I found a simply way based on the existing example in the documentation and looking at the source code. Here is my solution:
from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'item'
def start_request(self):
urls = ['http://www.example.com/get-feed-links']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_main)
def parse_main(self, response):
for el in response.css("li.feed-links"):
yield scrapy.Request(el.css("a::attr(href)").extract_first(),
callback=self.parse)
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))
item = TestItem()
item['id'] = node.xpath('#id').extract()
item['name'] = node.xpath('name').extract()
item['description'] = node.xpath('description').extract()
return item
I am trying to write a Django Rest Framework API handler that can receive a file as well as a JSON payload. I've set the MultiPartParser as the handler parser.
However, it seems I cannot do both. If I send the payload with the file as a multi part request, the JSON payload is available in a mangled manner in the request.data (first text part until the first colon as the key, the rest is the data). I can send the parameters in standard form parameters just fine - but the rest of my API accepts JSON payloads and I wanted to be consistent. The request.body cannot be read as it raises *** RawPostDataException: You cannot access body after reading from request's data stream
For example, a file and this payload in the request body:
{"title":"Document Title", "description":"Doc Description"}
Becomes:
<QueryDict: {u'fileUpload': [<InMemoryUploadedFile: 20150504_115355.jpg (image/jpeg)>, <InMemoryUploadedFile: Front end lead.doc (application/msword)>], u'{%22title%22': [u'"Document Title", "description":"Doc Description"}']}>
Is there a way to do this? Can I eat my cake, keep it and not gain any weight?
Edit:
It was suggested that this might be a copy of Django REST Framework upload image: "The submitted data was not a file". It is not. The upload and request is done in multipart, and keep in mind the file and upload of it is fine. I can even complete the request with standard form variables. But I want to see if I can get a JSON payload in there instead.
For someone who needs to upload a file and send some data, there is no straight fwd way you can get it to work. There is an open issue in json api specs for this. One possibility i have seen is to use multipart/related as shown here, but i think its very hard to implement it in drf.
Finally what i had implemented was to send the request as formdata. You would send each file as file and all other data as text.
Now for sending the data as text you can have a single key called data and send the whole json as string in value.
Models.py
class Posts(models.Model):
id = models.UUIDField(default=uuid.uuid4, primary_key=True, editable=False)
caption = models.TextField(max_length=1000)
media = models.ImageField(blank=True, default="", upload_to="posts/")
tags = models.ManyToManyField('Tags', related_name='posts')
serializers.py -> no special changes needed, not showing my serializer here as its too lengthy because of the writable ManyToMany Field implimentation.
views.py
class PostsViewset(viewsets.ModelViewSet):
serializer_class = PostsSerializer
parser_classes = (MultipartJsonParser, parsers.JSONParser)
queryset = Posts.objects.all()
lookup_field = 'id'
You will need custom parser as shown below for parsing json.
utils.py
from django.http import QueryDict
import json
from rest_framework import parsers
class MultipartJsonParser(parsers.MultiPartParser):
def parse(self, stream, media_type=None, parser_context=None):
result = super().parse(
stream,
media_type=media_type,
parser_context=parser_context
)
data = {}
# find the data field and parse it
data = json.loads(result.data["data"])
qdict = QueryDict('', mutable=True)
qdict.update(data)
return parsers.DataAndFiles(qdict, result.files)
The request example in postman
EDIT:
see this extended answer if you want to send each data as key value pair
I know this is an old thread, but I just came across this. I had to use MultiPartParser in order to get my file and extra data to come across together. Here's what my code looks like:
# views.py
class FileUploadView(views.APIView):
parser_classes = (MultiPartParser,)
def put(self, request, filename, format=None):
file_obj = request.data['file']
ftype = request.data['ftype']
caption = request.data['caption']
# ...
# do some stuff with uploaded file
# ...
return Response(status=204)
My AngularJS code using ng-file-upload is:
file.upload = Upload.upload({
url: "/api/picture/upload/" + file.name,
data: {
file: file,
ftype: 'final',
caption: 'This is an image caption'
}
});
I send JSON and an image to create/update a product object. Below is a create APIView that works for me.
Serializer
class ProductCreateSerializer(serializers.ModelSerializer):
class Meta:
model = Product
fields = [
"id",
"product_name",
"product_description",
"product_price",
]
def create(self,validated_data):
return Product.objects.create(**validated_data)
View
from rest_framework import generics,status
from rest_framework.parsers import FormParser,MultiPartParser
class ProductCreateAPIView(generics.CreateAPIView):
queryset = Product.objects.all()
serializer_class = ProductCreateSerializer
permission_classes = [IsAdminOrIsSelf,]
parser_classes = (MultiPartParser,FormParser,)
def perform_create(self,serializer,format=None):
owner = self.request.user
if self.request.data.get('image') is not None:
product_image = self.request.data.get('image')
serializer.save(owner=owner,product_image=product_image)
else:
serializer.save(owner=owner)
Example test:
def test_product_creation_with_image(self):
url = reverse('products_create_api')
self.client.login(username='testaccount',password='testaccount')
data = {
"product_name" : "Potatoes",
"product_description" : "Amazing Potatoes",
"image" : open("local-filename.jpg","rb")
}
response = self.client.post(url,data)
self.assertEqual(response.status_code,status.HTTP_201_CREATED)
#Nithin solution works but essentially it means you are sending JSON as strings and hence not using the actual application/json inside the multipart segments.
What we want is to make the backend accept data in the below format
------WebKitFormBoundaryrga771iuUYap8BB2
Content-Disposition: form-data; name="file"; filename="1x1_noexif.jpeg"
Content-Type: image/jpeg
------WebKitFormBoundaryrga771iuUYap8BB2
Content-Disposition: form-data; name="myjson"; filename="blob"
Content-Type: application/json
{"hello":"world"}
------WebKitFormBoundaryrga771iuUYap8BB2
Content-Disposition: form-data; name="isDownscaled"; filename="blob"
Content-Type: application/json
false
------WebKitFormBoundaryrga771iuUYap8BB2--
MultiPartParser works with the above format but will treat those jsons as files. So we simply unmarshal those jsons by putting them to data.
parsers.py
from rest_framework import parsers
class MultiPartJSONParser(parsers.MultiPartParser):
def parse(self, stream, *args, **kwargs):
data = super().parse(stream, *args, **kwargs)
# Any 'File' found having application/json as type will be moved to data
mutable_data = data.data.copy()
unmarshaled_blob_names = []
json_parser = parsers.JSONParser()
for name, blob in data.files.items():
if blob.content_type == 'application/json' and name not in data.data:
mutable_data[name] = json_parser.parse(blob)
unmarshaled_blob_names.append(name)
for name in unmarshaled_blob_names:
del data.files[name]
data.data = mutable_data
return data
settings.py
REST_FRAMEWORK = {
..
'DEFAULT_PARSER_CLASSES': [
..
'myproject.parsers.MultiPartJSONParser',
],
}
This should work now.
The final bit is testing. Since the test client that ships with Django and REST doesn't support multipart JSON, we work around that by wrapping any JSON data.
import io
import json
def JsonBlob(obj):
stringified = json.dumps(obj)
blob = io.StringIO(stringified)
blob.content_type = 'application/json'
return blob
def test_simple(client, png_3x3):
response = client.post(f'http://localhost/files/', {
'file': png_3x3,
'metadata': JsonBlob({'lens': 'Sigma 35mm'}),
}, format='multipart')
assert response.status_code == 200
If you're getting an error along the lines of Incorrect type. Expected pk value, received list., with #nithin's solution, it's because Django's QueryDict is getting in the way - it's specifically structured to use a list for each entry in the dictionary, and thus:
{ "list": [1, 2] }
when parsed by MultipartJsonParser yields
{ 'list': [[1, 2]] }
which trips up your serializer.
Here is an alternative which handles this case, specifically expecting the _data key for your JSON:
from rest_framework import parsers
import json
class MultiPartJSONParser(parsers.MultiPartParser):
def parse(self, stream, *args, **kwargs):
data = super().parse(stream, *args, **kwargs)
json_data_field = data.data.get('_data')
if json_data_field is not None:
parsed = json.loads(json_data_field)
mutable_data = {}
for key, value in parsed.items():
mutable_data[key] = value
mutable_files = {}
for key, value in data.files.items():
if key != '_data':
mutable_files[key] = value
return parsers.DataAndFiles(mutable_data, mutable_files)
json_data_file = data.files.get('_data')
if json_data_file:
parsed = parsers.JSONParser().parse(json_data_file)
mutable_data = {}
for key, value in parsed.items():
mutable_data[key] = value
mutable_files = {}
for key, value in data.files.items():
mutable_files[key] = value
return parsers.DataAndFiles(mutable_data, mutable_files)
return data
It is very simple to use a multipart post and a regular view, if this is an option.
You send the json as a field and files as files, then process in one view.
Here is a simple python client and a Django server:
Client - sending multiple files and an arbitrary json-encoded object:
import json
import requests
payload = {
"field1": 1,
"manifest": "special cakes",
"nested": {"arbitrary":1, "object":[1,2,3]},
"hello": "word" }
filenames = ["file1","file2"]
request_files = {}
url="example.com/upload"
for filename in filenames:
request_files[filename] = open(filename, 'rb')
r = requests.post(url, data={'json':json.dumps(payload)}, files=request_files)
Server - consuming the json and saving the files:
#csrf_exempt
def upload(request):
if request.method == 'POST':
data = json.loads(request.POST['json'])
try:
manifest = data['manifest']
#process the json data
except KeyError:
HttpResponseServerError("Malformed data!")
dir = os.path.join(settings.MEDIA_ROOT, "uploads")
os.makedirs(dir, exist_ok=True)
for file in request.FILES:
path = os.path.join(dir,file)
if not os.path.exists(path):
save_uploaded_file(path, request.FILES[file])
else:
return HttpResponseNotFound()
return HttpResponse("Got json data")
def save_uploaded_file(path,f):
with open(path, 'wb+') as destination:
for chunk in f.chunks():
destination.write(chunk)
I'd just like to add to #Pithikos's answer by modifying the parser to accept lists as well, in line with how DRF parses lists in serializers in utils/html#parse_html_list
class MultiPartJSONParser(parsers.MultiPartParser):
def parse(self, stream, *args, **kwargs):
data = super().parse(stream, *args, **kwargs)
# Any 'File' found having application/json as type will be moved to data
mutable_data = data.data.copy()
unmarshaled_blob_names = []
json_parser = parsers.JSONParser()
for name, blob in data.files.items():
if blob.content_type == 'application/json' and name not in data.data:
parsed = json_parser.parse(blob)
if isinstance(parsed, list):
# need to break it out into [0], [1] etc
for idx, item in enumerate(parsed):
mutable_data[name+f"[{str(idx)}]"] = item
else:
mutable_data[name] = parsed
unmarshaled_blob_names.append(name)
for name in unmarshaled_blob_names:
del data.files[name]
data.data = mutable_data
return data
The following code worked for me.
from django.core.files.uploadedfile import SimpleUploadedFile
import requests
from typing import Dict
with open(file_path, 'rb') as f:
file = SimpleUploadedFile('Your-Name', f.read())
data: Dict[str,str]
files: Dict[str,SimpleUploadedFile] = {'model_field_name': file}
requests.put(url, headers=headers, data=data, files=files)
requests.post(url, headers=headers, data=data, files=files)
'model_field_name' is the name of the FileField or ImageField in your Django model. You can pass other data as name or location as usual by using data parameter.
Hope this helps.
This work for me:
class FileUpload(APIView):
parser_classes = [MultiPartParser]
authentication_classes = [JWTAuthentication]
def post(self, request, filename, format=None):
file = request.data['file']
data = json.loads(request.POST['form'])
#.... just do.....
.
.
.
frontend part: example with fetch (vue frontend)
let data = await new FormData(); // creates a new FormData object
data.append("file", this.files); // add your file to form data
data.append('form',JSON.stringify(body)) //add your json
fetch(`https://endpoint/FileUpload/${body.nombre}`, {
method: 'POST',
body: data,
headers: {Authorization: `Bearer ${accessToken}`}
})
I hope this helps.