When integrating data with DFO365 we use Data Projects to integrate data from other systems. Definitions for these projects can be exported; giving a zip file containing multiple XML files; one per data entity in the project, a PackageHeader.xml, and a Manifest.xml file. Within the manifest file is an element called QueryData, which seems to be a byte array held as a string.
The QueryData field looks like this:
<QueryData>4a012f270000110001e649010000000a4de9030000862b00008c2b0000882b00
008b2b0000000084045400610078005600410054004e0075006d005400610062
006c00650045006e0074006900740079000000110001e8032e00540061007800
5600410054004e0075006d005400610062006c00650045006e00740069007400
79005f0031000000e2092a005400610078005600410054004e0075006d005400
610062006c00650045006e0074006900740079000000094de8030000f3190000
00920402001100010000ffffffffffffffff9b04ffff9a04ffff000000000000
01ffffffff009005000000000000000000000000000000000000000000000000
000000000000</QueryData>
I tried treating this as a byte encoded string with some success; i.e. the below PowerShell converts the above code to:
Ŋ✯ ʼn 蘀+谀+蠀+謀+ 萀各愀砀嘀䄀吀一甀洀吀愀戀氀攀䔀渀琀椀琀礀 ᄀĀϨ.TaxVATNumTableEntity_1 ৢ*TaxVATNumTableEntity 䴉Ϩ ᧳ 鈀ȄᄀĀ
Related
I'm working on a document-based SwiftUI application, and my file contents includes large arrays of doubles. The application involves importing some data from CSV, and then the file saves this data in a data structure.
However, when looking at the file sizes, the custom document type I declare in my application is 2-3 times larger than the CSV file. The other bits of metadata could not possibly be that large.
Here's the JSON output that is saved to file:
Here's the CSV file that was imported:
On looking at the raw text of the JSON output and comparing it with the original CSV, it became obvious that the file format I declare was using a lot of unnecessary precision. How do I make Swift's JSONEncoder only use, for example, 4 decimal places of precision?
I have an ADF pipeline exporting from xml dataset (ADLS) to json dataset (ADLS) with a copy Data activity. Due to the complex xml structure, I need to parse the nested xml to nested json then use T-SQL to parse the nested json into Synapse table.
However, the output nested has double backslash (It seems like escape characters) at nodes which have comma in it. You can check a sample of xml input and json output below:
xml input
<Address2>test, test</Address2>
json output
"Address2":"test\\, test"
How can I remove the double backslash in the output json with copy data activity in Azure Data Factory ?
Unfortunately there is no such provision in CopyData Activity.
However, I just tried with just the lines you provided as sample source and sink with CopyData Activity and it just copies as is. I don't see any \\. Perhaps you could share the exact pipeline you have, with details of the nested XML, JSON and T-SQL that you are using.
Repro: (with all default settings and properties)
I'm trying to understand the code for reading JSON file in Synapse Analytics. And here's the code provided by Microsoft documentation:
Query JSON files using serverless SQL pool in Azure Synapse Analytics
select top 10 *
from openrowset(
bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.jsonl',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) with (doc nvarchar(max)) as rows
go
I wonder why the format = 'csv'. Is it trying to convert JSON to CSV to flatten the file?
Why they didn't just read the file as a SINGLE_CLOB I don't know
When you use SINGLE_CLOB then the entire file is important as one value and the content of the file in the doc is not well formatted as a single JSON. Using SINGLE_CLOB will make us do more work after using the openrowset, before we can use the content as JSON (since it is not valid JSON we will need to parse the value). It can be done but will require more work probably.
The format of the file is multiple JSON's like strings, each in separate line. "line-delimited JSON", as the document call it.
By the way, If you will check the history of the document at GitHub, then you will find that originally this was not the case. As much as I remember, originally the file included a single JSON document with an array of objects (was wrapped with [] after loaded). Someone named "Ronen Ariely" in fact found this issue in the document, which is why you can see my name in the list if the Authors of the document :-)
I wonder why the format = 'csv'. Is it trying to convert json to csv to flatten the hierarchy?
(1) JSON is not a data type in SQL Server. There is no data type name JSON. What we have in SQL Server are tools like functions which work on text and provide support for strings which are JSON's like format. Therefore, we do not CONVERT to JSON or from JSON.
(2) The format parameter has nothing to do with JSON. It specifies that the content of the file is a comma separated values file. You can (and should) use it whenever your file is well formatted as comma separated values file (also commonly known as csv file).
In this specific sample in the document, the values in the csv file are strings, which each one of them has a valid JSON format. Only after you read the file using the openrowset, we start to parse the content of the text as JSON.
Notice that only after the title "Parse JSON documents" in the document, the document starts to speak about parsing the text as JSON.
I have a requirement to gather certain JSON documents from my database and save them in an outside drive as one file for a downstream consumer.
Using server-side Javascript I can combine the documents in a JSON object or array. However, they need to be saved into this singular file in ndjson format.
Is there any way to do this using xdmp.save in MarkLogic? I thought of saving the documents as a sequence but that throws an error.
xdmp.save() expects a node() for the second parameter.
You could serialize the JSON docs and delimit with a carriage return to generate the Newline Delimited JSON, and then create a text() node from that string.
const ndjson = new NodeBuilder()
.addText(cts.search(cts.collectionQuery("json")).toArray().join("\n"))
.toNode();
xdmp.save("/temp/ndjson.json", ndjson);
I have 2 directories: 1 with txt files and the other with corresponding JSON (metadata) files (around 90000 of each). There is one JSON file for each CSV file, and they share the same name (they don't share any other fields). I am trying to index all these files in Apache solr.
The txt files just have plain text, I mapped each line to a field call 'sentence' and included the file name as a field using the data import handler. No problems here.
The JSON file has metadata: 3 tags: a URL, author and title (for the content in the corresponding txt file).
When I index the JSON file (I just used the _default schema, and posted the fields to the schema, as explained in the official solr tutorial), I don't know how to get the file name into the index as a field. As far as i know, that's no way to use the Data import handler for JSON files. I've read that I can pass a literal through the bin/post tool, but again, as far as I understand, I can't pass in the file name dynamically as a literal.
I NEED to get the file name, it is the only way in which I can associate the metadata with each sentence in the txt files in my downstream Python code.
So if anybody has a suggestion about how I should index the JSON file name along with the JSON content (or even some workaround), I'd be eternally grateful.
As #MatsLindh mentioned in the comments, I used Pysolr to do the indexing and get the filename. It's pretty basic, but I thought I'd post what I did as Pysolr doesn't have much documentation.
So, here's how you use Pysolr to index multiple JSON files, while also indexing the file name of the files. This method can be used if you have your files and your metadata files with the same filename (but different extensions), and you want to link them together somehow, like in my case.
Open a connection to your Solr instance using the pysolr.Solr command.
Loop through the directory containing your files, and get the filename of each file using os.path.basename and store it in a variable (after removing the extension, if necessary).
Read the file's JSON content into another variable.
Pysolr expects whatever is to be indexed to be stored in a list of dictionaries where each dictionary corresponds to one record.
Store all the fields you want to index in a dictionary (solr_content in my code below) while making sure the keys match the field names in your managed-schema file.
Append the dictionary created in each iteration to a list (list_for_solr in my code).
Outside the loop, use the solr.add command to send your list of dictionaries to be indexed in Solr.
That's all there is to it! Here's the code.
solr = pysolr.Solr('http://localhost:8983/solr/collection_name')
folderpath = directory-where-the-files-are-present
list_for_solr = []
for filepath in iglob(os.path.join(folderpath, '*.meta')):
with open(filepath, 'r') as file:
filename = os.path.basename(filepath)
# filename is xxxx.yyyy.meta
filename_without_extension = '.'.join(filename.split('.')[:2])
content = json.load(file)
solr_content = {}
solr_content['authors'] = content['authors']
solr_content['title'] = content['title']
solr_content['url'] = content['url']
solr_content['filename'] = filename_without_extension
list_for_solr.append(solr_content)
solr.add(list_for_solr)