I have 2 directories: 1 with txt files and the other with corresponding JSON (metadata) files (around 90000 of each). There is one JSON file for each CSV file, and they share the same name (they don't share any other fields). I am trying to index all these files in Apache solr.
The txt files just have plain text, I mapped each line to a field call 'sentence' and included the file name as a field using the data import handler. No problems here.
The JSON file has metadata: 3 tags: a URL, author and title (for the content in the corresponding txt file).
When I index the JSON file (I just used the _default schema, and posted the fields to the schema, as explained in the official solr tutorial), I don't know how to get the file name into the index as a field. As far as i know, that's no way to use the Data import handler for JSON files. I've read that I can pass a literal through the bin/post tool, but again, as far as I understand, I can't pass in the file name dynamically as a literal.
I NEED to get the file name, it is the only way in which I can associate the metadata with each sentence in the txt files in my downstream Python code.
So if anybody has a suggestion about how I should index the JSON file name along with the JSON content (or even some workaround), I'd be eternally grateful.
As #MatsLindh mentioned in the comments, I used Pysolr to do the indexing and get the filename. It's pretty basic, but I thought I'd post what I did as Pysolr doesn't have much documentation.
So, here's how you use Pysolr to index multiple JSON files, while also indexing the file name of the files. This method can be used if you have your files and your metadata files with the same filename (but different extensions), and you want to link them together somehow, like in my case.
Open a connection to your Solr instance using the pysolr.Solr command.
Loop through the directory containing your files, and get the filename of each file using os.path.basename and store it in a variable (after removing the extension, if necessary).
Read the file's JSON content into another variable.
Pysolr expects whatever is to be indexed to be stored in a list of dictionaries where each dictionary corresponds to one record.
Store all the fields you want to index in a dictionary (solr_content in my code below) while making sure the keys match the field names in your managed-schema file.
Append the dictionary created in each iteration to a list (list_for_solr in my code).
Outside the loop, use the solr.add command to send your list of dictionaries to be indexed in Solr.
That's all there is to it! Here's the code.
solr = pysolr.Solr('http://localhost:8983/solr/collection_name')
folderpath = directory-where-the-files-are-present
list_for_solr = []
for filepath in iglob(os.path.join(folderpath, '*.meta')):
with open(filepath, 'r') as file:
filename = os.path.basename(filepath)
# filename is xxxx.yyyy.meta
filename_without_extension = '.'.join(filename.split('.')[:2])
content = json.load(file)
solr_content = {}
solr_content['authors'] = content['authors']
solr_content['title'] = content['title']
solr_content['url'] = content['url']
solr_content['filename'] = filename_without_extension
list_for_solr.append(solr_content)
solr.add(list_for_solr)
Related
I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps
Background: I want to store a dict object in json format that has say, 2 entries:
(1) Some object that describes the data in (2). This is small data mostly definitions, parameters that control, etc. and things (maybe called metadata) that one would like to read before using the actual data in (2). In short, I want good human readability of this portion of the file.
(2) The data itself is a large chunk- should more like machine readable (no need for human to gaze over it on opening the file).
Problem: How to specify some custom indent, say 4 to the (1) and None to the (2). If I use something like json.dump(data, trig_file, indent=4) where data = {'meta_data': small_description, 'actual_data': big_chunk}, meaning the large data will have a lot of whitespace making the file large.
Assuming you can append json to a file:
Write {"meta_data":\n to the file.
Append the json for small_description formatted appropriately to the file.
Append ,\n"actual_data":\n to the file.
Append the json for big_chunk formatted appropriately to the file.
Append \n} to the file.
The idea is to do the json formatting out the "container" object by hand, and using your json formatter as appropriate to each of the contained objects.
Consider a different file format, interleaving keys and values as distinct documents concatenated together within a single file:
{"next_item": "meta_data"}
{
"description": "human-readable content goes here",
"split over": "several lines"
}
{"next_item": "actual_data"}
["big","machine-readable","unformatted","content","here","....."]
That way you can pass any indent parameters you want to each write, and you aren't doing any serialization by hand.
See How do I use the 'json' module to read in one JSON object at a time? for how one would read a file in this format. One of its answers wisely suggests the ijson library, which accepts a multiple_values=True argument.
I'm trying to understand the code for reading JSON file in Synapse Analytics. And here's the code provided by Microsoft documentation:
Query JSON files using serverless SQL pool in Azure Synapse Analytics
select top 10 *
from openrowset(
bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.jsonl',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) with (doc nvarchar(max)) as rows
go
I wonder why the format = 'csv'. Is it trying to convert JSON to CSV to flatten the file?
Why they didn't just read the file as a SINGLE_CLOB I don't know
When you use SINGLE_CLOB then the entire file is important as one value and the content of the file in the doc is not well formatted as a single JSON. Using SINGLE_CLOB will make us do more work after using the openrowset, before we can use the content as JSON (since it is not valid JSON we will need to parse the value). It can be done but will require more work probably.
The format of the file is multiple JSON's like strings, each in separate line. "line-delimited JSON", as the document call it.
By the way, If you will check the history of the document at GitHub, then you will find that originally this was not the case. As much as I remember, originally the file included a single JSON document with an array of objects (was wrapped with [] after loaded). Someone named "Ronen Ariely" in fact found this issue in the document, which is why you can see my name in the list if the Authors of the document :-)
I wonder why the format = 'csv'. Is it trying to convert json to csv to flatten the hierarchy?
(1) JSON is not a data type in SQL Server. There is no data type name JSON. What we have in SQL Server are tools like functions which work on text and provide support for strings which are JSON's like format. Therefore, we do not CONVERT to JSON or from JSON.
(2) The format parameter has nothing to do with JSON. It specifies that the content of the file is a comma separated values file. You can (and should) use it whenever your file is well formatted as comma separated values file (also commonly known as csv file).
In this specific sample in the document, the values in the csv file are strings, which each one of them has a valid JSON format. Only after you read the file using the openrowset, we start to parse the content of the text as JSON.
Notice that only after the title "Parse JSON documents" in the document, the document starts to speak about parsing the text as JSON.
Using Jmeter, I need to add UUID extracted from JSON and add that in CSV in same column (multiple) to feed in Delete Request (REST). This is to test multiple delete calls which has unique UUID generated from POST call. Or is there any other way I can test multiple delete call after extracting from POST calls. Lets say 50 Post then 50 Delete calls.
I don't think you need to do anything as given threads reside in the same Thread Group you should be able to use the same variable for the Delete request.
JMeter Variables are local to a thread so different threads will have different variable values.
If you are still looking for a file-based solution be aware of fact that you can write an arbitrary JMeter Variable into a file using Groovy code like:
Add JSR223 PostProcessor after the JSON Extractor
Make sure you have groovy selected in the "Language" dropdown
Make sure you have Cache compiled script if available box ticked
Put the following code into "Script" area
def csvFile = new File('test.csv')
csvFile << vars.get('your_variable')
csvFile << System.getProperty('line.separator')
This way you will get any extracted UUID written into test.csv file in JMeter's "bin" folder, you can use it in the CSV Data Set Config for your Delete request.
More information:
Groovy Goodness: Working with Files
Apache Groovy - Why and How You Should Use It
I have lots folder named as a some series name. And every series folder it has its own chapter folder. In chapter folder some images in it. My website is manga(comic) site. So I am gonna record this folder's and image's path to mysql and return to the Json data for using with AngularjS. So How should I save these folders path or names to mysql for the get proper Json data and using with angularjs.
My table is like this: Can change,
id series_name folder path
1 Dragon Ball 788 01.jpg02.jpg03.jpg04.jpg05.jpg06.jpg..........
2 One Piece 332 01.jpg...................
3 One PÄ°ece 333 01.jpg02.jpg...........
My current website:
Link to Reader Part of My WebSite
I'm assuming you're using PHP on a LAMP stack. So first you would need to grab all your SQL fields and change them into JSON keys.
Create JSON-object the correct way
Then you can create your JSON object like this and pass it to Angular when it does an AJAX request. Make sure you create the array before placing it into your JSON object (for path).
{
id: Number,
series_name: String,
folder: Number,
path: [
String, String, String, ...
]
}
Here is the Angular documentation for an Angular GET request.
https://docs.angularjs.org/api/ng/service/$http
EDIT:
It's difficult because of how your filenames are formatted. If it were formatted like "01.jpg,02.jpg,03.jpg" it would be easier.
You can use preg_split with the regex:
$string = "01.jpg02.jpg03.jpg04.jpg05.jpg06.jpg";
$keywords = preg_split("/(.jpg|.png|.bmp)/", $string);
but you would need them all to be the same extension, then you need to re-append the extension to each element after you split it.
There may be a better way.