Is it possible to restrict values, or property names in the schema in accordance with data defined in another json (non-schema, just data file) file? Or even take files from a folder and process their names?
For example, YAML:
file 1:
Attributes:
- Attribute1
- Attribute2
file2:
Influence:
Attribute1: 1
Attribute2: -3
I want to have syntax help in the second file that depends on the data defined in the first file. How can I do it?
And harder case
there is a folder with some YAMLs/JSONs describe some events.
like:
Events/event1.yaml
Events/subfolder/event2.yaml
Another file should use only file names defined in the folder
For example:
DefaultEvents:
- event1
- event2
Is it possible and how to get autocomplete with JSON Schema in such a case?
It's not about validation, I need syntax help, autocomplete during making such files.
The only possibility I found is to add all possible values to JsonSchema dynamically with any programming language you use.
This solution will be sufficient when JsonSchema is stored locally in your project.
Related
I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps
I need to delete certain entries from nested Json files. As far as I know, I cant just delete them from the json file directly, so my next choice would be to load them into a pyspark dataframe, delete the entries there, create a new json with the same schema (& preferably the same name) and replace the old json file. I have extracted the schema into a json file, is there a way to write the dataframe back into a json file, somehow parsing the extracted schema?
Thanks!
Spark DataFrameWriter also has a method mode() to specify SaveMode; the argument to this method either takes below string or a constant from SaveMode class.
overwrite – mode is used to overwrite the existing file, alternatively, you can use SaveMode.Overwrite.
append – To add the data to the existing file, alternatively, you can use SaveMode.Append.
ignore – Ignores write operation when the file already exists, alternatively you can use SaveMode.Ignore.
errorifexists or error – This is a default option when the file already exists, it returns an error, alternatively, you can use SaveMode.ErrorIfExists.
df2.write.mode(SaveMode.Overwrite).json("/tmp/spark_output/zipcodes.json")
I have an API called VerifyIdentity which returns true or false for an ID.
I also have a CSV file and all the IDs in the file are valid IDs and should be returned True by VerifyIdentity API.
I want to create a feature file to test all the IDs. Is there a way to loop on that CSV file? I know that the cucumber outline can do very similar thing, but I can't manually type those IDs in my tests since there are too many IDs.
Thank you!
By the way, the IDs in the CSV are all the numbers between 1 and 100000. It should also work if there is a way to create a loop-like scenario
Reading CSV files from a Scenario written in Gherkin is not supported.
However this feature is supported in gherkin with qaf. You can have examples in CSV/Excel/XML/json/DB
Scenario Outline: Search Keyword using data from file
When I search for "<searchKey>"
Then I get at least "<number>" results
Then it should have "<searchResult>" in search results
Examples: {'datafile':'resources/testdata.csv'}
where your csv file may look like below:
searchKey,searchResult,number,TestCaseId
https://qmetry.github.io/qaf/latest/gherkin_client.html
I have 2 directories: 1 with txt files and the other with corresponding JSON (metadata) files (around 90000 of each). There is one JSON file for each CSV file, and they share the same name (they don't share any other fields). I am trying to index all these files in Apache solr.
The txt files just have plain text, I mapped each line to a field call 'sentence' and included the file name as a field using the data import handler. No problems here.
The JSON file has metadata: 3 tags: a URL, author and title (for the content in the corresponding txt file).
When I index the JSON file (I just used the _default schema, and posted the fields to the schema, as explained in the official solr tutorial), I don't know how to get the file name into the index as a field. As far as i know, that's no way to use the Data import handler for JSON files. I've read that I can pass a literal through the bin/post tool, but again, as far as I understand, I can't pass in the file name dynamically as a literal.
I NEED to get the file name, it is the only way in which I can associate the metadata with each sentence in the txt files in my downstream Python code.
So if anybody has a suggestion about how I should index the JSON file name along with the JSON content (or even some workaround), I'd be eternally grateful.
As #MatsLindh mentioned in the comments, I used Pysolr to do the indexing and get the filename. It's pretty basic, but I thought I'd post what I did as Pysolr doesn't have much documentation.
So, here's how you use Pysolr to index multiple JSON files, while also indexing the file name of the files. This method can be used if you have your files and your metadata files with the same filename (but different extensions), and you want to link them together somehow, like in my case.
Open a connection to your Solr instance using the pysolr.Solr command.
Loop through the directory containing your files, and get the filename of each file using os.path.basename and store it in a variable (after removing the extension, if necessary).
Read the file's JSON content into another variable.
Pysolr expects whatever is to be indexed to be stored in a list of dictionaries where each dictionary corresponds to one record.
Store all the fields you want to index in a dictionary (solr_content in my code below) while making sure the keys match the field names in your managed-schema file.
Append the dictionary created in each iteration to a list (list_for_solr in my code).
Outside the loop, use the solr.add command to send your list of dictionaries to be indexed in Solr.
That's all there is to it! Here's the code.
solr = pysolr.Solr('http://localhost:8983/solr/collection_name')
folderpath = directory-where-the-files-are-present
list_for_solr = []
for filepath in iglob(os.path.join(folderpath, '*.meta')):
with open(filepath, 'r') as file:
filename = os.path.basename(filepath)
# filename is xxxx.yyyy.meta
filename_without_extension = '.'.join(filename.split('.')[:2])
content = json.load(file)
solr_content = {}
solr_content['authors'] = content['authors']
solr_content['title'] = content['title']
solr_content['url'] = content['url']
solr_content['filename'] = filename_without_extension
list_for_solr.append(solr_content)
solr.add(list_for_solr)
I have a doubt. I do know that Logstash allows us to input csv/log files and filter it using separators and columns. And it will output into elasticsearch for it to be used by Kibana. However, after writing the conf file, do I need to specify index pattern by using the command:
CURL -XPUT 'http://localhost:5601/test' d
Because I do know that when you have a JSON file, you will have to define the mapping etc. Do I need to do this step for csv files and other non json files? Sorry for asking, I need to clear my doubt.
When you insert documents into a new elasticsearch index, a mapping is created for you. This may not be a good thing, as it's based on the initial value of each field. Imagine a field that normally contains a string, but the initial document contains an integer - now your mapping is wrong. This is a good case for creating a mapping.
If you insert documents through logstash into an index named logstash-YYYY-MM-DD (the default), logstash will apply its own mapping. It will use any pattern hints you gave it in grok{}, e.g.:
%{NUMBER:bytes:int}
and it will also make a "raw" (not analyzed) version of each string, which you can access as myField.raw. This may also not be what you want, but you can make your own mapping and provide it as an argument in the elasticsearch{} output stanza.
You can also make templates, which elasticsearch will apply when an index pattern matches the template definition.
So, you only need to create a mapping if you don't like the default behaviors of elasticsearch or logstash.
Hope that helps.