I have a csv file containing labels where VideoID is the index.
I have a set of extracted audio features (saved in .pkl) that needs to be vstack in order to get one master feature file. How can i make sure that i stack the pkl files following the index (VideoID) in the csv file? The end objective is to concatenate the audio features with the text feature to perform classification. I need them to point to correct labels.
Related
There is a nested json with very deep structure. File is of the format json.gz size 3.5GB. Once this file is uncompressed it is of size 100GB.
This json file is of the format, where Multiline = True (if this condition is used to read the file via spark.read_json then only we get to see the proper json schema).
Also, this file has a single record, in which it has two columns of Struct type array, with multilevel nesting.
How should I read this file and extract information. What kind of cluster / technique to use to extract relevant data from this file.
Structure of the JSON (multiline)
This is a single record. and the entire data is present in 2 columns - in_netxxxx and provider_xxxxx
enter image description here
I was able to achieve this in a bit different way.
Use the utility - Big Text File Splitter -
BigTextFileSplitter - Withdata Softwarehttps://www.withdata.com › big-text-file-splitter ( as the file was huge and multiple level nested) the split record size I kept was 500. This generated around 24 split files of around 3gb each. Entire process took 30 -40 mins.
Processed the _corrupt_record seperately - and populated the required information.
Read the each split file in a using - this option removes the _corrupt_record and also removes the null rows.
spark.read.option("mode", "DROPMALFORMED").json(file_path)
Once the information is fetched form each file, we can merge all the files into a single file, as per standard process.
I have an API called VerifyIdentity which returns true or false for an ID.
I also have a CSV file and all the IDs in the file are valid IDs and should be returned True by VerifyIdentity API.
I want to create a feature file to test all the IDs. Is there a way to loop on that CSV file? I know that the cucumber outline can do very similar thing, but I can't manually type those IDs in my tests since there are too many IDs.
Thank you!
By the way, the IDs in the CSV are all the numbers between 1 and 100000. It should also work if there is a way to create a loop-like scenario
Reading CSV files from a Scenario written in Gherkin is not supported.
However this feature is supported in gherkin with qaf. You can have examples in CSV/Excel/XML/json/DB
Scenario Outline: Search Keyword using data from file
When I search for "<searchKey>"
Then I get at least "<number>" results
Then it should have "<searchResult>" in search results
Examples: {'datafile':'resources/testdata.csv'}
where your csv file may look like below:
searchKey,searchResult,number,TestCaseId
https://qmetry.github.io/qaf/latest/gherkin_client.html
I have 2 directories: 1 with txt files and the other with corresponding JSON (metadata) files (around 90000 of each). There is one JSON file for each CSV file, and they share the same name (they don't share any other fields). I am trying to index all these files in Apache solr.
The txt files just have plain text, I mapped each line to a field call 'sentence' and included the file name as a field using the data import handler. No problems here.
The JSON file has metadata: 3 tags: a URL, author and title (for the content in the corresponding txt file).
When I index the JSON file (I just used the _default schema, and posted the fields to the schema, as explained in the official solr tutorial), I don't know how to get the file name into the index as a field. As far as i know, that's no way to use the Data import handler for JSON files. I've read that I can pass a literal through the bin/post tool, but again, as far as I understand, I can't pass in the file name dynamically as a literal.
I NEED to get the file name, it is the only way in which I can associate the metadata with each sentence in the txt files in my downstream Python code.
So if anybody has a suggestion about how I should index the JSON file name along with the JSON content (or even some workaround), I'd be eternally grateful.
As #MatsLindh mentioned in the comments, I used Pysolr to do the indexing and get the filename. It's pretty basic, but I thought I'd post what I did as Pysolr doesn't have much documentation.
So, here's how you use Pysolr to index multiple JSON files, while also indexing the file name of the files. This method can be used if you have your files and your metadata files with the same filename (but different extensions), and you want to link them together somehow, like in my case.
Open a connection to your Solr instance using the pysolr.Solr command.
Loop through the directory containing your files, and get the filename of each file using os.path.basename and store it in a variable (after removing the extension, if necessary).
Read the file's JSON content into another variable.
Pysolr expects whatever is to be indexed to be stored in a list of dictionaries where each dictionary corresponds to one record.
Store all the fields you want to index in a dictionary (solr_content in my code below) while making sure the keys match the field names in your managed-schema file.
Append the dictionary created in each iteration to a list (list_for_solr in my code).
Outside the loop, use the solr.add command to send your list of dictionaries to be indexed in Solr.
That's all there is to it! Here's the code.
solr = pysolr.Solr('http://localhost:8983/solr/collection_name')
folderpath = directory-where-the-files-are-present
list_for_solr = []
for filepath in iglob(os.path.join(folderpath, '*.meta')):
with open(filepath, 'r') as file:
filename = os.path.basename(filepath)
# filename is xxxx.yyyy.meta
filename_without_extension = '.'.join(filename.split('.')[:2])
content = json.load(file)
solr_content = {}
solr_content['authors'] = content['authors']
solr_content['title'] = content['title']
solr_content['url'] = content['url']
solr_content['filename'] = filename_without_extension
list_for_solr.append(solr_content)
solr.add(list_for_solr)
When using the Model Derivative API I successfully generate an obj representation from a step file. But within that process are some quirks that I do not fully understand:
The Post job has a output.advanced.exportFileStructure property which can be set to "multiple" and a output.advanced.objectIds property which lets you specify the which parts of the model you would like to extract. From the little that the documentation states, I would expect to receive one obj file per requested objectid. Which from my experience is not the case. So does this only work for compressed files like .iam and .ipt?
Well, anyway, instead I get one obj file for all objectIds with one polygon group per objectId. The groups are named (duh!), so I would expect them to be named like their objectId but it seams like the numbers are assigned in a random way. So how should I actually map an objectId to its corresponding 3d part? Is there any way to link the information from GET :urn/metadata/:guid/properties back to their objects?
I hope somebody can shine light on this. If you need more information I can provide you with the original step file, the obj and my server log.
You misunderstood the objectIds property of the derivatives API: specifying that field allows you to export only specific components to a single obj, for example your car model has 1000 different components, but you just want to export components that represent the engine: [34, 56, 76] (I just made those up...). If you want to export each objectId to a separate obj file, you need to fire multiple jobs. the "exportFileStructure" option only applies to composite designs (i.e. assemblies) single: creates one OBJ file for all the input files (assembly file), multiple: creates a separate OBJ file for each object. A step file is not a composite design.
As you noticed the obj groups are named randomly. As far as I know there is no easy reliable way to map a component in the obj file to the original objectId because .obj is a very basic format and it doesn't support metadata. You could use a geometric approach (finding where is the component in space, use bounding boxes, ...) to achieve the mapping but it could be challenging with complex models.
I'm using MongoDB because Meteor doesn't support-officially-anything else.
The main goal is to upload CSV files, parse them in Meteor and import the data to database.
The inserted data size can be 50-60GB or maybe more per file but I can't even insert anything bigger than 16MB due to document size limit. Also, even 1/10 of the insertion takes a lot of time.
I'm using CollectionFS for CSV file upload on the client. Therefore, I tried using CollectionFS for the data itself as well but it gives me an "unsupported data" error.
What can I do about this?
Edit: Since my question creates a confusion about storing data techniques, I want to clear something: I'm not interested in uploading the CSV file; I'm interested in storing the data in the file. I want to collect all user's data in one place and I want to fetch data with the lowest resources.
You could insert the csv file as a collection (the filename can become the collection name), with each row of the csv as a document. This will get around the 16 MB per document size limit. You may end up with a lot of collections, but that is okay. Another collection could keep track of the filename to collection name mapping.
In CollectionFS you can save file directly filesystem, just add the proper package and create your collection like these:
Csv = new FS.Collection("csv", {
stores: [
new FS.Store.FileSystem("csv","/home/username/csv")
],
filter: {
allow: {
extensions: ['csv']
}
}
});