I want to do relation extraction using doccano. I have already annotated data/entity relation using doccano and exported data is in jsonl format. I want to convert it into spacy format data to train bert using spacy on jsonl annotated data.
.
Drop this Annotation and go with NER Annotator spacy (reannotate it)
Related
I wouldn't like to build a Geomesa Datastore, just want to use the Geomesa Spark Core/SQL module to do some spatial analysis process on spark. My data sources are some GeoJson files on hdfs. However, I have to create a SpatialRDD by SpatialRDDProvider.
There is a Converter RDD Provider example in the documents of Geomesa:
import com.typesafe.config.ConfigFactory
import org.apache.hadoop.conf.Configuration
import org.geotools.data.Query
import org.locationtech.geomesa.spark.GeoMesaSpark
val exampleConf = ConfigFactory.load("example.conf").root().render()
val params = Map(
"geomesa.converter" -> exampleConf,
"geomesa.converter.inputs" -> "example.csv",
"geomesa.sft" -> "phrase:String,dtg:Date,geom:Point:srid=4326",
"geomesa.sft.name" -> "example")
val query = new Query("example")
val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)
I can choose GeoMesa's JSON Converter to create the SpatialRDD. However, it seems to be
necessary to assign all field names and types in geomesa.sft paramater and a converter config file. If I have many GeoJson files, I have to do this one by one manually, it is very
inconvenient obviously.
Is there any way that Geomesa Converter can infer the field names and types from the file?
Yes, GeoMesa can infer the type and converter. For scala/java, see this unit test. Alternatively, the GeoMesa CLI tools can be used ahead of time to persist the type and converter to reusable files, using the convert command (type inference is described in the linked ingest command).
What I'm trying to do is something similar to the Stackoverflow question here: basically converting .seq.gz JSON files to Parquet files with a proper schema defined.
I don't want to infer the schema, rather I would like to define my own, ideally having my Scala case classes so they can be reused as models by other jobs.
I'm not too sure whether I should deserialise my JSON into a case class and let toDS() to implicitly convert my data like below:
spark
.sequenceFile(input, classOf[IntWritable], classOf[Text])
.mapValues(
json => deserialize[MyClass](json.toString) // json to case class instance
)
.toDS()
.write.mode(SaveMode.Overwrite)
.parquet(outputFile)
...or rather use a Spark Data Frame schema instead, or even a Parquet schema. But I don't know how to do it though.
My objective is having full control over my models and possibly map JSON types (which is a poorer format) to Parquet types.
Thanks!
I want to use the dumped weights and model architecture in other framework for testing.
I know that:
model.get_config() can give the configuration of the model
model.to_json returns a representation of the model as a JSON string, but that the representation does not include the weights, only the architecture
model.save_weights(filepath) saves the weights of the model as a HDF5 file
I want to save the architecture as well as weights in a json file.
Keras does not have any built-in way to export the weights to JSON.
Solution 1:
For now you can easily do it by iterating over the weights and saving it to the JSON file.
weights_list = model.get_weights()
will return a list of all weight tensors in the model, as Numpy arrays.
Then, all you have to do next is to iterate over this list and write to the file:
for i, weights in enumerate(weights_list):
writeJSON(weights)
Solution 2:
import json
weights_list = model.get_weights()
print json.dumps(weights_list.tolist())
I want to convert my nested json into csv ,i used
df.write.format("com.databricks.spark.csv").option("header", "true").save("mydata.csv")
But it can use to normal json but not nested json. Anyway that I can convert my nested json to csv?help will be appreciated,Thanks!
When you ask Spark to convert a JSON structure to a CSV, Spark can only map the first level of the JSON.
This happens because of the simplicity of the CSV files. It is just asigning a value to a name. That is why {"name1":"value1", "name2":"value2"...} can be represented as a CSV with this structure:
name1,name2, ...
value1,value2,...
In your case, you are converting a JSON with several levels, so Spark exception is saying that it cannot figure out how to convert such a complex structure into a CSV.
If you try to add only a second level to your JSON, it will work, but be careful. It will remove the names of the second level to include only the values in an array.
You can have a look at this link to see the example for json datasets. It includes an example.
As I have no information about the nature of the data, I can't say much more about it. But if you need to write the information as a CSV you will need to simplify the structure of your data.
Read json file in spark and create dataframe.
val path = "examples/src/main/resources/people.json"
val people = sqlContext.read.json(path)
Save the dataframe using spark-csv
people.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("newcars.csv")
Source :
read json
save to csv
I have to develop a mapreduce program that is needed to perform a join on two different data sets.
One of them is a csv file and other is an avro file.
I am using MultipleInputs to process both sources. However to process both dataset in one single reducer, I am converting the Avro Data to Text by using
new Text(key.datum.toString())
My challenge is to convert the Json String generated above to Avro rcord back in reducer as the final output needs to be in avro format.
Is there a particular function or class that can be used to do this?
If yes, can you please quote an example as well?