Related
When i used the labeling the images from labelme interface as output i get json file.but i need to in image format like png,bmp,jpeg after labeling. can anyone suggest me any code ?
import json
from PIL import Image
with open('your,json') as f:
data = json.load(f)
# Load the file path from the json
imgpath = data['yourkey']
# Place the image path into the open method
img = Image.open(imgpath)
Based on the tutorial of the original repository, you can use labelme_json_to_dataset <<JSON_PATH>> -o <<OUTPUT_FOLDER_PATH>>.
To run it on python / jupyter, you can use:
import os
def labelme_json_to_dataset(json_path):
os.system("labelme_json_to_dataset "+json_path+" -o "+json_path.replace(".","_"))
If you need to do it for multiple images, just loop the function.
Based on the issue, labelme_json_to_dataset behavior can be reimplemented by using either labelme2voc.py or labelme2coco.py.
You also could use other implementation like labelme2Datasets
You also can implement your own modification of labelme_json_to_dataset using labelme library. Basically, you use label_file = labelme.LabelFile(filename=filename) followed by img = labelme.utils.img_data_to_arr(label_file.imageData). An example of a process would be like this:
import labelme
import os
import glob
def labelme2images(input_dir, output_dir, force=False, save_img=False, new_size=False):
"""
new_size_width, new_size_height = new_size
"""
if save_img:
_makedirs(path=osp.join(output_dir, "images"), force=force)
if new_size:
new_size_width, new_size_height = new_size
print("Generating dataset")
filenames = glob.glob(osp.join(input_dir, "*.json"))
for filename in filenames:
# base name
base = osp.splitext(osp.basename(filename))[0]
label_file = labelme.LabelFile(filename=filename)
img = labelme.utils.img_data_to_arr(label_file.imageData)
h, w = img.shape[0], img.shape[1]
if save_img:
if new_size:
img_pil = Image.fromarray(img).resize((new_size_height, new_size_width))
else:
img_pil = Image.fromarray(img)
img_pil.save(osp.join(output_dir, "images", base + ".jpg"))
I have csv file in Amazon s3 with is 62mb in size (114 000 rows). I am converting it into spark dataset, and taking first 500 rows from it. Code is as follow;
DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set=df.load("s3n://"+this.accessId.replace("\"", "")+":"+this.accessToken.replace("\"", "")+"#"+this.bucketName.replace("\"", "")+"/"+this.filePath.replace("\"", "")+"");
set.take(500)
The whole operation takes 20 to 30 sec.
Now I am trying the same but rather using csv I am using mySQL table with 119 000 rows. MySQL server is in amazon ec2. Code is as follow;
String url ="jdbc:mysql://"+this.hostName+":3306/"+this.dataBaseName+"?user="+this.userName+"&password="+this.password;
SparkSession spark=StartSpark.getSparkSession();
SQLContext sc = spark.sqlContext();
DataFrameReader df = new DataFrameReader(spark).format("csv").option("header", true);
Dataset<Row> set = sc
.read()
.option("url", url)
.option("dbtable", this.tableName)
.option("driver","com.mysql.jdbc.Driver")
.format("jdbc")
.load();
set.take(500);
This is taking 5 to 10 minutes.
I am running spark inside jvm. Using same configuration in both cases.
I can use partitionColumn,numParttition etc but I don't have any numeric column and one more issue is the schema of the table is unknown to me.
My issue is not how to decrease the required time as I know in ideal case spark will run in cluster but what I can not understand is why this big time difference in the above two case?
This problem has been covered multiple times on StackOverflow:
How to improve performance for slow Spark jobs using DataFrame and JDBC connection?
spark jdbc df limit... what is it doing?
How to use JDBC source to write and read data in (Py)Spark?
and in external sources:
https://github.com/awesome-spark/spark-gotchas/blob/master/05_spark_sql_and_dataset_api.md#parallelizing-reads
so just to reiterate - by default DataFrameReader.jdbc doesn't distribute data or reads. It uses single thread, single exectuor.
To distribute reads:
use ranges with lowerBound / upperBound:
Properties properties;
Lower
Dataset<Row> set = sc
.read()
.option("partitionColumn", "foo")
.option("numPartitions", "3")
.option("lowerBound", 0)
.option("upperBound", 30)
.option("url", url)
.option("dbtable", this.tableName)
.option("driver","com.mysql.jdbc.Driver")
.format("jdbc")
.load();
predicates
Properties properties;
Dataset<Row> set = sc
.read()
.jdbc(
url, this.tableName,
{"foo < 10", "foo BETWWEN 10 and 20", "foo > 20"},
properties
)
Please follow the steps below
1.download a copy of the JDBC connector for mysql. I believe you already have one.
wget http://central.maven.org/maven2/mysql/mysql-connector-java/5.1.38/mysql-connector-java-5.1.38.jar
2.create a db-properties.flat file in the below format
jdbcUrl=jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}
user=<username>
password=<password>
3.create a empty table first where you want to load the data.
invoke spark shell with driver class
spark-shell --driver-class-path <your path to mysql jar>
then import all the required package
import java.io.{File, FileInputStream}
import java.util.Properties
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
initiate a hive context or a sql context
val sQLContext = new HiveContext(sc)
import sQLContext.implicits._
import sQLContext.sql
set some of the properties
sQLContext.setConf("hive.exec.dynamic.partition", "true")
sQLContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
Load mysql db properties from file
val dbProperties = new Properties()
dbProperties.load(new FileInputStream(new File("your_path_to/db- properties.flat")))
val jdbcurl = dbProperties.getProperty("jdbcUrl")
create a query to read the data from your table and pass it to read method of #sqlcontext. this is where you can manage your where clause
val df1 = "(SELECT * FROM your_table_name) as s1"
pass the jdbcurl, select query and db properties to read method
val df2 = sQLContext.read.jdbc(jdbcurl, df1, dbProperties)
write it to your table
df2.write.format("orc").partitionBy("your_partition_column_name").mode(SaveMode.Append).saveAsTable("your_target_table_name")
I have got a CSV file along with a header which has to be read through Spark(2.0.0 and Scala 2.11.8) as a dataframe.
Sample csv data:
Item,No. of items,Place
abc,5,xxx
def,6,yyy
ghi,7,zzz
.........
I'm facing problem when I try to read this csv data in spark as a dataframe, because the header contains column(No. of items) having special character "."
Code with which I try to read csv data is:
val spark = SparkSession.builder().appName("SparkExample")
import spark.implicits._
val df = spark.read.option("header", "true").csv("file:///INPUT_FILENAME")
Error I'm facing:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to resolve No. of items given [Item,No. of items,Place];
If I remove the "." from the header, I wont get any error. Even tried with escaping the character,but it escapes all the "." characters even from the data.
Is there any way to escape the special character "." only from the CSV header using spark code?
#Pooja Nayak, Not sure if this was solved; answering this in the interest of community.
sc: SparkContext
spark: SparkSession
sqlContext: SQLContext
// Read the raw file from localFS as-is.
val rdd_raw = sc.textFile("file:///home/xxxx/sample.csv")
// Drop the first line in first partition because it is the header.
val rdd = rdd_raw.mapPartitionsWithIndex{(idx,iter) =>
if(idx == 0) iter.drop(1) else iter
}
// A function to create schema dynamically.
def schemaCreator(header: String): StructType = {
StructType(header
.split(",")
.map(field => StructField(field.trim, StringType, true))
)
}
// Create the schema for the csv that was read and store it.
val csvSchema: StructType = schemaCreator(rdd_raw.first)
// As the input is CSV, split it at "," and trim away the whitespaces.
val rdd_curated = rdd.map(x => x.split(",").map(y => y.trim)).map(xy => Row(xy:_*))
// Create the DF from the RDD.
val df = sqlContext.createDataFrame(rdd_curated, csvSchema)
imports that are necessary
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark._
I am giving you example which is working with pyspark, hopefully same will work for you, just by adding some language related syntax.
file =r'C:\Users\e5543130\Desktop\sampleCSV2.csv'
conf = SparkConf().setAppName('FICBOutputGenerator')
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
df = sqlContext.read.options(delimiter=",", header="true").csv("cars.csv") #Without deprecated API
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", ",").load("cars.csv")
I am trying to use SparkSession to convert JSON data of a file to RDD with Spark Notebook. I already have the JSON file.
val spark = SparkSession
.builder()
.appName("jsonReaderApp")
.config("config.key.here", configValueHere)
.enableHiveSupport()
.getOrCreate()
val jread = spark.read.json("search-results1.json")
I am very new to spark and do not know what to use for config.key.here and configValueHere.
SparkSession
To get all the "various Spark parameters as key-value pairs" for a SparkSession, “The entry point to programming Spark with the Dataset and DataFrame API," run the following (this is using Spark Python API, Scala would be very similar).
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
SparkConf().getAll()
or without importing SparkConf:
spark.sparkContext.getConf().getAll()
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SparkSession.html
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/spark_session.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html
You can get a deeper level list of SparkSession configuration options by running the code below. Most are the same, but there are a few extra ones. I am not sure if you can change these.
spark.sparkContext._conf.getAll()
SparkContext
To get all the "various Spark parameters as key-value pairs" for a SparkContext, the "Main entry point for Spark functionality," ... "connection to a Spark cluster," ... and "to create RDDs, accumulators and broadcast variables on that cluster,” run the following.
import pyspark
from pyspark import SparkConf, SparkContext
spark_conf = SparkConf().setAppName("test")
spark = SparkContext(conf = spark_conf)
SparkConf().getAll()
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkContext.html
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html
Spark parameters
You should get a list of tuples that contain the "various Spark parameters as key-value pairs" similar to the following:
[(u'spark.eventLog.enabled', u'true'),
(u'spark.yarn.appMasterEnv.PYSPARK_PYTHON', u'/<yourpath>/parcels/Anaconda-4.2.0/bin/python'),
...
...
(u'spark.yarn.jars', u'local:/<yourpath>/lib/spark2/jars/*')]
Depending on which API you are using, see one of the following:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/SparkConf.html
https://spark.apache.org/docs/latest//api/python/reference/api/pyspark.SparkConf.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkConf.html
For a complete list of Spark properties, see:
http://spark.apache.org/docs/latest/configuration.html#viewing-spark-properties
Setting Spark parameters
Each tuple is ("spark.some.config.option", "some-value") which you can set in your application with:
SparkSession
spark = (
SparkSession
.builder
.appName("Your App Name")
.config("spark.some.config.option1", "some-value")
.config("spark.some.config.option2", "some-value")
.getOrCreate())
sc = spark.sparkContext
SparkContext
spark_conf = (
SparkConf()
.setAppName("Your App Name")
.set("spark.some.config.option1", "some-value")
.set("spark.some.config.option2", "some-value"))
sc = SparkContext(conf = spark_conf)
spark-defaults
You can also set the Spark parameters in a spark-defaults.conf file:
spark.some.config.option1 some-value
spark.some.config.option2 "some-value"
then run your Spark application with spark-submit (pyspark):
spark-submit \
--properties-file path/to/your/spark-defaults.conf \
--name "Your App Name" \
--py-files path/to/your/supporting/pyspark_files.zip \
--class Main path/to/your/pyspark_main.py
This is how it worked for me to add spark or hive settings in my scala:
{
val spark = SparkSession
.builder()
.appName("StructStreaming")
.master("yarn")
.config("hive.merge.mapfiles", "false")
.config("hive.merge.tezfiles", "false")
.config("parquet.enable.summary-metadata", "false")
.config("spark.sql.parquet.mergeSchema","false")
.config("hive.merge.smallfiles.avgsize", "160000000")
.enableHiveSupport()
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.orc.impl", "native")
.config("spark.sql.parquet.binaryAsString","true")
.config("spark.sql.parquet.writeLegacyFormat","true")
//.config(“spark.sql.streaming.checkpointLocation”, “hdfs://pp/apps/hive/warehouse/dev01_landing_initial_area.db”)
.getOrCreate()
}
The easiest way to set some config:
spark.conf.set("spark.sql.shuffle.partitions", 500).
Where spark refers to a SparkSession, that way you can set configs at runtime. It's really useful when you want to change configs again and again to tune some spark parameters for specific queries.
In simple terms, values set in "config" method are automatically propagated to both SparkConf and SparkSession's own configuration.
for eg :
you can refer to
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-settings.html to understand how hive warehouse locations are set for SparkSession using config option
To know about the this api you can refer to : https://spark.apache.org/docs/2.0.1/api/java/org/apache/spark/sql/SparkSession.Builder.html
Every Spark config option is expolained at: http://spark.apache.org/docs/latest/configuration.html
You can set these at run-time as in your example above or through the config file given to spark-submit
I have a json file, nodes that looks like this:
[{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
,{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
,{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
,{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}]
I am able to read and manipulate this record with Python.
I am trying to read this file in scala through the spark-shell.
From this tutorial, I can see that it is possible to read json via sqlContext.read.json
val vfile = sqlContext.read.json("path/to/file/nodes.json")
However, this results in a corrupt_record error:
vfile: org.apache.spark.sql.DataFrame = [_corrupt_record: string]
Can anyone shed some light on this error? I can read and use the file with other applications and I am confident it is not corrupt and sound json.
As Spark expects "JSON Line format" not a typical JSON format, we can tell spark to read typical JSON by specifying:
val df = spark.read.option("multiline", "true").json("<file>")
Spark cannot read JSON-array to a record on top-level, so you have to pass:
{"toid":"osgb4000000031043205","point":[508180.748,195333.973],"index":1}
{"toid":"osgb4000000031043206","point":[508163.122,195316.627],"index":2}
{"toid":"osgb4000000031043207","point":[508172.075,195325.719],"index":3}
{"toid":"osgb4000000031043208","point":[508513,196023],"index":4}
As it's described in the tutorial you're referring to:
Let's begin by loading a JSON file, where each line is a JSON object
The reasoning is quite simple. Spark expects you to pass a file with a lot of JSON-entities (entity per line), so it could distribute their processing (per entity, roughly saying).
To put more light on it, here is a quote form the official doc
Note that the file that is offered as a json file is not a typical
JSON file. Each line must contain a separate, self-contained valid
JSON object. As a consequence, a regular multi-line JSON file will
most often fail.
This format is called JSONL. Basically it's an alternative to CSV.
To read the multi-line JSON as a DataFrame:
val spark = SparkSession.builder().getOrCreate()
val df = spark.read.json(spark.sparkContext.wholeTextFiles("file.json").values)
Reading large files in this manner is not recommended, from the wholeTextFiles docs
Small files are preferred, large file is also allowable, but may cause bad performance.
I run into the same problem. I used sparkContext and sparkSql on the same configuration:
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("Simple Application")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.config(conf)
.getOrCreate()
Then, using the spark context I read the whole json (JSON - path to file) file:
val jsonRDD = sc.wholeTextFiles(JSON).map(x => x._2)
You can create a schema for future selects, filters...
val schema = StructType( List(
StructField("toid", StringType, nullable = true),
StructField("point", ArrayType(DoubleType), nullable = true),
StructField("index", DoubleType, nullable = true)
))
Create a DataFrame using spark sql:
var df: DataFrame = spark.read.schema(schema).json(jsonRDD).toDF()
For testing use show and printSchema:
df.show()
df.printSchema()
sbt build file:
name := "spark-single"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.2"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "2.0.2"