I am using a Spark cluster and I want to write a string to a file and save it to the master node using Scala
I looked at this topic and tried some of the suggestions, but I can't find the saved file
You would just need to call collect on the RDD to bring the collection back to the driver. Then you can write it to the file system on the driver.
Although it would be recommended to write it in a parallelized fashion for performance.
Related
My application requires some kind of data backup and some kind of data exchange between users, so what I want to achieve is the ability to export an entity but not the entire database.
I have found some help but for the full database, like this post:
Backup core data locally, and restore from backup - Swift
This applies to the entire database.
I tried to export a JSON file, this might work except that the entity I'm trying to export contains images as binary data.
So I'm stuck.
Any help exporting not the full database but just one entity or how to write a JSON that includes binary data.
Take a look at protobuf. Apple has an official swift lib for it
https://github.com/apple/swift-protobuf
Protobuf is an alternate encoding to JSON that has direct support for serializing binary data. There are client libraries for any language you might need to read the data in, or command-line tools if you want to examine the files manually.
How can I batch load 1,000 XML files and convert to JSON with Talend? Is it possible to connect to Neo4J with Talend or maybe connect to an RDF Graph Database like Fluree?
Use the following components, tFileInputXML and tFileOutputJSON
I leave you a link to the documentation, look at the scenarios, they indicate how to configure each element.
I have worked with these elements, the reading of the XML works, you may have to read the file in different Inputs, it all depends on the structure of the xml.
You can also validate it using an XSD.
Regards.
Technologies I am using to fetch data from my MySQL database is Spark 2.4.4 and Scala. I want to display that data in my Angular8 project. Any help on how to do it? I could not find any documentation regarding this.
I am not sure if this is a scala/spark related question. It sounds more towards system design of your project.
One solution is to use your Angular8 directly read from MySQL. There are tons of tutorials online.
Another solution is to use your spark/scala to read data and dump to CSV/JSON file at somewhere and use Angular8 to read in that file. The pros is that you can do some transformation before displaying your data. The cons is that there is latency between transformation and displaying. After reading the flat file into JSON it's up to you how to render that data on user's screen.
Pardon my simple question but I'm relatively new to Spark/Hadoop.
I'm trying to load a bunch of small CSV files into Apache Spark. They're currently stored in S3, but I can download them locally if that simplifies things. My goal is to do this as efficiently as possible. It seems like it would be a shame to have some single-threaded master downloading and parsing a bunch of CSV files while my dozens of Spark workers sit idly. I'm hoping there's an idiomatic way to distribute this work.
The CSV files are arranged in a directory structure that looks like:
2014/01-01/fileabcd.csv
2014/01-01/filedefg.csv
...
I have two years of data, with directories for each day, and a few hundred CSVs inside of each. All of those CSVs should have an identical schema, but it's of course possible that one CSV is awry and I'd hate for the whole job to crash if there are a couple problematic files. Those files can be skipped as long as I'm notified in a log somewhere that that happened.
It seems that every Spark project I have in mind is in this same form and I don't know how to solve it. (e.g. trying to read in a bunch of tab-delimited weather data, or reading in a bunch of log files to look at those.)
What I've Tried
I've tried both SparkR and the Scala libraries. I don't really care which language I need to use; I'm more interested in the correct idioms/tools to use.
Pure Scala
My original thought was to enumerate and parallelize the list of all year/mm-dd combinations so that I could have my Spark workers all processing each day independently (download and parse all CSV files, then stack them on top of eachother (unionAll()) to reduce them). Unfortunately, downloading and parsing the CSV files using the spark-csv library can only be done in the "parent"/master job, and not from each child as Spark doesn't allow job nesting. So that won't work as long as I want to use the Spark libraries to do the importing/parsing.
Mixed-Language
You can, of course, use the language's native CSV parsing to read in each file then "upload" them to Spark. In R, this is a combination of some package to get the file out of S3 followed by a read.csv, and finishing off with a createDataFrame() to get the data into Spark. Unfortunately, this is really slow and also seems backwards to the way I want Spark to work. If all my data is piping through R before it can get into Spark, why bother with Spark?
Hive/Sqoop/Phoenix/Pig/Flume/Flume Ng/s3distcp
I've started looking into these tailored tools and quickly got overwhelmed. My understanding is that many/all of these tools could be used to get my CSV files from S3 into HDFS.
Of course it would be faster to read my CSV files in from HDFS than S3, so that solves some portion of the problem. But I still have tens of thousands of CSVs that I need to parse and am unaware of a distributed way to do that in Spark.
So right now (Spark 1.4) SparkR has support for json or parquet file structures. Csv files can be parsed, but then the spark context needs to be started with an extra jar (which needs to be downloaded and placed in the appropriate folder, never done this myself but my collegues have).
sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3")
sqlContext <- sparkRSQL.init(sc)
There is more information in the docs. I expect that a newer spark release would have more support for this.
If you don't do this you'll need to either resort to a different file structure or use python to convert all your files from .csv into .parquet. Here is a snippet from a recent python talk that does this.
data = sc.textFile(s3_paths, 1200).cache()
def caster(x):
return Row(colname1 = x[0], colname2 = x[1])
df_rdd = data\
.map(lambda x: x.split(','))\
.map(caster)
ddf = sqlContext.inferSchema(df_rdd).cache()
ddf.write.save('s3n://<bucket>/<filename>.parquet')
Also, how big is your dataset? You may not even need spark for analysis. Note that also as of right now;
SparkR has only DataFrame support.
no distributed machine learning yet.
for visualisation you will need to convert a distributed dataframe back into a normal one if you want to use libraries like ggplot2.
if your dataset is no larger than a few gigabytes, then the extra bother of learning spark might not be worthwhile yet
it's modest now, but you can expect more from the future
I've run into this problem before (but w/ reading a large qty of Parquet files) and my recommendation would be to avoid dataframes and to use RDDs.
The general idiom used was:
Read in a list of the files w/ each file being a line (In the driver). The expected output here is a list of strings
Parallelize the list of strings and map over them with a customer csv reader. with the return being a list of case classes.
You can also use flatMap if at the end of the day you want a data structure like List[weather_data] that could be rewritten to parquet or a database.
I want to process a ~300 GB JSON file in Hadoop. As far as my understanding goes a JSON consists of a single string with data nested in it. Now if I want to parse the JSON string using Google's GSON, then won't the Hadoop have to put the entire load upon a single node as the JSON is not logically divisible for it.
How do I partition the file (I can make out the partitions logically looking at the data) if I want that it should be processed parallely on different nodes. Do I have to break the file before I load it onto HDFS itself. Is it absolutely necessary that the JSON is parsed by one machine (or node) at least once?
Assuming you know can logically parse the JSON into logical separate components then you can accomplish this just by writing your own InputFormat.
Conceptually you can think of each of the logically divisible JSON components as one "line" of data. Where each component contains the minimal amount of information that can be acted on independently.
Then you will need to make a class, a FileInputFormat, where you will have to return each of these JSON components.
public class JSONInputFormat extends FileInputFormat<Text,JSONComponent {...}
If you can logically divide your giant JSON into parts, do it, and save these parts as separate lines in file (or records in sequence file). Then, if you feed this new file to Hadoop MapReduce, mappers will be able to process records in parallel.
So, yes, JSON should be parsed by one machine at least once. This preprocessing phase doesn't need to be performed in Hadoop, simple script can do the work. Use streaming API to avoid loading a lot of data into memory.
You might find this JSON SerDe useful. It allows hive to read and write in JSON format. If it works for you, it'll be a lot more convenient to process you JSON data with Hive as you don't have to worry about the custom InputFormat that is going to read your JSON data and create splits for you.