I am facing a problem:
I want to parse a bunch of html files in HDFS, and I want to use spark to process them. For convenient, I want to use pyspark, and also want to use the powerful package BeautifulSoup to parse the html files. Is it possible for me to do that? And how can read the files from HDFS by using BeautifulSoup?
Assuming that the html files are in HDFS, you could pipe a list of the file names into the driver, assigning a random number between 0 and the number of executors in your job. Then put this data into an rdd with parallelize. GroupyBy the random number and do a mapPartitions. The reducer function would process each row of the partition by reading the file in as text from HDFS. Then pass the text to BeautifulSoup and do your processing. Return results will be captured in an RDD. Keep in mind, that BeautifulSoup must be installed on each of your worker nodes or this will not work.
Related
I am trying to save a data frame into a document but it returns saying that the below error
java.lang.ClassNotFoundException: Failed to find data source: docx. Please find packages at http://spark.apache.org/third-party-projects.html
My code is below:
#f_data is my dataframe with data
f_data.write.format("docx").save("dbfs:/FileStore/test/test.csv")
display(f_data)
Note that i could save files of CSV, text and JSON format but is there any way to save a docx file using pyspark?
My question here. Do we have the support for saving data in the format of doc/docx?
if not, Is there any way to store the file like writing a file stream object into particular folder/S3 bucket?
In short: no, Spark does not support DOCX format out of the box. You can still collect the data into the driver node (i.e.: pandas dataframe) and work from there.
Long answer:
A document format like DOCX is meant for presenting information in small tables with style metadata. Spark focus on processing large amount of files at scale and it does not support DOCX format out of the box.
If you want to write DOCX files programmatically, you can:
Collect the data into a Pandas DataFrame pd_f_data = f_data.toDF()
Import python package to create the DOCX document and save it into a stream. See question: Writing a Python Pandas DataFrame to Word document
Upload the stream to a S3 blob using for example boto: Can you upload to S3 using a stream rather than a local file?
Note: if your data has more than one hundred rows, ask the receivers how they are going to use the data. Just use docx for reporting no as a file transfer format.
I have a gzipped JSON file that contains Array of JSON, something like this:
[{"Product":{"id"1,"image":"/img.jpg"},"Color":"black"},{"Product":{"id"2,"image":"/img1.jpg"},"Color":"green"}.....]
I know this is not the ideal data format to read into scala, however there is no other alternative but to process the feed in this manner.
I have tried :
spark.read.json("file-path")
which seems to take a long time (processes very quickly if you have data in MBs, however takes way long for GBs worth of data ), probably because spark is not able to split the file and distribute accross to other executors.
Wanted to see if there is a any way out to preprocess this data and load it into spark context as a dataframe.
Functionality I want seems to be similar to: Create pandas dataframe from json objects . But I wanted to see if there is any scala alternative which could do similar and convert the data to spark RDD / dataframe .
You can read the "gzip" file using spark.read().text("gzip-file-path"). Since Spark API's are built on top of HDFS API , Spark can read the gzip file and decompress it to read the files.
https://github.com/mesos/spark/blob/baa30fcd99aec83b1b704d7918be6bb78b45fbb5/core/src/main/scala/spark/SparkContext.scala#L239
However, gzip is non-splittable so spark creates an RDD with single partition. Hence, reading gzip files using spark doe not make sense.
You may decompress the gzip file and read the decompressed files to get most out of the distributed processing architecture.
Appeared like a problem with the data format being given to spark for processing. I had to pre-process the data to change the format to a spark friendly format, and run spark processes over that. This is the preprocessing I ended up doing: https://github.com/dipayan90/bigjsonprocessor/blob/master/src/main/java/com/kajjoy/bigjsonprocessor/Application.java
How can I make JMeter read the second sheet of my CSV?
I want to use CSV Data Set Config.
Normally, it reads the first line of the first sheet but is there any way to be a bit more flexible?
CSV file format doesn't have "sheets", it is a normal plain text file using delimiters in order to represent structured data.
If you are trying to get data from i.e. Microsoft Excel file type - unfortunately you won't be able to do it using CSV Data Set Config. The easiest would be exporting data as separate plain-text CSV files.
If you don't have the possibility to do the export you still can access the data from Excel files but it will be a little bit more tricky as you will have to use JSR223 Test Elements, Groovy language and Apache POI libraries
More information:
Busy Developers' Guide to HSSF and XSSF Features
How to Extract Data From Files With JMeter
Currently you can use CSV Data Set Config for that, you should add external code for example using Apache Commons CSV,
Download the jar file and place it in JMETER_HOME lib folder, and then write the code in JSR223 Element.
Examples exists, code for get second record:
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.RFC4180.parse(in);
// go to next record
records.next();
CSVRecord secondRecord = records.next();
//columnOne = secondRecord.get(0);
I have a very difficult situation:
I need to parse a bunch of html files in pyspark, but I still want to use BeautifulSoup to parse html files. The dilemma is:
If I saved these html files in HDFS, and use pyspark to read html files in, I can only read them as RDD, but I cannot pass RDD as the input parameter in BeautifulSoup;
If I saved these html files in local, and use BeautifulSoup to parse html files, the power of pyspark is not been used.
How can I do it?
I would suggest writing a UDF function in PySpark that would take this HTML column and return extracted fields from HTML, and then call it against DataFrame or RDD, whatever fits best your problem.
I need to read a bunch of JSON files from an HDFS directory. After I'm done processing, Spark needs to place the files in a different directory. In the meantime, there may be more files added, so I need a list of files that were read (and processed) by Spark, as I do not want to remove the ones that were not yet processed.
The function read.json converts the files immediately into DataFrames, which is cool but it does not give me the file names like wholeTextFiles. Is there a way to read JSON data while also getting the file names? Is there a conversion from RDD (with JSON data) to DataFrame?
From version1.6 on you can use input_file_name() to get the name of the file in which a row is located. Thus, getting the names of all the files can be done via a distinct on it.