how to let spark 2.0 reading mutli folders parquet like csv - csv

I have some daily data to save to multi folders(mostly based on time). now I have two format to store the files one is parquet and the other is csv , I would like to save to parquet format to save some space.
the folder structure is like following :
[root#hdp raw]# tree
.
├── entityid=10001
│   └── year=2017
│   └── quarter=1
│   └── month=1
│   ├── day=6
│   │   └── part-r-00000-84f964ec-f3ea-46fd-9fe6-8b36c2433e8e.snappy.parquet
│   └── day=7
│   └── part-r-00000-84f964ec-f3ea-46fd-9fe6-8b36c2433e8e.snappy.parquet
├── entityid=100055
│   └── year=2017
│   └── quarter=1
│   └── month=1
│   ├── day=6
│   │   └── part-r-00000-84f964ec-f3ea-46fd-9fe6-8b36c2433e8e.snappy.parquet
│   └── day=7
│   └── part-r-00000-84f964ec-f3ea-46fd-9fe6-8b36c2433e8e.snappy.parquet
├── entityid=100082
│   └── year=2017
│   └── quarter=1
│   └── month=1
│   ├── day=6
│   │   └── part-r-00000-84f964ec-f3ea-46fd-9fe6-8b36c2433e8e.snappy.parquet
│   └── day=7
│   └── part-r-00000-84f964ec-f3ea-46fd-9fe6-8b36c2433e8e.snappy.parquet
└── entityid=10012
└── year=2017
└── quarter=1
└── month=1
├── day=6
│   └── part-r-00000-84f964ec-f3ea-46fd-9fe6-8b36c2433e8e.snappy.parquet
└── day=7
└── part-r-00000-84f964ec-f3ea-46fd-9fe6-8b36c2433e8e.snappy.parquet
now I have a python list stores all the folders need to be read,suppose each time run it need to read only some of the folders base on filter conditions.
folderList=df_inc.collect()
folderString=[]
for x in folderList:
folderString.append(x.folders)
In [44]: folderString
Out[44]:
[u'/data/raw/entityid=100055/year=2017/quarter=1/month=1/day=7',
u'/data/raw/entityid=10012/year=2017/quarter=1/month=1/day=6',
u'/data/raw/entityid=100082/year=2017/quarter=1/month=1/day=7',
u'/data/raw/entityid=100055/year=2017/quarter=1/month=1/day=6',
u'/data/raw/entityid=100082/year=2017/quarter=1/month=1/day=6',
u'/data/raw/entityid=10012/year=2017/quarter=1/month=1/day=7']
the files were writen by :
df_join_with_time.coalesce(1).write.partitionBy("entityid","year","quarter","month","day").mode("append").parquet(rawFolderPrefix)
when I try to read the folders stored in folderString by df_batch=spark.read.parquet(folderString) error java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String encounters.
if I save the files in csv format and read it through below code it just works fine as following: please if anyway to read the filelist for parquet folder ,much appreciate!
In [46]: folderList=df_inc.collect()
...: folderString=[]
...:
...: for x in folderList:
...: folderString.append(x.folders)
...: df_batch=spark.read.csv(folderString)
...:
In [47]: df_batch.show()
+------------+---+-------------------+----------+----------+
| _c0|_c1| _c2| _c3| _c4|
+------------+---+-------------------+----------+----------+
|6C25B9C3DD54| 1|2017-01-07 00:00:01|1483718401|1483718400|
|38BC1ADB0164| 3|2017-01-06 00:00:01|1483632001|1483632000|
|38BC1ADB0164| 3|2017-01-07 00:00:01|1483718401|1483718400|

You are facing a miss understanding of partition in Hadoop and Parquet.
See, I have a simple file structure partitioned by year-month. It is like this:
my_folder
.
├── year-month=2016-12
| └── my_files.parquet
├── year-month=2016-11
| └── my_files.parquet
If I make a read from my_folder without any filter in my dataframe reader like this:
df = saprk.read.parquet("path/to/my_folder")
df.show()
If you check the Spark DAG visualization you can see that in this case it will read all my partitions as you said:
In the case above, each point in the first square is one partition of my data.
But if I change my code to this:
df = saprk.read.parquet("path/to/my_folder")\
.filter((col('year-month') >= lit(my_date.strftime('%Y-%m'))) &
(col('year-month') <= lit(my_date.strftime('%Y-%m'))))
The DAG visualization will show how many partitions I'm using:
So, if you filter by the column that is the partition you will not read all the files. Just that you need, you don't need to use that solution of reading one folder by folder.

I got this solved by :
df=spark.read.parquet(folderString[0])
y=0
for x in folderString:
if y>0:
df=df.union(spark.read.parquet(x))
y=y+1
it's a very ugly solution ,if you have good idea ,please let me know. many thanks.
few days later,found the perfect way to solve the problem by:
df=spark.read.parquet(*folderString)

Related

Converting multiple Markdown files with multiple CSS files to PDF as one single page, using wkhtmltopdf and Pandoc

My trainings were inspired by the following (one of the questions didn't work):
Using CSS when converting Markdown to PDF with Pandoc
How to convert Markdown + CSS -> PDF?
How to compile all .md files in a directory into a single .pdf with pandoc, while preserving YAML header data?
And the tutorial:
Converting Markdown to Beautiful PDF with Pandoc
My platform:
Endeavour OS (derived of Arch Linux)
KDE Plasma
MikTeX
Terminator
I have Lua installed.
My goal is to:
convert multiple Markdown to one single PDF page (not multiple PDF pages)
include multiple CSS files;
use LaTeX macros (for example: $\LaTeX$ on Markdown file)
to preserve YAML header data
I tried to run the command:
$ cd diários
$ pandoc -s \
# Because of few LaTeX macros on one Markdown file
-f markdown \
# I need to use HTML 5, but the result warned it is not compatible with the format output (xelatex)
-t html5 \
# Accoridng to https://jdhao.github.io/2019/05/30/markdown2pdf_pandoc/
--pdf-engine=xelatex \
--self-contained \
# Multiple CSS files
--css ../assets/css/colours.css,../assets/css/fonts.css,../assets/css/global.css \
# To preserve YAML data
--lua-filter=../scripts/demote.lua \
# Multiple Markdown files
-V geometry:margin=1in \
*.md \
# Output as PDF
-o combined.pdf
Tree
.
├── assets
│   ├── css
│   │   ├── colours.css
│   │   ├── fonts.css
│   │   ├── global.css
│   ├── fonts
│   │   ├── Latin Modern
│   │   │   ├── LM-bold-italic.woff
│   │   │   ├── LM-bold-italic.woff2
│   │   │   ├── LM-bold.woff
│   │   │   ├── LM-bold.woff2
│   │   │   ├── LM-italic.woff
│   │   │   ├── LM-italic.woff2
│   │   │   ├── LM-regular.woff
│   │   │   └── LM-regular.woff2
│   │  
│   ├── images
│   │   ├── 2021-12-27-camiseta-do-hackathon-do-itau.jpeg
│   │   ├── emojis
│   │   │   ├── 32
│   │   │   │   ├── 1-emoji-excited.png
│   │   │   │   ├── 2-emoji-happy.png
│   │   │   │   ├── 3-emoji-neutral.png
│   │   │   │   ├── 4-emoji-sad.png
│   │   │   │   └── 5-emoji-down.png
│   ├── videos
│   │   └── 2021-12-27 – Comboio no temrinal em dois monitores.mp4
├── diários
│   ├── 2021-05-28.md
│   ├── 2021-12-25.md
│   ├── 2021-12-26.md
│   ├── 2021-12-27.md
│   ├── 2021-12-28.md
│   ├── 2021-12-29.md
│   ├── 2021-12-30.md
│   ├── 2021-12-31.md
│   ├── 2022-01-01.md
│   ├── 2022-01-02.md
│   ├── 2022-01-03.md
│   ├── 2022-01-04.md
│   ├── 2022-01-05.md
│   ├── 2022-01-06.md
│   ├── 2022-01-07.md
│   ├── 2022-01-10.md
│   ├── 2022-01-11.md
├── scripts
│   └── demote.lua
Errors
pdf-engine xelatex is not compatible with output format html5
And with -t html5, another error:
Error running filter ../scripts/demote.lua:
[string "--[[..."]:227: Constructor for Header failed: [string "--[[..."]:258: attempt to index a nil value (local 'x')
stack traceback:
[C]: in function 'error'
[string "--[[..."]:227: in field 'Header'
../scripts/demote.lua:11: in function 'Pandoc'
With --lua-filter=../scripts/demote.lua removed, it almost worked, but it did not capture the CSS files. I believe the problem is in the pdf-engine=xelatex.

Ansible, role not found error

I try to play following playbook against localhost to provision Vagrant machine
---
- hosts: all
become: yes
roles:
- base
- jenkins
I have cloned necessary roles from github and they resides in a relative path roles/{role name}
Executing following command: ansible-playbook -i "localhost," -c local playbook.yml outputs this error:
==> default: ERROR! the role 'geerlingguy.java' was not found in /home/vagrant/provisioning/roles:/home/vagrant/provisioning:/etc/ansible/roles:/home/vagrant/provisioning/roles
==> default:
==> default: The error appears to have been in '/home/vagrant/provisioning/roles/jenkins/meta/main.yml': line 3, column 5, but may
==> default: be elsewhere in the file depending on the exact syntax problem.
==> default:
==> default: The offending line appears to be:
==> default:
==> default: dependencies:
==> default: - geerlingguy.java
==> default: ^ here
I cloned the missing dependency from github, and tried to reside it in relative path of roles/java and roles/geerlingguy/java, but either didn't solve the problem, and error stays the same.
I want to keep all roles locally in the synced provisioning folder, without using ansible-galaxy runtime, to make the provisioning method as self contained as possible.
Here is the provision folder structure as it is now
.
├── playbook.yml
└── roles
├── base
│   └── tasks
│   └── main.yml
├── java
│   ├── defaults
│   │   └── main.yml
│   ├── meta
│   │   └── main.yml
│   ├── README.md
│   ├── tasks
│   │   ├── main.yml
│   │   ├── setup-Debian.yml
│   │   ├── setup-FreeBSD.yml
│   │   └── setup-RedHat.yml
│   ├── templates
│   │   └── java_home.sh.j2
│   ├── tests
│   │   └── test.yml
│   └── vars
│   ├── Debian.yml
│   ├── Fedora.yml
│   ├── FreeBSD.yml
│   ├── RedHat.yml
│   ├── Ubuntu-12.04.yml
│   ├── Ubuntu-14.04.yml
│   └── Ubuntu-16.04.yml
└── jenkins
├── defaults
│   └── main.yml
├── handlers
│   └── main.yml
├── meta
│   └── main.yml
├── README.md
├── tasks
│   ├── main.yml
│   ├── plugins.yml
│   ├── settings.yml
│   ├── setup-Debian.yml
│   └── setup-RedHat.yml
├── templates
│   └── basic-security.groovy
├── tests
│   ├── requirements.yml
│   ├── test-http-port.yml
│   ├── test-jenkins-version.yml
│   ├── test-plugins-with-pinning.yml
│   ├── test-plugins.yml
│   ├── test-prefix.yml
│   └── test.yml
└── vars
├── Debian.yml
└── RedHat.yml
You should install or clone all required roles in the /roles folder (or in the system folder)
ansible-galaxy install -p ROLES_PATH geerlingguy.java
should fix this specific problem.
However, the best practice should be the use of a requirements.yml file where you require all the needed roles and then install them with ansible-galaxy directly in your playbook.
- name: run ansible galaxy
local_action: command ansible-galaxy install -r requirements.yml --ignore-errors
Simple symbolic link works like a charm without any installations:
$ mkdir /home/USER/ansible && ln -s /home/USER/GIT/ansible-root/roles
Here is the solution: the required path for the role is roles/geerlingguy.java/, not roles/geerlingguy/java/

Using Jekyll's Collection relative_directory for organizing pages/collections

I thought that setting the relative_directory (Jekyll Collection Docs) (github PR) property being set would help me keep my files organized without compromising my desired output, but it seems to be ignored/not used for producing files. I don't want my collections to be in the root directory, because I find it confusing to have ~10 collection folders adjacent to _assets, _data, _includes, _layouts, and others.
Fixes or alternative solutions are welcomed, as long as long as the output is the same, and my pages are in their own directory, without needing to put permalink front-matter on every single page.
_config.yaml
collections:
root:
relative_directory: '_pages/root'
output: true
permalink: /:path.html
root-worthy:
relative_directory: '_pages/root-worthy'
output: true
permalink: /:path.html
docs:
relative_directory: '_pages/docs'
output: true
permalink: /docs/:path.html
Directory Structure:
├── ...
├── _layouts
├── _pages
│   ├── root
│   │ ├── about.html
│   │ └── contact.html
│   ├── root_worthy
│   │ ├── quickstart.html
│   │ └── seo-worthy-page.html
│   └── docs
│   ├── errors.html
│   └── api.html
├── _posts
└── index.html
Desired output:
├── ...
├── _site
│   ├── about.html
│   ├── contact.html
│   ├── quickstart.html
│   ├── seo-worthy-page.html
│   └── docs
│   ├── errors.html
│   └── api.html
└── ...
It seems that the PR you mention is still not merged.
For 3.1.6 and next 3.2, jekyll code is still :
#relative_directory ||= "_#{label}"
But the requester made a plugin that looks like this :
_plugins/collection_relative_directory.rb
module Jekyll
class Collection
def relative_directory
#relative_directory ||= (metadata['relative_directory'] && site.in_source_dir(metadata['relative_directory']) || "_#{label}")
end
end
end

How do I get Lazybones to process sub-templates?

Playing with Lazybones for the first time. I've put together a simple project which attempts to include a single sub-template.
Here is the project structure:
.
├── build.gradle
├── gradlew
├── gradlew.bat
├── README.md
└── templates
├── groovy-lambda
│   ├── build.gradle
│   ├── lazybones.groovy
│   ├── README.md
│   ├── src
│   │   ├── main
│   │   │   ├── groovy
│   │   │   │   └── .retain
│   │   │   └── resources
│   │   │   └── .retain
│   │   └── test
│   │   ├── groovy
│   │   │   └── .retain
│   │   └── resources
│   │   └── .retain
│   └── VERSION
└── subtmpl-groovy-lambda-main-class
├── GroovyLambdaMainClass.groovy
├── lazybones.groovy
└── VERSION
And I'm including the sub-template like so
lazybones {
template "groovy-lambda" includes "groovy-lambda-main-class"
}
The sub-template gets packaged in the main artefact archive:
.
├── build.gradle
├── .lazybones
│   ├── groovy-lambda-main-class-template-1.0-SNAPSHOT.zip
│   └── stored-params.properties
├── README.md
└── src
├── main
│   ├── groovy
│   └── resources
└── test
├── groovy
└── resources
However the sub-template never gets processed at template execution time i.e. the sub-templates lazybones.groovy script doesn't seem to get run.
The whole project is available here on GitHub. To reproduce the issue do:
git#github.com:eddgrant/lazybones-template-aws-groovy-lambda.git
cd lazybones-template-aws-groovy-lambda.git
./gradlew installAllTemplates
cd /tmp
lazybones --verbose create groovy-lambda 1.0-SNAPSHOT groovy-lambda
I'm probably missing something trivial but can't quite figure it out. Most grateful for any pointers.
Everything is working as expected. Sub templates are only used by the lazybones generate command, which in turn works only once you have created a Lazybones-based project.
The classic example is something like a Grails or Rails project in which you would use the generate command to create new controllers or domain classes.

How can I get HTML reports that I host on Github Pages to link to stylesheets in another directory?

I'm trying to create a set of project pages using Github Pages. My main page is a copy of my project's README, which I generated through Github's auto-page generator. Under Project Health Metrics, I link to two HTML reports: one is a CodeNarc report (at health/codenarc/main.html) and the other is a Jacoco report (at health/jacoco/index.html).
The CodeNarc report renders fine, but the Jacoco report doesn't as it's not able to load the stylesheet and other resources kept in another directory. I'm keeping everything on a gh-pages branch with a directory structure that looks like this:
.
├── Gemfile
├── Gemfile.lock
├── _config.yml
├── _site
├── bin
├── build
├── build.gradle
├── config
├── docs
├── gradle.properties
├── health
├── images
├── index.html
├── javascripts
├── params.json
├── src
└── stylesheets
My health directory tree appears like this:
health
├── codenarc
│   ├── integrationTest.html
│   ├── main.html
│   └── test.html
├── html
│   └── projectHealth.html
└── jacoco
├── .resources
│   ├── branchfc.gif
│   ├── branchnc.gif
│   ├── branchpc.gif
│   ├── bundle.gif
│   ├── class.gif
│   ├── down.gif
│   ├── greenbar.gif
│   ├── group.gif
│   ├── method.gif
│   ├── package.gif
│   ├── prettify.css
│   ├── prettify.js
│   ├── redbar.gif
│   ├── report.css
│   ├── report.gif
│   ├── session.gif
│   ├── sort.gif
│   ├── sort.js
│   ├── source.gif
│   └── up.gif
├── .sessions.html
├── com.github.tagc.semver
│   ├── SemVerPlugin$_apply_closure1.html
│   ├── SemVerPlugin.groovy.html
│   ├── SemVerPlugin.html
│   ├── SemVerPluginExtension.groovy.html
│   ├── SemVerPluginExtension.html
│   ├── Version$Builder.html
│   ├── Version$Category.html
│   ├── Version$Parser.html
│   ├── Version.groovy.html
│   ├── Version.html
│   ├── index.html
│   └── index.source.html
└── index.html
If it helps, you can explore the tree and check out all the files from my Github repository.
I would like the Jacoco report to be able to access the resources in the .resources folder under health/jacoco, but it doesn't seem able to and I'm not quite sure why. I've tried playing around with this a lot on a private instance running on localhost through Jekyll.
Problem solved, thanks to help from some people in IRC and a lot of playing around.
Jekyll ignores any folders that are hidden (e.g. prepended with a dot or underscore), so it wasn't processing health/jacoco/.resources.
I got around this issue by including include: ['.resources'] in _config.yml. Don't forget to push this file to the remote gh-pages branch on Github, as Github uses this to determine what it processes as well.
The Jacoco report now renders properly because it can access the stylesheets and other resources it depends on.