I am using Jinja2 templates to create configs. I am trying to reverse the process, to read the contents of a file and store the variables.
For example,
If I have a file containing the following:
!
interface gi0/0
**random stuff**
description first desc
**random stuff**
!
interface gi0/1
description second desc
**random stuff**
!
Can I create a template as below and use it to read the files and store the variables?
interface {{INTERFACE}}
description {{DESCRIPTION}}
So I would end up with something like this:
(interface:"gi0/0",description:"first desc"),
(interface:"gi0/1",description:"second desc")
Related
I'm using SparkKubernetesOperator which has a template_field called application_file. Normally on giving this field a file's name, airflow reads that file and templates the jinja variable in it (just like script field in the BashOperator).
So this works and the file information is shown in the Rendered Template tab with the jinja variables replaced with the correct values.
start_streaming = SparkKubernetesOperator(
task_id='start_streaming',
namespace='spark',
application_file='user_profiles_streaming_dev.yaml',
...
dag=dag,
)
I want to use different files in the application_file field for different environments
So I used a jinja template in the field. But when I change the application_file with user_profiles_streaming_{{ var.value.env }}.yaml, the rendered output is just user_profiles_streaming_dev.yaml and not the file contents.
I know that recursive jinja variable replacement is not possible in airflow but I was wondering if there is any workaround for having different template files.
What I have tried -
I tried using a different operator and doing xcom push to read the file contents and sending it to SparkKubernetesOperator. While this was good for reading different files based on environment, it did not solve the issue of having the jinja variable replaced.
I also tried making a custom operator which inherits the SparkKubernetesOperator and has a template_field applicaton_file_name thinking that jinja replacement will take place 2 times, but this didn't work too.
I made an env file which had the environment details (dev/prod). Then I added this code to the start of my dag file
ENV = None
with open('/home/airflow/env', 'r') as env_file:
value = env_file.read()
if value == None or value == "":
raise Exception("ENV FILE NOT PRESENT")
ENV = value
and then accessed the environment in the code like this
submit_job = SparkKubernetesOperator(
task_id='submit_job',
namespace="spark",
application_file=f"adhoc_{ENV}.yaml",
do_xcom_push=True,
dag=dag,
)
This way I could have separate dev and prod files.
I have a text file with following content
2020-10-19 12:12:00.001;alan;male;{"id":"255","val":"22","type":"1","location":"12530,95823","status":1}
2020-10-19 12:12:00.001;anna;female;{"id":"256","val":"12","type":"1","location":"12140,25630","status":2}
I want to convert this to a csv file,
first separate by ; and the last column need to be treated as a json object and extract its values. the output need to be like in the below
date,name,gender,id,val,type,location1,location2,status
2020-10-19 12:12:00.001,alan,male,255,22,1,12530,95823,1
2020-10-19 12:12:00.001;anna,female,256,12,1,12140,25630,2
I am a beginner in nifi and I want to figure out the processors and their configuration to do this convert process. I have tried ConvertRecord and could only separate the content by ; .
It is great help if anyone could suggest a way to do this process.
Not an easy task ! but interesting.
I hope the structure is not changing, eg: the json column get more attributes !
So i would do this:
1 - SplitText by line(one line) - remove header if any
2 - ExtractText (create an attribute called body with a value of (?s)(^.*$))
3 - Update Attribute with two properties:
csv = ${body:substringBefore(';{'):replace(';',',')}
json = ${body:substringAfter(';{')}
4 - ReplaceText - and put {${json} as replacement value , Replacement Strategy : Always Replace
5 - EvaluateJson and extract all json attributes
6 - Attributestocsv with this def
csv,id,val,type,location,status
7 - mergecontent - add header (your col names), Delimiter Strategy = text and Demarcator Shift+Enter (newline)
Quite long walk and maybe not so optimal, you might wanna look into jolt for a better performance - but i ma lazy to think about a jolt spec :) .
I have the template for this - but i cannot loaded here as is to big and cannot use any file share service, so ?
Also if you have a mysql db on your hand , you cloud just load it as csv and use json_extract function
SELECT JSON_EXTRACT(name, "$.id") AS name
FROM table
I am looking at any library (in java) that can help me generate a dummy JSON file to test my code for e.g The JSON file can contain random user profile data-name, address, zipcode
I searched StackOverflow and found this link, found the following link : How to generate JSON string in Java?
I think the suggested library https://github.com/DiUS/java-faker, seems to be useful, however because of security constraints I cannot use this particular library. Are there any more recommendations?
Use for instance Faker, like that:
#!/usr/bin/env python3
from json import dumps
from faker import Faker
fake = Faker()
def user():
return dict(
name=fake.name(),
address=fake.address(),
bio=fake.text()
)
print('[')
try:
while True:
print(dumps(user()))
print(',')
except KeyboardInterrupt:
# XXX: json array can not end with a comma
print(dumps(user()))
print(']')
You can use it like that:
python3 fake_user.py > users.json
Use Ctrl+C to stop it when the file is big enough
I am using PanDoc to convert a large number of markdown (.md) files to html. I'm using the following Windows command-line:
for %%i in (*.md) do pandoc -f markdown -s %%~ni.md > html/%%~ni.html
On a test run, the html looks OK, except for the title tag - it's empty. Here is an example of the beginning of the .md file:
#Topic Title
- [Anchor 1](#anchor1)
- [Anchor 2](#anchor2)
<a name="anchor1"></a>
## Anchor 1
Is there a way I can tell PanDoc to parse the
#Topic Title
so that, in the html output files, I will get:
<title>Topic Title</title>
?
There are other .md tags I'd like to parse, and I think solving this will help me solve the rest of it.
I don't believe Pandoc supports this out-of-the-box. The relevant part of the Pandoc documentation states:
Templates may contain variables. Variable names are sequences of alphanumerics, -, and _, starting with a letter. A variable name surrounded by $ signs will be replaced by its value. For example, the string $title$ in
<title>$title$</title>
will be replaced by the document title.
It then continues:
Some variables are set automatically by pandoc. These vary somewhat depending on the output format, but include metadata fields (such as title, author, and date) as well as the following:
And proceeds to list a bunch of variables (none of which are relevant to your question). However, the above quote indicates that the title variable is a metadata field. The metadata field can be defined in a pandoc_title_block, a yaml_metadata_block, or passed in as a command line option.
The docs note that:
... you may also keep the metadata in a separate YAML file and pass it to pandoc as an argument, along with your markdown files ...
So you have a couple options:
Edit each document to add metadata defining the title for each document (this could possibly be scripted).
Write your script to extract the title (perhaps a regex which looks for #header in the first line) and passes that in to Pandoc as a command line option.
If you intend to start including the metadata in new documents you create going forward, then the first option is probably the way to go. Run a script once to batch edit your documents and then your done. However, if you have no intention of adding metadata to any documents, I would consider the second option. You are already running a loop, so just get the title before calling Pandoc within your loop (although I'm not sure how to do that in a windows script).
I want to build documentation site using Jekyll and GitHub Pages. The problem is Jekyll only accept a filename under _posts with exact pattern like YYYY-MM-DD-your-title-is-here.md.
How can I post a page in Jekyll without this filename pattern? Something like:
awesome-title.md
yet-another-title.md
etc.md
Thanks for your advance.
Don't use posts; posts are things with dates. Sounds like you probably want to use collections instead; you get all the power of Posts; but without the pesky date / naming requirements.
https://jekyllrb.com/docs/collections/
I use collections for almost everything that isn't a post. This is how my own site is configured to use collections for 'pages' as well as more specific sections of my site:
I guess that you are annoyed with the post url http://domaine.tld/category/2014/11/22/post.html.
You cannot bypass the filename pattern for posts, but you can use permalink (see documentation).
_posts/2014-11-22-other-post.md
---
title: "Other post"
date: 2014-11-22 09:49:00
permalink: anything-you-want
---
File will be anything-you-want/index.html.
Url will be http://domaine.tld/anything-you-want.
What I did without "abandoning" the posts (looks like using collections or pages is a better and deeper solution) is a combination of what #igneousaur says in a comment plus using the same date as prefix of file names:
Use permalink: /:title.html in _config.yml (no dates in published URLs).
Use the format 0001-01-01-name.md for all files in _posts folder (jekyll is happy about the file names and I'm happy about the sorting of the files).
Of course, we can include any "extra information" on the name, maybe some incremental id o anything that help us to organize the files, e.g.: 0001-01-01-001-name.md.
The way I solved it was by adding _plugins/no_date.rb:
class Jekyll::PostReader
# Don't use DATE_FILENAME_MATCHER so we don't need to put those stupid dates
# in the filename. Also limit to just *.markdown, so it won't process binary
# files from e.g. drags.
def read_posts(dir)
read_publishable(dir, "_posts", /.*\.markdown$/)
end
def read_drafts(dir)
read_publishable(dir, "_drafts", /.*\.markdown$/)
end
end
This overrides ("monkey patches") the standard Jekyll functions; the defaults for these are:
# Read all the files in <source>/<dir>/_drafts and create a new
# Document object with each one.
#
# dir - The String relative path of the directory to read.
#
# Returns nothing.
def read_drafts(dir)
read_publishable(dir, "_drafts", Document::DATELESS_FILENAME_MATCHER)
end
# Read all the files in <source>/<dir>/_posts and create a new Document
# object with each one.
#
# dir - The String relative path of the directory to read.
#
# Returns nothing.
def read_posts(dir)
read_publishable(dir, "_posts", Document::DATE_FILENAME_MATCHER)
end
With the referenced constants being:
DATELESS_FILENAME_MATCHER = %r!^(?:.+/)*(.*)(\.[^.]+)$!.freeze
DATE_FILENAME_MATCHER = %r!^(?>.+/)*?(\d{2,4}-\d{1,2}-\d{1,2})-([^/]*)(\.[^.]+)$!.freeze
As you can see, DATE_FILENAME_MATCHER as used in read_posts() requires a date ((\d{2,4}-\d{1,2}-\d{1,2})); I put date: 2021-07-06 in the frontmatter.
I couldn't really get collections to work, and this also solves another problem I had where storing binary files such as images in _drafts would error out as it tried to process them.
Arguably a bit ugly, but it works well. Downside is that it may break on update, although I've been patching various things for years and never really had any issues with it thus far. This is with Jekyll 4.2.0.
I wanted to use posts but not have the filenames in the date. The closest I got was naming the posts with an arbitrary 'date' like 0001-01-01cool-post.md and then use a different property to access the date.
If you use the last-modified-at plugin - https://github.com/gjtorikian/jekyll-last-modified-at - then you can use page.last_modified_at in your _layouts/post.html and whatever file you are running {% for post in site.posts %} in.
Now the dates are retrieved from the last git commit date (not author date) and the page.date is unused.
In the json schema for the config file are actually some useful information. See below code block for some examples.
I have set it to /:categories/:title. That drops the date and file extension, while preserving the categories.
I still use a proper date for the file name because you can use that date in your templates. I.e. to display the date on a post using {{ page.date }}.
{
"global-permalink": {
"description": "The global permalink format\nhttps://jekyllrb.com/docs/permalinks/#global",
"type": "string",
"default": "date",
"examples": [
"/:year",
"/:short_year",
"/:month",
"/:i_month",
"/:short_month",
"/:day",
"/:i_day",
"/:y_day",
"/:w_year",
"/:week",
"/:w_day",
"/:short_day",
"/:long_day",
"/:hour",
"/:minute",
"/:second",
"/:title",
"/:slug",
"/:categories",
"/:slugified_categories",
"date",
"pretty",
"ordinal",
"weekdate",
"none",
"/:categories/:year/:month/:day/:title:output_ext",
"/:categories/:year/:month/:day/:title/",
"/:categories/:year/:y_day/:title:output_ext",
"/:categories/:year/:week/:short_day/:title:output_ext",
"/:categories/:title:output_ext"
],
"pattern": "^((/(:(year|short_year|month|i_month|short_month|long_month|day|i_day|y_day|w_year|week|w_day|short_day|long_day|hour|minute|second|title|slug|categories|slugified_categories))+)+|date|pretty|ordinal|weekdate|none)$"
}
}