Dumping memory in a VCD file - chisel

My main problem is that I can't have access to my inner signals while I'm using peek/poke testing. For example, I'm trying to debug a cache design and because of that, I want to see the content of the cache memory. As it's not IO, I can't expect it in my peekPokeTester file. I tried to dump it using the --memVCD flag, but it failed to recognize this flag.
Does anyone know how I can see the content or dump it in a VCD file?

If you are using the unit test framework chiseltest, if you use the verilator backend
import chiseltest.experimental.TestOptionBuilder._
...
test(new DUT).withFlags(Array("--t-use-verilator", "--t-write-vcd")) { dut =>
If you are using chisel iotesters use
iotesters.Driver.execute(
Array("--backend-name", "verilator", "--generate-vcd-output", "on"),
() => new DUT) { c =>
If you are using the lastest chisel release 3.4 then the default backend treadle can now log memories too.
Add the flags "--tr-mem-to-vcd", "<specifier>" where specifier is one of
"all" log all values at all locations of all memories
"mem1:all" log all values at all locations for memory mem1
"mem1:0-4" log values at locations 0-4 for memory mem1
"mem1:b0-b100" log values at locations 0-4 but show addresses in binary for memory mem1
"mem1:h0-hff" log values at locations 0-255 but show addresses in hex for memory mem1
"mem1:o0-o377" log values at locations 0-255 but show addresses in octal for memory mem1
If you are feeling really bold and your dut is not too large. You can take the generated firrtl of your dut and write a test directly in treadle which allows you to peek and poke just about everything. Here's some example code from the treadle's internal test suite.

If you are using chisel3, change the following in Chick's answer
import chiseltest.experimental.TestOptionBuilder._
to:
import chisel3.tester.experimental.TestOptionBuilder._
See further details can be found here.

Related

How many tasks are created when spark read or write from mysql?

As far as I know, Spark executors handle many tasks at the same time to guarantee processing data parallelly.Here comes the question. When connecting to external data storage,say mysql,how many tasks are there to finishi this job?In other words,are multiple tasks created at the same time and each task reads all data ,or data is read from only one task and is distributed to the cluster in some other way? How about writing data to mysql,how many connections are there?
Here is some piece of code to read or write data from/to mysql:
def jdbc(sqlContext: SQLContext, url: String, driver: String, dbtable: String, user: String, password: String, numPartitions: Int): DataFrame = {
sqlContext.read.format("jdbc").options(Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> s"(SELECT * FROM $dbtable) $dbtable",
"user" -> user,
"password" -> password,
"numPartitions" -> numPartitions.toString
)).load
}
def mysqlToDF(sparkSession:SparkSession, jdbc:JdbcInfo, table:String): DataFrame ={
var dF1 = sparkSession.sqlContext.read.format("jdbc")
.option("url", jdbc.jdbcUrl)
.option("user", jdbc.user)
.option("password", jdbc.passwd)
.option("driver", jdbc.jdbcDriver)
.option("dbtable", table)
.load()
// dF1.show(3)
dF1.createOrReplaceTempView(s"${table}")
dF1
}
}
here is a good article which answers your question:
https://freecontent.manning.com/what-happens-behind-the-scenes-with-spark/
In simple words: the workers separate the reading task into several parts and each worker only read a part of your input data. The number of tasks divided depends on your ressources and your data volume. The writing is the same principle: Spark writes the data to a distributed storage system, such as Hdfs and in Hdfs the data is stored in a ditributed way: each worker writes its data to some storage node in Hdfs.
By default data from jdbc source are loaded by one thread so you will have one task processed by one executor and thats the case you may expect in your second function mysqlToDF
In the first function "jdbc" you are closer to parallel read but still some parameters are needed, numPartitions is not enough, spark need some integer/date column and lower/upper bounds to be able to read in paralell (it will execute x queries for partial results)
Spark jdb documentation
In this docu you will find:
partitionColumn, lowerBound, upperBound (none) These options must
all be specified if any of them is specified. In addition,
numPartitions must be specified. They describe how to partition the
table when reading in parallel from multiple workers. partitionColumn
must be a numeric, date, or timestamp column from the table in
question. Notice that lowerBound and upperBound are just used to
decide the partition stride, not for filtering the rows in table. So
all rows in the table will be partitioned and returned. This option
applies only to reading.
numPartitions (none) The maximum
number of partitions that can be used for parallelism in table reading
and writing. This also determines the maximum number of concurrent
JDBC connections. If the number of partitions to write exceeds this
limit, we decrease it to this limit by calling coalesce(numPartitions)
before writing. read/write
regarding write
How about writing data to mysql,how many connections are there?
As stated in docu it also depends on numPartitions, if number of partitions when writing will be higher than numPartitions Spark will figure it out and call coalesce. Remember that coalesce may generate skew so sometimes it may be better to repartition it explicitly with repartition(numPartitions) to distribute data equally before write
If you don't set numPartitions number of paralell connections on write may be the same as number of active tasks in given moment so be aware that with to high parallelism and no upper bound you may choke source server

Use libpcap to capture multiple interfaces to same file

I would like to use libpcap to capture on multiple specific interfaces (not 'any') to the same file
I have the following code (error handling and some args removed):
static gpointer pkt_tracing_thread(gpointer data)
{
while (1)
{
pcap_dispatch(g_capture_device1, .., dump_file1);
pcap_dispatch(g_capture_device2, .., dump_file2);
}
}
fp1 = calloc(1, sizeof(struct bpf_program));
fp2 = calloc(1, sizeof(struct bpf_program));
cap_dev1 = pcap_open_live(interface1,...
cap_dev2 = pcap_open_live(interface2,...
pcap_compile(cap_dev1, fp1, ...
pcap_compile(cap_dev2, fp2, ...
pcap_setfilter(cap_dev1, fp1);
pcap_setfilter(cap_dev2, fp2);
dump_file1 = pcap_dump_open(g_capture_device1, filename);
dump_file2 = pcap_dump_open(g_capture_device2, filename);
g_thread_create_full(pkt_tracing_thread, (gpointer)fp1, ...
g_thread_create_full(pkt_tracing_thread, (gpointer)fp2, ...
This does not work. What I see in filename is just packets on one of the interfaces. I'm guessing there could be threading issues in the above code.
I've read https://seclists.org/tcpdump/2012/q2/18 but I'm still not clear.
I've read that libpcap does not support writing in pcapng format, which would be required for the above to work, although I'm not clear about why.
Is there any way to capture multiple interfaces and write them to the same file?
Is there any way to capture multiple interfaces and write them to the same file?
Yes, but 1) you have to open the output file only once, with one call to pcap_dump_open() (otherwise, as with your program, you may have two threads writing to the same file independently and stepping on each other) and 2) you would need to have some form of mutex to prevent both threads from writing to the file at the same time.
Also, you should have one thread reading only from one capture device and the other thread reading from the other capture device, rather than having both threads reading from both devices.
As user9065877, you have to open the output file only once and write to it only from one thread at a time.
However, since you'd be serializing everything anyway, you may prefer to ask libpcap for pollable file descriptors for the interfaces and poll in a round-robin fashion for packets, using a single thread and no mutexes.

All the executors are not being used when reading JSON(zipped .gz) in GCP from google dataproc spark cluster using spark-submit

I just got introduced to this wonderful world of Big Data and Cloud technology, using GCP(dataproc) and pyspark. I have ~5 GB size JSON file(zipped, gz file) containing ~5 million records, I need to read each row and process only those rows which satisfies a certain condition. I have my working code and I issued a spark-submit with --num-partitions=5 but still only one worker is used to carry out the action.
This is the spark-submit command I am using:
spark-submit --num-executors 5 --py-files /home/user/code/dist/package-0.1-py3.6.egg job.py
job.py:
path = "gs://dataproc-bucket/json-files/data_5M.json.gz"
mi = spark.read.json(path)
inf_rel = mi.select(mi.client_id,
mi.user_id,
mi.first_date,
F.hour(mi.first_date).alias('hour'),
mi.notes).rdd.map(foo).filter(lambda x: x)
inf_relevance = inf_rel.map(lambda l: Row(**dict(l))).toDF()
save_path = "gs://dataproc-bucket/json-files/output_5M.json"
inf_relevance.write.mode('append').json(save_path)
print("END!!")
Dataproc config:
(I am using the free account for now, once I get working solution will add more cores and executors)
(Debian 9, Hadoop 2.9, Spark 2.4)
Master node:2 vCPU, 7.50 GB memory
Primary disk size: 32 GB
5 Worker nodes: 1 vCPU, 3.75 GB memory
Primary disk type: 32 GB
After spark-submit I can see in web UI that 5 executors were added but then only 1 executor remains active and perform all task and rest 4 are released.
I did my research and most of the questions talk about accessing data via JDBC.
Please suggest what I am missing here.
P.S. Eventually I would read 64 json files of 5 GB each, so might use 8 core * 100 workers.
Your best bet is to preprocess the input. Given a single input file, spark.read.json(... will create a single task to read and parse the JSON data as Spark cannot know ahead of time how to parallelize it. If your data is in line-delimited JSON format (http://jsonlines.org/), the best course of action would be to split it into manageable chunks beforehand:
path = "gs://dataproc-bucket/json-files/data_5M.json"
# read monolithic JSON as text to avoid parsing, repartition and *then* parse JSON
mi = spark.read.json(spark.read.text(path).repartition(1000).rdd)
inf_rel = mi.select(mi.client_id,
mi.user_id,
mi.first_date,
F.hour(mi.first_date).alias('hour'),
mi.notes).rdd.map(foo).filter(lambda x: x)
inf_relevance = inf_rel.map(lambda l: Row(**dict(l))).toDF()
save_path = "gs://dataproc-bucket/json-files/output_5M.json"
inf_relevance.write.mode('append').json(save_path)
print("END!!")
Your initial step here (spark.read.text(...) will still bottleneck as a single task. If your data isn't line-delimited or (especially!) you anticipate you will need to work with this data more than once, you should figure out a way to turn your 5GB JSON file into 1000 5MB JSON files before getting Spark involved.
.gz files are not splittable, so they're read by one core and placed onto a single partition.
see Dealing with a large gzipped file in Spark for reference.

Google Dataflow (Apache beam) JdbcIO bulk insert into mysql database

I'm using Dataflow SDK 2.X Java API ( Apache Beam SDK) to write data into mysql. I've created pipelines based on Apache Beam SDK documentation to write data into mysql using dataflow. It inserts single row at a time where as I need to implement bulk insert. I do not find any option in official documentation to enable bulk inset mode.
Wondering, if it's possible to set bulk insert mode in dataflow pipeline? If yes, please let me know what I need to change in below code.
.apply(JdbcIO.<KV<Integer, String>>write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver", "jdbc:mysql://hostname:3306/mydb")
.withUsername("username")
.withPassword("password"))
.withStatement("insert into Person values(?, ?)")
.withPreparedStatementSetter(new JdbcIO.PreparedStatementSetter<KV<Integer, String>>() {
public void setParameters(KV<Integer, String> element, PreparedStatement query) {
query.setInt(1, kv.getKey());
query.setString(2, kv.getValue());
}
})
EDIT 2018-01-27:
It turns out that this issue is related to the DirectRunner. If you run the same pipeline using the DataflowRunner, you should get batches that are actually up to 1,000 records. The DirectRunner always creates bundles of size 1 after a grouping operation.
Original answer:
I've run into the same problem when writing to cloud databases using Apache Beam's JdbcIO. The problem is that while JdbcIO does support writing up to 1,000 records in one batch, in I have never actually seen it write more than 1 row at a time (I have to admit: This was always using the DirectRunner in a development environment).
I have therefore added a feature to JdbcIO where you can control the size of the batches yourself by grouping your data together and writing each group as one batch. Below is an example of how to use this feature based on the original WordCount example of Apache Beam.
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
// Count words in input file(s)
.apply(new CountWords())
// Format as text
.apply(MapElements.via(new FormatAsTextFn()))
// Make key-value pairs with the first letter as the key
.apply(ParDo.of(new FirstLetterAsKey()))
// Group the words by first letter
.apply(GroupByKey.<String, String> create())
// Get a PCollection of only the values, discarding the keys
.apply(ParDo.of(new GetValues()))
// Write the words to the database
.apply(JdbcIO.<String> writeIterable()
.withDataSourceConfiguration(
JdbcIO.DataSourceConfiguration.create(options.getJdbcDriver(), options.getURL()))
.withStatement(INSERT_OR_UPDATE_SQL)
.withPreparedStatementSetter(new WordCountPreparedStatementSetter()));
The difference with the normal write-method of JdbcIO is the new method writeIterable() that takes a PCollection<Iterable<RowT>> as input instead of PCollection<RowT>. Each Iterable is written as one batch to the database.
The version of JdbcIO with this addition can be found here: https://github.com/olavloite/beam/blob/JdbcIOIterableWrite/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java
The entire example project containing the example above can be found here: https://github.com/olavloite/spanner-beam-example
(There is also a pull request pending on Apache Beam to include this in the project)

How to use the 7z DLL to compress and append many small chunks of data to a file

I would like to use the 7z DLL to append small amounts of data to one compressed file. At the moment my best guess would be uncompress the 7z file, append the data, and recompress it. Obviously, this is not a good solution performance wise, if the size of the 7z file becomes large (say 1 gb) and I want do save a new chunk every second. How could I do this in a better way?
I could use any compression format supported by the 7z DLL.
Have a look at the Python LZMA bindings (LZMA is the compression algorithm name of 7z), you should do what you want without ctypes stuff.
EDIT
To be confirmed, but a quick look at py7zlib.py shows only support for reading 7z files, not writing. However in the src dir there's a pylzma_compressfile.c, so maybe there's something to do.
EDIT 2
The pylzma.compressfile function seems to be there, so fine.
This is NOT MY ANSWER.
How can I use a DLL file from Python?
I think ctypes is the way to go.
The following example of ctypes is from actual code I've written (in Python 2.5). This has been, by far, the easiest way I've found for doing what you ask.
import ctypes
# Load DLL into memory.
hllDll = ctypes.WinDLL ("c:\\PComm\\ehlapi32.dll")
# Set up prototype and parameters for the desired function call.
# HLLAPI
hllApiProto = ctypes.WINFUNCTYPE (ctypes.c_int,ctypes.c_void_p,
ctypes.c_void_p, ctypes.c_void_p, ctypes.c_void_p)
hllApiParams = (1, "p1", 0), (1, "p2", 0), (1, "p3",0), (1, "p4",0),
# Actually map the call ("HLLAPI(...)") to a Python name.
hllApi = hllApiProto (("HLLAPI", hllDll), hllApiParams)
# This is how you can actually call the DLL function.
# Set up the variables and call the Python name with them.
p1 = ctypes.c_int (1)
p2 = ctypes.c_char_p (sessionVar)
p3 = ctypes.c_int (1)
p4 = ctypes.c_int (0)
hllApi (ctypes.byref (p1), p2, ctypes.byref (p3), ctypes.byref (p4))
The ctypes stuff has all the C-type data types (int, char, short, void*, ...) and can pass by value or reference. It can also return specific data types although my example doesn't do that (the HLL API returns values by modifying a variable passed by reference).