Are you up to date ?

October 23rd, 2016 | self-taught, training |

A few weeks ago, I’ve created a repo on github resuming how I stay up to date with the technology: chtefi/how-i-stay-updated.

I thought it would be nice to write a blog post on that, for you to find it, and for me to go a bit more deeper.

I’ll start explaining my situation in the IT world, why I follow all the things I’m following, then I’ll list of the resources I’m using to stay up-to-date so you could get some ideas. ;-)

Feel free to comment about what YOU are using to stay up-to-date. I’m eagerly impatient to know.

Frontend, backend, what’s the difference?

Those days, I’m working as a back-end and bigdata guy. Before that, I did a lot of front-end, so I also keep my front-end knowledge up to date. It’s still helpful to help and talk with the frontend teams, and it’s always useful when I want some quick UI to simplify my workflow. I even did a repo where I put everything I know about how to write a good website. Don’t hesitate to PR ! I may be a little rusty over some things.

As I was working mostly on the frontend side (and some backend but nothing fancy), I was bored and wanted to try something new. Reading all those news and stuff about bigdata made me wanting to go there and see what was the deal with that.

I’ve challenged myself, decided to resign of my current job (5 years there), and found a job in an amazing company in Paris, Powerspace. The boss was kind enough to let me in, despite the fact I don’t had a lot of experiences with their current stack. He accepted the challenge, because he knew I was able to learn quickly and that I definitely wanted to learn, making me a good resource.

Being good is being a good learner.

There is also this version: being an engineer (what most of us are considered today in IT) is being confortable with the ongoing changes and evolutions.

I was, and he trusted me.

Anyway, it’s been a year, and now, I have a way better knowledge of the available backend technologies we can use, how to use and link them, understand why they exist. I also like working on the architecture side thanks to the bigdata constraints. Then finally, I’m caring a lot more about the monitoring and alerting piece, those are truly something to be taken into account to know what’s going on each second in your company heart.

Working in the IT domain, you probably know Continuous Delivery, Continuous Integration, so you must be aware of the Continuous Learning. Here are the resources I’m using to do that.

Twitter

Of course.

It’s rare that tech people don’t have a twitter. This is often where the things happen at first. It’s an incredible source of informations. The characters limit is like a continuous TLDR feed, because we don’t have time to read everything. We just want to be aware it exists, and refer to it when we need to.

We don’t know if Twitter is still going to be there in 10 years. I sometimes wonder how we were doing before that. :-)

I’m posting a lot of links myself, everything I find interesting, don’t hesitate to say hi and follow me for your Continous Learning!

@chtefid

Chrome

My “new tab” is using tabbie since it was created. I never removed it even if it’s not perfect. It always answered to my needs, so, why change?

I’m using:

Bookmarks

I have a special folder “TO READ” where I bookmark for later.

Months ago, I was able to be stable around 10.
Right now, I have like >100 links in this folder. Always read some on evenings.
Once, I tried pocket, but, for some reason, I didn’t like it and stayed with my manual method. There is also delicious but I never gave a try.

I admit it sucks for a few reasons:

  • No tag system (i’m using Chrome)
  • You can resize the bookmark popup (Windows)
  • No search in this popup
  • The whole experience is not great

Well, to summarize : because the bookmarks system is just too damn basic. But it does the job.

There is a nice platform to git clone and install ourself, to save the bookmarks and shares them automatically, it’s called Shaarli.

Newsletters

Easy to subscribe and forget. The emails are where I got a tons of links. I don’t spend the same time on every one of them, it would be damn too time consuming!

  • http://highscalability.com/: stuff-the-internet-says-on-scalability, quotes and numbers. A LOT of links to blogs and articles.
  • http://www.confluent.io/blog/: top quality articles about confluent platform (kafka, streams, connect..)
  • https://scala.libhunt.com: Scala is the best language, framework, with the best libraries ever
  • Medium Daily Digest: I think I read 1/3 of the articles, good personal digest, technical and more worldwide news
  • Hadoop Weekly: ultra interesting for bigdata stuff, a must
  • Pony Foo: the ponyfoo’s articles are a must. the selection of the NL is quite good
  • DB Weekly: not reading everything, but interesting when i know the db
  • Dev Tips: one topic, one gif, one message. Good to know.

Github

I subscribed to some projects to read issues and PR and know where it’s going but it’s quite time consuming. I don’t recommend it except if you are really into open-source.

Videos

Of course, watching conferences and courses is a must.

It’s also useful when you don’t want to read something because it’s 3am.

Often watching in 2x.

  • MIT OpenCourseWare
  • Compressor Head: (not the robot metal band!) still need to finish this series, very interesting, by Colt McAnlis
  • Facebook Developers: Facebook has nice conferences: Reactjs, F8, @Scale
  • InfoQ: mass amount of talks with slides, can find great stuff from confs and around the world
  • Slideshare: sometimes, you just want to get details from slides because it’s more succinct (a video is a sequence of images right?)

Company

Professionally, Slack is very useful (even if too noisy sometimes..). Having a dedicated channel where people shares links about technology is a big plus. People likes to share, but don’t share too much, otherwise it’s lost.

Training

Of course, personal training is essential.

I spent some time on:

  • Coursera: perfect when I want to get into unknown
  • Hackerrank: always nice a have some challenges and discover new algorithms, mostly using Scala
  • Pluralsight: less than coursera but I followed some stuff a while ago
  • Egghead: nice free videos with react, redux, cycle and so on. But not all are free.

Books

I have a dozen of pdfs on my desktop about interesting stuff (scala, tdd, hadoop, elastic, akka, architecture, some softwares…). I’ve finished some, and started some others. Barely have time to read them.

I’m often reading a physical book aside. It’s more practical to stay focus, and “hopefully”, I have 2 hours of transit every day, so I try to not lose them.

Right now, it’s about DDD, super interesting Domain-Driven Design: Tackling Complexity in the Heart of Software. Links and tutorials are good, but books are really useful to show us the big picture and often go deeper.

More

Sometimes, you stumbled upon people sharing a lot of things. You keep them in case of (you ran out of things to read or do..), you never know.

Why all this ?

Because I love to know lots of things. It allows you to see the greater picture of things, and know what it’s possible, where we are in the domain.

Take the Javascript arrays.

  • Cool we can iterate using “for”.
  • Then do some functional programming and use “filter”, “.map”, “.concat”.
  • Then in more powerful languages such as Scala or Haskell, use some monads.
  • Then use scalaz to use category theory over your data (and lose your mind).
  • Use Kafka to act as an “array container” but partitioned and distributed.
  • Do some streaming processing to act as your “filter”, “map” or “reduce”.
  • Use Spark, Flink etc. to process streams in a distributed fashion, forever.
  • Be reactive, and use ReactiveX (rxjs/java/scala) to have a powerful API and deal with Observables.
  • Use an event-driven architecture.
  • Don’t forget backpressure.

I see that as the whole same thing at different levels of abstraction/simplicity.

Why it’s hard ?

  • Because my memory sucks and I’m sure I forget 90% of what I’m reading
  • Because I don’t have a lot of free time to read and learn all that. I’m trying, sleeping is overrated
  • Because most of things won’t be useful for me.
  • Because I could create stuff and write some code to make money instead

Why it’s useful to learn all that?

  • Because in about a year of that, I learned SO MANY things in a lot of domains and got better overall.

But we can’t learn everything

Despite all that, there are some domains I volontary don’t take a look, because I know it’s just too much or I know it’s just something I will never work with.

For instance:

  • mobile apps (except React-Native now, from far)
  • ops softwares (I mean chef, puppet, ansible, this kind): No problem to use them, but I don’t want to learn the internals, I just don’t care. I just love Docker. :-)
  • datascientism: I LOVE to work with datascientists, they are awesome people. They know the data better than anyone else in the company. But the ML piece, the Deep Learning piece is a huge world. I not the basics but it stops here. But we all should definitely learn a lot in this domain, this is the present, and it’s our future. Even Google wants to be a “ML Company First”, betting on the right horse.
  • design: I LOVE to work with designers, they are awesome people. Without them, we’ll still be working on green on black screens. I LOVE watch and study the work they are doing. I LOVE to criticize front-end stuff, even I have don’t have the answer.
  • networking: I know the basics and how to debug a bit chtefi/curated-system-tools but it’s a very complex world. I suck at understanding company’s architectures and acronyms, how it works etc.
  • The cloud: barely played with AWS, I want to know more, but waiting to have a use-case :)

Each job needs specialized people after all.

Blog

As you can see, I also “maintain” this blog.

I mostly pick a topic or a framework I want to learn and share my findings. (using a lot of gists too)
It’s useful because you try to not say stupid things and must verify your affirmations. See this another blog to understand why you should write blogs.

I have a tons of drafts and titles, but it’s very hard to find time to write the articles the way you want. I often takes several evenings doing only that to make an article good enough for me. I still must improve on that part.

Beside staying up-to-date

  • Don’t forget your wife, your kids, and your family.
  • Don’t forget yourself. Go to gym. Don’t try to be too geeky.

 

Feel free to comment about what YOU are using to stay up-to-date. I’m eagerly impatient to know.

Why it’s important to log using slf4j ?

September 7th, 2016 | code quality, hadoop, java, logging, scala |

You are a Java or Scala programmer. You are logging stuff with different levels of severity. And you probably already used slf4j even without noticing.

This post is global overview of its ecosystem, why it exists and how does it work. It’s not because you’re using something everyday that you know the details right?

Why does slf4j even exist?

Why do we need something complicated like a logging framework to do something simple as put a message on stdout? Because not everybody wants to use only stdout, and because of dependencies that have their own logging logic too.

slf4j needs love

slf4j is an API that exposes logging methods (logger.info, logger.error and so on). It’s just a facade, an abstraction, an interface. By itself, it can’t log anything. It needs an implementation, a binding, something that truly logs the message somewhere. slf4j is just the entry point, it needs an exit.

slf4j breathes in logs

But it can also serve as a exit for other logging systems. This is thanks to the logging adapters/bridges, that redirect others logging frameworks to slf4j. Hence, you can make all your application logs to go through the same pipe even if the origin logging system is different.

slf4j is magic

The magic in that ? You can do all that and update the implementation without altering the existing code.

 

We are going to see several logging implementations slf4j can be bound to.

I’m going to use Scala code because it’s more concise, but that’s exactly the same in Java.

Simple logging using JUL

JUL stands for java.util.logging. This is a package that exists since the JDK1.4 (JSR 47). It’s quite simple to use and does the job:

val l = java.util.logging.Logger.getLogger("My JUL")
l.info("coucou")

Output:

Aug 18, 2016 11:41:00 PM App$ delayedEndpoint$App$1
INFO: coucou

App is my class, delayedEndpoint is the method.

It’s configurable through its API:

// we create a logger that accept ALL levels
val l = java.util.logging.Logger.getLogger("My JUL")
l.setLevel(Level.ALL)
// we output ALL the logs to the console
val h = new ConsoleHandler
h.setLevel(Level.ALL)

// and to a file but only greather or equal to WARNING
val f = new FileHandler("warn.log", true)
f.setLevel(Level.WARNING)
f.setFormatter(new SimpleFormatter)

l.addHandler(h)
l.addHandler(f)

// log stuff
l.entering(classOf[App].toString, "myMethod")
l.info("hello there")
l.severe("badaboom")
l.exiting(classOf[App].toString, "myMethod")

That can output something like :

sept. 07, 2016 11:16:53 PM interface scala.App myMethod
FINER: ENTRY
sept. 07, 2016 11:16:53 PM com.App$ myMethod
INFO: hello there
sept. 07, 2016 11:16:53 PM com.App$ myMethod
INFO: hello there
sept. 07, 2016 11:16:53 PM com.App$ myMethod
SEVERE: badaboom
sept. 07, 2016 11:16:53 PM com.App$ myMethod
SEVERE: badaboom
sept. 07, 2016 11:16:53 PM interface scala.App myMethod
FINER: RETURN

The default format is horrible but we can see our logs. You’ll notice we have the INFO and SEVERE twice but not the FINER. It’s because, by default, there is already a console handler logging all INFO minimum.

It’s also configurable through a properties file often named “logging.properties”.

For instance, on OSX, you can find the JVM global JUL configuration here (that contains the default console handler we just talked about):

/Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home/jre/lib/logging.properties

You can use a file of yours by specifying its path in the system properties:

-Djava.util.logging.config.file=src/main/resources/logging.properties

Some values inside must be references (FQCN) that will be load dynamically, otherwise it’s simple properties (think beans).

.level = INFO
handlers=java.util.logging.ConsoleHandler
java.util.logging.ConsoleHandler.level = INFO
java.util.logging.ConsoleHandler.formatter = java.util.logging.SimpleFormatter
java.util.logging.SimpleFormatter.format=%4$s: %5$s [%1$tc]%n

We can get a reference to the global logger to change it’s minimum level:

java.util.logging.Logger.getGlobal.setLevel(Level.ALL)

The output is better:

FINER: ENTRY [Wed Sep 07 23:32:48 CEST 2016]
INFO: hello there [Wed Sep 07 23:32:48 CEST 2016]
SEVERE: badaboom [Wed Sep 07 23:32:48 CEST 2016]
FINER: RETURN [Wed Sep 07 23:32:48 CEST 2016]

Be careful, specifying a configuration file is not used as an override of the default! If you forget something (especially handlers=), you might not see any logging.

Note that we used the handler java.util.logging.ConsoleHandler but there is also available a FileHandler (if unconfigured, it logs into $HOME/java0.log).

LogManagers

All the Loggers created in the application are managed by a LogManager.

By default, there is a default instance created on startup. It’s possible to give another one, by specifying the property java.util.logging.manager.

It’s often used along with log4j that implements a custom LogManager (available in the package org.apache.logging.log4j:log4j-jul):

-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager

This way, any manager can have a hand on any Logger created in the application.

It can change their behavior and where do they read their configuration for instance. This is what we call a Logging Adapter or a bridge: you can log using JUL in the code and use log4j features to manipulate and save the logs. We’ll go into more details later in this post.

A smarter logging with slf4j-api

Let’s go into the main subject: slf4j.

The API

First, we need to add a dependency to its API:

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
val sl: Logger = LoggerFactory.getLogger("My App")
sl.info("hello")

We are getting some logs, but not what we expect:

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

slf4j is using a org.slf4j.helpers.NOPLogger to log, but unfortunately, as the name says, all methods are empty shells:

// org.slf4j.helpers.NOPLogger.java
final public void info(String msg, Throwable t) {
    // NOP
}

The application still works, but without logs. slf4j tries to find a class “org.slf4j.impl.StaticLoggerBinder” available in the classpath. If it does found one, it fallbacks to the NOPLogger.

A simple slf4j binding

Fortunately, there is a simple implementation of slf4j :

libraryDependencies += "org.slf4j" % "slf4j-simple" % "1.7.21"

Now it can find a org.slf4j.impl.StaticLoggerBinder to create a Logger (a SimpleLogger in this case).

By default, this logger publishes messages to System.err, but it can actually write to System.out or any file.

val sl: Logger = LoggerFactory.getLogger("My App")
sl.info("message from {}", "slf4j!")

Output:

[main] INFO My App — message from slf4j!

The style and destination can be configured using System variables or via a properties file.

-Dorg.slf4j.simpleLogger.showDateTime=true
-Dorg.slf4j.simpleLogger.dateTimeFormat="yyyy-MM-dd HH:mm:ss"
-Dorg.slf4j.simpleLogger.levelInBrackets=true
-Dorg.slf4j.simpleLogger.logFile=simpleLogger.log

Here, we say we want to log into a file “simpleLogger.log”.

For the sake of clarity and organization, it’s preferable to put those props in a dedicated file such as src/main/resources/simplelogger.properties:

org.slf4j.simpleLogger.showDateTime=true
org.slf4j.simpleLogger.dateTimeFormat="yyyy-MM-dd HH:mm:ss"
org.slf4j.simpleLogger.levelInBrackets=true
org.slf4j.simpleLogger.logFile=simpleLogger.log

This was our first sl4j logging implementation. But we already saw another one: JUL !

slf4j to JUL

slf4j can redirect its logs to JUL that provides the “writing” piece as we already saw.

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.slf4j" % "slf4j-jdk14" % "1.7.21"

The name “slf4j-jdk14″ is because JUL package appeared in the JDK1.4 as we said. Strange name to pick but well.

Output:

INFO: message from slf4j! [Thu Aug 18 23:45:15 CEST 2016]

The code is the same as previously, we just changed the implementation. Notice the output is different than the SimpleLogger’s.

This logger is actually an instance of JDK14LoggerAdapter. It’s using the style we defined at the beginning, in logging.properties, used by JUL, remember ?.

Note that you don’t have the full control on the Logger via the API as we had when using directly java.util.logging.Logger which exposes more methods. We just have access to the slf4j’s ones. This is why the configuration files comes in handy.

Multiple implementations

If we have multiple implementations available, slf4j will have to pick between them, and it will leave you a small warning about that.

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.slf4j" % "slf4j-jdk14" % "1.7.21"
libraryDependencies += "org.slf4j" % "slf4j-simple" % "1.7.21"

Output:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [.../slf4j-simple/jars/slf4j-simple-1.7.21.jar!...]
SLF4J: Found binding in [.../org.slf4j/slf4j-jdk14/jars/slf4j-jdk14–1.7.21.jar!...]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.SimpleLoggerFactory]
"2016-08-18 23:53:54" [main] [INFO] My App - message from slf4j!

As we said org.slf4j.impl.StaticLoggerBinder is the class slf4j-api is looking for in the classpath to get an implementation. This is the class that must exist in a slf4j implementation jar.

This message is just a warning, the logging will work. But slf4j will simply pick one available logging implementation and deal with it. But it’s a bad smell that should be fixed, because maybe it won’t pick the one you want.

It often happens when pom.xml or build.sbt imports dependencies that themselves depends on one of the slf4j implementation.

They have to be excluded and your own program should import a slf4j implementation itself. If you don’t, you could run in a no-logging issue.

A real case causing logs loss

For a real case, let’s import the hadoop client lib:

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0"

If we restart our program, it’s getting more verbose and we’re getting a surprise:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [.../org.slf4j/slf4j-log4j12/jars/slf4j-log4j12–1.7.5.jar!...]
SLF4J: Found binding in [.../org.slf4j/slf4j-jdk14/jars/slf4j-jdk14–1.7.21.jar!...]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
log4j:WARN No appenders could be found for logger (My App).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

We can see some log4j warnings that we never imported, and we don’t even see our own message! Where did it go?

It went into log4j that is not configured, meaning into a blackhole.

One way is to exclude the log4j impl from the dependencies:

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.slf4j" % "slf4j-jdk14" % "1.7.21"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0" exclude("org.slf4j", "slf4j-log4j12")

If we restart our program, we can see that our JUL console logs are back.

Note that the hadoop logging will still be voided, because it still rely on log4j configuration we didn’t configured.

One way to fix this and get the hadoop logs, would be to redirect log4j api to slf4j api. It’s possible, we simply need to add a dependency to org.slf4j:log4j-over-slf4j.

Again, we’ll see that in details later in this article, but the point is: you shouldn’t have multiple logging implementations available in one program.

slf4j implementations should be declared as optional

A best practice when writing a library or any module that can be imported somewhere, is to set slf4j implementation dependency as “optional”:

libraryDependencies += "org.slf4j" % "slf4j-jdk14" % "1.7.21" % "optional"
<dependency>
  <groupId>org.slf4j</groupId>
  <artifactId>slf4j-jdk14</artifactId>
  <optional>true</optional>
</dependency>

With optional, the dependency won’t be imported by transitivity.

The program which depends on it can use anything, no need to exclude it. More details here https://maven.apache.org/guides/introduction/introduction-to-optional-and-excludes-dependencies.html.

JCL/ACL

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.slf4j" % "slf4j-jcl" % "1.7.21"

JCL stands for Jakarta Commons Logging.

Jakarta is an old retired Apache project, basically, it’s known as ACL now, Apache Commons Logging. It’s not maintained anymore (since 2014), but we can find it in old projects.

It serves the same purpose as slf4j, meaning it’s an abstraction over different logging frameworks such as log4j or JUL.

slf4j’s getLogger() will return a JCLLoggerAdapter that will look for a specific “Log” implementation set by the System variable “org.apache.commons.logging.Log”.

If not set, it will try to fallback on any implementations it can find in the classpath (log4j, JUL..).

New projects should forget about it. Only, if they depends on an old project that depends on JCL, then it should be considered to add a bridge to redirect JCL logs to the implementation of the project.

log4j

log4j is a widely-used logging framework. v1.x has been refactored and improved a lot to create the v2.x called log4j2.

Again, it can be used as an abstraction over a logging implementation, but it can be used as an implementation as well.

log4j1.2

log4j1.2 has reached end of life in 2015.

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.slf4j" % "slf4j-log4j12" % "1.7.21"

Note that this will pull log4j1.2 library too. Here is the dependency tree:


[info] +-org.slf4j:slf4j-log4j12:1.7.21
[info]   +-log4j:log4j:1.2.17
[info]   +-org.slf4j:slf4j-api:1.7.21

When calling slf4j’s getLogger(“My App”), it will use log4j API to create the logger:

org.apache.log4j.LogManager.getLogger(name);

Note that this LogManager has nothing to do with the JUL’s one.

When you don’t have slf4j but just log4j, this is the method you call to get a Logger. slf4j-log4j12 just does the same.

Anyway, that’s not enough:

log4j:WARN No appenders could be found for logger (My App).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

log4j needs a configuration file. We can create a simple properties file “src/main/resources/log4j.properties”:

log4j.rootLogger=DEBUG, STDOUT
log4j.appender.STDOUT=org.apache.log4j.ConsoleAppender
log4j.appender.STDOUT.layout=org.apache.log4j.PatternLayout
log4j.appender.STDOUT.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n

If we restart our program, we can see our message:

0 [main] INFO My App — message from slf4j!

Or if we like xml (nobody?), we can create a file “log4j.xml” (notice the lowercase tags):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE log4j:configuration PUBLIC "-//APACHE//DTD LOG4J 1.2//EN" "log4j.dtd">
<log4j:configuration xmlns:log4j='http://jakarta.apache.org/log4j/'>

 <appender name="STDOUT" class="org.apache.log4j.ConsoleAppender">
   <layout class="org.apache.log4j.PatternLayout">
     <param name="ConversionPattern" value="%d %-5p [%t] %C{2} (%F:%L) - %m%n"/>
   </layout>
 </appender>

 <root>
   <priority value ="info" />
   <appender-ref ref="STDOUT" />
 </root>

</log4j:configuration>

Output:

2016–08–22 01:06:38,194 INFO [main] App$ (App.scala:11) — message from slf4j!

But you shouldn’t useWhen you don’t have slf4j but just log4j, this is the method you call to get a Logger. slf4j-log4j12 just does the same.

log4j2

Now, let’s say we want to use the latest version of log4j. It may be the most popular slf4j’s binding used nowadays.

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.apache.logging.log4j" % "log4j-slf4j-impl" % "2.6.2"

Notice the organization of the binding is “org.apache.logging.log4j”, and not “org.slf4j” like log4j12’s.

Only adding this dependency is not enough :

Failed to instantiate SLF4J LoggerFactory
Reported exception:
java.lang.NoClassDefFoundError: org/apache/logging/log4j/spi/AbstractLoggerAdapter
...

We need to add log4j-api dependency ourselves:

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.apache.logging.log4j" % "log4j-slf4j-impl" % "2.6.2"
libraryDependencies += "org.apache.logging.log4j" % "log4j-api" % "2.6.2"

Not enough yet!

ERROR StatusLogger Log4j2 could not find a logging implementation. Please add log4j-core to the classpath. Using SimpleLogger to log to the console…

We need to add log4j-core dependency too

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.apache.logging.log4j" % "log4j-slf4j-impl" % "2.6.2"
libraryDependencies += "org.apache.logging.log4j" % "log4j-api" % "2.6.2"
libraryDependencies += "org.apache.logging.log4j" % "log4j-core" % "2.6.2"

We get another error message (!) :

ERROR StatusLogger No log4j2 configuration file found. Using default configuration: logging only errors to the console.

That’s better, we just need a configuration file, that’s the last step.

Let’s create a sample log4j2.xml (notice the caps):

<?xml version="1.0" encoding="UTF-8"?>
<Configuration>

 <Appenders>
   <File name="A1" fileName="A1.log" append="true">
     <PatternLayout pattern="%t %-5p %c{2} - %m%n"/>
   </File>
   <Async name="ASYNCFILE">
     <AppenderRef ref="A1"/>
   </Async>
   <Console name="STDOUT" target="SYSTEM_OUT">
     <PatternLayout pattern="%d %-5p [%t] %C{2} (%F:%L) - %m%n"/>
   </Console>
 </Appenders>

 <Loggers>
   <Root level="debug">
     <AppenderRef ref="STDOUT"/>
     <AppenderRef ref="ASYNCFILE"/>
   </Root>
 </Loggers>

</Configuration>

Our message is finally back and a file A1.log is created too:

2016–08–22 01:51:49,912 INFO [run-main-a] App$ (App.scala:8) — message from slf4j!

log4j2 is excellent because it has a vast collections of Appenders where to write the logs : https://logging.apache.org/log4j/log4j-2.4/manual/appenders.html

  • Console, File, RollingFile, MemoryMappedFile
  • Flume, Kafka, JDBC, JMS, Socket
  • SMTP (emails on errors, woo!)
  • Any Appender can be treated as Async too (doing the logging in another thread, to not block the main thread cause of i/o)

logback

logback has the same father as log4j, it was meant to be the successor of log4j.

The syntax of the configuration is therefore quite similar.

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "ch.qos.logback" % "logback-classic" % "1.1.7"

“logback-classic” will pull-down “logback-core” as dependency, no need to add it.

It will run without configuration (finally!):

02:17:43.032 [run-main-1f] INFO My App — message from slf4j!

But of course, you can create a logback.xml to customize its behavior:

<configuration debug="true" scan="true" scanPeriod="30 seconds">

    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
        </encoder>
    </appender>

    <root level="debug">
        <appender-ref ref="STDOUT" />
    </root>
</configuration>
  • debug : to display some info about the logging system creation on startup
  • scan : modification are taken into account in live. This is particularly useful in production when you just want to get debug message for a short amount of time.
  • notice the xml style is log4j1.2’s

It’s also possible to use a custom config file

-Dlogback.configurationFile=src/main/resources/logback.xml

logback has the same collection of appenders as log4j. Some are not part of the official package, such as:

TLDR

Add a dependency to slf4j which is a logging interface: “org.slf4j” % “slf4j-api” and add an logging implementation:

Implementation Dependency(ies) Configuration / Note
to the console “org.slf4j” % “slf4j-simple” simplelogger.properties
to java.util.logging (JUL) “org.slf4j” % “slf4j-jdk14″ logging.properties
to JCL/ACL “org.slf4j” % “slf4j-jcl” (deprecated)
to log4j1.2 “org.slf4j” % “slf4j-log4j12″ (deprecated) log4.[properties|xml]
to log4j2 “org.apache.logging.log4j” % “log4j-[slf4j-impl|api|core]” log4j2.xml
to logback “ch.qos.logback” % “logback-classic” logback.xml

A very nice picture to resume what we just saw (we didn’t talked about sl4j-nop, it’s just a black hole):


http://www.slf4j.org/manual.html

 

So we learned about multiple implementations/bindings of slf4j’s api.

But if your project depends on other projects that are not using slf4j but directly JUL or log4j, it’s possible to redirect them to your own slf4j’s implementation, thanks to the bridges.

Bridges

Previously, we imported hadoop-client and our logs disappeared because it was using a log4j logger we never configured.

We excluded its implementation from the program and could see our logs again, but the logs of the hadoop-client library was still using log4j, and therefore its logs went into the void.

To avoid that, it’s possible to create a bridge to send log4j messages to slf4j, that we will dispatch where we want.

The bridge package generally contains both sides in the name, as “org.apache.logging.log4j” % “log4j-to-slf4j” % “2.6.2”.

For instance, with those dependencies :

libraryDependencies += "org.slf4j" % "slf4j-api" % "1.7.21"
libraryDependencies += "org.slf4j" % "slf4j-jdk14" % "1.7.21"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0"
libraryDependencies += "org.apache.logging.log4j" % "log4j-to-slf4j" % "2.6.2"

The path of the logs is:
hadoop’s log(…) → ACL → log4j → bridge → slf4j → JUL → System.err
Phew!

val sl: Logger = LoggerFactory.getLogger("My App")
sl.info("message from {}", "slf4j!")
// generate some hadoop logs
new DFSClient(new InetSocketAddress(1337), new Configuration)

We are actually “lucky” because 2 implementations were available for slf4j: log4j (provided in hadoop-client) and “slf4j-jdk14″.

Fortunately for us, slf4j pock “slf4j-jdk14″. Otherwise we would have get trap into an infinite loop :

hadoop’s log(…) → ACL → log4j → bridge → slf4j → log4j → log4j → bridge → slf4j → log4j→ log4j → bridge → slf4j → log4j…

Output:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [...slf4j-jdk14–1.7.21.jar!...]
SLF4J: Found binding in [...slf4j-log4j12–1.7.5.jar!...]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.JDK14LoggerFactory]
INFO: message from slf4j! [Fri Aug 19 01:08:46 CEST 2016]
FINE: dfs.client.use.legacy.blockreader.local = false [Fri Aug 19 01:08:46 CEST 2016]
FINE: dfs.client.read.shortcircuit = false [Fri Aug 19 01:08:46 CEST 2016]
FINE: dfs.client.domain.socket.data.traffic = false [Fri Aug 19 01:08:46 CEST 2016]
FINE: dfs.domain.socket.path = [Fri Aug 19 01:08:46 CEST 2016]
…

Another bridge supposedly doing the same exists : “org.slf4j” % “log4j-over-slf4j” % “1.7.21”. Unfortunately, it creates the infinite loop in our case, because slf4j pick “slf4j-log4j12″:

SLF4J: Found binding in [...slf4j-log4j12–1.7.5.jar!...]
SLF4J: Found binding in [...slf4j-jdk14–1.7.21.jar!...]
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" java.lang.StackOverflowError

But we can explicitely exclude the other implementation :

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0" exclude("org.slf4j", "slf4j-log4j12")

If we do, both bridges are working as expected.

As you can see, without altering anything in the hadoop library, we made it generate logs where and with the format we wanted.

Bridges between those common implementations are available (they couldn’t agree on the naming it seems..):

  • jcl-over-slf4j
  • log4j-over-slf4j
  • jul-to-slf4j

That’s the power of slf4j and its implementations. It’s completely decoupled from the source.

TLDR

Here’s a picture resuming the available bridges to slf4j:


http://www.slf4j.org/legacy.html

Performance

Some applications can generate a tremendous amount of logs. Some precautions should be taken care of:

  • async logging should always be preferred (another thread doing the logging, not the caller’s). This is often available in the logging configuration itself ()
  • you should not add level guards (if (logger.isDebugEnabled)…) before logging, that brings us to the next point:
  • do not concat strings yourself in the message: use the placeholder syntax such as log.info(“the values are {} and {}”, item, item2). The .toString() won’t be computed if it’s not needed (it can be cpu intensive, but basically, it’s just useless to call it if the log level is not enough).
  • In Scala, you generally use https://github.com/Log4s/log4s to avoid this and just use classic string interpolation. It’s based on macros and will automatically add guard conditions.

Some benchmarks and comparaison: https://logging.apache.org/log4j/2.x/performance.html

Last notes

slf4j is useful combined to a powerful implementation such as log4j2 or logback.

But be careful when the application is managed by another application, like supervisor, because they can handle the logging themselves too like file rolling, or logstash to somewhere. Often, keeping the logging configuration simple (stdout) is enough.

A lot of frameworks have traits or abstract classes or globals to provide the logging directly  :

  • Akka : provides LoggingReceive, ActorLogging, akka.event.Logging, akka-slf4j.jar
  • Spark : it’s using log4j and had a trait org.apache.spark.Logging (removed in 2.0.0)
  • Play Framework: it’s using logback and provides a Logger object/class on top of slf4j’s Logger

 

 

JNA—Java Native Access: enjoy the native functions

August 3rd, 2016 | java, jna, jni, native, scala |

Before JNA: JNI

If you’re into the Java world, you’ve probably heard of JNI: Java Native Interface.
It’s used to call the native functions of the system or of any native library.
Some good JNI explanations and examples here: http://www.ibm.com/developerworks/java/tutorials/j-jni/j-jni.html

Most of developers will never use it because it’s not often necessary to access the system resources, the windows, the volumes etc. That really depends of your business.

Sometimes, you want to use a library that’s not written in Java but in C. It’s very performant and battle-tested, you need to create a bridge. This is where JNI and JNA comes into play.

About resources, Java provides already some high-level API for some system aspects (memory, disks), such as:

  • Runtime.getRuntime().maxMemory()
  • Runtime.getRuntime().availableProcessors()
  • File.listRoots()(0).getFreeSpace()

But it’s pretty limited. Behind the scene, they are declared as native and rely on JNI.

You can use some projects that offers more options, such as oshi (Operating System & Hardware Information). It makes all possible information on the OS and hardware of the machine available (all memory and cpu metrics, network, battery, usb, sensors..).

It’s not using JNI: it’s using JNA!
JNA is JNI’s cousin: created to be simpler to use, and to write only Java code. (Scala in our case :) Note that there is a slight call overhead compared to JNI because of the dynamic bindings.

JNA

Basically, it dynamically links the functions of the native library to some functions declared in a Java/Scala interface/trait. Nothing more.

The difficulty comes with the signature of the functions you want to “import”.
You can easily find their native signatures (Google is our friend), but it’s not always obvious to find how to translate them using the Java/Scala types.

Hopefully, the documentation of JNA is pretty good to understand the subtle cases : Using the library, FAQ.

 

Let’s review how to use it using Scala and SBT (instead of Java).

How to use it

First, SBT:

libraryDependencies ++= Seq(
  "net.java.dev.jna" % "jna" % "4.2.2",
  "net.java.dev.jna" % "jna-platform" % "4.2.2")

The “jna” dependency is the core.

“jna-platform” is optional. It contains a lot of already written interfaces to access some standard libraries on several systems (Windows (kernel32, user32, COM..), Linux (X11), Mac). If you plan to use any system library, check out this package first.

Then, the Scala part.

Use the existing platform bindings

With jna-platform, you can use the existing bindings:

import com.sun.jna.platform.win32.Kernel32
import com.sun.jna.ptr.IntByReference

val cn = new Array[Char](256)
val success: Boolean = Kernel32.INSTANCE.GetComputerName(cn, new IntByReference(256))
println(if (success) Native.toString(cn) else Kernel32.INSTANCE.GetLastError())

You can feel the native way when calling this function (most native functions follows this style):

  • you provide a buffer and its length
  • you get a boolean as result to indicate success/failure
  • in case of a failure, you call GetLastError() to know the code of the error (such as 111 for ERROR_BUFFER_OVERFLOW)
  • in case of a success, the buffer contains the name

That’s very manual but that’s the way. (nowadays, we would return the String and throw an Exception on failure)

For information, the native signature is :

BOOL WINAPI GetComputerName(
  _Out_   LPTSTR  lpBuffer,
  _Inout_ LPDWORD lpnSize);

A pointer to some buffer to write into and its size (use as input and as output).

Listing the opened windows

Another more complex example to retrieve the list of opened windows :

import com.sun.jna.platform.win32.{User32, WinUser}

User32.INSTANCE.EnumWindows(new WinUser.WNDENUMPROC {
  override def callback(hWnd: HWND, arg: Pointer): Boolean = {
    val buffer = new Array[Char](256)
    User32.INSTANCE.GetWindowText(hWnd, buffer, 256)
    println(s"$hWnd: ${Native.toString(buffer)}")
    true
  }
}, null)

Output:

native@0xb0274: JavaUpdate SysTray Icon 
native@0x10342: GDI+ Window 
native@0x10180: Windows Push Notifications Platform 
(a lot more)...

The native signature of EnumWindows is :

BOOL WINAPI EnumWindows(
  _In_ WNDENUMPROC lpEnumFunc,
  _In_ LPARAM      lParam);
  • we use User32 because it contains the windows functions of Windows
  • a WNDENUMPROC  is a pointer to a callback. JNA already has an interface of the same name to be able to create this type in the JVM.
  • we call another function of User32 in get the title of each window

Create a custom binding

It’s time to fly with our own wings.

Let’s call a famous function of the Windows API: MessageBox. You know, the popups? It’s in User32.lib but JNA did not implemented it. Let’s do it ourselves.

First, we create an interface with the Java/Scala signature of the native function which is :

int WINAPI MessageBox(
  _In_opt_ HWND    hWnd,
  _In_opt_ LPCTSTR lpText,
  _In_opt_ LPCTSTR lpCaption,
  _In_     UINT    uType);

The Scala equivalence could be:

import com.sun.jna.Pointer
import com.sun.jna.win32.StdCallLibrary

trait MyUser32 extends StdCallLibrary {
  def MessageBox(hWnd: Pointer, lpText: String, lpCaption: String, uType: Int)
}
  • We use simple Strings and not Array[Char] because they are only used as inputs (_In_).
  • The name of the function must be exactly the native’s one (with caps)

Now, we need to instantiate the interface with JNA and call our function:

val u32 = Native.loadLibrary("user32", classOf[MyUser32], W32APIOptions.UNICODE_OPTIONS).asInstanceOf[MyUser32]
val MB_YESNO = 0x00000004
val MB_ICONEXCLAMATION = 0x00000030
u32.MessageBox(null, "Hello there!", "Hi", MB_YESNO | MB_ICONEXCLAMATION)

  • Always use W32APIOptions.UNICODE_OPTIONS or you’ll get into troubles when calling functions (that will automatically convert the input/output of the calls)

It was quite simple right? That’s the purpose of JNA. Just need an interface with the native method declaration, you can call it.

The difficulty could be to write the Java signature, but a tool can help: JNAerator. From the native language, it can generate Java signatures, pretty cool!

 

More examples of JNA usage on their github’s: https://github.com/java-native-access/jna/tree/master/contrib

 

Demystifying openssl

July 25th, 2016 | https, openssl, security, ssh |

I always saw openssl as a complicated beast.

I generally use openssl only to create a pair of private/public keys to be used with ssh.

But when I need to use it for some reasons, I’m always wondering what to do. I google some tutorials then copy/paste commands. But I never understood what’s the deal with all the files .key, .pem, .csr, .crt?! Then, when I succeed to do my thing, I move on. I never really tried to understand the flow.

I’m not an expert in openssl. I just want to demystify some of its features. How it is used to generate keys and certificates? What we can do beyond that? (random number generator, manually encrypt/decrypt files)

I’ll use a classic example: generate a self-signed certificate. Several commands will be used, several files will be generated. I’ll add the certificate into nginx to make it work. Finally, I will use some other commands unrelated to certificates.

Why a certificate?

It has two purposes :

  • ensure that you are on the website it claims to be

The browsers (or the operating systems) have a set of Certificate Authority root certificates installed. They use them to verify the certificates of the websites.

  • encrypt the data between you and the website server

The certificates contains a public key that the browser will use to encrypt the data to send. The server will be able to decrypt them using the associated private key stored on its disk.

Why a self-signed certificate?

Those kind of certificates were more useful before, when the HTTPS certificates were not free. To work with HTTPS in staging or development environments, it was the way to go.

Now, we have Let’s Encrypt and https://gethttpsforfree.com/ to get free HTTPS certificates truly signed by a Certificate Authority. But it’s still more complex to grab than simply generate a self-signed certificates.

Step 1: create a RSA key

$ openssl genrsa -des3 -passout pass:ThEpWd -out out.key 2048
Generating RSA private key, 2048 bit long modulus
.................................................+++
.......................+++
e is 65537 (0x10001)

We start by generating a RSA key (genrsa) :

  • We generate a key of 2048 bits. The more the safer. The more, the slower to generate. The process has to generate 2 primes numbers and do some security checks (represented by the “.” and “+”). It will also use some random signals to always get a different result.
  • We encrypt it with the Triple DES cipher using a password. Here, it is provided directly into the command line. It could also be passed from a file (“file:pwd.txt”). Any program trying to use the key will first need the password. It’s a secondary layer of security in case someone has access to the file.
  • Note that for automation, you could get into troubles. If a program wants you to type a password — because you are using an encrypted private key — such as nginx or any openssl commands, you could need a human interaction.
  • The output is by default the PEM format (it’s just the base64 of the DER format which is binary). PEM stands for “Privacy-Enhanced Mail”.https://tools.ietf.org/html/rfc1421. It’s a very old format.

A simpler command without encryption would be :

$ openssl genrsa -out out.key 2048

Encrypted, the key looks like this (“RSA PRIVATE KEY” with headers):

-----BEGIN RSA PRIVATE KEY-----
Proc-Type: 4,ENCRYPTED
DEK-Info: DES-EDE3-CBC,7FC8C1B4B0FAC94A
6wj6oRh3NDj4aI5U2tI3YxhyZX++a9nhQ3c6uznDidO9pH91wYmMznfeLTNWLA5f
Q/X5cDWFUY05XNVLDCuVosLumJ76fLi6yq/ZZ3rCgbglV415jdc3ozl9q1UANtp0
…
-----END RSA PRIVATE KEY-----

It’s possible to see the “content” of the key using the tool “rsa” — it’s all about mathematics — :

$ openssl rsa -noout -text -inform PEM -in out.key
Private-Key: (2048 bit)
modulus:
 00:b2:e0:a7:95:98:f6:26:8c:32:09:04:a6:ac:9a:
 42:b8:1e:43:de:fe:f6:c1:f1:1a:0e:b2:0c:86:47:
 35:6b:46:e4:46:36:bf:ef:cc:34:c0:09:9f:77:eb:
...
publicExponent: 65537 (0x10001)
...

The full synopsis of genrsa is :

# genrsa - generate an RSA private key
openssl genrsa [-out filename] [-passout arg] [-des] [-des3] [-idea] [-f4] [-3] [-rand file(s)] [-engine id] [numbits]

By default, openssl outputs the PEM format. It’s a plain ASCII file with some specific header and footer, and a big string in-between. It’s useful to send them through email (not the private keys of course!) along some text, or even to send a message encrypted (PGP).

Bonus: create a key from another key

“rsa” can also be used to convert any key to any other key format. For instance, we could generate another private key based on our first:

$ openssl rsa -passin pass:ThEpWd -in out.key -out out-next.key

I’m not sure why . Is that more secured? I already stumbled upon this technic to create a certificate, instead of using a password protected key generated by genrsa. Please, let me know!

The full synopsis of rsa is :

# rsa — RSA key processing tool
openssl rsa [-inform PEM|NET|DER] [-outform PEM|NET|DER] [-in filename] [-passin arg] [-out filename] [-passout arg] [-sgckey] [-des] [-des3] [-idea] [-text] [-noout] [-modulus] [-check] [-pubin] [-pubout] [-engine id]

Step 2: create a certificate request

Now that we have our private key, we are going to create a certificate request. Both the key and the certificate request are needed to create a certificate.

It’s necessary to request one because you are not supposed to be the one who signs the certificate. It is the role of the Certificate Authorities.

We use “openssl req” to generate a .csr (Certificate Signing Request).

$ openssl req -new -key out.key -out out.csr -sha256
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter ‘.’, the field will be left blank.
-----
Country Name (2 letter code) [AU]:FR
State or Province Name (full name) [Some-State]:
Locality Name (eg, city) []:
Organization Name (eg, company) [Internet Widgits Pty Ltd]:
Organizational Unit Name (eg, section) []:
Common Name (e.g. server FQDN or YOUR name) []:
...
  • We provide the private key to use (out.key)
  • We fulfill several metadata about the company (if self-signed, we don’t care a lot). They will be identifiable in the certificate.
  • We use a SHA-256 hash. If unspecified, it’s SHA-1 that will be used. SHA-1 creates a hash of 160 bits, and is deprecated due to some weaknesses. For instance, Chrome clearly displays a warning if the certificate is still using SHA-1. The new “standard” is SHA-256: it creates a hash of 256 bits.

The certificate request looks like this (“CERTIFICATE REQUEST”):

-----BEGIN CERTIFICATE REQUEST-----
MIICxTCCAa0CAQAwaTELMAkGA1UEBhMCRlIxDjAMBgNVBAgMBVBhcmlzMQ0wCwYD
VQQKDAROb25lMQ0wCwYDVQQLDAROb25lMQ0wCwYDVQQDDARKb2huMR0wGwYJKoZI
...
-----END CERTIFICATE REQUEST-----

We can verify what’s inside:

$ openssl req -in out.csr -text -verify -noout
verify OK
Certificate Request:
 Data:
 Version: 0 (0x0)
 Subject: C=FR, ST=Some-State, L=Paris, O=Internet Widgits Pty Ltd, CN=John Doe
...

The short synopsis of req is :

# req - PKCS#10 certificate request and certificate generating utility.
 openssl req [-inform PEM|DER] [-outform PEM|DER] [-in filename] [-out filename] [-text] [-pubkey] [-noout] [-verify] [-new] [-key filename] [-keyform PEM|DER] [-keyout filename] [-x509] [-days n]

PKCS stands for “Public Key Cryptographic Standards”. PKCS#10 is the format for request a certificate (https://tools.ietf.org/html/rfc2986).

Faster alternative — TLDR

It’s possible to create a certificate request with a private key in one shot:

$ openssl req -new -out out.csr -sha256

By default, it will generate a RSA 2048 bits key, ask you for a pass-phrase, and the private key will be output to “privkey.pem”.

To get rid of the defaults, one can use :

$ openssl req -new -nodes -out out.csr -keyout out.key -sha256
  • -nodes to not encrypt the key (no pass-phrase)
  • -keyout [filename] instead of “privkey.pem”

All the defaults of openssl are configurable: /etc/ssl/openssl.cnf

Step 3: generate the self-signed certificate

Finally, we create a self-signed certificate from our certificate request.

$ openssl x509 -req -in out.csr -signkey out.key -out out.crt -days 365 -sha256
Signature ok
subject=/C=AU/ST=Some-State/O=Internet Widgits Pty Ltd
Getting Private key
  • x509 is all about certificates standards
  • We provide the certificate request (-req -in) and the private key (that makes it self-signed)
  • We set it valid for a year
  • Do not forget to use SHA-256 to avoid browsers’ deprecation warning

The certificate looks like this:

-----BEGIN CERTIFICATE-----
MIIDTjCCAjYCCQD5/NymfWIDMzANBgkqhkiG9w0BAQsFADBpMQswCQYDVQQGEwJG
UjEOMAwGA1UECAwFUGFyaXMxDTALBgNVBAoMBE5vbmUxDTALBgNVBAsMBE5vbmUx
...
-----END CERTIFICATE-----

Again, we can inspect the content:

$ openssl x509 -in out.crt -text -noout
Certificate:
 Data:
  Version: 1 (0x0)
  Serial Number: 14039925222936737604 (0xc2d7d59a8230f744)
 Signature Algorithm: sha1WithRSAEncryption
  Issuer: C=AU, ST=Some-State, O=Internet Widgits Pty Ltd
  Validity
   Not Before: Jul 22 00:07:52 2016 GMT
   Not After : Jul 22 00:07:52 2017 GMT
 ...

The short synopsis of x509 is:

# x509 — Certificate display and signing utility
openssl x509 [-inform DER|PEM|NET] [-outform DER|PEM|NET] [-keyform DER|PEM] [-in filename] [-out filename] [-startdate] [-enddate] 
[-days arg] [-signkey filename] [-req] ...

Faster alternative — TLDR

We did all that, but it is possible to generate a self-signed certificate with its private key in one shot.

$ openssl req -x509 -nodes -new -keyout out.key -out out.crt -sha256
  • We add “-x509” to generate a self signed certificate and not a certificate request (if unspecified).

Aparte

Let’s just present some other openssl’s tools before going back to nginx and the certificate.

Random generator

If you have some bash and need nice generated strings (instead of $RANDOM or /dev/urandom), you can use rand :

$ openssl rand -hex 50
c15b5469946564a704777286ad81f6ca0c9c49a36c45a159174cc028c953f3ab1434977927fc9f11b6b45f4194f5ed2a6090
$ openssl rand 100 | od -An
 062351 017220 011763 071125 020723 127621 022632 037276
 066565 114650 073202 106612 120052 111556 075054 033665

Encrypt/Decrypt anything

“enc” can be used to encrypt and decrypt anything using any encryption algorithm existing on Earth.

For instance, encoding/decoding in base64 :

$ openssl enc -base64
use base64 to encode me!
[Ctrl+D]
dXNlIGJhc2U2NCB0byBlbmNvZGUgbWUhCg==
$ openssl enc -base64 -d
dXNlIGJhc2U2NCB0byBlbmNvZGUgbWUhCg==
[Ctrl+D]
use base64 to encode me!

It’s a bit useless because anybody can decrypt it, it’s not password-protected.

aes-256-cbc (NSA approved) does a nicer job:

$ openssl enc -aes-256-cbc -out encrypted.txt
enter aes-256-cbc encryption password:
Verifying — enter aes-256-cbc encryption password:
encrypt me!
[Ctrl+D]
$ xxd encrypted.txt
0000000: 5361 6c74 6564 5f5f 9448 1158 e593 c2b7 Salted__.H.X….
0000010: 67fb 442d af21 75ad 80e3 483a ff37 d8f9 g.D-.!u…H:.7..
$ openssl enc -aes-256-cbc -in encrypted.txt -d
enter aes-256-cbc decryption password:
encrypt me!

Nginx

Back to our certificate, let’s make it run on our website !

A minimal subset of changes to nginx would be to add this to one’s configuration :

server {
  listen 443;
  ssl on;
  ssl_certificate /tmp/out.crt;
  ssl_certificate_key /tmp/out.key;
}

You should not use /tmp of course. Considering using /etc/ssl/{cert,private} or /usr/local/share/ca-certificates to store your files.

Then restart nginx and enter the passphrase if any (beware of automation):

$ sudo service nginx restart
Restarting nginx:
Enter PEM pass phrase:
Enter PEM pass phrase:
nginx.

Then you will get a scary error in your browser. Either you won’t be able to do anything, or you will have a link at the bottom to continue.

If you have no link to continue, try to manually trust the certificate. On Windows, you can double-click on the .crt file to install it, or go manually in the stores :

On Debian and such, you can copy it to the appropriate folder :

$ sudo cp out.crt /usr/local/share/ca-certificates/out.crt
$ sudo update-ca-certificates
Updating certificates in /etc/ssl/certs… 
1 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d….done.

A better nginx configuration

nginx’s ssl configuration is a very important topic. Except when dealing with self-signed certificates because they should not be exposed to the Internet. But otherwise, you should definitely follow the advises of the following gist.

https://gist.github.com/plentz/6737338

  • One common thing to add, as it suggests, is to generate and use a dhparam file. It adds more randomness to the process and reduce the risk of disclosures.

Moreover, to test a installed certificate and nginx’s configuration, a good start is to submit the page’s address to https://www.ssllabs.com/ssltest/.

Now, openssl is not scary anymore.

Note: this is a cross-post of my medium post: https://medium.com/@ChtefiD/demystifying-openssl-b7b8dbcdd90a

From Apache Flume to Apache Impala using HDFS

March 31st, 2016 | cloudera, flume, hadoop, hbase, hdfs, impala |

Let’s use HDFS as a database !

So, we have data coming from one of our service, which is the source of a flume agent. Now, we want to be able to query them in a scalable fashion without using hbase or any other database, to be lean.

One way is to use HDFS as a database (Flume has a HDFS sink that handle partitioning), create a Hive table on top to query its content, and because we want something performant and fast, to actually use Impala to query the data using a Apache Parquet format.

Here’s a little diagram of the stack :

flume to impala

  • Apache Oozie is used to regularly export the HDFS content to a parquet format.
  • We are storing our data into HDFS in an Apache Avro format (Snappy compressed) because of all its advantages and because we are already using it everywhere.

Let’s review the stack one piece at a time, starting with the Flume configuration.

Flume

Let’s say we have a avro source where our custom service is sending events :

agent1.sources.events_source.type = avro
agent1.sources.events_source.bind = 0.0.0.0
agent1.sources.events_source.port = 9876

Let’s first configure the Flume HDFS sink where we are going to export our events, the configuration is pretty long but every piece has its importance :

agent1.sinks = ... s

agent1.sinks.s.type = hdfs
agent1.sinks.s.hdfs.path = /user/flume/events/ymd=%Y-%m-%d/h=%H
agent1.sinks.s.hdfs.inUsePrefix = .
agent1.sinks.s.hdfs.fileType = DataStream
agent1.sinks.s.hdfs.filePrefix = events
agent1.sinks.s.hdfs.fileSuffix = .avro
agent1.sinks.s.hdfs.rollInterval = 300
agent1.sinks.s.hdfs.rollSize = 0
agent1.sinks.s.hdfs.rollCount = 0
agent1.sinks.s.serializer = com.company.CustomAvroSerializer$Builder
agent1.sinks.s.channel = events_channel

Let’s review this config.

Partition

Because of our volume of data, we want to partition them per year-month-day then by hour. The “ymd=” and “h=” in the path are important, it represents the “column” name of the time dimension that will be queryable later.

Note that your Flume source must have a “timestamp” in the header for Flume to know what is the time dimension. If you don’t have this info, you can simply add hdfs.useLocalTimeStamp = true to use the ingestion time, but it’s discouraged because that means you don’t have any timestamp column in your data, and you’re going to get stuck later when doing some Impala partitioning.

Roll interval

We decide to roll a new file every 5min, and not based on size nor count (they have to be explicitely set to 0 because they have another default value).

By default, Flume buffers into a .tmp file we can’t rely on, and because we want to access the fresh data quickly, 5min is a good start. This is going to generate a bunch of file (144 per day), but we don’t care because we are going to export them later into a hourly parquet format and clean up the old HDFS content.

Moreover, if Flume suddenly dies, you are going to lose maximum 5min of data, instead of the whole buffer. Stopping properly Flume flushes the buffer hopefully.

File name

The inUsePrefix to “.” is to hide the working files to Hive during a query (it ignores the aka hidden files). If you don’t, some MapReduces can fail because at first, Hive saw a file Flume was buffering into (a .tmp), then the time to execute the MR, it was not there anymore (because of some flush), and kaboom, the MR will fail :

Caused by: java.io.FileNotFoundException: File does not exist: hdfs://hadoop01:8020/user/flume/events/ymd=2016–03–17/h=15/events.1458223202146.avro.tmp

File type

By default, the filetype is SequenceFile. We don’t want that, because that makes Flume convert the output stream to a SequenceFile that Hive will not be able to read because the avro schema won’t be inside. Setting it to DataStream let the data sent unaltered.

FYI, a typical SequenceFile body :

SEQ♠!org.apache.hadoop.io.LongWritable”org.apache.hadoop.io.BytesWritable ▒c(g▒s▒▒►▒▒▒|TF..

A snappy compressed avro file body :

Obj♦avro.schema▒8{"type":"record","name":"Order",...}avro.codecsnappy ▒▒T▒▒▒▒♣

Avro serializer

Our custom serializer is doing some conversion of the original event, and simply emits some avro using a DataFileWriter with the snappyCodec:

DatumWriter<Object> writer = new GenericDatumWriter<>(schema);
dataFileWriter = new DataFileWriter<>(writer)
                    .setCodec(CodecFactory.snappyCodec())
                    .create(schema, out);

Multiple Flume ?

Be careful if multiple Flume are writing to the same location, the buffer is not shareable and you could have name conflicts. By default, Flume named the file with a timestamp (in milliseconds). That’s fine most of the time but you never know if they are going to collide one day. Consider having two different configurations with a different prefix or suffix to the filename.

Performance consideration

Don’t forget to monitor your Flume when adding a HDFS sink. The overhead is noticeable; there are a lot more I/O threads (by default it’s 10 but I noticed way more threads with VisualVM), and the CPU slightly increases.

Our Flume is now properly sinking to HDFS, let’s check it out.

HDFS

First thing is to verify the data are correct and readable.

We check the partitioning is present :

$ hdfs dfs -ls /user/flume/events/ymd=2016-01-29
/user/flume/events/ymd=2016-01-29/h=00
/user/flume/events/ymd=2016-01-29/h=01
...

Then we check a file to see if it’s parsable :

$ hdfs dfs -cat /user/flume/events/ymd=2016–03–14/h=15/events.1458147890641.avro
Objavro.schema�8{“type”:”record”,”name”:”OrderRecord”,”namespace”:”com.company.avro”,”fields”:[...

Our schema is there, it’s not a SequenceFile, good! Let’s use the avro tools to deserialize the content. It’s simply a .jar with some useful functions (getschema, tojson), downloadable here.

$ curl -sLO http://apache.crihan.fr/dist/avro/avro-1.7.7/java/avro-tools-1.7.7.jar
$ java -jar avro-tools-1.7.7.jar getschema events.1458147890641.avro
{
  "type" : "record",
  "name" : "Order",
  "namespace" : "com.company",
  "fields" : [ {
    "name" : "type",
    "type" : "string"
  }, {
...
$ java -jar avro-tools-1.7.7.jar tojson logs.1458147890641.avro
{"type":"AD12","customer_id":2458189, ...
{"type":"AD12","customer_id":9515711, ...

Our HDFS is in place and get streamed new data. Let’s now configure Hive to create a table on top of the files.

Hive

Avro is hopefully standard in Hadoop and Hive has everything to read avro files to create a table.

AvroSerDe

The magic happens when using the AvroSerDe (Avro Serializer Deserializer). It is used to read avro files to create a table, and vice-versa, to create avro files from a table (with some INSERT). It also detects and uncompresses the files when compressed with Snappy.

Under the hood, it’s simply using DataFile[Writer|Reader]<GenericRecord> code to read/write avro content.

https://cwiki.apache.org/confluence/display/Hive/AvroSerDe

Create the table

We create a Hive external table mapped onto our .avro files using AvroSerDe and specifying a avro schema to read the data (it will probably be the same that the schema used to write the data BUT it could be different, such as only a slice):

CREATE EXTERNAL TABLE avro_events
PARTITIONED BY (ymd STRING, h INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/user/flume/events/'
TBLPROPERTIES ('avro.schema.literal' = '
{
 "type" : "record",
 "name" : "Order",
 "namespace" : "com.company",
 "fields" : [ {
   "name" : "type",
   "type" : "string"
 } ]
}');

The PARTITIONED BY matches our structure and names :

$ hdfs dfs -ls /user/flume/events/ymd=2016–03–17/h=12 
/user/flume/events/ymd=2016–03–17/h=12/events.1458212422229.avro
/user/flume/events/ymd=2016–03–17/h=12/events.1458212456756.avro
...

It’s possible to externalize the schema into its proper file on HDFS and use it :

TBLPROPERTIES ('avro.schema.url' = '/user/flume/Order.avsc');

You can generate the schema if you don’t have it (if it’s generated on the fly by the code with some Scala macros for instance):

$ java -jar avro-tools-1.7.7.jar getschema events.avro &gt; Order.avsc
$ hdfs dfs -put Order.avsc /user/flume/

ERR: Long schemas

If you have a looong schema, therefore a long query, you could end up with this error:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:javax.jdo.JDODataStoreException: Put request failed : INSERT INTO “TABLE_PARAMS” (“PARAM_VALUE”,”TBL_ID”,”PARAM_KEY”) VALUES (?,?,?)
org.datanucleus.store.rdbms.exceptions.MappedDatastoreException: INSERT INTO “TABLE_PARAMS” (“PARAM_VALUE”,”TBL_ID”,”PARAM_KEY”) VALUES (?,?,?)
Caused by: org.postgresql.util.PSQLException: ERROR: value too long for type character varying(4000)

You must use an external schema in this case, your query is too long.

If you’re using a UI like Hue to do the query, and notice some weird things, use the hive shell, the error will be properly displayed, Hue won’t.

ERR: Partition and field names conflicts

If you have a column in your data which the name is the same as the partition dimensions name (ymd and h in our case), you are going to get this error :

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop.hive.ql.metadata.HiveException: Partition column name hour conflicts with table columns.

You need to change the partition column name you are using.

Notify Hive of the new partitions

Our data are in HDFS, our Hive table is mapped onto it, good. But Hive need to discover the data now, it’s not automatic because we are short-circuiting it (we are inserting data without any Hive INSERT).

msck repair table to the rescue. It simply looks in the folder to discover new directories and add them to the metadata.

hive> msck repair table avro_events;
OK
Partitions not in metastore: avro_events:ymd=2016–03–17/hour=12
Repair: Added partition to metastore avro_events:ymd=2016–03–17/h=12
Time taken: 2.339 seconds, Fetched: 1 row(s)

Hopefully, we are not going to do that manually each time we need, we are going to use Oozie later in the process, and directly via Impala.

Drop partitions

If you’re playing a bit and saw some “Partitions missing from filesystem” when doing a msck repair, you can remove the partitions (metadata) manually :

hive> alter table avro_events DROP IF EXISTS PARTITION (ymd='2016–03–21') PURGE;
Dropped the partition ymd=2016–03–21/h=13
Dropped the partition ymd=2016–03–21/h=14
Dropped the partition ymd=2016–03–21/h=15
Dropped the partition ymd=2016–03–21/h=16
Dropped the partition ymd=2016–03–21/h=17
Dropped the partition ymd=2016–03–21/h=18
Dropped the partition ymd=2016–03–21/h=19
Dropped the partition ymd=2016–03–21/h=20
Dropped the partition ymd=2016–03–21/h=21
Dropped the partition ymd=2016–03–21/h=22
Dropped the partition ymd=2016–03–21/h=23

Querying

You can already query the Hive table to get proper records.

For instance, if you have a column “timestamp”, it’s nice to check the latest ingested data time :

select max(`timestamp`) from avro_events;

Hive has some magic column we can used to get more insights about the table content. To know how many records each partition contains, we can use the virtual column INPUT__FILE__NAME :

hive> select INPUT__FILE__NAME, COUNT(*)
FROM events
GROUP BY INPUT__FILE__NAME
ORDER BY INPUT__FILE__NAME;
hdfs://hadoop:8020/events/ymd=2016-03-22/h=00/...avro      910
hdfs://hadoop:8020/events/ymd=2016-03-22/h=00/...avro      1572
hdfs://hadoop:8020/events/ymd=2016-03-22/h=00/...avro      4884

That can come in handy if you just want to look what’s inside a specific file :

hive> SELECT COUNT(*) FROM avro_events
      WHERE INPUT__FILE__NAME = '...';

Because we are not crazy enough to query through Hive, let’s focus on querying the data through Impala to get blazing fast responses.

Impala

We can use Impala to query the avro table, but for performance reason, we are going to export it into a Parquet table afterwards. This step is mostly to be aware that it’s possible.

Query the Hive avro table

If right now, in Impala, you do:

> show tables;

You won’t see the Hive table yet. You need to make Impala aware it’s there:

> INVALIDATE METADATA avro_events;

When it’s done, you’ll be able to query it, with frozen data still. It’s the same problem that with Hive.

For Impala to see the latest data of the existing partitions, we can use REFRESH :

> REFRESH avro_events;

But that won’t discover the new partitions still (ie: if Flume just created a new hour partition).

We have to use the equivalent of Hive’s msck repair table, but for Impala:

> ALTER TABLE avro_events RECOVER PARTITIONS;

This RECOVER PARTITIONS will do what REFRESH does (see the latest data of the existing partitions), but will discover the new ones too. I don’t know the impact and process time on big tables with tons of partitions.

Query a Parquet table

Because we want to be taken seriously, we want to store our data into Parquet files to get fast queries. Parquet stores the data in columns rather in rows, supporting über-fast filtering because it doesn’t need to parse every rows.

First, we need to create a partitioned Impala table stored in Parquet :

CREATE TABLE avro_events_as_parquet (type STRING, ...)
PARTITIONED BY(ymd STRING, h INT)
STORED AS PARQUETFILE;

It doesn’t have to follow the same partitioning than the Hive table but for the sake of clarity, it does.

It’s empty, we are going to fill it with the avro table. To do that, we are going to base our partition logic on a “timestamp” column you should have in your data. We can’t retrieve the partition values from the Hive table avro folders name because it’s not queryable.

-- We ensure we're viewing the latest partitions of the Hive table
-- where Flume is sinking its content

ALTER TABLE avro_events RECOVER PARTITIONS;
REFRESH avro_events;

-- We insert the data overwriting the partitions we are going to
-- write into, to be sure we don't append stuff to existing (that
-- would do duplicates and wrong stats!)

INSERT OVERWRITE avro_events_as_parquet
PARTITION(ymd, h)
  SELECT type, ..., 
    FROM_UNIXTIME(FLOOR(`timestamp`/1000), 'yyyy-MM-dd') AS ymd,
    CAST(FROM_UNIXTIME(FLOOR(`timestamp`/1000), 'HH') AS INT) AS h
  FROM avro_events
  [ WHERE timestamp < $min AND timestamp > $max ];

-- We compute the stats of the new partitions
COMPUTE INCREMENTAL STATS avro_events_as_parquet;

We specify the partition values doing some transformation on our “timestamp” column (note: FROM_UNIXTIME uses the current locale).

The WHERE is not necessary the first time, you just want to load everything. Later, when you have a scheduled job every hour for instance, you want to filter which partition you want to write, thus the WHERE.

The end result is something like :

$ hdfs dfs -ls .../avro_events_as_parquet/ymd=2016–03–23/h=13
78.5 M avro_events_as_parquet/ymd=2016–03–23/h=13/644a2fbc8ad76fb2–2bd5c46c0a9a2cba_497985827_data.0.parq

A nice .parq file combining all the 5min-range avro files Flume created.

Now, because Flume is streaming, and because you want to query Impala without doing all the manual updates yourself, you’re planning a recurring Oozie job to take care of them.

Oozie

It’s quite straight-forward, it’s just an encapsulation of the scripts we just used.

First, we define a coordinator running every hour that will write the previous hour partition. (the coordinator could trigger the workflow every 5mins to have less lag in Impala if that’s necessary; the same Parquet partition would just be overwritten 12 times per hour with more and more data each time)

We take care of adding a small lag before the exact new hour, 12:00, and instead run the workflow at 12:05, to be sure Flume has flushed its data (the 5min is not random, it’s the same value as Flume rollInterval).

That means we can define a property in the coordinator.xml like that :

<property>
  <name>partitionDate</name>
  <value>${coord:dateOffset(coord:dateOffset(coord:nominalTime(), -1, 'HOUR'), -5, 'MINUTE')}</value>
</property>

If the workflow runs at 12:05, then partitionDate=2016–03–31T11:00Z. The partition ymd=2016–03–31/h=11 will contain every data in [11:00, 12:00).

Then, in the worflow, we create a <shell> action where we pass this value :

<shell xmlns="uri:oozie:shell-action:0.2">
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>

    <exec>${insertOverwriteScript}</exec>
    <argument>${partitionDate}</argument>
    ...

And the script uses it to create the WHERE condition we just talked about in the Impala part using some shell script transformations (because with what we did, we just have a plain TZ date, but we need 2 timestamps at least (min and max), or any other time value we could use from our data if we have them, such as “day”, “hour”).

I assume you know Oozie and what you do to not provide the full scripts in there.

Improvements

I think the stack could be simplified using a serializer to export into Parquet directly to HDFS, but that would still create a bunch of tiny parquet files (because we want to flush a lot) so we would still need to merge them automatically at the end. What do you think?

Conclusion

I hope I made myself clear and that the process makes sense.

Don’t forget we did that just because we didn’t want to use any database, but only HDFS to store our data, for some reasons. We wanted to be able to query quickly through Impala to have “almost-realtime” data (with a coordinator every 5min for instance that would do it).

Another solution would be to sink Flume into HBase then query over it, or create a Hive/Impala table on top.

Oh, this post is missing some gifs.

If you’re not me, come say hi on twitter : https://twitter.com/chtefid

What’s the password again?

February 20th, 2016 | security, ssh |

Damn, I forgot the password.
– Me

This is a crosspost of my medium post.

In your company network, or home network, you have access to several computers, and you probably want to interact with some of them, start some tasks, check some logs, reboot them and so on.

Of course, you are already using ssh to do so.

If you are still typing a username and password each time you log in somewhere, stop it.

Passwords are obsolete

Stop losing your time, looking at your excel to get the password or such, and use the power of the public-key cryptography. Smart people did that for a reason.

The principle is simple :

  • you have a (private) key in a file on your computer
  • the other computer has your (public) key in a file
  • those keys are strongly related

When you are going to connect to the other computer, both ssh and sshd (the daemon waiting for connections) will talk to each other and will check if they can make some kind of association using the keys each of them have. If they succeed, it means your are the one with the good private key (only your computer stores it, nothing else), it knows you, you are how you claim to be, and therefore you have access.

Can be used with GitHub, BitBucket, anything

Note that this approach does not only apply to the networks where you ssh, but to the whole Internet such as when you are using GitHub, BitBucket, or anything that has a login and password.

You are not doing pure ssh command-line into them but you are pushing data into them. Data that need to be authenticated. That’s why they offer the possibility to add some public keys for your account (through their UI) that will be in pair with the private key you have stored on your local computer.

Time to do some hacking

Here is my key

First, we need to generate those keys on our local computer :

# ssh-keygen -t rsa
 Generating public/private rsa key pair.
 Enter file in which to save the key (/root/.ssh/id_rsa):
 Enter passphrase (empty for no passphrase):
 Enter same passphrase again:
 Your identification has been saved in /root/.ssh/id_rsa.
 Your public key has been saved in /root/.ssh/id_rsa.pub.

By default, we won’t put any passphrase, even if that can be considered as a bad practice.

That will generate a private key in the file /root/.ssh/id_rsa :

-----BEGIN RSA PRIVATE KEY-----
 MIIEowIBAAKCAQEA/gV3aLUTDenLFw7hkkfNcJT4pbnt7gQVcjga4Rik4+hIU6a6
 ...
 -----END RSA PRIVATE KEY-----

Never share this, never ever.

And its related public key in /root/.ssh/id_rsa.pub :

ssh-rsa AAAAB3NzaC1yc2EAAAADA...slxyu9Ki8Hn6jWR root@computer

The keys are related to your user and hostname. You are going to copy that line to any server you want to access. It will use the user@hostname to find out if it has a public key from you.

A message can be encrypted using this public key, only by its private key.

Note that it’s popular to add some options to ssh-keygen, to generate a stronger key and set explicitly the user name in id_rsa.pub such as :

$ ssh-keygen -t rsa -b 4096 -C "email@example.com"

It’s dangerous to go alone, take this

Now, we are going to copy our public key to the server. That will make it be able to identify ourself and let us the access.

The default path used by sshd is .ssh/authorized_keys in the home folder of the user you want to connect with on the server :

root@local:~# ssh john@server

Path is : /home/john/.ssh/authorized_keys

You can create it manually if it doesn’t exist yet.

Its content is quite simple, one line per public key (several users could connect as john on the server) :

$ cat .ssh/authorized_keys
 ssh-rsa AAAAB3NzaC1yc2EAAAADA...slxyu9Ki8Hn6jWR root@computer
 ssh-rsa 2cvUZEP4ZuMtElv/Iu6M6...w8Qoa4A3b8a+YNl trainee@macos

As soon as you save the file, you’ll be able to connect from your local computer on the server, no question asked.

root@local:~# ssh john@server
 john@server:~$

One-way only

This config is a one-way only. You can’t connect from the server to the local. If you’d like to do so, you must perform the same actions but from the other side (generate keys on the server and add the public key to the local computer).

But, do not considerer this. A user on a server should never be able to connect to another computer. You should always get exit to your local computer, then connect to the other server. Do not connect nodes between them, it can introduce security issues.

One computer to rule them all

Once you put your public key somewhere, never generate your keys set on your computer. Otherwise you’ll lose the ability to connect on the servers you set up, because the public key won’t “match” the private key. ssh-keygen will warn you if that’s the case, but better not to try. ;-)

.ssh/config

Even the username can be optional

Another great feature is to avoid to type the username you want to login with on the server.

Instead of :

root@local:~# ssh john@staging

It would be nice to do :

root@local:~# ssh staging

And have automatically “john” as user (because this is the one set up with the keys).

To do so, edit the file .ssh/config and add something like :

Host staging
  HostName staging.host.lan
  User john

You can add as many hosts as you want in this file. It’s simply some kind of mapping between a name you give “Host” and User@HostName.

Beyond hosts

This file is not only useful to declare Host mappings. It can contain way more configuration bits, such as :

Host *
  ServerAliveInterval 60

This is to avoid the famous “broken pipe” you’ll get if you are inactive in a ssh session. That will send keep alive packets to be sure the connection stays up.

Beyond ssh

The keys are not only used by ssh.

scp is also compatible (to copy files from/to another host):

$ scp -r staging:/tmp/logs .

It follows the same rule as ssh : using .ssh/config and the set of keys on both sides.

But it’s not enough

Note that for further security, your keys should have a passphrase (that we set to blank in our example), and that you should use ssh-agent to, again, avoid to type it when needed.

Come say hi on twitter : https://twitter.com/chtefid

Java CLI, GC, memory, and tools overview

January 10th, 2016 | gc, java, performance |

Back to the Java world, I’ve made my mind and knew that I didn’t know enough, that I wasn’t enough confident.

Therefore I’ve looked around myself at some “simple” aspects of Java (CLI, GC, tools) to consolidate my knowledge, and made this post to give a global overview of what the Java CLI has to offer, how to configure the memory heap, what are the GC principles and most useful options, and introduce some tools to debug and profile the JVM.

I assume you already know Java and know what do the Garbage Collector with the Young Generation and the Old Generation. Hopefully, this post will learn you some new tricks.

I won’t talk about framework here, just about the basics :

  • Java command line and options
  • Concise summary of the Garbage Collectors and theirs logs
  • Memory tuning and its limits
  • UI tools to debug and profile a JVM
  • CLI tools shipped with the JVM

java version “1.8.0_66″
Java(TM) SE Runtime Environment (build 1.8.0_66-b18)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b18, mixed mode)

Java flags overview

I’ll introduce some of the most useful flags we can enable with java to get more info and understand a bit more what’s going on under the hood.

Quick note :

  • -XX:+[option] : enable the following option
  • -XX:-[option] : disable the following option
  • -XX:[property]= : give a value to the property

-XX:+PrintCommandLineFlags

First, it’s interesting to know what are the default options of the JVM.

$ java -XX:+PrintCommandLineFlags -version
-XX:InitialHeapSize=268055680 -XX:MaxHeapSize=4288890880 -XX:+PrintCommandLineFlags
-XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:-UseLargePagesIndividualAllocation
-XX:+UseParallelGC
  • -XX:InitialHeapSize=268055680 : 256m, default to 1/64 of the RAM (alias for -Xms)
  • -XX:MaxHeapSize=4288890880 : 4g, default to 1/4 of the RAM (alias for -Xmx)
  • -XX:+UseParallelGC : Parallel GC (PSYoungGen (Parallel Scavenge) + ParOldGen). Checkout the GC chapter if it’s scary.
  • the others are for misc optimisations

-XX:+PrintFlagsFinal

We can list every existing flags and their values. There are hundreds of them.
Below is just the list of all the Print* we can use to display more info in the logs and some GC related ones.

$ java -XX:+PrintFlagsFinal -version
[Global flags]
...
bool PrintAdaptiveSizePolicy             = false      {product}
bool PrintCMSInitiationStatistics        = false      {product}
intx PrintCMSStatistics                  = 0          {product}
bool PrintClassHistogram                 = false      {manageable}
bool PrintClassHistogramAfterFullGC      = false      {manageable}
bool PrintClassHistogramBeforeFullGC     = false      {manageable}
bool PrintCodeCache                      = false      {product}
bool PrintCodeCacheOnCompilation         = false      {product}
bool PrintCommandLineFlags               = false      {product}
bool PrintCompilation                    = false      {product}
bool PrintConcurrentLocks                = false      {manageable}
intx PrintFLSCensus                      = 0          {product}
intx PrintFLSStatistics                  = 0          {product}
bool PrintFlagsFinal                    := true       {product}
bool PrintFlagsInitial                   = false      {product}
bool PrintGC                             = false      {manageable}
bool PrintGCApplicationConcurrentTime    = false      {product}
bool PrintGCApplicationStoppedTime       = false      {product}
bool PrintGCCause                        = true       {product}
bool PrintGCDateStamps                   = false      {manageable}
bool PrintGCDetails                      = false      {manageable}
bool PrintGCID                           = false      {manageable}
bool PrintGCTaskTimeStamps               = false      {product}
bool PrintGCTimeStamps                   = false      {manageable}
bool PrintHeapAtGC                       = false      {product rw}
bool PrintHeapAtGCExtended               = false      {product rw}
bool PrintHeapAtSIGBREAK                 = true       {product}
bool PrintJNIGCStalls                    = false      {product}
bool PrintJNIResolving                   = false      {product}
bool PrintOldPLAB                        = false      {product}
bool PrintOopAddress                     = false      {product}
bool PrintPLAB                           = false      {product}
bool PrintParallelOldGCPhaseTimes        = false      {product}
bool PrintPromotionFailure               = false      {product}
bool PrintReferenceGC                    = false      {product}
bool PrintSafepointStatistics            = false      {product}
intx PrintSafepointStatisticsCount       = 300        {product}
intx PrintSafepointStatisticsTimeout     = -1         {product}
bool PrintSharedArchiveAndExit           = false      {product}
bool PrintSharedDictionary               = false      {product}
bool PrintSharedSpaces                   = false      {product}
bool PrintStringDeduplicationStatistics  = false      {product}
bool PrintStringTableStatistics          = false      {product}
bool PrintTLAB                           = false      {product}
bool PrintTenuringDistribution           = false      {product}
bool PrintTieredEvents                   = false      {product}
bool PrintVMOptions                      = false      {product}
bool PrintVMQWaitTime                    = false      {product}
bool PrintWarnings                       = true       {product}
...
bool UseParNewGC                         = false      {product}
bool UseParallelGC                      := true       {product}
bool UseParallelOldGC                    = true       {product}
...

The := means that the default value was overridden by something (you or the JVM Ergonomics).
You can see the JVM Ergonomics estimates java could use the ParallelGC on my PC.

Moreover, you can know what is the value of any flag the JVM handles.
For instance, you can find out what is the Young Generation size (“NewSize”) doing some | grep NewSize :

uintx MaxNewSize                       := 1430257664  {product}
uintx NewSize                          := 89128960    {product}

More details on how to read that on javaworld or codecentric.

Get more details in the logs

As a reminder :

-XX:+PrintGC / -verbose:gc

This is the first step to know what’s going on with your program and its GC.

[GC (Allocation Failure)  954K->896K(1536K), 0.0008951 secs]
[Full GC (Ergonomics)  896K->290K(1536K), 0.0026976 secs]
[GC (Allocation Failure)  778K->290K(1536K), 0.0006170 secs]

We can see the total heap going from 954K to 896K, with a max at 1536K for instance.

  • Allocation Failure : the JVM couldn’t find any more space in the Young Generation and had to clean it up. This is a normal behavior.
  • Ergonomics : the JVM decided to start a Full GC on its own.
  • Metadata GC Threshold : Metaspace size is exhausted. Raise the default MetaspaceSize and maybe the max MaxMetaspaceSize.

-XX:+PrintGCDetails

It’s more interesting, you still see the heap size changes, but you also see the young generation PSYoungGen, the old generation ParOldGen, and the Metaspace changes (because I was running with the Parallel GC, it’s different according to which GC is used).

[GC (Allocation Failure)
  [PSYoungGen: 465K->400K(1024K)]
  954K->896K(1536K), 0.0011948 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]
[Full GC (Ergonomics)
  [PSYoungGen: 400K->0K(1024K)]
  [ParOldGen: 496K->290K(512K)]
  896K->290K(1536K),
  [Metaspace: 2520K->2520K(1056768K)], 0.0032720 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC (Allocation Failure)
  [PSYoungGen: 488K->0K(1024K)]
  778K->290K(1536K), 0.0010046 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]

-XX:+PrintReferenceGC

This options works with the previous one.
It adds information about the different *Reference type variables (Soft, Weak, Final, Phantom, JNI) the program might used.

The PhantomReference are quite tricky to work with the GC, be careful. But if you’re using them, I’m pretty sure you know it, right ? plumbr has some nice tips about it.

[GC (Allocation Failure)
  [SoftReference, 0 refs, 0.0003665 secs]
  [WeakReference, 9 refs, 0.0001271 secs]
  [FinalReference, 7 refs, 0.0001104 secs]
  [PhantomReference, 0 refs, 0 refs, 0.0001707 secs]
  [JNI Weak Reference, 0.0002208 secs]
  [PSYoungGen: 465K->400K(1024K)]
  954K->896K(1536K), 0.0026939 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]

-XX:+PrintGCTimeStamps / -XX:+PrintGCDateStamps

It’s useful to know when things happen and how often.
The date is useful to be able to match easily with other logs.

2016-01-11T01:12:48.878+0100: 1.071: [GC (Allocation Failure)  954K->928K(1536K), 0.0020453 secs]
2016-01-11T01:12:48.878+0100: 1.072: [Full GC (Ergonomics)  928K->290K(1536K), 0.0031099 secs]
2016-01-11T01:12:49.883+0100: 2.075: [GC (Allocation Failure)  778K->290K(1536K), 0.0012529 secs]

-XX:+PrintGCApplicationStoppedTime

It’s useful to know how much time your application didn’t do anything, because the World was Stopped.
You really want to minimize those times.

Total time for which application threads were stopped: 0.0000492 seconds,
  Stopping threads took: 0.0000179 seconds
Total time for which application threads were stopped: 0.0033140 seconds,
  Stopping threads took: 0.0000130 seconds
Total time for which application threads were stopped: 0.0004002 seconds,
  Stopping threads took: 0.0000161 seconds

-XX:+PrintAdaptiveSizePolicy

This displays some metrics about survivals and promotions that the JVM Ergonomics is using to tune and optimize the GC behavior (by modifying space sizes).

[GC (Allocation Failure)
  AdaptiveSizePolicy::update_averages:  survived: 409616  promoted: 8192  overflow: false
  AdaptiveSizeStart: 1.087 collection: 1
  PSAdaptiveSizePolicy::compute_eden_space_size: costs minor_time: 0.000377 major_cost: 0.000000
    mutator_cost: 0.999623 throughput_goal: 0.990000 live_space: 268845056 free_space: 1048576
    old_eden_size: 524288 desired_eden_size: 524288
  AdaptiveSizeStop: collection: 1
 954K->896K(1536K), 0.0022971 secs]

-XX:-UseAdaptiveSizePolicy

The JVM Ergonomics tries to enhance the latency and the throughput of your application by tuning the GC behavior such as modifying the space sizes.
You can disable this behavior if you know you don’t need it.

And you can still have the details about the survivors and promotions if combined with the previous flag.

$ java -XX:+PrintAdaptiveSizePolicy -XX:-UseAdaptiveSizePolicy -XX:+PrintGC ...
[GC (Allocation Failure)
  AdaptiveSizePolicy::update_averages:  survived: 442384  promoted: 8192  overflow: false
  954K->928K(1536K), 0.0027480 secs]

Memory tuning

Heap size

Heap = Young Generation (Eden + Survivors) + Old Generation (Tenured)

This is the big part that you can impact for the better or for the worse.
If you think you need to change it, you need to be sure it’s necessary, know the existing GC cycles, know that you have reached the limit (or not).
Or you can just give a try and check the behavior, latency, and throughput of your application. ;-)

  • -Xms / -XX:InitialHeapSize : initial heap size
  • -Xmx / -XX:MaxHeapSize : maximum heap size

The MaxHeapSize influences the InitialHeapSize up until 256m.

if MaxHeapSize=2m   then InitialHeapSize=2m   (max)
if MaxHeapSize=256m then InitialHeapSize=256m (max)
if MaxHeapSize=512m then InitialHeapSize=256m (half)

Default size

As we already said, the default MaxHeapSize is 1/4 of the machine RAM, and the InitialHeapSize is 1/64.

For instance, on my machine, I have 16GB of RAM, that gives :

InitialHeapSize = 268435456  = 256m
MaxHeapSize     = 4290772992 = 4092m

Be careful with big numbers and PrintFlagsFinal, it won’t display them properly if greater than 4094m, because it displays them as uint, thus the limit is 4,294,967,295.

$ java -Xmx4g -XX:+PrintFlagsFinal -version |  grep "MaxHeapSize"
 uintx MaxHeapSize                              := 0                                   {product}

Minimum size

The minimum heap size you can set is 1M (your program doesn’t do much but it’s still possible!).

If you try to put less, you’ll end up with :

Too small initial heap

But actually, even if you ask for 1m, you’ll end with 2m :

$ java -Xmx1m -XX:+PrintFlagsFinal -version |  grep HeapSize
    uintx InitialHeapSize                          := 2097152                             {product}
    uintx MaxHeapSize                              := 2097152                             {product}

You will always get a MaxHeapSize divisible by 2.

Not enough heap ?

And if your program needs more heap and can’t find it with a lovely OOM :

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Young generation (in heap)

Young Generation = Eden + Survivors

This part of the heap is where all variables start their lifecycle. They are born here, and they will likely evolve to survivors, then ends up in the old generation; if alive long enough.

  • -XX:NewSize : young generation initial size
  • -XX:MaxNewSize : young generation maximum size
  • -Xmn : shortcut for both

The MaxHeapSize and InitialHeapSize influences the MaxNewSize and NewSize.

if MaxHeapSize=1g (InitialHeapSize=256m) then MaxNewSize=341m       and NewSize=85m
if MaxHeapSize=4g (InitialHeapSize=256m) then MaxNewSize=1365m (x4) and NewSize=85m
if MaxHeapSize=4g and InitialHeapSize=1g then MaxNewSize=1365m      and NewSize=341m (x4)

By default, the ratio is 3/1 between MaxHeap/MaxNewSize and InitialHeapSize/NewSize.

Default size

We just saw that NewSize/MaxNewSize are linked to InitialHeapSize/MaxHeapSize.

The default of MaxHeapSize is 1/4 of the machine RAM, and the InitialHeapSize is 1/64.
Therefore, the default of MaxNewSize is (1/4)/3 of the RAM, and NewSize is (1/4)/3 of InitialHeapSize.

On my machine, I have 16GB of RAM, that gives :

InitialHeapSize = 256m
MaxHeapSize     = 4092m

MaxNewSize = 1430257664 = 1364m (= 4092m/3)
NewSize    = 89128960   = 85m   (= 256m/3)

Minimum size

You can’t have MaxNewSize < NewSize : [code] Java HotSpot(TM) 64-Bit Server VM warning: NewSize (1536k) is greater than the MaxNewSize (1024k). A new max generation size of 1536k will be used. [/code] The 1536k will be equally separated between the Eden Space, the from Survivor space, and the to Survivor space. (512k each)

You can’t neither have the MaxNewSize >= HeapSize (young gen size can’t be greater than the total heap size) :

$ java -Xmx2m -XX:MaxNewSize=2m -XX:+PrintFlagsFinal -version | grep NewSize
Java HotSpot(TM) 64-Bit Server VM warning:
MaxNewSize (2048k) is equal to or greater than the entire heap (2048k).
A new max generation size of 1536k will be used.
    uintx MaxNewSize                               := 1572864                             {product}
    uintx NewSize                                  := 1572864                             {product}

Not enough space ?

Even if you have a MaxNewSize of 1m and your program tries to allocate 1g bytes, it will work if you have a big enough heap size. The allocation will just directly go into the old generation space.

Thread Stack (off heap)

Each and every threads in the program will allocate this size for their stack.

This is where they will store the function parameters values they are currently executing (and they are removed when the function exits). The deeper calls you do, the deeper you’re going into the stack. (FILO)

Recursive calls can go very deeply because of their intrinsic natures. This is where you have to be careful in your logic and maybe increase the default ThreadStackSize.

  • -Xss / -XX:ThreadStackSize : thread stack size

Default size

If you look yourself, you’ll find out it’s 0 :

$ java -XX:+PrintFlagsFinal -version | grep ThreadStackSize
intx ThreadStackSize                           = 0                                   {pd product}

0 means it will fallback to the OS default.
Check out Oracle website for unix or Windows, it’s between 320kB and 1MB.

Minimum size

The usage says you must specify at least 108k, but you can run *only* with 65k to start a simple program.

The stack size specified is too small, Specify at least 108k
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

StackOverflow

Be specially careful with recursion and its stop condition :

Exception in thread "main" java.lang.StackOverflowError

A simple program with a recursive function taking 2 int parameters can be called up to :

ThreadStackSize at 65k  : 888 times
ThreadStackSize at 130k : 1580 times
ThreadStackSize at 260k : 2944 times

But the more parameters to the function you add, the less times you’ll be able to call it.

Metaspace (off heap)

This is where the class definitions are. It’s a special space because class definitions are not mutable, they are loaded once and for all.

You will probably never touched the default configuration.

  • (Java < 8) -XX:MaxPermSize (it was a fixed size, difficult to estimate)
  • (Java >= 8) -XX:MaxMetaSpaceSize (is unlimited by default)

Default size

As we said, it’s unlimited.
Well, if we look closely, it’s defined at 4GB for me :

$ java -XX:+PrintFlagsFinal -version | grep Metaspace
{product}
uintx MaxMetaspaceSize                          = 4294901760                          {product}
uintx MetaspaceSize                             = 21807104                            {pd product}

MaxHeapSize has no impact on it, it’s a memory off-heap.

Minimum size

You can’t set a too small size (< 5m) otherwise you'll end up with those errors : [code] Error occurred during initialization of VM OutOfMemoryError: Metaspace - java.lang.OutOfMemoryError: Metaspace <<no stack trace available>> - Exception in thread "main" java.lang.OutOfMemoryError: Metaspace [/code] But it's a good idea to set a max size, to be sure the JVM will never take an “unlimited” memory (and breaks the other apps on the server) in case of some bugs.

But you can mess with it

I have a program that create dynamically classes on the fly and load them up. (https://gist.github.com/chtefi/018493089f4c75f36662)

$ java -XX:MaxMetaspaceSize=10m
-Djava.home="C:\Program Files\Java\jdk1.8.0_60\jre"
-classpath "C:\wip\out\production\test"
com.company.TestMetaspaceLimit

With 10m of metaspace, it crashes around 300 classes loaded (it’s not a lot in the Java world if you’re using some frameworks).

If you enable the GCDetails logs, you’ll see a lot of cool errors :

[Full GC (Metadata GC Threshold)
  [PSYoungGen: 64K->0K(1396224K)]
  [PSOldGen: 6548K->6548K(2793472K)] 6612K->6548K(4189696K),
  [Metaspace: 9954K->9954K(1058816K)], 0.0174232 secs]
  [Times: user=0.02 sys=0.00, real=0.02 secs]
[GC (Last ditch collection)
  [PSYoungGen: 0K->0K(1396224K)] 6548K->6548K(4189696K), 0.0006371 secs]
  [Times: user=0.00 sys=0.00, real=0.00 secs]
[Full GC (Last ditch collection)
  [PSYoungGen: 0K->0K(1396224K)]
  [PSOldGen: 6548K->6548K(2793472K)] 6548K->6548K(4189696K),
  [Metaspace: 9954K->9954K(1058816K)], 0.0183340 secs]
  [Times: user=0.01 sys=0.00, real=0.02 secs]

Garbage Collectors

Each GC deals differently with the Young Generation Space (new objects) and the Old Generation Space (variables referenced since a while), because the Young is a very fast-paced space, not the old one.

The Young Generation space should never be too big, 2GB seems like a good limit. Otherwise, algorithms could not be as performant when processing it.

-XX:+UseSerialGC

It’s the basic GC : the Serial GC.

It’s using a single-core and Stop the World when processing.

  • Young Generation : Mark + Copy (using survivor spaces) / in the logs : DefNew
  • Old Generation : Mark + Sweep + Compact / in the logs : Tenured

Example of output :

[GC (Allocation Failure)
  [DefNew: 652352K->81514K(733888K), 0.2248788 secs]
  1630766K->1620630K(2364580K), 0.2255284 secs]
  [Times: user=0.19 sys=0.03, real=0.22 secs]
[GC (Allocation Failure)
  [DefNew: 733839K->81489K(733888K), 0.2495329 secs]
  [Tenured: 2180251K->1993562K(2180276K), 0.3855474 secs]
  2272954K->1993562K(2914164K),
  [Metaspace: 2765K->2765K(1056768K)], 0.6373276 secs]
  [Times: user=0.55 sys=0.09, real=0.64 secs]

-XX:+UseParallelGC -XX:+UseOldParallelGC

The Parallel GC.

It’s an evolution of the Serial one.
It’s doing the same, but faster because it’s using multiple cores to do the job.
And again, it’s Stopping the World when processing.

With Java 8, specifying -XX:+UseParallelGC automatically set -XX:+UseOldParallelGC.

  • Young Generation : Parallel Mark + Copy (using survivor spaces) / in the logs : PSYoungGen
  • Old Generation : Parallel Mark + Sweep + Compact / in the logs : ParOldGen

Example of output :

[GC (Allocation Failure)
  [PSYoungGen: 76221K->10729K(141824K)]
  127345K->126994K(316928K), 0.0173292 secs]
  [Times: user=0.05 sys=0.02, real=0.02 secs]
[Full GC (Ergonomics)
  [PSYoungGen: 10729K->0K(141824K)]
  [ParOldGen: 116265K->126876K(287744K)]
  126994K->126876K(429568K),
  [Metaspace: 2742K->2742K(1056768K)], 0.0224399 secs]
  [Times: user=0.03 sys=0.00, real=0.02 secs]
  • GC (Allocation Failure) : a minor GC (Young generation) was done because space was not available
  • Full GC (Ergonomics) : the JVM decided to do a Full GC (Young + Old generations) because of some thresholds

But you can force-disable it by doing -XX:-UseOldParallelGC : you’ll end up using the PSOldGen old generation collector. It’s not parallel anymore but serial (as the SerialGC). You should probably not used it.

You can control how many threads the parallel phases are using -XX:ParallelGCThreads=N.
It is by default the number of cores the computer has. (must be at least 1)

-XX:+UseParNewGC -XX:+UseConcMarkSweepGC

The Concurrent Mark and Sweep GC.

It’s an evolution of the Parallel GC. This time, it’s not a Stop The World algo everywhere.
It can collect the old generation concurrently while the application is still running, meaning you should have a better latency.

ParNewGC, while collecting the young generation, will send some stats to the ConcMarkSweepGC, that will estimate if it should run a GC (according to the trend of the promotion rates in the young generation). This is why the CMS works with this one and not the classic parallel UseParallelGC.

Moreover, while being mostly concurrent, it has just of few phases where it still must Stop The World but they are very short period, contrary to the previous algorithms.

With Java 8, specifying -XX:+UseConcMarkSweepGC automatically set -XX:+UseParNewGC.

  • Young Generation : Mark + Copy (using survivor spaces) / in the logs : ParNew
  • Old Generation : Mark + Sweep : do NOT Stop the World (mostly) / in the logs : CMS Initial Mark, CMS Final Remark

Example of output (times were removed for the sake of clarity) :

[GC (CMS Initial Mark) [1 CMS-initial-mark: 1446700K(1716408K)] 1456064K(1795064K), 0.0006139 secs]
[CMS-concurrent-mark-start]
[CMS-concurrent-mark: 0.014/0.014 secs]
[CMS-concurrent-preclean-start]
[CMS-concurrent-preclean: 0.003/0.003 secs]
[CMS-concurrent-abortable-preclean-start]
[CMS-concurrent-abortable-preclean: 0.021/0.381 secs]
[GC (CMS Final Remark)
  [YG occupancy: 14224 K (78656 K)]
  [Rescan (parallel) , 0.0013603 secs]
  [1 CMS-remark: 1585968K(1716408K)] 1600193K(1795064K), 0.0032058 secs]
[CMS-concurrent-sweep-start]
[CMS-concurrent-sweep: 0.003/0.003 secs]
[CMS-concurrent-reset-start]
[CMS-concurrent-reset: 0.004/0.004 secs]

The Stop The World events happen during the CMS Initial Mark and CMS Final Remark.

You can have notice that the Old Generation is not compacted at the end, meaning it can still exists holes in the memory that is not used because it’s too small.

If Java can’t find any more memory because of that, it will trigger a Full GC by calling the Parallel GC (well, a GC that does the compaction but Stop The World). Moreover, this can happen also when a CMS compaction is currently in progress (concurrently), then suddenly, a lot of survivors are promoted to the old generation and boom, no more space.
This is why the CMS must be triggered way before the space is filled.

It is the role of the flag +XX:CMSInitiatingOccupancyFraction, by default, it’s around 92% according to Oracle.

Moreover, you can control how many threads to use for the concurrent part using -XX:ConcGCThreads=N. (measure before change)

-XX:+UseG1GC

The latest Java HotSpot VM GC.

It handles the space differently compared to its predecessors (being closer to the ConcMarkSweepGC).

There is no more Young and Old regions only. There are a bunch of regions of different sizes (certains will be automatically resized on the fly by the GC to enhance performance), each of them deal with only one type of generation : an Eden, a Survivor, or a Old. (and some with Humongous objects : they are so big they span on several regions). It targets around 2000 regions, each of them between 1MB and 32MB.

It is oriented for quite big heaps (> 4GB), and for small latency environments : you specify how much pause time (max) you desire for GCs (default: 0.5s).

It is mostly concurrent (does not affect too much the latency of the application) and parallel (for the Stop the World phases), but is a bit more computing intensive (compute stats to enhance behavior, predict what to clean, to reach the desired pause time).

It’s a bit more complicated than the others, you can refer to those two great resources to get more details : Getting Started with the G1 Garbage Collector, and Garbage First Garbage Collector Tuning.

It’s a bit like the CMS GC :

  • you have a STW Mark
  • then a concurrent scan from the marked references
  • then a STW Remark (to take into account the updates since the Mark)
  • then the cleaning and copy of regions

Example of output :

[GC pause (G1 Evacuation Pause) (young) 1478M->1475M(3764M), 0.0540170 secs]
[GC pause (G1 Evacuation Pause) (young) 1767M->1766M(3830M), 0.0581689 secs]
[GC pause (G1 Evacuation Pause) (young) (initial-mark) 2105M->2106M(3830M), 0.0674928 secs]
[GC concurrent-root-region-scan-start]
[GC concurrent-root-region-scan-end, 0.0004460 secs]
[GC concurrent-mark-start]
[GC concurrent-mark-end, 0.0153593 secs]
[GC remark, 0.0065189 secs]
[GC cleanup 2126M->2114M(3830M), 0.0021820 secs]
[GC concurrent-cleanup-start]
[GC concurrent-cleanup-end, 0.0001478 secs]
[GC pause (G1 Evacuation Pause) (young) 2483M->2484M(3830M), 0.0773962 secs]
[GC pause (G1 Evacuation Pause) (mixed) 2620M->2586M(3830M), 0.0467784 secs]
[GC pause (G1 Evacuation Pause) (young) 3029M->3023M(3830M), 0.0782551 secs]
[GC pause (G1 Evacuation Pause) (young) (initial-mark) 3248M->3237M(3830M), 0.0752451 secs]
[GC concurrent-root-region-scan-start]
[GC concurrent-root-region-scan-end, 0.0003445 secs]
[GC concurrent-mark-start]
[GC concurrent-mark-end, 0.0189316 secs]
[GC remark, 0.0083292 secs]
[GC cleanup 3278M->2968M(3830M), 0.0026447 secs]
[GC concurrent-cleanup-start]
[GC concurrent-cleanup-end, 0.0004819 secs]
[GC pause (G1 Evacuation Pause) (young) 3082M->3078M(3830M), 0.0309070 secs]
[GC pause (G1 Evacuation Pause) (mixed) 3245M->3078M(3830M), 0.0408398 secs]
  • G1 Evacuation Pause : copy alive objects (Eden or Survivors) to another region(s) compacting them and promoting them if old enough (to an Old Generation region). It’s a Stop The World process
  • concurrent-* : marks and scan alive objects and do some cleaning while the application is still running
  • (mixed) : both young and old generations copied (“evacuated”) elsewhere at the same time

Profiling

ASCII profiling

If you’re an hardcore player, you can use the Java agent hpref to retrieve a human-readable heap dump with the Java profile of your application (when it ends).
It’s bundled by default in the HotSpot JVM.

$ java -agentlib:hprof=heap=sites com.company.MyApp

That will generate a file java.hprof.txt where you can easily find out what are the most expensive function calls :

SITES BEGIN (ordered by live bytes) Tue Jan 12 22:38:06 2016
          percent          live          alloc'ed  stack class
 rank   self  accum     bytes objs     bytes  objs trace name
    1 14.87% 14.87%   2103552 30499   2103552 30499 302579 char[]
    2 10.35% 25.21%   1463952 30499   1463952 30499 302580 com.sun.tools.javac.file.ZipFileIndex$Entry
    3  9.27% 34.48%   1311424   11   1311424    11 301304 com.sun.tools.javac.util.SharedNameTable$NameImpl[]

So, it seems I’ve allocated a tons of char[] (2M, 31000 objects).
To know the callstack, find the trace value in the file, you’ll end up with something like this :

TRACE 302579:
	java.lang.StringCoding$StringDecoder.decode(<Unknown Source>:Unknown line)
	java.lang.StringCoding.decode(<Unknown Source>:Unknown line)
	java.lang.String.<init>(<Unknown Source>:Unknown line)
	com.sun.tools.javac.file.ZipFileIndex$ZipDirectory.readEntry(ZipFileIndex.java:665)

Et voilà, this is it. (it was not my fault!)

Another option is to collect function calls count and CPU usage using cpu=times :

$ java -agentlib:hprof=cpu=times com.company.MyApp
...
TRACE 312480:
	com.sun.tools.javac.file.ZipFileIndex$ZipDirectory.readEntry(ZipFileIndex.java:Unknown line)
	com.sun.tools.javac.file.ZipFileIndex$ZipDirectory.buildIndex(ZipFileIndex.java:Unknown line)
	com.sun.tools.javac.file.ZipFileIndex$ZipDirectory.access$000(ZipFileIndex.java:Unknown line)
	com.sun.tools.javac.file.ZipFileIndex.checkIndex(ZipFileIndex.java:Unknown line)
...
CPU TIME (ms) BEGIN (total = 17046) Tue Jan 12 22:52:08 2016
rank   self  accum   count trace method
   1  3.64%  3.64%   30711 312480 com.sun.tools.javac.file.ZipFileIndex$ZipDirectory.readEntry
   2  2.53%  6.17%    7392 312914 java.io.WinNTFileSystem.normalize
   3  2.38%  8.54%    3984 301205 java.lang.String$CaseInsensitiveComparator.compare
   4  2.09% 10.64%  324312 301204 java.lang.String.charAt

In a few seconds :

  • 30711 calls to ZipDirectory.readEntry
  • 324312 calls to String.charAt

That’s quite straight-forward, can be processed by third-party tools, or gather for comparaisons.

If you want the live stats, this is not your tool. An IDE with a true Profiler will be a better solution.
But anyway, that can come in handy !

There are a few more options, check out hprof.

JMX

A nice and easy way to get in-depth of your live program (local or distant), is to enable JMX when starting the application. JMX can be secured, but if you don’t want to be bother with that, start the JVM with those settings :

  • -Dcom.sun.management.jmxremote.port=5432
  • -Dcom.sun.management.jmxremote.authenticate=false
  • -Dcom.sun.management.jmxremote.ssl=false

If will expose its internals through the protocol JMX on port 5432.

You need a program to read from it. Fortunately, there is one installed by default : jvisualvm.
Just start your Java program somewhere, then start jvisualvm.

If on the same computer, it will automatically find it.
Install the plugin VisualGC if you don’t have it, to monitor the GC in details it’s a win.
You can even do the CPU and Memory profiling live.


Alternatives exist (ofc) such as JProfiler and YourKit. Check out the difference here.

You can also use jconsole (shipped with java). You don’t even need to start your process with JMX, jconsole can just plug itself.

Java CLI tools

The HotSpot JVM has some useful console tools shipped within too.

If you encounter any odd errors, ensure you have access to folder /tmp/hsperfdata_<username> <username> being the user that started the Java process.

jps

List the java process running on the machine. (remember doing ps aux | grep java ?)

$ jps
11080 Launcher
11144 Jps
12140 TestMetaspaceLimit
$ jps -lvV
11080 org.jetbrains.jps.cmdline.Launcher -Xmx700m -D...
12140 com.company.TestMetaspaceLimit -Djava.home=C:\Program Files\Java\jdk1.8.0_60\jre -D...
6028 sun.tools.jps.Jps -Dapplication.home=C:\Program Files\Java\jdk1.8.0_60 -Xms8m

Official documentation.

jstat

Monitor some aspects of a running JVM (no JMX needed).

List of aspects :

-class
-compiler
-gc
-gccapacity
-gccause
-gcmetacapacity
-gcnew
-gcnewcapacity
-gcold
-gcoldcapacity
-gcutil
-printcompilation

Monitor the GC, show the timestamp in front, pull every 1s :

$ jstat -gc -t 7844 1s
Timestamp        S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
           14,0 40960,0 53248,0 40947,8  0,0   506880,0 343724,3  175104,0   62801,2   13440,0 12979,7 1664,0 1552,3      8    0,144   0      0,000    0,144
           15,1 40960,0 53248,0 40947,8  0,0   506880,0 454765,2  175104,0   62801,2   13440,0 12979,7 1664,0 1552,3      8    0,144   0      0,000    0,144
           16,1 77824,0 53248,0  0,0   53240,9 506880,0 40423,7   175104,0   104781,8  13952,0 13581,6 1664,0 1596,0      9    0,203   0      0,000    0,203

Official documentation.

jinfo

Get the value of any flag of a running Java process.

$ jinfo -flag MaxHeapSize 5044
-XX:MaxHeapSize=4290772992

Official documentation.

jstack

Get the current stack trace of all the threads of a running Java process.
Useful if you wonder what’s going on with a process.

$ jstack 1204
...
"main" #1 prio=5 os_prio=0 tid=0x0000000002c9e000 nid=0x2d88 runnable [0x000000000347e000]
   java.lang.Thread.State: RUNNABLE
        at java.io.RandomAccessFile.length(Native Method)
        at java.io.RandomAccessFile.skipBytes(Unknown Source)
        at com.sun.tools.javac.file.ZipFileIndex.readBytes(ZipFileIndex.java:381)
        ...
        at com.sun.tools.javac.api.JavacTaskImpl.call(JavacTaskImpl.java:138)
        at com.company.TestMetaspaceLimit.createClass(TestMetaspaceLimit.java:42)
        at com.company.TestMetaspaceLimit.main(TestMetaspaceLimit.java:28)
...

Official documentation.

There is also a nice tool to easily parse the content of jstack: mjprof

$ jstack -l 38515 | mjprof contains/name,Main/

Check its README, it has a lot of different operators that can be chained to get only what you want.

jmap

You can display the config and usage of the whole heap.
It’s useful but I think it’s more useful to use some UI to monitor that such as jvisualvm or jconsole.

$ jmap -heap 11080
Attaching to process ID 11080, please wait...          
Debugger attached successfully.                        
Server compiler detected.                              
JVM version is 25.60-b23                               
                                                       
using thread-local object allocation.                  
Parallel GC with 4 thread(s)                           
                                                       
Heap Configuration:                                    
   MinHeapFreeRatio         = 0                        
   MaxHeapFreeRatio         = 100                      
   MaxHeapSize              = 734003200 (700.0MB)      
   NewSize                  = 89128960 (85.0MB)        
   MaxNewSize               = 244318208 (233.0MB)      
   OldSize                  = 179306496 (171.0MB)      
   NewRatio                 = 2                        
   SurvivorRatio            = 8                        
   MetaspaceSize            = 21807104 (20.796875MB)   
   CompressedClassSpaceSize = 1073741824 (1024.0MB)    
   MaxMetaspaceSize         = 17592186044415 MB        
   G1HeapRegionSize         = 0 (0.0MB)                
                                                       
Heap Usage:                                            
PS Young Generation                                    
Eden Space:                                            
   capacity = 67108864 (64.0MB)                        
   used     = 8111152 (7.7353973388671875MB)           
   free     = 58997712 (56.26460266113281MB)           
   12.08655834197998% used                             
From Space:                                            
   capacity = 11010048 (10.5MB)                        
   used     = 6575688 (6.271064758300781MB)            
   free     = 4434360 (4.228935241699219MB)            
   59.72442626953125% used                             
To Space:                                              
   capacity = 11010048 (10.5MB)                        
   used     = 0 (0.0MB)                                
   free     = 11010048 (10.5MB)                        
   0.0% used                                           
PS Old Generation                                      
   capacity = 179306496 (171.0MB)                      
   used     = 81936 (0.0781402587890625MB)             
   free     = 179224560 (170.92185974121094MB)         
   0.04569605777138158% used                           
                                                       
6521 interned Strings occupying 524504 bytes.          

Official documentation.

Resources

I hope this overview was clear and wide enough to make you feel stronger about the basis of Java and that you learned some new tricks. I did.