Oracle JRockit The Definitive Guide 读书笔记三

Runtime code generation

Total JIT compilation needs to be a lazy process. If any method or class referenced from another method would be fully generated depth first referral time, there would be significant code generation overhead. Also, just because a class is referenced from the code doesn't mean that every method of the class has to be compiled right away or even that any of its methods will ever be executed. Control flow through the Java program might take a different path. This problem obviously doesn't exist in a mixed mode solution, in which everything starts out as interpreted bytecode with no need to compile ahead of execution.


JRockit solves this problem by generating stub code for newly referred but not yet generated methods. These stubs are called tramplines, and basically consist of a few lines of native code pretending to be the final version of the method. When the method is first called, and control jumps to the trampoline, all it does is execute a call that tells JRockit that the real method needs to be generated. The code generator fulfils the request and returns the starting address of the real method, to which the trampoline then dispatches control. To the user it looks like the Java method was called directly, when in fact it was generated just at the first time it was actually called.

0x1000: method A                                    0x3000: method C
    call method B @ 0x2000                              call method B @ 0x2000

0x2000; method B (trampoline)                       0x4000: The "real" method B
    call JVM.Generate(B) -> start                       ...
    write trap @ 0x2000
    goto start @ 0x4000

Consider the previous example. method A, whose generated code resides at address 0x1000 is executing a call to method B, that it believes is placed at address 0x2000. This is the first call to method Bever. Consequently, all that is at address 0x2000 is a trampoline. The first thing the trampoline does is to issue a native call to the JVM, telling it to generate the real method B. Execution then halts until this code generation request has been fulfilled, and a starting address for method B is returned, let's say 0x4000. The trampoline then dispatches control to method B by jumping to that address.

Note that there may be several calls to method B in the code already, also pointing to the trampoline address 0x2000. Consider, for example, the call in method C that hasn't been executed yet. These calls need to be updated as well, without method B being regenerated. JRockit solves this by writing an illegal instruction at address 0x2000, when the trampoline has run. This way, the system will trap if the trampoline is called more than once. The JVM has a special exception handler that catches the trap, and patches the call to the trampoline so that it points to the real method instead. In this case it means overwriting the call to 0x2000 in method C with a call to 0x400. This process is called back patching.

Back patching is used for all kinds of code replacement in the virtual machine, not just for method generation. If, for example, a hot method has been regenerated to a more efficient version, the code version of the code is fitted with a trap at the start and back patching takes place in a similar manner, gradually redirecting calls from the old method to the new one.

If there are no more references to an older version of a method, its native code buffer can be scheduled for garbage collection by the rumtime system so as to unclutter the memory. This is necessary in a world that uses a total JIT strategy because the amount of code produced can quite large.

在上面的例子中,method A的起始地址在0x1000,它在调用method B时以为其起始地址是0x2000,这时method B第一次被调用。0x2000位置处的存根代码就是Trampoline,它告知JVM要为method B生成代码。此时,程序会一直等待,直到代码生成器完成工作,并返回method B的真正地址,再跳转到该地址开始执行。

注意,可能对method B的调用可能会有多处,即都指向Trampoline的地址0x2000。例如上面例子中method C。这些对method B的调用应该修改为真正的method B的地址,而不是每次都重新生成一边method B。JRockit的解决办法时,当Trampoline运行过一次之后,在0x2000处写入一个陷阱指令,如果此时Trampoline再被调用,被JRockit会捕获到该事件,并将调用指向真正的method B。这个过程称为回填(back patching)。




Code generation requests

In JRockit, code generation requests are passed to the code generator from the runtime when a method needs to be compiled. The requests can be either synchronous or asynchronous.

Synchronous code generation requests do one of the following:

  1. Quickly generate a method for the JIT, with a specified level of efficiency

  2. Generate an optimized method, with a specified level of efficiency.

An asynchronous request is:

Act upon an invalidated assumption, for example, force regeneration of a method or patch the native code of a method.

Internally, JRockit keeps synchronous code generation requests in a code generation queue and optimization queue, depending on request type. The queues are consumed by one or more code generation and / or optimization threads, depending on system configuration.

The code generation queue contains generation requests for methods that are needed for program execution to proceed. These requests, except for special cases during bootstrapping, are essentially generated by tampolines. The call "generate me" that each trampoline contains, inserts a request in the code generation queue, and blocks until the method generation is complete. The return value of the call is the address in memory where the new method starts, to which the trampoline finally jumps.

Optimization requests

Optimization requests are added to the optimization queue whenever a method is found to be hot, that is when the runtime system has realized that we are spending enough time executing the Java code of that method so that optimization is warranted.

The optimization queue understandably runs at a lower priority than the code generation queue as its work is not necessary for code execution, but just for code performance. Also, an optimization request usually takes orders of magnitude longer than a standard code generation request to execute, trading compile time for effcient code.

On-stack replacement

Once an optimized version of a method is generated, the existing version of the code for that method needs to be replaced. As previously described, the method entry point of the existing code version of the method is overwritten with a trap instruction. Calls to the old method will be back patched to point to the new, optimized piece of code.

Some optimizers swap out code on the existing execution stack by replacing the code of a method with a new version in the middle of its execution. This is referred to as on-stack replacement and requires extensive bookkeeping. Though this is possible in a completely JIT-compiled world, it is easier to implement where there is an interpreter to fall back to.

JRockit doesn't do on-stack replacement, as the complexity required to do so is deemed too great. Even though the code for a more optimal version of the method may have been generated, JRockit will continue executing the old version of the method if it is currently running.

当完成某个方法的优化请求后,需要替换掉该方法的现存版本。正如前面的提到的,会使用陷阱指令覆盖(trap instruction)现存版本的方法入口点,于是再次调用该方法时会通过回填技术指向新的、优化过的版本。

有些优化器会在方法执行过程中,使用优化后的版本替换掉现有的版本,这就是所谓的 栈上替换(on-stack replacement,OSR)。实现OSR需要额外记录大量信息,此外,尽管在完全JIT编译策略下可以实现OSR,但在有解释器辅助的环境中,实现起来更容易。因为可以退化为解释执行,替换后再执行编译后的代码(译者注,这句话是我编的,原文是“Though this is possible in a completely JIT-compiled world, it is easier to implement where there is an interpreter to fall back to”)。



Object information for GC

For various reasons, a garbage collector needs to keep track of which registers and stack frame locations contain Java objects at any given point in the program. This information is generated by the JIT compiler and is stored in a database in the runtime system. The JIT compiler is the component responsible for creating this data because type information is available "for free" while generating code. The compiler has to deal with types anyway. In JRockit, the object meta info is called livemaps, and a detailed explanation of how the code generation system works with the garbage collector is given in Chapter 3, Adaptive Memory Management.

Assumptions made about the generated code

An assumption database is another part of the JRockit runtime that communicates with the code generator.

A walkthrough of method generation in JRockit

The JRockit IR format

The first stage of the JRockit code pipeline turns the bytecode into an Intermediate Representation (IR). As it is conceivable that other languages may be compiled be the same frontend, and also for convenience, optimizers tend to work with a common internal intermediate format.

JRockit works with an intermediate format that differs from bytecode, looking more like classic text book compiler formats. This is the common approach that most compilers use, but of course the format of IR that a compiler users always varies slightly depending on implementation and the language being compiled.

Aside from the previously mentioned protability issue, JRockit also doesn't work with bytecode internally because of the issues with unstructured control flow and the execution stack model, which differs from any modern hardware register model.

Because we lack the information to completely reconstruct the ASTs, a method in JRockit is represented as a directed graph, a control flow graph, whose nodes are basic blocks. The definition of a basic block is that if one instruction in the basic block is executed, all other instructions in it will be executed al well. Since there are no btranches in our example, the md5_F function will turn into exactyly one basic block.

Data flow

A basic block contains zero to many operations, which in turn have operands. Operands can be other operations (forming expression trees), variables (virtual registers or automic operands), constants, addresses, and so on, depending on how close to the actual hardware representation the IR is.

JIT comlilation

This following figure illustrates the different stages of the JRockit code pipeline:

`BC2HIR -->  HIR2MIR  -->  MIR2LIR  -->  RegAlloc  -->  EMIT`

Generating HIR

The first module in the code generator, BC2HIR, is the frontend against the bytecode and its purpose is to quickly translate bytecodes into IR. HIR in the case stands for High-level Intermediate Representation.

This is the output, the High-level IR, or HIR: params: v1 v2 v3 block0: [first] [id=0] 10 @9:49 (i32) return {or {and vi v2} {and {xor v1 -1} v3}}

In JRockit IR, the annotation @ before each statement identifies its program point in the code all the way down to assembler level. The first number following the @ is the bytecode offset of the expression and the last is the source code line number information. This is part of the complex meta info framework in JRockit that maps individual native instructions back to their Java program points.

The BC2HIR module that turns bytecodes into a control flow graph with expressions is not computationally complex.


MIR or Middle-level Intermediate Representation, is the transform domain where most code optimization take place. This is because most optimizations work best with three address code or rather instructions that only contain atomic operands, not other instructions. Transforming HIR to MIR is simply an in-order traversal of expression trees metioned earlier and the creation of temporary variables. As no hardware deals with expression trees, it is natural that code turns into progressively simpler operations on the path through the code pipeline.

Our md5_F example would look something like the following code to the JIT compiler, when the expression trees have been flattened. Note that no operation contains other operations anymore. Each operation writes its result to a temporary variable, whick is in turn used by later operations.

params: v1 v2 v3
block0: [first] [id=0]
    2 @2:49*    (i32) and       v1 v2 -> v4
    5 @5:49*    (i32) xor       v1 -1 -> v5
    7 @7:49*    (i32) and       v5 v3 -> v5
    8 @8:49*    (i32) or        v4 v5 -> v4
   10 @9:49*    (i32) return    v4


After MIR, it is time to turn platform dependent as we are approaching native code. LIR, or Low-level IR, looks different depending on hardware architeture.

Following is the LIR for the md5_F method on a 32-bit x86 platform:

params: v1 v2 v3
block 0: [first] [id=0]
    2 @2:49*    (i32)   x86_and         v2 v1 -> v2
   11 @2:49*    (i32)   x86_mov         v2 -> v4
    5 @5:49*    (i32)   x86_xor         v1 -1 -> v1
   12 @5:49*    (i32)   x86_mov         v1 -> v5
    7 @7:49*    (i32)   x86_and         v5 v3 -> v5
    8 @9:49*    (i32)   x86_ox          v4 v5 -> v4
   14 @9:49*    (i32)   x86_mov         v4 -> eax
   13 @9:49*    (i32)   x86_ret         eax

Register allocation

There can be any number of virtual registers (variables) in the code, but the physical platform only has a small number of them. Therefore, the JIT compiler needs to do register allocation, transforming the virtual variable mappings into machine registers. If at any given point in the program, we need to use more variables than there are physical registers in the machine at the same time, the local stack frame has to be used for temporary storage. This is called spilling, and the register allocator implements spills by inserting move instructions that shuffle registers back and forth from the stack. Natually spill moves incur overhead, so their placement is highly significant in optimezed code.

We can aslo note that the register allocator has added an epilogue and prologue to method in which stack manipulation takes place. This is because it has figured needs to use tow callee-save registers for storage. A register for the caller. If the stack frame and restored just before the method returns. By JRockit convention on x86, callee-save registers for Java code are ebx and ebp. Any calling convention typically includes a few callee-save registers since if every register was potentially destroyed over a call, the end result would be even ore spill code.

Generating optimized code

At each stage, an optimization module is plugged into the JIT.

A general overview

MIR readily transforms into Single Static Assignment(SSA) form, a transform domain that makes sure that any vairables has only one definition. SSA transformation is part of virtually every commercial compiler today and makes implementing many code optimizations much easier. Another added benefit is that code optimizations in SSA form can be potentially more powerful.

LIR is platform-dependent and initially not register allocated, so transformations that form more efficient native operation sequences can be performed here.

The JRockit optimizer contains a very advanced register allocator that is based on a technique called graph fusion, that extends the standard graph coloring approximation algorithm to work on subregions in the IR. Graph fusion has the attractive property that the edges in the flow graph, processed early, generate fewer spills than the edges processed later. Therefore, if we can pick hot subregions before cold ones, the resulting code will be more optimal. Additional penalty comes from the need to insert shuffle code when fusion regions in order to form a complete method. Shuffle code consists of sequences of move instructions to copy the contents of one local register allocation into another one.

Finally, just before code emission, various peephole optimizations can be applied to the native code, replacing one to several register allocated instructions in sequence with more optimal ones.

How does the optimizer works

Generating optimized code for a method in JRockit generally takes 10 to 100 times as long al JITing it with no demands for execution speed. Therefore, it is important to only optimize frequently executed method.

Similar issues exist with boxed types. Boxed types turn into hidden objects (for example instances of java.lang.Integer) on the bytecode level. Several traditional compiler optimizations, such as escape analysis, can often easily strip down a boxed type to its primitive value. This removes the hidden object allocation that javac put in the bytecode to implement the boxed type.


Read Post

Oracle JRockit The Definitive Guide 读书笔记二

Adaptive code generation

Java is dynamic in nature and certain code generation strategies fit less well than others. From the earlier discussion, the following conclusions can be drawn:

  1. Code genteration should be done at runtime, not ahead of time

  2. All methods cannot be treated equally by code genertor. There needs to be a way to discern a hot method from a cold one. Otherwise unnecessary optimization effort on hot methods.

  3. In a JIT compiler, bookkeeping needs to be in place in order to keep up with the adaptive rumtime. This is because generated native code invalidated by changes to the running program must be thrown away and potentially regenerated.

Achieving code execution efficiency in an adaptive runtime, no matter what JIT or interpretation strategy it uses, all boils down to the equation:

Total Execution Time = Code Generation Time + Execution Time

The JVM needs to know precisely which methods are worth the extra time spent on more elaborate code generation and optimization efforts.

Determining "hotness"

As we have seen, "one size fits all" code generation that interprets every method, or JIT compiling every method with a high optimization level, is a bad idea in an adaptive runtime. The former, because although it keeps code generation time down, execution time goes way up. The latter, because even though execution is fast, generating the highly optimized code takes up a significant part of the total runtime.We need to know if a method is hot or not in order to know should give it lots of code generator attention, as we can't treat all methods the same.

The common denominator for all ways of profiling is that of samples of where code spends execution time is collected. These are used by the runtime to make optimization decisions--the more samples available, the better informed decisions are made.

Invocation counters

One way to sample hot methods is to use invocation counters. An invocation counter is typically associated with each method and is incremented when the method is called. This is done either by the bytecode interpreter or in the form of an extra add instruction compiled into the prologue of the native code version of the method.

Especially in the JIT compiled world, where code execution speed doesn't disappear into interpertation overhead,usually in the form of cache misses in the CPU. This is because a particular location in memory has to be frequently written to by the add at the start of each method.

Software-based thread sampling

Another, more cache friendly, way to determine hotness is by using htread sampling. This means periodically examining where in the program Java threads are currently executing and logging their instruction pointers.Thread sampling requires no code instrumentation.

Stopping threads, which is normally required in order to extract their contexts is, however, quite an expensive operation. Thus getting a large amount of samples without disrupting anything at all requires a complete JVM-internal thread implementation, a cunstom operating system such as in Oracle JRockit Virtual Edition, or specialized hardware support.

Hardware-based sampling

Certain hardware platforms, such as Intel IA-64, provides hardware instrumentation mechanisms that may be used by an application.

Optimizing a changing program

In object-oriented languages, virtual method dispatch is usually compiled as indirect call(that is the destination has to be read from memory) to addresses in a dispatch table. This is because a vitual call can have several possible receivers depending on the class hiearchy. A dispatch table exists for every class and contains the receivers of its virtual calls. A static method or a virtual method that is known to have only one implementation can instead be turned into a direct call with a fixed destination. This is typically much faster to execute.

The JVM solves this by "gambling". It bases its code generation decisions on assumptions that the world will remain unchanged forever, which is usually the case. If it turns out not to be so, its bookkeeping system triggers callbacks if any assumption needs is violated. When this happens, the code containing the original assumption needs to be regenerated--in our example the static dispath needs to te replaced by a virtual one. Having to revert code generated based on an assumption about a closed world is typically very costly, but if it happens rarely enough, the benefit of the original assumption will deliver a performance increase anyway.

Some typical assumptions that the JIT compiler and JVM, in general, might bet on are:

  1. A virtual method probably won't be overridden. As it only exists only in one version, it can always be called with fixed destination address like a static method.

  2. A float will probably never be NaN. We can use hardware instructions instead of an expensive call to the native floating point library that is required for corner cases.

  3. The program probably won't throw an exception in a particular try block. Schedule the catch clause as cold code and give it less attention from the optimizer.

  4. The hardware instruction fsin probably has the right precision for most trigonometry. If it doesn't, cause an exception and call the native floating pint library instead.

  5. A lock probably won't be too saturated and can start out as a fast spinlock.

  6. A lock will probably be repeatedly taken and released by the same thread, so the unlock operation and future reacquisitions of the lock can optimistically be treated as no-ops.

A static environment that was compiled ahead of time and runs in a closed world can not, in general, make these kinds of assumptions. An adaptive runtime, however, can revert its illegal decisions if the cirteria they were based on are violated. In theory, it can make any crazy assumption that might pay off, as long as it can be reverted with small enough cost. Thus, an adaptive runtime is potentially far more powerful than a static environment given that the "gambling" pays off.

Given that we find this area --and JRockit is based on runtime information feedback in all relevant areas to make the best decisions--an adaptive runtime has the potential to outperform a static environment very time.

Inside the JIT compiler

Working with bytecode

While compiled bytecode may sound low level, it is still a well-defined format that keeps its code(operations) and data (operands and constant pool entries) strictly separated from each other.

As we have seen, most bytecode operations pop operands from the statck and push results. No native platforms are stack machines, rather they rely on registers for storing intermediate values. Mapping a language that uses local variables to native registers is straightforward, but mapping an evaluation statck to registers is slightly more complex. Java aslo defines plenty of virtual registers, local variables, but uses an evaluation stack anyway. It is the authors' opinion that this is less than optimal.

Another problem, that in rare cases my be a design advantage, is the ability of Java bytecodes to express more than Java source code.

Bytecode "optimizers"

Our advice is to not use bytecode optimizers, ever!

Abstract syntax trees

A bytecode to native compiler can't simply assume that the given bytecode is compiled Java source code, but needs to cover all eventualities. A compiler whose frontend reads source code usually works by first tokenizing the source code into known constructs and building an Abstract Syntax Tree (AST).

Perhaps, in retrospect, it would have been a better design rationale to directly use an encoded version of the compiler's ASTs as bytecode format. Various academic papers have shown that ASTs are possible to represent in an equally compact or more compact way than Jave byteocde, so space is not a problem. Interpreting an AST at runtime would also only be slightly more difficult than interpreint bytecode.

Where to optimize

However, as we have explained, explicit optimization on the bytecode level is probably a good thing to avoid.

Adaptive optimization can never substitute bad algorithms with good ones. At most, it can make the bad ones run little bit faster.

Exceptions are very expensive operations and are assumed to be just that -- exceptions. The "gambling" behavior of the JVM, thinking that exceptions are rare, became a bad fit.

The JRockit code pipeline

Why JRockit has no bytecode interpreter

JRockit uses the code generation strategy total JIT compilation.

Later, as JRockit became a major mainstream JVM, known for its performance, the need to diversify the code pipline into client and server parts was recognized. No interpreter was added, howerver. Rather the JIT was modified to differentiate even further between cold and hot code, enabling faster "sloppy" code generation the first time a method was encountered. This greatly improved startup time to a satisfying degree, but of course, getting to pure interpreter speeds with a compile-only approach is still very hard.

Another aspect that makes life easier with an inerpreter is debuggability. Bytecode contains meta information about things like variable names and line numbers. These are needed by the debugger. In order to support debuggability, the JRockit JIT had to propagate this kind of information all the way from pre-bytecode basis to per-native to add an interpreter. This has the added benefit that, to our knowledge, JRockit is the only virtual machine that lets the user debug optimized code.

The main problems with the compile-only strategy in JRockit are the code bloat (solved by garbage collecting code buffers with methods no longer is use) and compilation time for large methods (solved by having a sloppy mode for the JIT).


The "brain" of the JRockit JVM is the runtime system itself. It keeps track of what goes on in the world that comprises the virtual execution environment. The runtime system is aware of which Java classes and methods make up the "world" and requests that the code generator compiles them at appropriate times with appropriate levels of code quality.

To simplify things a bit, the first thing the runtime wants to do when the JVM is started, is to look up and jump to the main method of a Java program. This is done through a standard JNI call from the native JVM, just like any other native application would use JNI to call Java code.

Searching for main triggers a complex chain of actions and dependencies. A lot of other Java methods required for bootstrapping and fundamental JVM behavior need to be generated in order to resolve main function. When finally main is ready and compiled to native code, the JVM can execute its first native-to-Java stub and pass control from the JVM to the Java program.

To study the bootstrap behavior of JRockit, try running a simple Java program with the command-line switch -Xverbose:codegen.

Read Post

Cache: A Place for Concealment and Safekeeping





本文主要介绍Intel处理器的CPU缓存实现原理。值得一提的是关于缓存的讲解通常都会将基本的概念混淆,而且缺乏生动的示例,当然不排除是由作者智商低造成的。随他去吧,下面讲解双核CPU一级缓存工作原理之part 1:

作为缓存中数据单元的line,其实就是内存中连续的字节块。如上图所示,缓存使用64字节的line。这些line可以叫做cache bank或者way,另外,还为每个way配备了一个用于存储其信息的directorywaydirectory作为一个单元类似于电子表格中的列,而set则可看作是行。这样,就可以通过directory定位相应的line。图中缓存有64个set,每个set包含8个way,因此有512个line,总共加起来有32KB。

根据图中缓存情况分析,物理内存被分成了多个4KB大小物理页,每页包含4KB / 64 bytes = 64line。一个4KB大小的页,0-63字节是第一个line,64-127字节是第二个line,以此类推。每页都以这种方式组织,所以0页的第三个line和1页的第三个line是不同的。


图中的缓存是多路组相联模式,也就是内存中特定的line只能存储在指定的set(或者row)中。所以,所有物理页的第一个line(0-63字节)都必须存储在第0个row中,第二个line存储在第1个row中,以此类推。图中每row有8个单元可用于存储与之相对应的line,所以称之为8路组相联。当内存寻址时,第11-6位用于定位存储在页(4KB)中的行号,从而也就确定了存储在哪个set中。比如物理地址:0x800010a0(二进制表示为:100000000000000000010000101),11-6位对应的二进制为:000010,所以它必须存储在set 2中。

目前为止,仍然没法定位到底是row中的哪一个单元,这个就要靠directory了。每个line都有一个directory作为标识,表示line所在位置的页号。图中的处理器可寻址64GB的物理RAM,所以就有64GB / 4BK = 224页,也就是directory需要24位。我们示例中物理地址0x800010a0对应的页号为:(0x800010a0b) / (4 KB) = 524 289,下面介绍双核CPU一级缓存工作原理之part 2:

由于每个set中只有8个way,所以tag matching过程非常快速。图中使用箭头表示tag的并行比较过程,如果存在有效的linetag匹配,那么记做一次缓存命中,否则去二级缓存查找,还没有匹配的则只能去物理内存中查找。Intel二级缓存的原理与一级缓存一样,只不更大和更多的way。比如通过增加8个way就可以获得64KB(= 4KB × 16)大小的缓存,增加set数量为4096个,那么way大小就增加至256BK(= 4096 × 64byte),通过这两处简单的提升,二级缓存就可达到4MB(= 256KB × 16)大小。同理,tag需要18位(= 36 - 12 - 6),set index需要12位(4096 = 212),物理页数与way大小一致。


内存寻址通常都使用虚地址,所以一级缓存需要借助页单元获取物理页地址,以供tag使用。按约定,set index来自于虚地址的靠末尾几位(示例中是11-6位),并且不需要转换。所以一级缓存的tag依赖物理地址,而set index依赖虚地址,这样CPU就可以进行并行查找操作了。因为一级缓存的way大小永远不会超过内存管理单元页大小,所以指定物理内存地址保证与相同的虚set index相关联。但是二级缓存就是另外一回事儿了,因为way大小可能比内存管理单元大,所以tag必须是物理的,set index也必须是物理的。但是,请求到达二级缓存的时候,一级缓存已经计算好了物理地址,所以二级缓存一直工作的很好。






我们已知的信息只有这么多:L1 Cache - 32KB, 8-way set associative, 64-byte cache lines;还有就是物理内存地址位36位。


Cache entry structure

Cache row entries usually have the following structure:

tag data block flag bits

The data block (cache line) contains the actual data fetched from the main memory. The tag contains (part of) the address of the actual data fetched from the main memory. The flag bits are discussed below.

The "size" of the cache is the amount of main memory data it can hold. This size can be calculated as the number of bytes stored in each data block times the number of blocks stored in the cache. (The number of tag and flag bits is irrelevant to this calculation, although it does affect the physical area of a cache.) An effective memory address is split (MSB to LSB) into the tag, the index and the block offset.

tag index block offset

The index describes which cache row (which cache line) that the data has been put in. The index length is bits for r cache rows. The block offset specifies the desired data within the stored data block within the cache row. Typically the effective address is in bytes, so the block offset length is bits, where b is the number of bytes per data block. The tag contains the most significant bits of the address, which are checked against the current row (the row has been retrieved by index) to see if it is the one we need or another, irrelevant memory location that happened to have the same index bits as the one we want. The tag length in bits is address_length - index_length - block_offset_length.

Some authors refer to the block offset as simply the "offset" or the "displacement".

block offset对应图中Offset into cache line,从已知条件中我们得知line采用的是64byte = 26,即block offset需要6位。同理index对应图中Set Index,32KB / 64byte = 512 个line,每个set又包含8个way(术语好乱),故512个line / 8way = 64个set = 26,从而得出index也需要6位来索引64个set。图中已给出内存地址是36位,这样一来的话tag = 36 - 6 - 6 = 24位。

最后要吐槽下GFW,这混蛋太恶心了,简直恶心透顶了,让善良的楼主心烦意乱。楼主最开始使用鲜果订阅博客,实在忍受不了那禽兽一般的排版后,转战Digg,结果可想而知,就被墙了,知道嘛,混蛋!刚开始楼主还蒙在鼓里,不知道为什么Digg怎么也登录不上(通过google帐号登录),暗自感叹这东西太垃圾了,遂Google了一下,看起来像是被墙了。楼主使用的是Green免费200m流量的vpn,这家vpn挺不错的,那个大名鼎鼎的池建强大大也推荐过,愿意为他做广告(真正的原因是楼主的博客流量很低,无所谓。哈哈)。接通了vpn,地球引力就正常了,一切都通畅了。万恶的GFW,fuck you!





Read Post

Oracle JRockit The Definitive Guide 读书笔记一


Chapter 1: Getting Started

Command-line option

There are main types of command-line options to JRockit--system properties, standardized options(-X flags), and non-standard ones(-XX flags)

System properties

Arguments starting with -D are interpreted as directive to set a system property.

Standardized options

Configuration settings for the JVM typically start with -X for settings that are commonly supported across vendors.

Non-stdndart options

Vendor-sepecific configuration options are usually prefixed with -XX.These options should be treaded as potentially unsupported and subject to change without notice.If any JVM setup depends on -XX-prefixed options,those flags should be removed or ported before an application is started on a JVM from a different vendor.

Once the JVM options have been determined,the user application can be started.Typically,moving an existing application to JRockit leads to an increase in runtime performance and a slight increase in memory consumption.

The JVM documentation should always be consulted to determine if non-standard command-line options have the same semantics between different JVMs and JVM versions.


JRockit的命令行参数有三种: 系统属性(以`-D`开头),标准命令行选项(以`-X`开头),非标准命令行选项(以`-XX`开头)。


Chapter 2: Adaptive Code Generation

The Java Virtual Machine

The JVM is required to turn the bytecodes into native code for the CPU on which the Java application executes.This can be done in one of the following two ways(or a combination of both)

  1. The Java Virtual Machine specification fully describes the JVM as a state machine,so there is no need to actually translate bytecode to native code.The JVM can emulate the entire execution state of the Java program,including emulating each bytecode instruction as a function of the JVM state.This is referred to as bytecode interpretation.The only native code (barring JNI) that executes directly here is the JVM itself.

  2. The Java Vitual Machine compiles the bytecode that is to be executed to native code for a particular platform and then calls the native code. When bytecode programs are compiled to native code, this is typically done one method at the time, just before the method in question is to be executed for the first time. This is Known as Just-In-Time compilation(JIT).

Naturally, a native code version of a program executes orders of magnitude faster than an interpreted one. The tradeoff is, as we shall see, bookkeeping and compilation time overhead.


1. JVM虚拟机规范将JVM描述成一个状态机,这样做实际上是不需要将字节码转换为本地代码。

2. JVM将字节码编译为平台相关的本地代码,然后直接执行本地代码。


Stack machine

The Java Virtual Machine is a stack machine. All bytecode operations, with few exceptions, are computed on an evaluation stack by popping operands from the stack, executing the operation and pushing the result back to the stack.

In addition to the stack, the bytecode format specifies up to 65,536 registers or local variables.

An operation in bytecode is encoded by just one byte, so Java supports up to 256 opcodes, from which most available values are claimed. Each operation has a unique byte value and a human-readable mnemonic.




Bytecode format

Slot 0 in an instance method is reserved for this, according to the JVM specification, and this particular example is an instance method.

Operations and operands

As we see, Java bytecode is a relatively compack format, the previous method only being four bytes in length(a fraction of the source code mass). Operations are always encoded with one byte for the opcode, followed by an optional number of operations of variable length. Typically, a bytecode instruction complete with operands is just one to three bytes.

Other more complex constructs such as tables switches also exist in bytecocde with an entire jump table of offsets following the opcode in the bytecode.

The constant pool

A program requires data as well as code. Data is used for operands. The operand data for a bytecode program can, as we have seen, be kept in the bytecode instruction itself. But this is only true when the data is small enough, or commonly used(such as the constant 0).

Larger chunks of data, such as string constants or large numbers,are stored in a constant pool at the beginning of the .class file.Indexes to the data in the pool are used as operands instead of the actual data itself. If the string aVeryLongFunctionName had to be separately encoded in a compiled method each time it was operated on,bytecode would not be compact at all.

Code generation strategies

Pure bytecode interpretation

Early JVMs contained only simple bytecode interpreters as a means of executing Java code. To simplify this a little, a bytecode interpreter is just a main function with a large switch construct on the possible opcodes. The function is called with a state representing the contents of the Java evaluation stack and the local variables. Interpreting a bytecode operation uses this state as input and output. All in all, the fundamentals of a working interpreter shouldn't amount to more than a couple of thousand lines of code.


There are several simplicity benefits to using a pure interpreter. The code generator of an interpreting JVM just needs to be recompiled to support a new hardware architecture. No new native compiler needs to be written. Also, a native compiler for just one platform is probably much larger than our simple switch construct.


A pure bytecode interpreter also needs little bookkeeping. A JVM that compiles some or all methods to native code would need to keep track of all compiled code. If a method is changed at runtime, which Java allows, it needs to be scheduled for regeneration as the old code is obsolete. In a pure interpreter, its new bytecodes are simply interpreted again from the start the next time that we emulate to the method.


It follow that the amount of bookkeeping in a completely interpreted model is minimal. This leads itself well to being used in an adaptive runtime such as a JVM, where things change all the time.


Naturally, there is a significant performance penalty to a purely interpreted language when comparing the execution time of an interpreted method with a native code version of the same code.


Static compilation

Usually, an entire Java program was compiled into native code before execution. This is known as ahead-of-time compilation.

The obvious disadvantage of static compilation for Java is that the benefits of platform independence immediately disappear. The JVM is removed from the equation.

Another disadvantage is that the automatic memory management of Java has to be handled more or less explicitly, leading to limited implementations with scalability issues.

Total JIT compilation

Another way to speed up bytecode execution is to not use an interpreter at all, and JIT compile all Java methods to native code immediately when they are first encountered. The compilation takes place at runtime, inside the JVM, not ahead-of-time.

Total JIT compilation has the advantage that we do not need to maintain an interpreter, but the disadvantage is that compile time becomes a factor in the total runtime. While we definitely see benefits in JIT compiling hot methods, we also unnecessarily spend expensive compile time on cold methods and methods that are run only once. Those methods might as well have been interpreted instead.

The main disadvantage of total JIT compilation is still low code generation speed. In the same way that an interpreted method executes hundreds of times slower than a native one, a native method that has to be generated from Java bytecodes takes hundreds of times longer to get ready for execution than an interpreted method. When using total JIT compilation, it is extremely important to spend clock cycles on optimizing code only where it will pay off in better execution time. The mechanism that detects hot methods has to be very advanced, indeed. Even a quick and dirty JIT compiler is still significantly slower at getting code ready for execution than a pure interpreter. The interpreter never needs to translate bytecodes into anything else.

Another issue that becomes more important with total JIT compilation is the large amounts of throwaway code that is produced. If a method is regenerated, for example since assumptions made by the compiler are no longer valid, the old code takes up precious memory. The same is true for a method that has been optimized. Therefore, the JVM requires some kind of "garbage collection" for generated code or a system with large amounts of JIT compiltation would slowly run out of native memoey as code buffers grow.

JRockit is an example of a JVM that uses an advanced variant of total JIT compilation as its code generation stratege.





JRockit JVM的代码生成策略是在完全JIT编译的基础上,加以优化整改而成的。

Mixed mode interpretation

The first workable solution that was proposed, that would both increase execution speed and not compromise the dynamic nature of Java, was mixed mode interpretation.

In a JVM using mixed mode interpretation, all methods start out as interpreted when they are first encountered. However, when a method is found to be hot, it is scheduled for JIT compilation and turned into more efficient native code. This adaptive approach is similar to that of keeping different code quality levels in the JIT, described in the previous section.

Detecting hot methods is a fundamental functionality of every modern JVM, regardless of code execution model, and it will be covered to a greater extent later in this chapter. Early mixed mode interpreters typically detected the hotness of a method by counting the number of times it was large enough, optimizing JIT compilation would be tiggered for method.

Similar to total JIT compilation, if the process of determining if a method is hot is good enough, the JVM spends compilation time only the methods where it makes the most difference. If a method is seldom executed, the JVM would waste no time turing it into native code, but rather keep interpreting it each time that it is called.

Sun Microsystems was the first vendor to embrace mixed interpretation in the HotSpot compiler, available both in a client version and a server side version, the latter with more advanced code optimizations. HotSpot in turn, was based on technology acquired from Longview Technologies LLC(which started out as Animorphic)

最早提出的、在不牺牲Java动态特性的情况下,提升程序执行效率的解决方案是以 混合模式 运行程序。





注:文中几乎所有的中文都来自于 这里,楼主曾不自量力尝试自己去翻译,可是翻译完之后,与这位仁兄的一对比,那可真是惨不忍睹啊。所以说,翻译真的是艺术范畴,我等只有羡慕嫉妒恨了。

Read Post

Hello BTrace




其次,直接解压到相应的目录下tar -xvf btrace-bin.tar.gz

然后,配置BTRACE_HOME,楼主使用的是zsh,所以编辑~/.zshrc文件,在文件末尾添加export BTRCE_HOME=/export/servers/btrace,然后在将BTRACE_HOME添加到PATH中。export PATH=$PATH:$HADOOP_HOME/bin:$BTRCE_HOME/bin:$SBT_HOME/bin:$MAVEN_HOME/bin:$ANT_HOME/bin



➜ bin btrace
Usage: btrace <options> <pid> <btrace source or .class file> <btrace arguments>
where possible options include:
-classpath <path> Specify where to find user class files and annotation processors
-cp <path> Specify where to find user class files and annotation processors
-I <path> Specify where to find include files
-p <port> Specify port to which the btrace agent listens for clients



BTrace is a safe, dynamic tracing tool for Java. BTrace works by dynamically (bytecode) instrumenting classes of a running Java program. BTrace inserts tracing actions into the classes of a running Java program and hotswaps the traced program classes.


  1. 不能创建对象
  2. 不能抛出或者捕获异常
  3. 不能用synchronized关键字
  4. 不能对目标程序中的instace或者static变量
  5. 不能调用目标程序的instance或者static方法
  6. 脚本的field、method都必须是static的
  7. 脚本不能包括outer,inner,nested class
  8. 脚本中不能有循环,不能继承任何类,任何接口与assert语句




safe mode下只能使用BTrace内置的功能,jdk相关的功能都不能使用,否则编译就不通过。BTraceUtils就提供了诸多静态方法可供使用,比如str(),它会调用目标对象的toString方法。然后我就踩了个坑。


public class TestBtrace {
            clazz = "com.xxxx.rpc.impl.ProductRPCImpl", 
            method = "queryProduct", 
            location = @Location(Kind.RETURN)
    public static void queryProduct(@Self Object self, Set<Long> skuIds, @Return Object result){
        println(strcat("入参: ", str(skuIds)));
        println(strcat("出参: ", str(result)));



入参: [10270495]
出参: [[skuId=10270495,name=染整工艺实验教程(附赠光盘1张),category=1713;3282;3709,valueWeight=0.399,imagePath=17220/c1168702-51bd-4645-946b-74f0700b1300.jpg,venderId=0,venderType=0,venderName=,wstate=1,businessCode=bk0193,maxPurchQty=0,wyn=1,length=0,width=0,height=0,brandId=0,extFieldMap={},valuePayFirst=0,skuMark=]]



public class TestBtrace {
            clazz = "com.xxxx.impl.PromotionProxyImpl", 
            method = "calculate", 
            location = @Location(Kind.RETURN)
    public static void calculate(@Self Object self, @Return Object result){
        println(strcat("出参: ", str(result)));





各种无果后,楼主干脆一不做二不休用蹩脚的chinese english跑到BTrace官网上发了个帖子,问作者这tm到底是虾米情况!!查看原帖请点击这里


Why i want to use unsafe mode is because i found a strange behavior in safe mode:

I want to use business domain in BTrace script, and override the toString() method for str(), i found somtimes it does not use the toString() which i overrided,and print a address of the domain. and in another case, it print the right result of toString() returnd.Is that the right behavior? And somebody tell me BTrace script will always use the orginal toString() because of safe mode, i don't think so.
For the second part - toString() is only invoked for instances of the classes loaded by the bootstrap classloader (system classes). For all the other objects an identity string is returned to prevent the execution of unknown code.

大家注意到了没,只有被BootStrapClassLoader加载的类实例的toString()方法才会被BTrace script调用,其他情况使用ObjecttoString()方法,非重写后的。楼主顿觉真气止不住的涌来,原来如此啊。第一个示例的返回值类型是个Set自然会由系统类加载器加载,其范型也沾了点金,同样被加载了;而实例二中Result是自定义对象,由应用类加载器加载,待遇自然而然就不同喽。这样也就解释了最初的疑问,完美解决。楼主可不是一般人,刨根问底能手中的能手,哼,遂翻出我大BTrace源码一探究竟。

 * Returns a string representation of the object. In general, the
 * &lt;code&gt;toString&lt;/code&gt; method returns a string that
 * "textually represents" this object. The result should
 * be a concise but informative representation that is easy for a
 * person to read. For bootstrap classes, returns the result of
 * calling Object.toString() override. For non-bootstrap classes,
 * default toString() value [className@hashCode] is returned.
 * @param  obj the object whose string representation is returned
 * @return a string representation of the given object.
public static String str(Object obj) {
    if (obj == null) {
        return "null";
    } else if (obj instanceof String)    {
        return (String) obj;
    } else if (obj.getClass().getClassLoader() == null) {
        try {
            return obj.toString();
        } catch (NullPointerException e) {
            // NPE can be thrown from inside the toString() method we have no control over
            return "null";
    } else {
        return identityStr(obj);

至此,真相大白了。可是仍然没有解决问题啊?我需要输出业务实体的字段值!!好吧,试试unsafe mode吧。然后,然后就继续踩坑。

坑二:unsafe mode

safe mode有太多的限制了,这样楼主同样不能忍。不过也还是有他的道理的,在可以保证safe的前提下楼主喜欢unsafe mode思米达。不用再畏首畏尾了,木有那么多限制自然顺风顺水。结果苦逼的楼主在开启unsafe mode时候给跪了。

开启unsafe mode有两个步骤,首先编辑$BTRACE_HOME/bin/btrace文件,将启动参数-Dcom.sun.btrace.unsafe改为true;然后在Btrace script中的@BTrace注解中增加unsafe = true


DEBUG: btrace debug mode is set
DEBUG: btrace unsafe mode is set
DEBUG: assuming default port 2020
DEBUG: assuming default classpath '.'
DEBUG: attaching to 1625
DEBUG: checking port availability: 2020
DEBUG: attached to 1625
DEBUG: loading /export/servers/btrace/build/btrace-agent.jar
DEBUG: agent args: port=2020,debug=true,unsafe=true,systemClassPath=/export/servers/jdk1.6.0_25/lib/tools.jar,probeDescPath=.
DEBUG: loaded /export/servers/btrace/build/btrace-agent.jar
DEBUG: registering shutdown hook
DEBUG: registering signal handler for SIGINT
DEBUG: submitting the BTrace program
DEBUG: opening socket to 2020
DEBUG: sending instrument command
DEBUG: entering into command loop
DEBUG: received com.sun.btrace.comm.ErrorCommand@3c24c4a3
com.sun.btrace.VerifierException: Unsafe mode, requested by the script, not allowed
at com.sun.btrace.runtime.Verifier.reportError(
at com.sun.btrace.runtime.Verifier.reportError(
at com.sun.btrace.runtime.Verifier$1.visit(
at Source)
at Source)
at Source)
at Source)
at com.sun.btrace.runtime.InstrumentUtils.accept(
at com.sun.btrace.runtime.InstrumentUtils.accept(
at com.sun.btrace.agent.Client.verify(
at com.sun.btrace.agent.Client.loadClass(
at com.sun.btrace.agent.RemoteClient.<init>(
at com.sun.btrace.agent.Main.startServer(
at com.sun.btrace.agent.Main.access$000(
at com.sun.btrace.agent.Main$
DEBUG: received com.sun.btrace.comm.ExitCommand@11e9c82e

神奇吧?日志上面显示我已经开启了unsafe mode,对的,已经开启了。可是他就是提示不支持,这是要闹哪样。。。



Hi brother, there is a mistake here, i am using BTrace 1.2.4 not the 2. I am so sorry
But it really happened in BTrace 1.2.4, and finally i fount it works after restarting the application.
Then i try to change -Dcom.sun.btrace.unsafe to false, and i can still run the BTrace script in unsafe mode.
So, is that i must restarting the application,if changed the mode?
For the first part - once you start the agent it will keep its unsafe flag forever. You could start the application with BTrace Agent and a dummy script to allow unsafe BTrace scripts eg.





btrace -classpath ./a.jar 109776


-classpath后面可以跟一个字符串,里面可以包含多个路径拼接在一起。 例如在Windows上分隔符是分号: -classpath ./a.jar;./b.jar 在Linux上分隔符是冒号: -classpath ./a.jar:./b.jar

原文链接在 这里

Read Post
Older entries →