Oracle JRockit The Definitive Guide 读书笔记三

Runtime code generation

Total JIT compilation needs to be a lazy process. If any method or class referenced from another method would be fully generated depth first referral time, there would be significant code generation overhead. Also, just because a class is referenced from the code doesn't mean that every method of the class has to be compiled right away or even that any of its methods will ever be executed. Control flow through the Java program might take a different path. This problem obviously doesn't exist in a mixed mode solution, in which everything starts out as interpreted bytecode with no need to compile ahead of execution.

Trampolines

JRockit solves this problem by generating stub code for newly referred but not yet generated methods. These stubs are called tramplines, and basically consist of a few lines of native code pretending to be the final version of the method. When the method is first called, and control jumps to the trampoline, all it does is execute a call that tells JRockit that the real method needs to be generated. The code generator fulfils the request and returns the starting address of the real method, to which the trampoline then dispatches control. To the user it looks like the Java method was called directly, when in fact it was generated just at the first time it was actually called.

0x1000: method A                                    0x3000: method C
    call method B @ 0x2000                              call method B @ 0x2000

0x2000; method B (trampoline)                       0x4000: The "real" method B
    call JVM.Generate(B) -> start                       ...
    write trap @ 0x2000
    goto start @ 0x4000

Consider the previous example. method A, whose generated code resides at address 0x1000 is executing a call to method B, that it believes is placed at address 0x2000. This is the first call to method Bever. Consequently, all that is at address 0x2000 is a trampoline. The first thing the trampoline does is to issue a native call to the JVM, telling it to generate the real method B. Execution then halts until this code generation request has been fulfilled, and a starting address for method B is returned, let's say 0x4000. The trampoline then dispatches control to method B by jumping to that address.

Note that there may be several calls to method B in the code already, also pointing to the trampoline address 0x2000. Consider, for example, the call in method C that hasn't been executed yet. These calls need to be updated as well, without method B being regenerated. JRockit solves this by writing an illegal instruction at address 0x2000, when the trampoline has run. This way, the system will trap if the trampoline is called more than once. The JVM has a special exception handler that catches the trap, and patches the call to the trampoline so that it points to the real method instead. In this case it means overwriting the call to 0x2000 in method C with a call to 0x400. This process is called back patching.

Back patching is used for all kinds of code replacement in the virtual machine, not just for method generation. If, for example, a hot method has been regenerated to a more efficient version, the code version of the code is fitted with a trap at the start and back patching takes place in a similar manner, gradually redirecting calls from the old method to the new one.

If there are no more references to an older version of a method, its native code buffer can be scheduled for garbage collection by the rumtime system so as to unclutter the memory. This is necessary in a world that uses a total JIT strategy because the amount of code produced can quite large.

在上面的例子中,method A的起始地址在0x1000,它在调用method B时以为其起始地址是0x2000,这时method B第一次被调用。0x2000位置处的存根代码就是Trampoline,它告知JVM要为method B生成代码。此时,程序会一直等待,直到代码生成器完成工作,并返回method B的真正地址,再跳转到该地址开始执行。

注意,可能对method B的调用可能会有多处,即都指向Trampoline的地址0x2000。例如上面例子中method C。这些对method B的调用应该修改为真正的method B的地址,而不是每次都重新生成一边method B。JRockit的解决办法时,当Trampoline运行过一次之后,在0x2000处写入一个陷阱指令,如果此时Trampoline再被调用,被JRockit会捕获到该事件,并将调用指向真正的method B。这个过程称为回填(back patching)。

不仅是方法生成,回填技术常用于虚拟机的各种代码替换操作。例如,当某个热方法被重新编译为更高效的版本时,就是在该方法的之前版本的起始位置上设置一个陷阱(trap),当再次调用该方法时会触发异常,虚拟机捕获到该异常后会将调用指向新生成的代码的位置。

注意,这只是个不得已的办法,因为我们没有足够的时间遍历所有已经编译过的代码去查找所有需要更新的调用。

当没有任何引用指向某个方法的老的编译版本时,该版本就可以由运行时系统回收掉,释放内存资源。这对于使用完全JIT编译策略的JVM来说非常重要,因为编译后的代码量非常大,要避免出现内存耗尽的情况。

Code generation requests

In JRockit, code generation requests are passed to the code generator from the runtime when a method needs to be compiled. The requests can be either synchronous or asynchronous.

Synchronous code generation requests do one of the following:

  1. Quickly generate a method for the JIT, with a specified level of efficiency

  2. Generate an optimized method, with a specified level of efficiency.

An asynchronous request is:

Act upon an invalidated assumption, for example, force regeneration of a method or patch the native code of a method.

Internally, JRockit keeps synchronous code generation requests in a code generation queue and optimization queue, depending on request type. The queues are consumed by one or more code generation and / or optimization threads, depending on system configuration.

The code generation queue contains generation requests for methods that are needed for program execution to proceed. These requests, except for special cases during bootstrapping, are essentially generated by tampolines. The call "generate me" that each trampoline contains, inserts a request in the code generation queue, and blocks until the method generation is complete. The return value of the call is the address in memory where the new method starts, to which the trampoline finally jumps.

Optimization requests

Optimization requests are added to the optimization queue whenever a method is found to be hot, that is when the runtime system has realized that we are spending enough time executing the Java code of that method so that optimization is warranted.

The optimization queue understandably runs at a lower priority than the code generation queue as its work is not necessary for code execution, but just for code performance. Also, an optimization request usually takes orders of magnitude longer than a standard code generation request to execute, trading compile time for effcient code.

On-stack replacement

Once an optimized version of a method is generated, the existing version of the code for that method needs to be replaced. As previously described, the method entry point of the existing code version of the method is overwritten with a trap instruction. Calls to the old method will be back patched to point to the new, optimized piece of code.

Some optimizers swap out code on the existing execution stack by replacing the code of a method with a new version in the middle of its execution. This is referred to as on-stack replacement and requires extensive bookkeeping. Though this is possible in a completely JIT-compiled world, it is easier to implement where there is an interpreter to fall back to.

JRockit doesn't do on-stack replacement, as the complexity required to do so is deemed too great. Even though the code for a more optimal version of the method may have been generated, JRockit will continue executing the old version of the method if it is currently running.

当完成某个方法的优化请求后,需要替换掉该方法的现存版本。正如前面的提到的,会使用陷阱指令覆盖(trap instruction)现存版本的方法入口点,于是再次调用该方法时会通过回填技术指向新的、优化过的版本。

有些优化器会在方法执行过程中,使用优化后的版本替换掉现有的版本,这就是所谓的 栈上替换(on-stack replacement,OSR)。实现OSR需要额外记录大量信息,此外,尽管在完全JIT编译策略下可以实现OSR,但在有解释器辅助的环境中,实现起来更容易。因为可以退化为解释执行,替换后再执行编译后的代码(译者注,这句话是我编的,原文是“Though this is possible in a completely JIT-compiled world, it is easier to implement where there is an interpreter to fall back to”)。

不过,JRockit中并没有实现OSR,因为复杂性太高。因此,即使已经生成了优化后的方法,还是要等下一次调用才会生效。

Bookkeeping

Object information for GC

For various reasons, a garbage collector needs to keep track of which registers and stack frame locations contain Java objects at any given point in the program. This information is generated by the JIT compiler and is stored in a database in the runtime system. The JIT compiler is the component responsible for creating this data because type information is available "for free" while generating code. The compiler has to deal with types anyway. In JRockit, the object meta info is called livemaps, and a detailed explanation of how the code generation system works with the garbage collector is given in Chapter 3, Adaptive Memory Management.

Assumptions made about the generated code

An assumption database is another part of the JRockit runtime that communicates with the code generator.

A walkthrough of method generation in JRockit

The JRockit IR format

The first stage of the JRockit code pipeline turns the bytecode into an Intermediate Representation (IR). As it is conceivable that other languages may be compiled be the same frontend, and also for convenience, optimizers tend to work with a common internal intermediate format.

JRockit works with an intermediate format that differs from bytecode, looking more like classic text book compiler formats. This is the common approach that most compilers use, but of course the format of IR that a compiler users always varies slightly depending on implementation and the language being compiled.

Aside from the previously mentioned protability issue, JRockit also doesn't work with bytecode internally because of the issues with unstructured control flow and the execution stack model, which differs from any modern hardware register model.

Because we lack the information to completely reconstruct the ASTs, a method in JRockit is represented as a directed graph, a control flow graph, whose nodes are basic blocks. The definition of a basic block is that if one instruction in the basic block is executed, all other instructions in it will be executed al well. Since there are no btranches in our example, the md5_F function will turn into exactyly one basic block.

Data flow

A basic block contains zero to many operations, which in turn have operands. Operands can be other operations (forming expression trees), variables (virtual registers or automic operands), constants, addresses, and so on, depending on how close to the actual hardware representation the IR is.

JIT comlilation

This following figure illustrates the different stages of the JRockit code pipeline:

`BC2HIR -->  HIR2MIR  -->  MIR2LIR  -->  RegAlloc  -->  EMIT`

Generating HIR

The first module in the code generator, BC2HIR, is the frontend against the bytecode and its purpose is to quickly translate bytecodes into IR. HIR in the case stands for High-level Intermediate Representation.

This is the output, the High-level IR, or HIR: params: v1 v2 v3 block0: [first] [id=0] 10 @9:49 (i32) return {or {and vi v2} {and {xor v1 -1} v3}}

In JRockit IR, the annotation @ before each statement identifies its program point in the code all the way down to assembler level. The first number following the @ is the bytecode offset of the expression and the last is the source code line number information. This is part of the complex meta info framework in JRockit that maps individual native instructions back to their Java program points.

The BC2HIR module that turns bytecodes into a control flow graph with expressions is not computationally complex.

MIR

MIR or Middle-level Intermediate Representation, is the transform domain where most code optimization take place. This is because most optimizations work best with three address code or rather instructions that only contain atomic operands, not other instructions. Transforming HIR to MIR is simply an in-order traversal of expression trees metioned earlier and the creation of temporary variables. As no hardware deals with expression trees, it is natural that code turns into progressively simpler operations on the path through the code pipeline.

Our md5_F example would look something like the following code to the JIT compiler, when the expression trees have been flattened. Note that no operation contains other operations anymore. Each operation writes its result to a temporary variable, whick is in turn used by later operations.

params: v1 v2 v3
block0: [first] [id=0]
    2 @2:49*    (i32) and       v1 v2 -> v4
    5 @5:49*    (i32) xor       v1 -1 -> v5
    7 @7:49*    (i32) and       v5 v3 -> v5
    8 @8:49*    (i32) or        v4 v5 -> v4
   10 @9:49*    (i32) return    v4

LIR

After MIR, it is time to turn platform dependent as we are approaching native code. LIR, or Low-level IR, looks different depending on hardware architeture.

Following is the LIR for the md5_F method on a 32-bit x86 platform:

params: v1 v2 v3
block 0: [first] [id=0]
    2 @2:49*    (i32)   x86_and         v2 v1 -> v2
   11 @2:49*    (i32)   x86_mov         v2 -> v4
    5 @5:49*    (i32)   x86_xor         v1 -1 -> v1
   12 @5:49*    (i32)   x86_mov         v1 -> v5
    7 @7:49*    (i32)   x86_and         v5 v3 -> v5
    8 @9:49*    (i32)   x86_ox          v4 v5 -> v4
   14 @9:49*    (i32)   x86_mov         v4 -> eax
   13 @9:49*    (i32)   x86_ret         eax

Register allocation

There can be any number of virtual registers (variables) in the code, but the physical platform only has a small number of them. Therefore, the JIT compiler needs to do register allocation, transforming the virtual variable mappings into machine registers. If at any given point in the program, we need to use more variables than there are physical registers in the machine at the same time, the local stack frame has to be used for temporary storage. This is called spilling, and the register allocator implements spills by inserting move instructions that shuffle registers back and forth from the stack. Natually spill moves incur overhead, so their placement is highly significant in optimezed code.

We can aslo note that the register allocator has added an epilogue and prologue to method in which stack manipulation takes place. This is because it has figured needs to use tow callee-save registers for storage. A register for the caller. If the stack frame and restored just before the method returns. By JRockit convention on x86, callee-save registers for Java code are ebx and ebp. Any calling convention typically includes a few callee-save registers since if every register was potentially destroyed over a call, the end result would be even ore spill code.

Generating optimized code

At each stage, an optimization module is plugged into the JIT.

A general overview

MIR readily transforms into Single Static Assignment(SSA) form, a transform domain that makes sure that any vairables has only one definition. SSA transformation is part of virtually every commercial compiler today and makes implementing many code optimizations much easier. Another added benefit is that code optimizations in SSA form can be potentially more powerful.

LIR is platform-dependent and initially not register allocated, so transformations that form more efficient native operation sequences can be performed here.

The JRockit optimizer contains a very advanced register allocator that is based on a technique called graph fusion, that extends the standard graph coloring approximation algorithm to work on subregions in the IR. Graph fusion has the attractive property that the edges in the flow graph, processed early, generate fewer spills than the edges processed later. Therefore, if we can pick hot subregions before cold ones, the resulting code will be more optimal. Additional penalty comes from the need to insert shuffle code when fusion regions in order to form a complete method. Shuffle code consists of sequences of move instructions to copy the contents of one local register allocation into another one.

Finally, just before code emission, various peephole optimizations can be applied to the native code, replacing one to several register allocated instructions in sequence with more optimal ones.

How does the optimizer works

Generating optimized code for a method in JRockit generally takes 10 to 100 times as long al JITing it with no demands for execution speed. Therefore, it is important to only optimize frequently executed method.

Similar issues exist with boxed types. Boxed types turn into hidden objects (for example instances of java.lang.Integer) on the bytecode level. Several traditional compiler optimizations, such as escape analysis, can often easily strip down a boxed type to its primitive value. This removes the hidden object allocation that javac put in the bytecode to implement the boxed type.

文中翻译均引用自这里


Read Post

Oracle JRockit The Definitive Guide 读书笔记二

Adaptive code generation

Java is dynamic in nature and certain code generation strategies fit less well than others. From the earlier discussion, the following conclusions can be drawn:

  1. Code genteration should be done at runtime, not ahead of time

  2. All methods cannot be treated equally by code genertor. There needs to be a way to discern a hot method from a cold one. Otherwise unnecessary optimization effort on hot methods.

  3. In a JIT compiler, bookkeeping needs to be in place in order to keep up with the adaptive rumtime. This is because generated native code invalidated by changes to the running program must be thrown away and potentially regenerated.

Achieving code execution efficiency in an adaptive runtime, no matter what JIT or interpretation strategy it uses, all boils down to the equation:

Total Execution Time = Code Generation Time + Execution Time

The JVM needs to know precisely which methods are worth the extra time spent on more elaborate code generation and optimization efforts.

Determining "hotness"

As we have seen, "one size fits all" code generation that interprets every method, or JIT compiling every method with a high optimization level, is a bad idea in an adaptive runtime. The former, because although it keeps code generation time down, execution time goes way up. The latter, because even though execution is fast, generating the highly optimized code takes up a significant part of the total runtime.We need to know if a method is hot or not in order to know should give it lots of code generator attention, as we can't treat all methods the same.

The common denominator for all ways of profiling is that of samples of where code spends execution time is collected. These are used by the runtime to make optimization decisions--the more samples available, the better informed decisions are made.

Invocation counters

One way to sample hot methods is to use invocation counters. An invocation counter is typically associated with each method and is incremented when the method is called. This is done either by the bytecode interpreter or in the form of an extra add instruction compiled into the prologue of the native code version of the method.

Especially in the JIT compiled world, where code execution speed doesn't disappear into interpertation overhead,usually in the form of cache misses in the CPU. This is because a particular location in memory has to be frequently written to by the add at the start of each method.

Software-based thread sampling

Another, more cache friendly, way to determine hotness is by using htread sampling. This means periodically examining where in the program Java threads are currently executing and logging their instruction pointers.Thread sampling requires no code instrumentation.

Stopping threads, which is normally required in order to extract their contexts is, however, quite an expensive operation. Thus getting a large amount of samples without disrupting anything at all requires a complete JVM-internal thread implementation, a cunstom operating system such as in Oracle JRockit Virtual Edition, or specialized hardware support.

Hardware-based sampling

Certain hardware platforms, such as Intel IA-64, provides hardware instrumentation mechanisms that may be used by an application.

Optimizing a changing program

In object-oriented languages, virtual method dispatch is usually compiled as indirect call(that is the destination has to be read from memory) to addresses in a dispatch table. This is because a vitual call can have several possible receivers depending on the class hiearchy. A dispatch table exists for every class and contains the receivers of its virtual calls. A static method or a virtual method that is known to have only one implementation can instead be turned into a direct call with a fixed destination. This is typically much faster to execute.

The JVM solves this by "gambling". It bases its code generation decisions on assumptions that the world will remain unchanged forever, which is usually the case. If it turns out not to be so, its bookkeeping system triggers callbacks if any assumption needs is violated. When this happens, the code containing the original assumption needs to be regenerated--in our example the static dispath needs to te replaced by a virtual one. Having to revert code generated based on an assumption about a closed world is typically very costly, but if it happens rarely enough, the benefit of the original assumption will deliver a performance increase anyway.

Some typical assumptions that the JIT compiler and JVM, in general, might bet on are:

  1. A virtual method probably won't be overridden. As it only exists only in one version, it can always be called with fixed destination address like a static method.

  2. A float will probably never be NaN. We can use hardware instructions instead of an expensive call to the native floating point library that is required for corner cases.

  3. The program probably won't throw an exception in a particular try block. Schedule the catch clause as cold code and give it less attention from the optimizer.

  4. The hardware instruction fsin probably has the right precision for most trigonometry. If it doesn't, cause an exception and call the native floating pint library instead.

  5. A lock probably won't be too saturated and can start out as a fast spinlock.

  6. A lock will probably be repeatedly taken and released by the same thread, so the unlock operation and future reacquisitions of the lock can optimistically be treated as no-ops.

A static environment that was compiled ahead of time and runs in a closed world can not, in general, make these kinds of assumptions. An adaptive runtime, however, can revert its illegal decisions if the cirteria they were based on are violated. In theory, it can make any crazy assumption that might pay off, as long as it can be reverted with small enough cost. Thus, an adaptive runtime is potentially far more powerful than a static environment given that the "gambling" pays off.

Given that we find this area --and JRockit is based on runtime information feedback in all relevant areas to make the best decisions--an adaptive runtime has the potential to outperform a static environment very time.

Inside the JIT compiler

Working with bytecode

While compiled bytecode may sound low level, it is still a well-defined format that keeps its code(operations) and data (operands and constant pool entries) strictly separated from each other.

As we have seen, most bytecode operations pop operands from the statck and push results. No native platforms are stack machines, rather they rely on registers for storing intermediate values. Mapping a language that uses local variables to native registers is straightforward, but mapping an evaluation statck to registers is slightly more complex. Java aslo defines plenty of virtual registers, local variables, but uses an evaluation stack anyway. It is the authors' opinion that this is less than optimal.

Another problem, that in rare cases my be a design advantage, is the ability of Java bytecodes to express more than Java source code.

Bytecode "optimizers"

Our advice is to not use bytecode optimizers, ever!

Abstract syntax trees

A bytecode to native compiler can't simply assume that the given bytecode is compiled Java source code, but needs to cover all eventualities. A compiler whose frontend reads source code usually works by first tokenizing the source code into known constructs and building an Abstract Syntax Tree (AST).

Perhaps, in retrospect, it would have been a better design rationale to directly use an encoded version of the compiler's ASTs as bytecode format. Various academic papers have shown that ASTs are possible to represent in an equally compact or more compact way than Jave byteocde, so space is not a problem. Interpreting an AST at runtime would also only be slightly more difficult than interpreint bytecode.

Where to optimize

However, as we have explained, explicit optimization on the bytecode level is probably a good thing to avoid.

Adaptive optimization can never substitute bad algorithms with good ones. At most, it can make the bad ones run little bit faster.

Exceptions are very expensive operations and are assumed to be just that -- exceptions. The "gambling" behavior of the JVM, thinking that exceptions are rare, became a bad fit.

The JRockit code pipeline

Why JRockit has no bytecode interpreter

JRockit uses the code generation strategy total JIT compilation.

Later, as JRockit became a major mainstream JVM, known for its performance, the need to diversify the code pipline into client and server parts was recognized. No interpreter was added, howerver. Rather the JIT was modified to differentiate even further between cold and hot code, enabling faster "sloppy" code generation the first time a method was encountered. This greatly improved startup time to a satisfying degree, but of course, getting to pure interpreter speeds with a compile-only approach is still very hard.

Another aspect that makes life easier with an inerpreter is debuggability. Bytecode contains meta information about things like variable names and line numbers. These are needed by the debugger. In order to support debuggability, the JRockit JIT had to propagate this kind of information all the way from pre-bytecode basis to per-native to add an interpreter. This has the added benefit that, to our knowledge, JRockit is the only virtual machine that lets the user debug optimized code.

The main problems with the compile-only strategy in JRockit are the code bloat (solved by garbage collecting code buffers with methods no longer is use) and compilation time for large methods (solved by having a sloppy mode for the JIT).

Bootstrapping

The "brain" of the JRockit JVM is the runtime system itself. It keeps track of what goes on in the world that comprises the virtual execution environment. The runtime system is aware of which Java classes and methods make up the "world" and requests that the code generator compiles them at appropriate times with appropriate levels of code quality.

To simplify things a bit, the first thing the runtime wants to do when the JVM is started, is to look up and jump to the main method of a Java program. This is done through a standard JNI call from the native JVM, just like any other native application would use JNI to call Java code.

Searching for main triggers a complex chain of actions and dependencies. A lot of other Java methods required for bootstrapping and fundamental JVM behavior need to be generated in order to resolve main function. When finally main is ready and compiled to native code, the JVM can execute its first native-to-Java stub and pass control from the JVM to the Java program.

To study the bootstrap behavior of JRockit, try running a simple Java program with the command-line switch -Xverbose:codegen.


Read Post

Cache: A Place for Concealment and Safekeeping

楼主最近疯了,开始疯狂的写博客,不过也是件好事儿。

先说下这篇文章的来源,整个博客是微博上一些大牛推荐的,然后楼主折腾了好久好久终于在iPad上订阅成功了他,然后一个阳光明媚的早晨,神经病一样的楼主从床上爬起来,然后去楼下跑步了,跑完步之后楼主就拿起来iPad玩弄起来,点开Digg,随便翻开了这篇订阅的文章。然后就森森的被吸引了,欲罢不能。随后悲剧的事情就发生了,文章里面除了知道的知识点都是不知道的,竟然会发生这种事情(其实肯定会发生的,神经病)。看懂(楼主自认为是看懂了)这篇文章还是花费了大把的时间,粗略的算下大概一天半的样子。

这文章的名字楼主实在是不想翻译了,怎么翻译都觉得不爽,随他去吧。

-------------------------------------------------------------------请叫我华丽丽的分割线---------------------------------------------------------

本文主要介绍Intel处理器的CPU缓存实现原理。值得一提的是关于缓存的讲解通常都会将基本的概念混淆,而且缺乏生动的示例,当然不排除是由作者智商低造成的。随他去吧,下面讲解双核CPU一级缓存工作原理之part 1:

作为缓存中数据单元的line,其实就是内存中连续的字节块。如上图所示,缓存使用64字节的line。这些line可以叫做cache bank或者way,另外,还为每个way配备了一个用于存储其信息的directorywaydirectory作为一个单元类似于电子表格中的列,而set则可看作是行。这样,就可以通过directory定位相应的line。图中缓存有64个set,每个set包含8个way,因此有512个line,总共加起来有32KB。

根据图中缓存情况分析,物理内存被分成了多个4KB大小物理页,每页包含4KB / 64 bytes = 64line。一个4KB大小的页,0-63字节是第一个line,64-127字节是第二个line,以此类推。每页都以这种方式组织,所以0页的第三个line和1页的第三个line是不同的。

全相联缓存模式下,内存中每个line都可以存储任意个line中,这使得存储变得很灵活,但是访问变得异常复杂。由于一级缓存和二级缓存对操作的功耗,物理空间以及速度都有着非常苛刻的限制,就意味这全相联很难被应用。

图中的缓存是多路组相联模式,也就是内存中特定的line只能存储在指定的set(或者row)中。所以,所有物理页的第一个line(0-63字节)都必须存储在第0个row中,第二个line存储在第1个row中,以此类推。图中每row有8个单元可用于存储与之相对应的line,所以称之为8路组相联。当内存寻址时,第11-6位用于定位存储在页(4KB)中的行号,从而也就确定了存储在哪个set中。比如物理地址:0x800010a0(二进制表示为:100000000000000000010000101),11-6位对应的二进制为:000010,所以它必须存储在set 2中。

目前为止,仍然没法定位到底是row中的哪一个单元,这个就要靠directory了。每个line都有一个directory作为标识,表示line所在位置的页号。图中的处理器可寻址64GB的物理RAM,所以就有64GB / 4BK = 224页,也就是directory需要24位。我们示例中物理地址0x800010a0对应的页号为:(0x800010a0b) / (4 KB) = 524 289,下面介绍双核CPU一级缓存工作原理之part 2:

由于每个set中只有8个way,所以tag matching过程非常快速。图中使用箭头表示tag的并行比较过程,如果存在有效的linetag匹配,那么记做一次缓存命中,否则去二级缓存查找,还没有匹配的则只能去物理内存中查找。Intel二级缓存的原理与一级缓存一样,只不更大和更多的way。比如通过增加8个way就可以获得64KB(= 4KB × 16)大小的缓存,增加set数量为4096个,那么way大小就增加至256BK(= 4096 × 64byte),通过这两处简单的提升,二级缓存就可达到4MB(= 256KB × 16)大小。同理,tag需要18位(= 36 - 12 - 6),set index需要12位(4096 = 212),物理页数与way大小一致。

如果set被填满了,那么在下一个line存入前必须有一个失效。为避免此类情况频繁发生,性能敏感的程序就要自己组织数据的存储,使得内存访问在line中均匀分布。例如,程序中有一个512字节大小的对象数组,这些对象可能存储在内存的不同页中。如果对象的字段都被分配到同样的line上,同时也在竞争同一个set,而且程序频繁访问指定字段(比如通过调用虚方法,频繁访问虚函数表),就会造成set被填满。这时,就会不断有line被失效,然后新的数据被缓存,严重映像缓存的利用率。示例中的一级缓存只能保存8个对象的虚函数表。set碰撞造成的缓存低命中率甚至导致整个缓存利用率低,是多路组相联需要付出的代价。但是借助于计算机的相对速度,几乎所有的应用程序根本就不需要担心这一点。

内存寻址通常都使用虚地址,所以一级缓存需要借助页单元获取物理页地址,以供tag使用。按约定,set index来自于虚地址的靠末尾几位(示例中是11-6位),并且不需要转换。所以一级缓存的tag依赖物理地址,而set index依赖虚地址,这样CPU就可以进行并行查找操作了。因为一级缓存的way大小永远不会超过内存管理单元页大小,所以指定物理内存地址保证与相同的虚set index相关联。但是二级缓存就是另外一回事儿了,因为way大小可能比内存管理单元大,所以tag必须是物理的,set index也必须是物理的。但是,请求到达二级缓存的时候,一级缓存已经计算好了物理地址,所以二级缓存一直工作的很好。

最后,directory中还存储着对应line的无效或共享的状态。一级和二级缓存中,line状态可以是4种MESI状态中的一种,包括:修改、专有、共享、无效。

更新:Dave在评论中提出直接匹配,他其实就是只有一路的组相联。与全相联相反:他拥有超快的访问速度,但是高冲撞率和低命中率。

-------------------------------------------------------------------请叫我华丽丽的分割线---------------------------------------------------------

以下的内容都是楼主对上面这篇文章的理解,以及在读懂他过程中的所做的一些功课。

首先第一幅图中这部分是怎么来的?

我们已知的信息只有这么多:L1 Cache - 32KB, 8-way set associative, 64-byte cache lines;还有就是物理内存地址位36位。

刚开始看这篇文章的时候,虽然被吸引,但是其中的知识点基本上都不了解,连专业术语都不知道是什么。只好硬着头皮google了缓存结构,结果还真找到了,在wiki上看到了第一个参考资料,介绍CPU缓存的文章。下面这段解决了我的疑问:

Cache entry structure

Cache row entries usually have the following structure:

tag data block flag bits

The data block (cache line) contains the actual data fetched from the main memory. The tag contains (part of) the address of the actual data fetched from the main memory. The flag bits are discussed below.

The "size" of the cache is the amount of main memory data it can hold. This size can be calculated as the number of bytes stored in each data block times the number of blocks stored in the cache. (The number of tag and flag bits is irrelevant to this calculation, although it does affect the physical area of a cache.) An effective memory address is split (MSB to LSB) into the tag, the index and the block offset.

tag index block offset

The index describes which cache row (which cache line) that the data has been put in. The index length is bits for r cache rows. The block offset specifies the desired data within the stored data block within the cache row. Typically the effective address is in bytes, so the block offset length is bits, where b is the number of bytes per data block. The tag contains the most significant bits of the address, which are checked against the current row (the row has been retrieved by index) to see if it is the one we need or another, irrelevant memory location that happened to have the same index bits as the one we want. The tag length in bits is address_length - index_length - block_offset_length.

Some authors refer to the block offset as simply the "offset" or the "displacement".

block offset对应图中Offset into cache line,从已知条件中我们得知line采用的是64byte = 26,即block offset需要6位。同理index对应图中Set Index,32KB / 64byte = 512 个line,每个set又包含8个way(术语好乱),故512个line / 8way = 64个set = 26,从而得出index也需要6位来索引64个set。图中已给出内存地址是36位,这样一来的话tag = 36 - 6 - 6 = 24位。

最后要吐槽下GFW,这混蛋太恶心了,简直恶心透顶了,让善良的楼主心烦意乱。楼主最开始使用鲜果订阅博客,实在忍受不了那禽兽一般的排版后,转战Digg,结果可想而知,就被墙了,知道嘛,混蛋!刚开始楼主还蒙在鼓里,不知道为什么Digg怎么也登录不上(通过google帐号登录),暗自感叹这东西太垃圾了,遂Google了一下,看起来像是被墙了。楼主使用的是Green免费200m流量的vpn,这家vpn挺不错的,那个大名鼎鼎的池建强大大也推荐过,愿意为他做广告(真正的原因是楼主的博客流量很低,无所谓。哈哈)。接通了vpn,地球引力就正常了,一切都通畅了。万恶的GFW,fuck you!

EDIT

刚刚在洗完的时候突然想起来自己犯了一个愚蠢的错误,竟然将原文链接忘记注明了。罪过,罪过。

原文链接:

http://duartes.org/gustavo/blog/post/intel-cpu-caches/

参考资料:

http://en.wikipedia.org/wiki/CPU_cache

http://zh.wikipedia.org/zh-cn/CPU缓存

http://zh.wikipedia.org/wiki/内存管理单元

http://en.wikipedia.org/wiki/Vtable

http://www.csbio.unc.edu/mcmillan/Media/L20Spring2012.pdf


Read Post

Oracle JRockit The Definitive Guide 读书笔记一

楼主自工作以来,自认为还是读了不少书的,但是层次仅仅是读了。其实读了和理解书中的内容会有非常非常大的差别,不知道有没有人和楼主一样,读到一些内容的时候会心潮澎湃,然而一觉之后就不记得了。有那么多血淋淋的教训之后,楼主决定不再那么懒了,尽量多做些读书笔记。

Chapter 1: Getting Started

Command-line option

There are main types of command-line options to JRockit--system properties, standardized options(-X flags), and non-standard ones(-XX flags)

System properties

Arguments starting with -D are interpreted as directive to set a system property.

Standardized options

Configuration settings for the JVM typically start with -X for settings that are commonly supported across vendors.

Non-stdndart options

Vendor-sepecific configuration options are usually prefixed with -XX.These options should be treaded as potentially unsupported and subject to change without notice.If any JVM setup depends on -XX-prefixed options,those flags should be removed or ported before an application is started on a JVM from a different vendor.

Once the JVM options have been determined,the user application can be started.Typically,moving an existing application to JRockit leads to an increase in runtime performance and a slight increase in memory consumption.

The JVM documentation should always be consulted to determine if non-standard command-line options have the same semantics between different JVMs and JVM versions.

第一章只抽取了这么一点点内容:

JRockit的命令行参数有三种: 系统属性(以`-D`开头),标准命令行选项(以`-X`开头),非标准命令行选项(以`-XX`开头)。

需要注意的是使用非标准命令行选项时,要根据当前JVM版本查看文档以确保当前版本支持。

Chapter 2: Adaptive Code Generation

The Java Virtual Machine

The JVM is required to turn the bytecodes into native code for the CPU on which the Java application executes.This can be done in one of the following two ways(or a combination of both)

  1. The Java Virtual Machine specification fully describes the JVM as a state machine,so there is no need to actually translate bytecode to native code.The JVM can emulate the entire execution state of the Java program,including emulating each bytecode instruction as a function of the JVM state.This is referred to as bytecode interpretation.The only native code (barring JNI) that executes directly here is the JVM itself.

  2. The Java Vitual Machine compiles the bytecode that is to be executed to native code for a particular platform and then calls the native code. When bytecode programs are compiled to native code, this is typically done one method at the time, just before the method in question is to be executed for the first time. This is Known as Just-In-Time compilation(JIT).

Naturally, a native code version of a program executes orders of magnitude faster than an interpreted one. The tradeoff is, as we shall see, bookkeeping and compilation time overhead.

Java应用运行的机器上,JVM需要将字节码转换为CPU可执行的本地代码,实现方式可以是一下任意一种或者两者混合:

1. JVM虚拟机规范将JVM描述成一个状态机,这样做实际上是不需要将字节码转换为本地代码。

2. JVM将字节码编译为平台相关的本地代码,然后直接执行本地代码。

编译为本地代码后,程序的执行效率会比解释执行快几个数量级,不过,这是以额外的信息记录和编译时间为代价的。

Stack machine

The Java Virtual Machine is a stack machine. All bytecode operations, with few exceptions, are computed on an evaluation stack by popping operands from the stack, executing the operation and pushing the result back to the stack.

In addition to the stack, the bytecode format specifies up to 65,536 registers or local variables.

An operation in bytecode is encoded by just one byte, so Java supports up to 256 opcodes, from which most available values are claimed. Each operation has a unique byte value and a human-readable mnemonic.

JVM是基于栈的虚拟机,绝大多数字节码操作都是基于操作数栈的入栈和出栈。

除了栈之外,字节码格式中还规定了多达65536个寄存器,也叫局部变量。

操作数在字节码中被编码为一个字节中,所以最多支持256个操作数,每个操作都有唯一的值和类似于汇编指令的助记符。

Bytecode format

Slot 0 in an instance method is reserved for this, according to the JVM specification, and this particular example is an instance method.

Operations and operands

As we see, Java bytecode is a relatively compack format, the previous method only being four bytes in length(a fraction of the source code mass). Operations are always encoded with one byte for the opcode, followed by an optional number of operations of variable length. Typically, a bytecode instruction complete with operands is just one to three bytes.

Other more complex constructs such as tables switches also exist in bytecocde with an entire jump table of offsets following the opcode in the bytecode.

The constant pool

A program requires data as well as code. Data is used for operands. The operand data for a bytecode program can, as we have seen, be kept in the bytecode instruction itself. But this is only true when the data is small enough, or commonly used(such as the constant 0).

Larger chunks of data, such as string constants or large numbers,are stored in a constant pool at the beginning of the .class file.Indexes to the data in the pool are used as operands instead of the actual data itself. If the string aVeryLongFunctionName had to be separately encoded in a compiled method each time it was operated on,bytecode would not be compact at all.

Code generation strategies

Pure bytecode interpretation

Early JVMs contained only simple bytecode interpreters as a means of executing Java code. To simplify this a little, a bytecode interpreter is just a main function with a large switch construct on the possible opcodes. The function is called with a state representing the contents of the Java evaluation stack and the local variables. Interpreting a bytecode operation uses this state as input and output. All in all, the fundamentals of a working interpreter shouldn't amount to more than a couple of thousand lines of code.

早期的JVM使用解释器来模拟字节码指令的执行。为了简化实现,解释器就是在一个主函数中加上一个包含了所有操作码的分支跳转结构。调用该函数时,会附带上表示操作数栈和局部变量的数据结构,以此作为字节码操作的输入输出。总体来看,解释器的核心代码最多也就几千行。

There are several simplicity benefits to using a pure interpreter. The code generator of an interpreting JVM just needs to be recompiled to support a new hardware architecture. No new native compiler needs to be written. Also, a native compiler for just one platform is probably much larger than our simple switch construct.

纯解释执行这种方式简单有效,如果想要添加对新硬件架构的支持时,只需简单修改代码,重新编译即可,无需编写新的本地编译器。此外,写一个本地编译器的代码量也比写一个使用分支跳转结果的纯解释器大得多。

A pure bytecode interpreter also needs little bookkeeping. A JVM that compiles some or all methods to native code would need to keep track of all compiled code. If a method is changed at runtime, which Java allows, it needs to be scheduled for regeneration as the old code is obsolete. In a pure interpreter, its new bytecodes are simply interpreted again from the start the next time that we emulate to the method.

解释器在执行字节码时几乎不需要记录额外信息。而编译执行的JVM会将一些或全部字节码编译为本地代码,这时就需要跟踪所有经过编译的代码。如果某个方法在运行过程中发生了改变,就需要重新生成代码。相比之下,解释器只需要再解释一遍新的字节码就可以了。

It follow that the amount of bookkeeping in a completely interpreted model is minimal. This leads itself well to being used in an adaptive runtime such as a JVM, where things change all the time.

因为解释执行所需要记录的额外信息最少,所以就很适用于像JVM这样随时代码可能发生变化的自适应运行时。

Naturally, there is a significant performance penalty to a purely interpreted language when comparing the execution time of an interpreted method with a native code version of the same code.

当然,相比于执行编译为本地代码的方式,纯解释执行的性能很差。

Static compilation

Usually, an entire Java program was compiled into native code before execution. This is known as ahead-of-time compilation.

The obvious disadvantage of static compilation for Java is that the benefits of platform independence immediately disappear. The JVM is removed from the equation.

Another disadvantage is that the automatic memory management of Java has to be handled more or less explicitly, leading to limited implementations with scalability issues.

Total JIT compilation

Another way to speed up bytecode execution is to not use an interpreter at all, and JIT compile all Java methods to native code immediately when they are first encountered. The compilation takes place at runtime, inside the JVM, not ahead-of-time.

Total JIT compilation has the advantage that we do not need to maintain an interpreter, but the disadvantage is that compile time becomes a factor in the total runtime. While we definitely see benefits in JIT compiling hot methods, we also unnecessarily spend expensive compile time on cold methods and methods that are run only once. Those methods might as well have been interpreted instead.

The main disadvantage of total JIT compilation is still low code generation speed. In the same way that an interpreted method executes hundreds of times slower than a native one, a native method that has to be generated from Java bytecodes takes hundreds of times longer to get ready for execution than an interpreted method. When using total JIT compilation, it is extremely important to spend clock cycles on optimizing code only where it will pay off in better execution time. The mechanism that detects hot methods has to be very advanced, indeed. Even a quick and dirty JIT compiler is still significantly slower at getting code ready for execution than a pure interpreter. The interpreter never needs to translate bytecodes into anything else.

Another issue that becomes more important with total JIT compilation is the large amounts of throwaway code that is produced. If a method is regenerated, for example since assumptions made by the compiler are no longer valid, the old code takes up precious memory. The same is true for a method that has been optimized. Therefore, the JVM requires some kind of "garbage collection" for generated code or a system with large amounts of JIT compiltation would slowly run out of native memoey as code buffers grow.

JRockit is an example of a JVM that uses an advanced variant of total JIT compilation as its code generation stratege.

另一种加速字节码执行速度的方法是彻底抛弃解释器,当首次调用某个Java方法时,将其编译为本地代码。这种编译方式是发生在运行时的,在JVM内部完成,因此不属于预编译范畴。

完全JIT编译的好处是不需要维护解释器,但缺点是编译时间影响主体业务程序的运行。编辑器对所有方法一视同仁,在编译那些热点方法的同时,也会编译那些执行次数较少的,甚至是只执行一次的方法。这些方法本可以解释执行的

完全JIT编译的主要缺点在于生成代码的速度太慢。对同一个方法来说,编译为本地代码后的执行效率比直接解释执行高数百倍,但准备时间却长数百倍。使用完全JIT编译这种方式时需要特别注意的是,尽管检测热方法的机制比较先进,但仍要慎重考虑执行效率和准备时间的问题,权衡得失。即使不使用优化能力很强的JIT编译器,准备时间仍然比纯解释执行慢得多,因为解释器跟不做编译工作。

使用完全JIT编译的另一个问题是,在程序运行过程中会产生大量废气代码。如果某个方法需要重新生成,例如由于编译器之前所作的假设失效或由于对方法进行了优化,这时在生成了新代码之后,之前生成的老代码仍然会占用内存。因此,JVM需要某种“垃圾回收”机制来清理这些已经废弃的代码,否则,对于大量使用JIT编译的系统来说,最终会由于代码缓冲区容量的增长而消耗掉所有内存资源。

JRockit JVM的代码生成策略是在完全JIT编译的基础上,加以优化整改而成的。

Mixed mode interpretation

The first workable solution that was proposed, that would both increase execution speed and not compromise the dynamic nature of Java, was mixed mode interpretation.

In a JVM using mixed mode interpretation, all methods start out as interpreted when they are first encountered. However, when a method is found to be hot, it is scheduled for JIT compilation and turned into more efficient native code. This adaptive approach is similar to that of keeping different code quality levels in the JIT, described in the previous section.

Detecting hot methods is a fundamental functionality of every modern JVM, regardless of code execution model, and it will be covered to a greater extent later in this chapter. Early mixed mode interpreters typically detected the hotness of a method by counting the number of times it was large enough, optimizing JIT compilation would be tiggered for method.

Similar to total JIT compilation, if the process of determining if a method is hot is good enough, the JVM spends compilation time only the methods where it makes the most difference. If a method is seldom executed, the JVM would waste no time turing it into native code, but rather keep interpreting it each time that it is called.

Sun Microsystems was the first vendor to embrace mixed interpretation in the HotSpot compiler, available both in a client version and a server side version, the latter with more advanced code optimizations. HotSpot in turn, was based on technology acquired from Longview Technologies LLC(which started out as Animorphic)

最早提出的、在不牺牲Java动态特性的情况下,提升程序执行效率的解决方案是以 混合模式 运行程序。

使用混合模式时,首次调用某个方法时都是以解释器来执行的,但当发现某个方法是热方法时,则安排JIT编译器将之编译为本地代码。这种方法与上一节中提到的使用不同等级的JIT编译生成不同质量的本地代码类似,但执行速度更快。

对于现代JVM来说,能够在任何代码模型中检测出热方法是一项基本功能,后文会对此详细阐述。早期的混合模式策略中,通过记录方法的调用次数来查找热方法,如果调用次数超过了某个阈值,则启动JIT编译器执行优化编译工作。

与完全JIT编译类似,JVM只会对那些热方法进行优化编译,以期获得最好的执行效果,而对那些很少执行的方法,JVM根本不会花费时间去编译它们,但仍需要在每次调用它们时更新相关信息。

在混合模式中,代码的重新生成不再是关键问题。如果某个方法的本地代码需要重新生成,那么JVM会直接抛弃已经编译出的本地代码,下次调用该方法时由解释器解释执行。此后,如果该方法仍然够热,届时会重新执行优化编译工作。

注:文中几乎所有的中文都来自于 这里,楼主曾不自量力尝试自己去翻译,可是翻译完之后,与这位仁兄的一对比,那可真是惨不忍睹啊。所以说,翻译真的是艺术范畴,我等只有羡慕嫉妒恨了。


Read Post

Hello BTrace

很久之前就有听说过BTrace这样一个神奇的工具存在,一直很懒,没有实际使用过。最近总被测试环境以及预发布环境数据困扰着,今天需要通过log打印这个接口的数据验证一下功能,明天需要另外一个接口的返回值,入参。更悲剧的是每到下班时间这样的事儿就如潮水一般涌来,这让我情何以堪啊,完全不能忍啊。受够了这样重复的打log,然后还要重新部署测试环境,重启服务,tail日志文件。如果是预发布就更悲催了,还要求运维大哥帮帮忙,覆盖个包,还的买个小饮料什么的。就这样,决定尝试BTrace这货,试图从这种不能忍的工作中解脱出来。目的比较单一,就是方法执行过程中的入参与出参,没有涉及到使用BTrace打印内存以及堆栈信息等。

Linux安装BTrace

首先,这里下载BTrace的release版本

其次,直接解压到相应的目录下tar -xvf btrace-bin.tar.gz

然后,配置BTRACE_HOME,楼主使用的是zsh,所以编辑~/.zshrc文件,在文件末尾添加export BTRCE_HOME=/export/servers/btrace,然后在将BTRACE_HOME添加到PATH中。export PATH=$PATH:$HADOOP_HOME/bin:$BTRCE_HOME/bin:$SBT_HOME/bin:$MAVEN_HOME/bin:$ANT_HOME/bin

最终,还有可能需要修改btrace脚本中的JAVA_HOME,这个视情况而定。

至此,BTrace就安装完毕,如何验证是否安装配置成功呢?终端输入btrace命令,输出如下提示及表示安装配置成功:

➜ bin btrace
Usage: btrace <options> <pid> <btrace source or .class file> <btrace arguments>
where possible options include:
-classpath <path> Specify where to find user class files and annotation processors
-cp <path> Specify where to find user class files and annotation processors
-I <path> Specify where to find include files
-p <port> Specify port to which the btrace agent listens for clients

在决定使用BTrace前,楼主还是稍稍做了点其他的努力,尝试了一把HouseMD,不过安装过程中我放弃了,放弃它有两点原因:第一它是scala实现的,楼主又比较笨,准备环境嫌太麻烦;其次如果想在线上使用它还是有成本的,而不像BTrace,基于java环境,无需额外的准备工作。

BTrace官方是这样描述的:

BTrace is a safe, dynamic tracing tool for Java. BTrace works by dynamically (bytecode) instrumenting classes of a running Java program. BTrace inserts tracing actions into the classes of a running Java program and hotswaps the traced program classes.

BTrace自称是安全的,这个安全就带来了诸多的限制:

  1. 不能创建对象
  2. 不能抛出或者捕获异常
  3. 不能用synchronized关键字
  4. 不能对目标程序中的instace或者static变量
  5. 不能调用目标程序的instance或者static方法
  6. 脚本的field、method都必须是static的
  7. 脚本不能包括outer,inner,nested class
  8. 脚本中不能有循环,不能继承任何类,任何接口与assert语句

这些限制引用自这里,这篇文章介绍的挺好的,推荐读一下。

下面是本文的重点,开始踩坑!

坑一:toString

safe mode下只能使用BTrace内置的功能,jdk相关的功能都不能使用,否则编译就不通过。BTraceUtils就提供了诸多静态方法可供使用,比如str(),它会调用目标对象的toString方法。然后我就踩了个坑。

先看两个例子:

@BTrace
public class TestBtrace {
    
    @OnMethod(
            clazz = "com.xxxx.rpc.impl.ProductRPCImpl", 
            method = "queryProduct", 
            location = @Location(Kind.RETURN)
            )
    public static void queryProduct(@Self Object self, Set<Long> skuIds, @Return Object result){
        
        println(strcat("入参: ", str(skuIds)));
        println(strcat("出参: ", str(result)));
    }
}

resultSet<ProductInfo>类型。

程序输出为:

入参: [10270495]
出参: [com.xxx.domain.ProductInfo@40363c[skuId=10270495,name=染整工艺实验教程(附赠光盘1张),category=1713;3282;3709,valueWeight=0.399,imagePath=17220/c1168702-51bd-4645-946b-74f0700b1300.jpg,venderId=0,venderType=0,venderName=,wstate=1,businessCode=bk0193,maxPurchQty=0,wyn=1,length=0,width=0,height=0,brandId=0,extFieldMap={},valuePayFirst=0,skuMark=]]

因为楼主重写了ProductInfo对象的toString()方法,所以输出了对象的属性值。从第一个示例可以得出结论:重写toString()方法起到了作用。

示例二:

@BTrace
public class TestBtrace {
    
    @OnMethod(
            clazz = "com.xxxx.impl.PromotionProxyImpl", 
            method = "calculate", 
            location = @Location(Kind.RETURN)
            )
    public static void calculate(@Self Object self, @Return Object result){
        println(strcat("出参: ", str(result)));
    }
}

result是自定义的Result类型,同样重写了toString()方法。

出参: com.xxx.domain.Result@18eb9f6

从这个示例的结果来讲,楼主重写的toString()根本就没有被执行。

到这里楼主就彻底迷糊了,为毛有的起作用了,而有的没有呢?google了一圈也没有找到满意的答案。

各种无果后,楼主干脆一不做二不休用蹩脚的chinese english跑到BTrace官网上发了个帖子,问作者这tm到底是虾米情况!!查看原帖请点击这里

截取一段楼主蹩脚的e文和作者耐心的解答。

Question:
Why i want to use unsafe mode is because i found a strange behavior in safe mode:

I want to use business domain in BTrace script, and override the toString() method for str(), i found somtimes it does not use the toString() which i overrided,and print a address of the domain. and in another case, it print the right result of toString() returnd.Is that the right behavior? And somebody tell me BTrace script will always use the orginal toString() because of safe mode, i don't think so.
Answer:
For the second part - toString() is only invoked for instances of the classes loaded by the bootstrap classloader (system classes). For all the other objects an identity string is returned to prevent the execution of unknown code.

大家注意到了没,只有被BootStrapClassLoader加载的类实例的toString()方法才会被BTrace script调用,其他情况使用ObjecttoString()方法,非重写后的。楼主顿觉真气止不住的涌来,原来如此啊。第一个示例的返回值类型是个Set自然会由系统类加载器加载,其范型也沾了点金,同样被加载了;而实例二中Result是自定义对象,由应用类加载器加载,待遇自然而然就不同喽。这样也就解释了最初的疑问,完美解决。楼主可不是一般人,刨根问底能手中的能手,哼,遂翻出我大BTrace源码一探究竟。

/**
 * Returns a string representation of the object. In general, the
 * &lt;code&gt;toString&lt;/code&gt; method returns a string that
 * "textually represents" this object. The result should
 * be a concise but informative representation that is easy for a
 * person to read. For bootstrap classes, returns the result of
 * calling Object.toString() override. For non-bootstrap classes,
 * default toString() value [className@hashCode] is returned.
 *
 * @param  obj the object whose string representation is returned
 * @return a string representation of the given object.
 */
public static String str(Object obj) {
    if (obj == null) {
        return "null";
    } else if (obj instanceof String)    {
        return (String) obj;
    } else if (obj.getClass().getClassLoader() == null) {
        try {
            return obj.toString();
        } catch (NullPointerException e) {
            // NPE can be thrown from inside the toString() method we have no control over
            return "null";
        }
    } else {
        return identityStr(obj);
    }
}

至此,真相大白了。可是仍然没有解决问题啊?我需要输出业务实体的字段值!!好吧,试试unsafe mode吧。然后,然后就继续踩坑。

坑二:unsafe mode

safe mode有太多的限制了,这样楼主同样不能忍。不过也还是有他的道理的,在可以保证safe的前提下楼主喜欢unsafe mode思米达。不用再畏首畏尾了,木有那么多限制自然顺风顺水。结果苦逼的楼主在开启unsafe mode时候给跪了。

开启unsafe mode有两个步骤,首先编辑$BTRACE_HOME/bin/btrace文件,将启动参数-Dcom.sun.btrace.unsafe改为true;然后在Btrace script中的@BTrace注解中增加unsafe = true

然后楼主就收获了这样一陀奇葩的日志:

DEBUG: btrace debug mode is set
DEBUG: btrace unsafe mode is set
DEBUG: assuming default port 2020
DEBUG: assuming default classpath '.'
DEBUG: attaching to 1625
DEBUG: checking port availability: 2020
DEBUG: attached to 1625
DEBUG: loading /export/servers/btrace/build/btrace-agent.jar
DEBUG: agent args: port=2020,debug=true,unsafe=true,systemClassPath=/export/servers/jdk1.6.0_25/lib/tools.jar,probeDescPath=.
DEBUG: loaded /export/servers/btrace/build/btrace-agent.jar
DEBUG: registering shutdown hook
DEBUG: registering signal handler for SIGINT
DEBUG: submitting the BTrace program
DEBUG: opening socket to 2020
DEBUG: sending instrument command
DEBUG: entering into command loop
DEBUG: received com.sun.btrace.comm.ErrorCommand@3c24c4a3
com.sun.btrace.VerifierException: Unsafe mode, requested by the script, not allowed
at com.sun.btrace.runtime.Verifier.reportError(Verifier.java:385)
at com.sun.btrace.runtime.Verifier.reportError(Verifier.java:376)
at com.sun.btrace.runtime.Verifier$1.visit(Verifier.java:141)
at com.sun.btrace.org.objectweb.asm.ClassReader.a(Unknown Source)
at com.sun.btrace.org.objectweb.asm.ClassReader.a(Unknown Source)
at com.sun.btrace.org.objectweb.asm.ClassReader.accept(Unknown Source)
at com.sun.btrace.org.objectweb.asm.ClassReader.accept(Unknown Source)
at com.sun.btrace.runtime.InstrumentUtils.accept(InstrumentUtils.java:66)
at com.sun.btrace.runtime.InstrumentUtils.accept(InstrumentUtils.java:62)
at com.sun.btrace.agent.Client.verify(Client.java:397)
at com.sun.btrace.agent.Client.loadClass(Client.java:224)
at com.sun.btrace.agent.RemoteClient.<init>(RemoteClient.java:59)
at com.sun.btrace.agent.Main.startServer(Main.java:379)
at com.sun.btrace.agent.Main.access$000(Main.java:65)
at com.sun.btrace.agent.Main$3.run(Main.java:166)
at java.lang.Thread.run(Thread.java:662)
DEBUG: received com.sun.btrace.comm.ExitCommand@11e9c82e

神奇吧?日志上面显示我已经开启了unsafe mode,对的,已经开启了。可是他就是提示不支持,这是要闹哪样。。。

又是一通的搜索,妈蛋,竟然还是无果,快绝望了。嗯,就这个问题整整折腾了一个下午+半个晚上,最终楼主一怒之下关掉显示器电源头也不回的离去。结果第二天早上又试了一下竟然好了,你敢信??

楼主又是惊喜又是惶恐,一直在怀疑他是怎么其作用的。做了好多尝试,就那种病急乱投医的那种。楼主这个走狗屎运啊,还真试出来了,修改BTrace运行模式后重启下应用就可以了。貌似是找到答案了,不过如果真是这种情况那又不能忍啊,完全没有办法在生产环境下使用啊。就有跑到官网找作者‘理论’去了。

Question:
Hi brother, there is a mistake here, i am using BTrace 1.2.4 not the 2. I am so sorry
But it really happened in BTrace 1.2.4, and finally i fount it works after restarting the application.
Then i try to change -Dcom.sun.btrace.unsafe to false, and i can still run the BTrace script in unsafe mode.
So, is that i must restarting the application,if changed the mode?
Answer:
For the first part - once you start the agent it will keep its unsafe flag forever. You could start the application with BTrace Agent and a dummy script to allow unsafe BTrace scripts eg.

额,谦逊的作者不厌其烦的解答了我的问题,还每次都谢谢楼主,怎么好意思呢。。。最终这个坑也让老子填上了,不过说实话,还是楼主功力不够深,没耐心自己读文档。

结束本文前再插播一个在使用BTrace时候困扰楼主好久的低级问题,希望没有困扰到大家。

这样的就是这个,脚本中需要依赖多个jar包的时候要怎么办呢?

依赖单个jar包肯定是这样了:

btrace -classpath ./a.jar 109776 TestServiceBtrace.java

然后多个jar包的时候楼主就蒙圈了,最后是R大给解答的:

-classpath后面可以跟一个字符串,里面可以包含多个路径拼接在一起。 例如在Windows上分隔符是分号: -classpath ./a.jar;./b.jar 在Linux上分隔符是冒号: -classpath ./a.jar:./b.jar

原文链接在 这里


Read Post
Older entries →