Compact Strings in Java 9

One of the enhancement in Java 9 is Compact String with the goal to make String class and related classes more space efficient while maintaining performance in most scenarios.

Motivation for introducing Compact String in Java

Till Java 8, String was stored internally as a character array with each character taking 2 bytes of space where UTF16 was used for character encoding.

Data gathered from many different applications indicates that strings are a major component of heap usage, moreover most String objects contain only Latin-1 also called ISO-8859-1 characters. Latin-1 is a 8-bit character set meaning it needs 1 byte of space i.e. 1 byte less than UTF16 for each character. If strings can be stored using Latin-1 character encoding that will bring substantial reduction in memory usage by String objects. That is the motivation behind compact Strings in Java.

Java 9 compact Strings

Java 9 onwards this space efficiency optimization is brought to String class in Java using a new feature called compact Strings.

Instead of char array Java 9 onward String is stored internally as a byte array plus an encoding-flag field.

This new String class stores characters encoded as ISO-8859-1/Latin-1 (1 byte per character) if all the characters of the String can be stored using 1 byte each.

In case any character of the String needs 2 bytes (in case of special characters) all the characters of the String are stored as UTF-16 (2 bytes per character).

How to determine whether UTF16 or Latin-1 character encoding has to be used is done using the encoding-flag field known as coder.

So in Java 8 String class there was this code for String storage-

/** The value is used for character storage. */
private final char value[];

Which is changed Java 9 onward to use byte[]-

@Stable
private final byte[] value;

A flag (field named coder) to identify the encoding is also added-

/**
 * The identifier of the encoding used to encode the bytes in
 * {@code value}. The supported values in this implementation are
 *
 * LATIN1
 * UTF16
 *
 * @implNote This field is trusted by the VM, and is a subject to
 * constant folding if String instance is constant. Overwriting this
 * field after construction will cause problems.
 */
private final byte coder;

Which can have either of the following two values.

@Native static final byte LATIN1 = 0;
@Native static final byte UTF16  = 1;

Changes in String methods for compact Strings

Methods in String class are also changed to check if String is stored as Latin-1 character or UTF-16 character and appropriate implementation is used. For example substring() method of the String class with Compact String changes-

public String substring(int beginIndex) {
  if (beginIndex < 0) {
    throw new StringIndexOutOfBoundsException(beginIndex);
  }
  int subLen = length() - beginIndex;
  if (subLen < 0) {
    throw new StringIndexOutOfBoundsException(subLen);
  }
  if (beginIndex == 0) {
    return this;
  }
  return isLatin1() ? StringLatin1.newString(value, beginIndex, subLen)
                    : StringUTF16.newString(value, beginIndex, subLen);
}

private boolean isLatin1() {
  return COMPACT_STRINGS && coder == LATIN1;
}

Using XX:-CompactStrings option

By default Compact String option is enabled which can be disabled by using -XX:-CompactStrings VM option. You may want to disable it, if mainly UTF-16 Strings are used in your application.

That's all for the topic Compact Strings in Java 9. If something is missing or you have something to share about the topic please write a comment.

You may also like

KnpCode

August 10, 2022