为什么Java中的大多数字符串操作都基于regexp?

[英]Why are most string manipulations in Java based on regexp?


In Java there are a bunch of methods that all have to do with manipulating Strings. The simplest example is the String.split("something") method.

在Java中有很多方法都与操纵字符串有关。最简单的例子是String.split(“something”)方法。

Now the actual definition of many of those methods is that they all take a regular expression as their input parameter(s). Which makes then all very powerful building blocks.

这些方法的实际定义是,它们都以正则表达式作为输入参数。这使得所有的建筑都变得非常强大。

Now there are two effects you'll see in many of those methods:

在这些方法中你会看到两种效果:

  1. They recompile the expression each time the method is invoked. As such they impose a performance impact.
  2. 它们在每次调用方法时重新编译表达式。因此,它们会对性能产生影响。
  3. I've found that in most "real-life" situations these methods are called with "fixed" texts. The most common usage of the split method is even worse: It's usually called with a single char (usually a ' ', a ';' or a '&') to split by.
  4. 我发现,在大多数“现实生活”的情况下,这些方法被称为“固定”文本。split方法最常见的用法甚至更糟:通常使用一个char(通常是' a ';';'或'&')进行分割。

So it's not only that the default methods are powerful, they also seem overpowered for what they are actually used for. Internally we've developed a "fastSplit" method that splits on fixed strings. I wrote a test at home to see how much faster I could do it if it was known to be a single char. Both are significantly faster than the "standard" split method.

因此,这不仅是因为默认的方法很强大,还因为它们的实际用途而显得力不从心。在内部,我们开发了一种“快速分割”的方法,在固定的字符串上进行分割。我在家里写了一个测试,看看如果知道是单个字符,我能多快完成。两者都比“标准”分割方法快得多。

So I was wondering: why was the Java API chosen the way it is now? What was the good reason to go for this instead of having a something like split(char) and split(String) and a splitRegex(String) ??

所以我想知道:为什么Java API选择了现在的方式?为什么要这样做,而不是使用split(char)和split(String)以及splitRegex(String) ?


Update: I slapped together a few calls to see how much time the various ways of splitting a string would take.

更新:我打了几个电话,看看各种拆分字符串的方法要花多少时间。

Short summary: It makes a big difference!

简短的总结:它有很大的不同!

I did 10000000 iterations for each test case, always using the input

我为每个测试用例做了10000次迭代,始终使用输入

"aap,noot,mies,wim,zus,jet,teun" 

and always using ',' or "," as the split argument.

并且总是使用','或',"作为分裂的论点。

This is what I got on my Linux system (it's an Atom D510 box, so it's a bit slow):

这是我在Linux系统上得到的(它是一个Atom D510的盒子,所以有点慢):

fastSplit STRING
Test  1 : 11405 milliseconds: Split in several pieces
Test  2 :  3018 milliseconds: Split in 2 pieces
Test  3 :  4396 milliseconds: Split in 3 pieces

homegrown fast splitter based on char
Test  4 :  9076 milliseconds: Split in several pieces
Test  5 :  2024 milliseconds: Split in 2 pieces
Test  6 :  2924 milliseconds: Split in 3 pieces

homegrown splitter based on char that always splits in 2 pieces
Test  7 :  1230 milliseconds: Split in 2 pieces

String.split(regex)
Test  8 : 32913 milliseconds: Split in several pieces
Test  9 : 30072 milliseconds: Split in 2 pieces
Test 10 : 31278 milliseconds: Split in 3 pieces

String.split(regex) using precompiled Pattern
Test 11 : 26138 milliseconds: Split in several pieces 
Test 12 : 23612 milliseconds: Split in 2 pieces
Test 13 : 24654 milliseconds: Split in 3 pieces

StringTokenizer
Test 14 : 27616 milliseconds: Split in several pieces
Test 15 : 28121 milliseconds: Split in 2 pieces
Test 16 : 27739 milliseconds: Split in 3 pieces

As you can see it makes a big difference if you have a lot of "fixed char" splits to do.

正如你所看到的,如果你有很多“固定字符”分割来做的话,它会有很大的不同。

To give you guys some insight; I'm currently in the Apache logfiles and Hadoop arena with the data of a big website. So to me this stuff really matters :)

给你们一些启发;我目前在Apache logfiles和Hadoop arena中有一个大网站的数据。所以对我来说,这些东西真的很重要

Something I haven't factored in here is the garbage collector. As far as I can tell compiling a regular expression into a Pattern/Matcher/.. will allocate a lot of objects, that need to be collected some time. So perhaps in the long run the differences between these versions is even bigger .... or smaller.

这里我没有对垃圾收集器进行因式分解。就我所知,将正则表达式编译成模式/Matcher/..将分配许多对象,那需要收集一些时间。也许从长远来看这些版本之间的差异更大....或更小。

My conclusions so far:

我的结论:

  • Only optimize this if you have a LOT of strings to split.
  • 只有当你有很多字符串要分割时才会优化它。
  • If you use the regex methods always precompile if you repeatedly use the same pattern.
  • 如果您使用regex方法,如果您重复使用相同的模式,那么一定要预先编译。
  • Forget the (obsolete) StringTokenizer
  • 忘记StringTokenizer(过时的)
  • If you want to split on a single char then use a custom method, especially if you only need to split it into a specific number of pieces (like ... 2).
  • 如果您想在单个字符上进行分割,那么使用自定义方法,特别是如果您只需要将它分割成特定数量的块(比如……)2)。

P.S. I'm giving you all my homegrown split by char methods to play with (under the license that everything on this site falls under :) ). I never fully tested them .. yet. Have fun.

附注:我给你我所有的自产的通过char的方法来玩(在这个网站上的所有东西都归为:)。我从来没有完全测试过它们。然而。玩得开心。

private static String[]
        stringSplitChar(final String input,
                        final char separator) {
    int pieces = 0;

    // First we count how many pieces we will need to store ( = separators + 1 )
    int position = 0;
    do {
        pieces++;
        position = input.indexOf(separator, position + 1);
    } while (position != -1);

    // Then we allocate memory
    final String[] result = new String[pieces];

    // And start cutting and copying the pieces.
    int previousposition = 0;
    int currentposition = input.indexOf(separator);
    int piece = 0;
    final int lastpiece = pieces - 1;
    while (piece < lastpiece) {
        result[piece++] = input.substring(previousposition, currentposition);
        previousposition = currentposition + 1;
        currentposition = input.indexOf(separator, previousposition);
    }
    result[piece] = input.substring(previousposition);

    return result;
}

private static String[]
        stringSplitChar(final String input,
                        final char separator,
                        final int maxpieces) {
    if (maxpieces <= 0) {
        return stringSplitChar(input, separator);
    }
    int pieces = maxpieces;

    // Then we allocate memory
    final String[] result = new String[pieces];

    // And start cutting and copying the pieces.
    int previousposition = 0;
    int currentposition = input.indexOf(separator);
    int piece = 0;
    final int lastpiece = pieces - 1;
    while (currentposition != -1 && piece < lastpiece) {
        result[piece++] = input.substring(previousposition, currentposition);
        previousposition = currentposition + 1;
        currentposition = input.indexOf(separator, previousposition);
    }
    result[piece] = input.substring(previousposition);

    // All remaining array elements are uninitialized and assumed to be null
    return result;
}

private static String[]
        stringChop(final String input,
                   final char separator) {
    String[] result;
    // Find the separator.
    final int separatorIndex = input.indexOf(separator);
    if (separatorIndex == -1) {
        result = new String[1];
        result[0] = input;
    }
    else {
        result = new String[2];
        result[0] = input.substring(0, separatorIndex);
        result[1] = input.substring(separatorIndex + 1);
    }
    return result;
}

9 个解决方案

#1


12  

Note that the regex need not be recompiled each time. From the Javadoc:

注意,regex不需要每次都重新编译。从Javadoc:

An invocation of this method of the form str.split(regex, n) yields the same result as the expression

对表单str.split(regex, n)的这种方法的调用产生与表达式相同的结果。

Pattern.compile(regex).split(str, n) 

That is, if you are worried about performance, you may precompile the pattern and then reuse it:

也就是说,如果您担心性能,您可以预编译模式,然后重用它:

Pattern p = Pattern.compile(regex);
...
String[] tokens1 = p.split(str1); 
String[] tokens2 = p.split(str2); 
...

instead of

而不是

String[] tokens1 = str1.split(regex);
String[] tokens2 = str2.split(regex);
...

I believe that the main reason for this API design is convenience. Since regular expressions include all "fixed" strings/chars too, it simplifies the API to have one method instead of several. And if someone is worried about performance, the regex can still be precompiled as shown above.

我认为这个API设计的主要原因是方便。由于正则表达式也包含所有“固定”字符串/字符,因此将API简化为只有一个方法而不是几个方法。如果有人担心性能问题,regex仍然可以像上面所示的那样进行预编译。

My feeling (which I can't back with any statistical evidence) is that most of the cases String.split() is used in a context where performance is not an issue. E.g. it is a one-off action, or the performance difference is negligible compared to other factors. IMO rare are the cases where you split strings using the same regex thousands of times in a tight loop, where performance optimization indeed makes sense.

我的感觉(我无法提供任何统计证据支持)是,大多数String.split()都是在性能不是问题的上下文中使用的。这是一次性行为,或者与其他因素相比,性能差异可以忽略不计。在我看来,很少有这样的情况:在一个紧密的循环中使用相同的regex对字符串进行成千上万次的分割,在这种情况下,性能优化确实是有意义的。

It would be interesting to see a performance comparison of a regex matcher implementation with fixed strings/chars compared to that of a matcher specialized to these. The difference might not be big enough to justify the separate implementation.

有趣的是,我们可以看到regex matcher实现与固定字符串/chars的性能对比。差异可能不会大到足以证明单独实现是合理的。

#2


12  

I wouldn't say most string manipulations are regex-based in Java. Really we are only talking about split and replaceAll/replaceFirst. But I agree, it's a big mistake.

我不会说大多数字符串操作都是基于Java的基于regex的。实际上我们只是在讨论split和replaceAll/replaceFirst。但我同意,这是一个大错误。

Apart from the ugliness of having a low-level language feature (strings) becoming dependent on a higher-level feature (regex), it's also a nasty trap for new users who might naturally assume that a method with the signature String.replaceAll(String, String) would be a string-replace function. Code written under that assumption will look like it's working, until a regex-special character creeps in, at which point you've got confusing, hard-to-debug (and maybe even security-significant) bugs.

除了让低级语言特性(字符串)依赖于高级特性(regex)的丑陋之处之外,对于可能自然而然地认为带有签名字符串的方法的新用户来说,这也是一个令人讨厌的陷阱。replaceAll(String, String)将是一个String -replace函数。在这种假设下编写的代码看起来会正常工作,直到出现一个regex特殊的字符,这时您会发现一些混乱的、难以调试的(甚至可能是安全的)bug。

It's amusing that a language that can be so pedantically strict about typing made the sloppy mistake of treating a string and a regex as the same thing. It's less amusing that there's still no builtin method to do a plain string replace or split. You have to use a regex replace with a Pattern.quoted string. And you only even get that from Java 5 onwards. Hopeless.

有趣的是,一种对输入如此严格的语言犯了一个草率的错误,把字符串和regex当作一回事。更有趣的是,仍然没有构建方法来执行纯字符串替换或分割。您必须使用正则表达式替换模式。引用的字符串。你甚至从Java 5开始就得到了这个。绝望。

@Tim Pietzcker:

@Tim Pietzcker:

Are there other languages that do the same?

还有其他语言也这么做吗?

JavaScript's Strings are partly modelled on Java's and are also messy in the case of replace(). By passing in a string, you get a plain string replace, but it only replaces the first match, which is rarely what's wanted. To get a replace-all you have to pass in a RegExp object with the /g flag, which again has problems if you want to create it dynamically from a string (there is no built-in RegExp.quote method in JS). Luckily, split() is purely string-based, so you can use the idiom:

JavaScript的字符串部分是模仿Java的,在replace()中也很混乱。通过传入一个字符串,可以得到一个简单的字符串替换,但是它只替换第一个匹配项,这很少是需要的。要获得一个替换——您只需传入带有/g标志的RegExp对象,如果您想从字符串中动态创建该对象(没有内置的RegExp。JS引用方法)。幸运的是,split()纯粹是基于字符串的,因此您可以使用这个习语:

s.split(findstr).join(replacestr)

Plus of course Perl does absolutely everything with regexen, because it's just perverse like that.

当然,Perl使用regexen做了所有的事情,因为它就是这样的反常。

(This is a comment more than an answer, but is too big for one. Why did Java do this? Dunno, they made a lot of mistakes in the early days. Some of them have since been fixed. I suspect if they'd thought to put regex functionality in the box marked Pattern back in 1.0, the design of String would be cleaner to match.)

(这句话不仅仅是一个答案,而是太大了。Java为什么要这么做?不知道,他们在早期犯了很多错误。其中一些已经被修复。我猜想,如果他们想把regex功能放在1.0中标记的模式框中,那么字符串的设计将更容易匹配。

#3


2  

I imagine a good reason is that they can simply pass the buck on to the regex method, which does all the real heavy lifting for all of the string methods. Im guessing they thought if they already had a working solution it was less efficient, from a development and maintenance standpoint, to reinvent the wheel for each string manipulation method.

我认为一个很好的理由是,它们可以简单地将责任传递给regex方法,该方法对所有字符串方法执行所有真正繁重的工作。我猜他们认为如果他们已经有了一个有效的解决方案,那么从开发和维护的角度来看,为每个字符串处理方法重新发明轮子的效率就会降低。

#4


2  

Interesting discussion!

有趣的讨论!

Java was not originally intended as a batch programming language. As such the API out of the box are more tuned towards doing one "replace" , one "parse" etc. except on Application initialization when the app may be expected to be parsing a bunch of configuration files.

Java最初不是作为批处理编程语言编写的。因此,开箱即用的API更倾向于执行一个“替换”、一个“解析”等操作,但在应用程序初始化时,应用程序可能需要解析一堆配置文件。

Hence optimization of these APIs was sacrificed in the altar of simplicity IMO. But the question brings up an important point. Python's desire to keep the regex distinct from the non regex in its API, stems from the fact that Python can be used as an excellent scripting language as well. In UNIX too, the original versions of fgrep did not support regex.

因此,这些api的优化被牺牲在简单的国际海事组织的祭坛上。但这个问题引出了一个重要的问题。Python希望在其API中保持regex与非regex的区别,这是因为Python也可以作为一种优秀的脚本语言使用。在UNIX中,fgrep的原始版本也不支持regex。

I was engaged in a project where we had to do some amount of ETL work in java. At that time, I remember coming up with the kind of optimizations that you have alluded to, in your question.

我参与了一个项目,我们必须在java中做一些ETL工作。当时,我记得我想到了你刚才提到的那种优化,在你的问题中。

#5


1  

I suspect that the reason why things like String#split(String) use regexp under the hood is because it involves less extraneous code in the Java Class Library. The state machine resulting from a split on something like , or space is so simple that it is unlikely to be significantly slower to execute than a statically implemented equivalent using a StringCharacterIterator.

我怀疑,为什么像String#split(String)这样的东西在后台使用regexp是因为它在Java类库中包含的无关代码较少。由于对某些东西(如空间)的分割而产生的状态机非常简单,因此它执行起来不太可能比使用StringCharacterIterator的静态实现的等价对象慢很多。

Beyond that the statically implemented solution would complicate runtime optimization with the JIT because it would be a different block of code that also requires hot code analysis. Using the existing Pattern algorithms regularly across the library means that they are more likely candidates for JIT compilation.

除此之外,静态实现的解决方案会使JIT的运行时优化变得复杂,因为它将是一个不同的代码块,并且需要进行热代码分析。在库中定期使用现有的模式算法意味着它们更可能用于JIT编译。

#6


1  

Very good question..

很好的问题。

I suppose when the designers sat down to look at this (and not for very long, it seems), they came at it from a point of view that it should be designed to suit as many different possibilities as possible. Regular Expressions offered that flexibility.

我想,当设计师们坐下来研究这个问题的时候(看起来并不是很长时间),他们是从这样的角度来看待这个问题的:它应该被设计得尽可能地适合各种不同的可能性。正则表达式提供了灵活性。

They didn't think in terms of efficiencies. There is the Java Community Process available to raise this.

他们没有考虑效率。有可用的Java社区进程来提高这一点。

Have you looked at using the java.util.regex.Pattern class, where you compile the expression once and then use on different strings.

您看过使用java.util.regex了吗?模式类,在其中编译表达式一次,然后使用不同的字符串。

Pattern exp = Pattern.compile(":");
String[] array = exp.split(sourceString1);
String[] array2 = exp.split(sourceString2);

#7


1  

In looking at the Java String class, the uses of regex seem reasonable, and there are alternatives if regex is not desired:

在查看Java字符串类时,regex的使用似乎是合理的,如果不需要regex,还有其他的选择:

http://java.sun.com/javase/6/docs/api/java/lang/String.html

http://java.sun.com/javase/6/docs/api/java/lang/String.html

boolean matches(String regex) - A regex seems appropriate, otherwise you could just use equals

布尔匹配(字符串regex)——regex似乎是合适的,否则您可以使用equals

String replaceAll/replaceFirst(String regex, String replacement) - There are equivalents that take CharSequence instead, preventing regex.

字符串replaceAll/replaceFirst(字符串regex,字符串替换)——有等价的替换为CharSequence,防止regex。

String[] split(String regex, int limit) - A powerful but expensive split, you can use StringTokenizer to split by tokens.

String[] split(String regex, int limit)—一个强大但昂贵的分割,您可以使用StringTokenizer按令牌进行分割。

These are the only functions I saw that took regex.

这是我看到的唯一使用regex的函数。

Edit: After seeing that StringTokenizer is legacy, I would defer to Péter Török's answer to precompile the regex for split instead of using the tokenizer.

编辑:在看到StringTokenizer是遗留问题之后,我将遵从Peter Torok的回答,预编译regex用于split,而不是使用tokenizer。

#8


0  

The answer to your question is that the Java core API did it wrong. For day to day work you can consider using Guava libraries' CharMatcher which fills the gap beautifully.

您的问题的答案是Java core API做错了。对于日复一日的工作,你可以考虑使用番石榴图书馆的CharMatcher来填补这个空白。

#9


0  

...why was the Java API chosen the way it is now?

…为什么Java API选择现在的方式?

Short answer: it wasn't. Nobody ever decided to favor regex methods over non-regex methods in the String API, it just worked out that way.

简短的回答是:不。在字符串API中,没有人决定使用regex方法来处理非regex方法,它就是这样解决的。

I always understood that Java's designers deliberately kept the string-manipulation methods to a minimum, in order to avoid API bloat. But when regex support came along in JDK 1.4, of course they had to add some convenience methods to String's API.

我总是理解Java的设计者故意将字符串操作方法保持在最小,以避免API膨胀。但是当regex支持出现在JDK 1.4中时,他们当然必须向String的API添加一些方便的方法。

So now users are faced with a choice between the immensely powerful and flexible regex methods, and the bone-basic methods that Java always offered.

因此,现在用户需要在强大而灵活的regex方法和Java一直提供的基础方法之间做出选择。

智能推荐

注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2010/07/29/53405f8a8b5835982ef8b234dc979f22.html



 
© 2014-2019 ITdaan.com 粤ICP备14056181号  

赞助商广告