通过两个不同的分隔符将一条线拆分成多个部分

[英]Splitting a line into parts by two different delimiter


I have lines with the following structure:

我有以下结构的行:

STRING1 space STRING2 space FREETEXT

where both STRING1 and STRING2 could be:

STRING1和STRING2都可以是:

  1. "space* slash space*" \s*/\s* delimited words, e.g. word1 / word2 / word3.
  2. “space * slash space *”\ s * / \ s *分隔词,例如word1 / word2 / word3。

  3. or one signle word. Regex: \w+
  4. 或一个单词。正则表达式:\ w +

  5. the FREETEXT is any string... (.*)
  6. FREETEXT是任何字符串......(。*)

I know how to match:

我知道如何匹配:

* one word such `\w+`
* two delimited words: `\w+\s*/\s*\w+'

but don't know how to match "1 or more" words delimited by \s*/\s*, e.g. something like /(\w+(\s*/\s*)?)/

但不知道如何匹配由\ s * / \ s *分隔的“1个或多个”单词,例如类似/(\ w +(\ s * / \ s *)?)/

maybe more understandable definition:

也许更容易理解的定义:

line: string space string space freetext;
string: \w+
        ||
        string \s*/\s* \w+
space: \s+
freetext: .*

Need get all 3 parts, e.g. the following code

需要获得所有3个部分,例如以下代码

use 5.014;
use warnings;
my $slash_string = qr(\w+|\w+\s*/\s*);                     #<- help1 here
while(<DATA>) {
    if( m{^($slash_string)+\s+($slash_string)+\s+(.*)$} ) {  #<- help2 here
        say join ' | ', $1, $2, $3;
    }
}
__DATA__
magnam est dolorem ea est
non / ipsum harum asperiores nesciunt voluptatem
nunt / harum / dicta nisi minus quo similique unde
porro inventore / repudiandae dolorem ipsum
enim  ipsam / aut / numquam illum vero eveniet
natus / voluptas aut / deserunt et nisi sequi est
sed / quam / magni ex / assumenda / et eaque cum et modi

should produce the wanted output

应该产生想要的输出

magnam | est | dolorem ea est
non / ipsum | harum | asperiores nesciunt voluptatem
nunt / harum / dicta | nisi | minus quo similique unde
porro | inventore / repudiandae | dolorem ipsum
enim | ipsam / aut / numquam | illum vero eveniet
natus / voluptas | aut / deserunt | et nisi sequi est
sed / quam / magni | ex / assumenda / et | eaque cum et modi

2 个解决方案

#1


This will do as you ask. I've changed $slash_string to be a word, followed by zero or more occurrences of a slash followed by another word.

这会像你问的那样做。我已经将$ slash_string更改为一个单词,然后是零或多次出现的斜杠,后跟另一个单词。

I've also taken the + quantifier off your ($slash_string)+ (because we only want one sequence of slash-separated words here) and added the /x modifier so that the patterns can be made more readable by adding insignificant whitespace.

我也把+量词从你的($ slash_string)+中取出(因为我们在这里只需要一个斜线分隔的单词序列)并添加了/ x修饰符,这样通过添加无效的空格可以使模式更具可读性。

I'm pretty sure the output matches your requirement, but I've only checked it by eye.

我很确定输出符合您的要求,但我只是通过眼睛检查。

use 5.014;
use warnings;

my $slash_string = qr/ \w+ (?: \s* \/ \s* \w+ )* /x;

while ( <DATA> ) { 
    if ( / ^ ($slash_string) \s+ ($slash_string) \s+ (.*) /x ) {
        say join '  ', map "[$_]", $1, $2, $3;
    }
}

__DATA__
magnam est dolorem ea est
non / ipsum harum asperiores nesciunt voluptatem
nunt / harum / dicta nisi minus quo similique unde
porro inventore / repudiandae dolorem ipsum
enim ipsam / aut / numquam illum vero eveniet
natus / voluptas aut / deserunt et nisi sequi est
sed / quam / magni ex / assumenda / et eaque cum et modi

output

[magnam]  [est]  [dolorem ea est]
[non / ipsum]  [harum]  [asperiores nesciunt voluptatem]
[nunt / harum / dicta]  [nisi]  [minus quo similique unde]
[porro]  [inventore / repudiandae]  [dolorem ipsum]
[enim]  [ipsam / aut / numquam]  [illum vero eveniet]
[natus / voluptas]  [aut / deserunt]  [et nisi sequi est]
[sed / quam / magni]  [ex / assumenda / et]  [eaque cum et modi]

#2


If the count of the spaces around the / isn't matter, the problem can be reduced to split at spaces. The logic:

如果/周围的空间计数无关紧要,则问题可以减少到空格分割。逻辑:

  • replace all \s*/\s* with only the / - e.g. from the word1 / word2 / word3 you will get word1/word2/word3
  • 仅使用/ - 替换所有\ s * / \ s *从word1 / word2 / word3你将得到word1 / word2 / word3

  • spit the string at the spaces into 3 parts
  • 将空间的字符串吐成3个部分

  • replace each / back to /
  • 替换每个/返回/

code

while(<DATA>) {
    chomp;
    s!\s*/\s*!/!g;   #remove all spaces around the /
    my @parts = split /\s+/, $_, 3;
    say join ' | ', map {s!/! / !gr} @parts; #return the spaces
}

output

magnam | est | dolorem ea est
non / ipsum | harum | asperiores nesciunt voluptatem
nunt / harum / dicta | nisi | minus quo similique unde
porro | inventore / repudiandae | dolorem ipsum
enim | ipsam / aut / numquam | illum vero eveniet
natus / voluptas | aut / deserunt | et nisi sequi est
sed / quam / magni | ex / assumenda / et | eaque cum et modi
智能推荐

注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2015/05/04/1b3be46fb1316f204a3db878a80eab19.html



 
© 2014-2019 ITdaan.com 粤ICP备14056181号  

赞助商广告