正则表达式补集

tech2024-05-19  89

正则表达式补集

Yeah, I’m a little late getting these answers posted. Sorry!

是的,发布这些答案有点晚了。 抱歉!

If you missed it, last week’s challenge dealt with deciphering regular expressions and finding subtle bugs within ’em.

如果您错过了它,那么上周的挑战将涉及解密正则表达式和在'em中发现细微的错误。

As with last week, before getting to the actual answers please indulge while I pontificate a bit:

与上周一样,在深入了解实际答案之前,请尽情享受:

Hopefully it’s pretty obvious that regular expressions are a double-edged sword. Sure, deciphering them makes a fun quiz, but imagine running across these monsters in code and trying to figure out what they do… not fun.

希望很明显,正则表达式是一把双刃剑。 当然,解密它们会是一个有趣的测验,但请想象一下,在代码中遇到这些怪物,并试图弄清楚它们的作用……不是很有趣。

Fortunately, nearly every regex implementation has a “verbose” mode that allows you to embed comments inside regular expressions (n most languages this is the x flag). For the sake of those who must read your code, please use the verbose mode!

幸运的是,几乎每个正则表达式实现都具有“详细”模式,可让您将注释嵌入正则表达式内(在大多数语言中,这是x标志)。 为了那些必须阅读您的代码的人,请使用详细模式!

OK, on to the answers:

OK,继续回答:

1. [A-PR-Y0-9]{3}-[A-PR-Y0-9]{3}-[A-PR-Y0-9]{4} (1. [A-PR-Y0-9]{3}-[A-PR-Y0-9]{3}-[A-PR-Y0-9]{4})

This is a US phone number, including ones that use letters (i.e. 831-555-CODE). Rewritten in verbose mode, it makes a lot more sense:

这是美国的电话号码 ,包括使用字母的电话号码 (即831-555-CODE )。 以冗长的模式重写,这更有意义:

[A-PR-Y0-9]{3} # Area code prefix - [A-PR-Y0-9]{3} # 3-digit exchange - [A-PR-Y0-9]{4} # 4-digit suffix

birman had a nice roundup of the problems with this pattern:

birman很好地总结了这种模式的问题 :

[It] doesn’t account for a preceding 1, if the area code is in parenthesis, if the digit groups are separated by a dot or space instead of a dash, or the fact that cell phones have Q and Z on them. It also doesn’t make sure the group is isolated, and not part of something like 1234888-234-123456123.

如果区号放在括号中,数字组用点或空格而不是破折号隔开,或者手机上带有Q和Z的事实,则[It]不会占去1。 它还不能确保该组是隔离的,并且不属于1234888-234-123456123之类的内容。

That last point — the isolation error — is a very common error when writing regular expressions.

最后一点-隔离错误-是编写正则表达式时非常常见的错误。

2. &(?!(w+|#d+);) (2. &(?!(w+|#d+);))

This is not, as most people thought, a mistaken attempt to match HTML entities. It’s actually a pattern that will match ampersands in HTML that are not part of entities (it’s taken from Django’s fix_ampersands template filter).

就像大多数人认为的那样,这并不是匹配HTML实体的错误尝试。 它实际上是一种模式,它将匹配不属于实体HTML中的&符号(取自Django的fix_ampersands模板过滤器)。

Here’s the verbose mode:

这是详细模式:

& # Match an ampersand... (?! # ... that is *not* followed by... ( w+ # ... word characters... | # ... or... #d+ # ... numeric entity symbols... ) ; # ... and a semi-colon. )

The “problem” with this pattern is pretty subtle: it matches HTML entities that are well-formed by still invalid (e.g. &#ggxy;). So as a way of finding unencoded ampersands it’s just fine, but if you wanted to use it as part of an HTML validator, it would be unacceptable.

这种模式的“问题”非常微妙:它匹配格式正确但仍然无效HTML实体(例如&#ggxy; )。 因此,作为查找未编码的“&”号的一种方法很好,但是如果您想将其用作HTML验证器的一部分,那将是不可接受的。

3. (-?(?:0|[1-9]d*))(.d+)?([eE][-+]?d+)? (3. (-?(?:0|[1-9]d*))(.d+)?([eE][-+]?d+)?)

Most readers got this one; it’s a IEEE floating point number, with optional exponent. In verbose mode:

大多数读者都知道这一点。 这是一个IEEE浮点数 ,带有可选指数。 在详细模式下:

( # The non-fractional part of the base -? # could be a leading negative sign (?: # Non-matching group... 0|[1-9]d* # 0, or multiple digits ) ) (.d+)? # Decimal point and fractional part of the base ( # Exponent [eE] # [-+]? # > "e", plus or minus, exponent. d+ # / )?

Some readers thought the d in the base part was a bug; it’s not, actually — that expression matches either 0, or a number that starts with 1-9 and then contains any digits.

一些读者认为基础部分中的d是一个错误。 实际上不是-表达式匹配0或以1-9开头的数字,然后包含任何数字。

The actual bug is that this pattern matches non-normalized numbers (i.e. 123.45e3, which should more properly be written 1.2345e5).

实际的错误是该模式匹配非规范化的数字(即123.45e3 ,应该更恰当地写为1.2345e5 )。

4. ([da-f]{2}:){5}([da-f]{2}) (4. ([da-f]{2}:){5}([da-f]{2}))

Nearly everyone got this one: it’s a MAC address:

几乎每个人都有这个:这是一个MAC地址 :

([da-f]{2}:){5} # Two hex digits followed by a colon, x5 ([da-f]{2}) # Two hex digits to end.

As birman noted, this pattern fails to match a few other forms allowed for MAC addresses; they can be written with hyphens (12-34-56-78-9A-BC), or as dotted quads (1234.5678.9ABC).

正如birman所 指出的 ,此模式无法匹配MAC地址允许的其他几种形式。 它们可以用连字符( 12-34-56-78-9A-BC )或点分四边形( 1234.5678.9ABC ) 1234.5678.9ABC 。

5. <[^>]*?> (5. <[^>]*?>)

This one also seemed to be easy for most readers; it matches any SGML tag. In verbose syntax:

对于大多数读者来说,这似乎也很容易。 它与任何SGML标签匹配。 用冗长的语法:

< # Atart the tag [^>]*? # Any non-gt character > # End the tag

The “bug” in this one is a little more abstract: malformed SGML/HTML will severely muck it up. I’ll leave finding such code an exercise for the reader, though.

这个错误中的“错误”更为抽象:格式错误的SGML / HTML将严重破坏它。 不过,我将留给读者练习这样的代码。

下次 (Next time)

Tune in tomorrow for the next installment of the quiz. This week’s question will be a “things that every web developer should know” quiz; I think it’s a lot of fun.

明天收看第二部分的小测验。 本周的问题将是“每个Web开发人员都应该知道的事情”测验。 我觉得这很有趣。

See you tomorrow!

明天见!

翻译自: https://www.sitepoint.com/answers-to-episode-2-real-life-regular-expressions/

正则表达式补集

最新回复(0)