5 月 082015
 

Source: bash – grep valid domain regex – Stack Overflow

A truly complete solution takes more work, but here’s a shot (note that a @ prefix is assumed):

^@(([a-zA-Z](-?[a-zA-Z0-9])*)\.)*[a-zA-Z](-?[a-zA-Z0-9])+\.[a-zA-Z]{2,}$

You can use this with egrep (or grep -E), but also with [[ ... =~ ... ]], bash’s regex-matching operator.

Makes the following assumptions, which are more permissive than actual DNS name constraints:

  • Only ASCII (non-foreign) letters are allowed – see below for Internationalized Domain Name (IDN) considerations; also, the ASCII forms of IDNs – e.g., xn--bcher-kva.ch for bücher.ch – are not matched (though it would be easy to fix that).
  • There’s no limit on the number of nested subdomains.
  • There’s no limit on the length of any label (name component), and no limit on the overall length of the name (for actual limits, see here).
  • The TLD (last component) is composed of letters only and has a length of at least 2.
  • Both subdomain and domain names must start with a letter; the domain name must have a length of at least 2; subdomains are allowed to be single-letter.

Here’s a quick test:

for d in @subdom..dom.ext @dom.ext @subdom.dom.ext @subsubdom.subdom.dom.ext @subsub-dom.sub-dom.ext; do
 [[ $d =~ \
    ^@(([a-zA-Z](-?[a-zA-Z0-9])*)\.)*[a-zA-Z](-?[a-zA-Z0-9])+\.[a-zA-Z]{2,}$ \
 ]] && echo YES || echo NO
done

Support for Internationalized Domain Names (IDN):

A simple improvement to also match IDNs is to replace [a-zA-Z] with [[:alpha:]] and [a-zA-Z0-9] with [[:alnum:]] in the above regex; i.e.:

^@(([[:alpha:]](-?[[:alnum:]])*)\.)*[[:alpha:]](-?[[:alnum:]])+\.[[:alpha:]]{2,}$

Caveats:

  • Not all Unix-like platforms fully support all Unicode letters when matching against [[:alpha:]] or [[:alnum:]]. For instance, using UTF-8-based locales, OS X 10.9.1 apparently only matches Latin diacritics (e.g., ü, á) and Cyrillic characters (in addition to ASCII), whereas Linux 3.2 laudably appears to cover all scripts, including Asian and Arabic ones.
  • I’m unclear on whether names in right-to-left writing scripts are properly matched.
  • For the sake of completeness: even though the regex above makes no attempt to enforce length limits, attempting to do so with IDNs would be much more complex, as the length limits apply to the ASCII encoding of the name (via Punycode), not the original.

Tip of the hat to @Alfe for pointing out the problem with IDNs.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)

CAPTCHA Image
Play CAPTCHA Audio
Reload Image