com.boxbe.pub.email
Class EmailAddress

java.lang.Object
  extended by com.boxbe.pub.email.EmailAddress

public class EmailAddress
extends java.lang.Object

EmailAddress.java

A utility class to parse, clean up, and extract email addresses from messages per RFC2822 syntax. Designed to integrate with Javamail (this class will require that you have a javamail mail.jar in your classpath), but you could easily change the existing methods around to not use Javamail at all. For example, if you're changing the code, see the difference between getInternetAddress and getDomain: the latter doesn't depend on any javamail code. This is all a by-product of what this class was written for, so feel free to modify it to suit your needs.

For real-world addresses, this class is roughly 3-4 times slower than parsing with InternetAddress, but it can handle a whole lot more. Because of sensible design tradeoffs made in javamail, if InternetAddress has trouble parsing, it might throw an exception, but often it will silently leave the entire original string in the result of ia.getAddress(). This class can be trusted to only provide authenticated results.

This class has been tested on a few thousand real-world addresses, and is live in production environments, but you may want to do some of your own testing to ensure that it works for you. In other words, it's not beta, but it's not guaranteed yet.

Comments/Questions/Corrections welcome: java <at> caseyconnor.org

Started with code by Les Hazlewood: leshazlewood.com.

Modified/added: removed some functions, added support for CFWS token, corrected FWSP token, added some boolean flags, added getInternetAddress and extractHeaderAddresses and other methods, some optimization.

Where Mr. Hazlewood's version was more for ensuring certain forms that were passed in during registrations, etc, this handles more types of verifying as well a few forms of extracting the data in predictable, cleaned-up chunks.

Note: CFWS means the "comment folded whitespace" token from 2822, in other words, whitespace and comment text that is enclosed in ()'s.

Limitations: doesn't support nested CFWS (comments within (other) comments), doesn't support mailbox groups except when flat-extracting addresses from headers or when doing verification, doesn't support any of the obs-* tokens. Also: the getInternetAddress and extractHeaderAddresses methods return InternetAddress objects; if the personal name has any quotes or \'s in it at all, the InternetAddress object will always escape the name entirely and put it in quotes, so multiple-token personal names with those characters somewhere in them will always be munged into one big escaped string. This is not really a big deal at all, but I mention it anyway. (And you could get around it by a simple modification to those methods to not use InternetAddress objects.) See the docs of those methods for more info.

Note: This does not do any header-length-checking. There are no such limitations on the email address grammar in 2822, though email headers in general do have length restrictions. So if the return path is 40000 unfolded characters long, but otherwise valid under 2822, this class will pass it.

Examples of passing (2822-valid) addresses, believe it or not:

bob @example.com
"bob" @ example.com
bob (comment) (other comment) @example.com (personal name)
"<bob \" (here) " < (hi there) "bob(the man)smith" (hi) @ (there) example.com (hello) > (again)

(none of which are permitted by javamail, incidentally)

By using getInternetAddress(), you can retrieve an InternetAddress object that, when toString()'ed, would reveal that the parser had converted the above into:

<bob@example.com>
<bob@example.com>
"personal name" <bob@example.com>
"<bob \" (here)" <"bob(the man)smith"@example.com>

(respectively)

If parsing headers, however, you'll probably be calling extractHeaderAddresses().

A future improvement may be to use this class to extract info from corrupted addresses, but for now, it does not permit them.

Some of the configuration booleans allow a bit of tweaking already. The source code can be compiled with these booleans in various states. They are configured to what is probably the most commonly-useful state.

Version:
1.11
Author:
Les Hazlewood, Casey Connor

Field Summary
static java.util.regex.Pattern ADDR_SPEC_PATTERN
          Java regex pattern for 2822 "addr-spec" token; Not necessarily useful, but available in case.
static java.util.regex.Pattern ADDRESS_PATTERN
          Java regex pattern for 2822 "address" token; Not necessarily useful, but available in case.
static boolean ALLOW_DOMAIN_LITERALS
          This constant changes the behavior of the domain parsing.
static boolean ALLOW_DOT_IN_ATEXT
          This constant allows "." to appear in atext.
static boolean ALLOW_PARENS_IN_LOCALPART
          This contant allows ")" or "(" to appear in quoted versions of the localpart (they are never allowed in unquoted versions)
static boolean ALLOW_QUOTED_IDENTIFIERS
          This constant states that quoted identifiers are allowed (using quotes and angle brackets around the raw address) are allowed, e.g.:
static boolean ALLOW_SQUARE_BRACKETS_IN_ATEXT
          This constant allows "[" or "]" to appear in atext.
static java.util.regex.Pattern COMMENT_PATTERN
          Java regex pattern for 2822 "comment" token; Not necessarily useful, but available in case.
static boolean EXTRACT_CFWS_PERSONAL_NAMES
          This controls the behavior of getInternetAddress and extractHeaderAddresses.
static java.util.regex.Pattern MAILBOX_LIST_PATTERN
          Java regex pattern for 2822 "mailbox-list" token; Not necessarily useful, but available in case.
static java.util.regex.Pattern MAILBOX_PATTERN
          Java regex pattern for 2822 "mailbox" token; Not necessarily useful, but available in case.
 
Constructor Summary
EmailAddress()
           
 
Method Summary
static javax.mail.internet.InternetAddress[] extractHeaderAddresses(java.lang.String header_txt)
          Given a header, like the From:, extract valid 2822 addresses from it and place them in an array.
static java.lang.String[] getAddressParts(java.lang.String email)
          See getInternetAddress; does the same thing but returns the constituent parts of the address in a three-element array (or null if the address is invalid).
static java.lang.String getDomain(java.lang.String email)
          See getInternetAddress; does the same thing but returns the domain part in string form (essentially, the part to the right of the @).
static java.lang.String getFirstComment(java.lang.String text)
          Given a string, extract the first matched comment token as defined in 2822, trimmed; return null on all errors or non-findings
static javax.mail.internet.InternetAddress getInternetAddress(java.lang.String email)
          Given a 2822-valid single address string, give us an InternetAddress object holding that address, otherwise returns null.
static java.lang.String getLocalPart(java.lang.String email)
          See getInternetAddress; does the same thing but returns the local part that would have been returned from getInternetAddress() in String form (essentially, the part to the left of the @).
static java.lang.String getPersonalName(java.lang.String email)
          See getInternetAddress; does the same thing but returns the personal name that would have been returned from getInternetAddress() in String form.
static java.lang.String getReturnPathAddress(java.lang.String email)
          Pull out the cleaned-up return path address.
static java.lang.String getReturnPathBracketContents(java.lang.String email)
          WARNING: You may want to use getReturnPathAddress() instead if you're looking for a clean version of the return path without CFWS, etc.
static boolean isValidAddressList(java.lang.String header_txt)
          Tells us if a header line is valid, i.e.
static boolean isValidMailbox(java.lang.String email)
          Checks to see if the specified string is a valid email address according to the RFC 2822 specification, which is remarkably squirrely.
static boolean isValidMailboxList(java.lang.String header_txt)
          Tells us if a header line is valid, i.e.
static boolean isValidReturnPath(java.lang.String email)
          Tells us if the email represents a valid return path header string.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ALLOW_DOMAIN_LITERALS

public static final boolean ALLOW_DOMAIN_LITERALS
This constant changes the behavior of the domain parsing. If true, the parser will allow 2822 domains, which include single-level domains (e.g. bob@localhost) as well as domain literals, e.g.:

someone@[192.168.1.100] or
john.doe@[23:33:A2:22:16:1F] or
me@[my computer]

The RFC says these are valid email addresses, but most people don't like allowing them. If you don't want to allow them, and only want to allow valid domain names (RFC 1035, x.y.z.com, etc), and specifically only those with at least two levels ("example.com"), then change this constant to false.

Its default (compiled) value is false, thus it is not RFC 2822 compliant, but you should set it depending on what you need for your application.

See Also:
Constant Field Values

ALLOW_QUOTED_IDENTIFIERS

public static final boolean ALLOW_QUOTED_IDENTIFIERS
This constant states that quoted identifiers are allowed (using quotes and angle brackets around the raw address) are allowed, e.g.:

"John Smith" <john.smith@somewhere.com>

The RFC says this is a valid mailbox. If you don't want to allow this, because for example, you only want users to enter in a raw address (john.smith@somewhere.com - no quotes or angle brackets), then change this constant to false.

Its default (compiled) value is true to remain RFC 2822 compliant, but you should set it depending on what you need for your application.

See Also:
Constant Field Values

ALLOW_DOT_IN_ATEXT

public static final boolean ALLOW_DOT_IN_ATEXT
This constant allows "." to appear in atext.

The addresses:

Kayaks.org <kayaks@kayaks.org>

Bob K. Smith<bobksmith@bob.net>

...are not valid. They should be:

"Kayaks.org" <kayaks@kayaks.org>

"Bob K. Smith" <bobksmith@bob.net>

If this boolean is set to false, the parser will act per 2822 and will require the quotes; if set to true, it will allow the use of "." without quotes. Default (compiled) setting is false.

See Also:
Constant Field Values

EXTRACT_CFWS_PERSONAL_NAMES

public static final boolean EXTRACT_CFWS_PERSONAL_NAMES
This controls the behavior of getInternetAddress and extractHeaderAddresses. If true, it allows the not-totally-kosher-but-happens-in-the-real-world practice of:

<bob@example.com> (Bob Smith)

In this case, "Bob Smith" is not techinically the personal name, just a comment. If this is set to true, the methods will convert this into: Bob Smith <bob@example.com>

This also happens somewhat more often and appropriately with

mailer-daemon@blah.com (Mail Delivery System)

If a personal name appears to the left and CFWS appears to the right of an address, the methods will favor the personal name to the left. If the methods need to use the CFWS following the address, they will take the first comment token they find.

e.g.:

"bob smith" <bob@example.com> (Bobby)
will yield personal name "bob smith"
<bob@example.com> (Bobby)
will yield personal name "Bobby"
bob@example.com (Bobby)
will yield personal name "Bobby"
bob@example.com (Bob) (Smith)
will yield personal name "Bob"

Default (compiled) setting is true.

See Also:
Constant Field Values

ALLOW_SQUARE_BRACKETS_IN_ATEXT

public static final boolean ALLOW_SQUARE_BRACKETS_IN_ATEXT
This constant allows "[" or "]" to appear in atext. Not very useful, maybe, but there it is.

The address:

[Kayaks] <kayaks@kayaks.org> ...is not valid. It should be:

"[Kayaks]" <kayaks@kayaks.org>

If this boolean is set to false, the parser will act per 2822 and will require the quotes; if set to true, it will allow them to be missing.

One real-world example seen:

Bob Smith [mailto:bsmith@gmail.com]=20

Use at your own risk. There may be some issue with enabling this feature in conjunction with ALLOW_DOMAIN_LITERALS, but i haven't looked into that. If ALLOW_DOMAIN_LITERALS is false, i think this should be pretty safe. Whether or not it's useful, that's up to you. Default (compiled) setting of false.

See Also:
Constant Field Values

ALLOW_PARENS_IN_LOCALPART

public static final boolean ALLOW_PARENS_IN_LOCALPART
This contant allows ")" or "(" to appear in quoted versions of the localpart (they are never allowed in unquoted versions)

The default (2822) behavior is to allow this, i.e. boolean true.

You can disallow it, but better to leave it true. I left this hanging around (from an earlier incarnation of the code) as a random option you can switch off. No, it's not necssarily useful. Long story.

If false, it will prevent such addresses from being valid, even though they are: "bob(hi)smith"@test.com

Deafult (compiled) setting of true.

See Also:
Constant Field Values

MAILBOX_PATTERN

public static final java.util.regex.Pattern MAILBOX_PATTERN
Java regex pattern for 2822 "mailbox" token; Not necessarily useful, but available in case.


ADDR_SPEC_PATTERN

public static final java.util.regex.Pattern ADDR_SPEC_PATTERN
Java regex pattern for 2822 "addr-spec" token; Not necessarily useful, but available in case.


MAILBOX_LIST_PATTERN

public static final java.util.regex.Pattern MAILBOX_LIST_PATTERN
Java regex pattern for 2822 "mailbox-list" token; Not necessarily useful, but available in case.


ADDRESS_PATTERN

public static final java.util.regex.Pattern ADDRESS_PATTERN
Java regex pattern for 2822 "address" token; Not necessarily useful, but available in case.


COMMENT_PATTERN

public static final java.util.regex.Pattern COMMENT_PATTERN
Java regex pattern for 2822 "comment" token; Not necessarily useful, but available in case.

Constructor Detail

EmailAddress

public EmailAddress()
Method Detail

isValidMailbox

public static boolean isValidMailbox(java.lang.String email)
Checks to see if the specified string is a valid email address according to the RFC 2822 specification, which is remarkably squirrely. See doc for this class: 2822 not fully implemented, but probably close enough for almost any needs.

If being used on a 2822 header, this method applies to Sender, Resent-Sender, only, although you can also use it on the Return-Path if you know it to be non-empty (see doc for isValidReturnPath()!). Folded header lines should work OK, but I haven't tested that.

Parameters:
email - the email address string to test for validity (null and "" OK, will return false for those)
Returns:
true if the given email text is valid according to RFC 2822, false otherwise.

isValidReturnPath

public static boolean isValidReturnPath(java.lang.String email)
Tells us if the email represents a valid return path header string.

NOTE: legit forms like <(comment here)> will return true.

You can check isValidReturnPath(), and if it is true, and if getInternetAddress() returns null, you know you have a DSN, whether it be an empty return path or one with only CFWS inside the brackets (which is legit, as demonstated above). Note that you can also simply call getReturnPathAddress() to have that operation done for you.

Note that <""> is not a valid return-path.


getReturnPathBracketContents

public static java.lang.String getReturnPathBracketContents(java.lang.String email)
WARNING: You may want to use getReturnPathAddress() instead if you're looking for a clean version of the return path without CFWS, etc. See that documentation first!

Pull whatever's inside the angle brackets out, without alteration or cleaning. This is more secure than a simple substring() since paths like:

<(my > path) >

...are legal return-paths and may throw a simpler parser off. However this method will return all CFWS (comments, whitespace) that may be between the brackets as well. So the example above will return:

(my > path)_
(where the _ is the trailing space from the original string)


getReturnPathAddress

public static java.lang.String getReturnPathAddress(java.lang.String email)
Pull out the cleaned-up return path address. May return an empty string. Will require two parsings due to an inefficiency.

Returns:
null if there are any syntax issues or other weirdness, otherwise the valid, trimmed return path email address without CFWS, surrounding angle brackets, with quotes stripped where possible, etc. (may return an empty string).

isValidMailboxList

public static boolean isValidMailboxList(java.lang.String header_txt)
Tells us if a header line is valid, i.e. checks for a 2822 mailbox-list (which could only have one address in it, or might have more.) Applicable to From or Resent-From headers only.

This method seems quick enough so far, but I'm not totally convinced it couldn't be slow given a complicated near-miss string. You may just want to call extractHeaderAddresses() instead, unless you must confirm that the format is perfect. I think that in 99.9999% of real-world cases this method will work fine.

See Also:
isValidAddressList(String)

isValidAddressList

public static boolean isValidAddressList(java.lang.String header_txt)
Tells us if a header line is valid, i.e. a 2822 address-list (which could only have one address in it, or might have more.) Applicable to To, Cc, Bcc, Reply-To, Resent-To, Resent-Cc, and Resent-Bcc headers only.

This method seems quick enough so far, but I'm not totally convinced it couldn't be slow given a complicated near-miss string. You may just want to call extractHeaderAddresses() instead, unless you must confirm that the format is perfect. I think that in 99.9999% of real-world cases this method will work fine and quickly enough. Let me know what your testing reveals.

See Also:
isValidMailboxList(String)

getInternetAddress

public static javax.mail.internet.InternetAddress getInternetAddress(java.lang.String email)
Given a 2822-valid single address string, give us an InternetAddress object holding that address, otherwise returns null. The email address that comes back from the resulting InternetAddress object's getAddress() call will have comments and unnecessary quotation marks or whitespace removed.

If your String is an email header, you should probably use extractHeaderAddresses instead, since most headers can have multiple addresses in them. (see that method for more info.) This method will indeed fail if you use it on a header line with more than one address.

Exception: You CAN and should use this for the Sender header, and probably you want to use it for the X-Original-To as well.

Another exception: You can use this for the Return-Path, but if you want to know that a Return-Path is valid and you want to extract it, you will have to call both this method and isValidReturnPath; this operation can be done for you by simply calling getReturnPathAddress() instead of this method. In terms of this method's application to the return-path, note that the common valid Return-Path value <> will return null. So will the illegitimate "" or legitimate empty-string, but other illegitimate Return-Paths like

"hi" <bob@smith.com>

will return an address, so the moral is that you may want to check isValidReturnPath() first, if you care. This method is useful if you trust the return path and want to extract a clean address from it without CFWS (getReturnPathBracketContents() will return any CFWS), or if you want to determine if a validated return path actually contains an address in it and isn't just empty or full of CFWS. Except for empty return paths (those lacking an address) the Return-Path specification is a subset of valid 2822 addresses, so this method will work on all non-empty return-paths, failing only on the empty ones.

In general for this method, note: although this method does not use InternetAddress to parse/extract the information, it does ensure that InternetAddress can use the results (i.e. that there are no encoding issues), but note that an InternetAddress object can hold (and use) values for the address which it could not have parsed itself. Thus, it's possible that for InternetAddress addr, which came as the result of this method, the following may throw an exception or may silently fail:
InternetAddress addr2 = InternetAddress.parse(addr.toString());

Again, all other uses of that addr object should work OK. It is recommended that if you are using this class that you never create an InternetAddress object using InternetAddress's own constructors or parsing methods; rather, retrieve them through this class. Perhaps the addr.clone() would work OK, though.

The personal name will include any and all phrase token(s) to the left of the address, if they exist, and the string will be trim()'ed, but note that InternetAddress, when generating the getPersonal() result or the toString() result, if it encounters any quotes or backslashes in the personal name String, will put the entire thing in a big quoted-escaped chunk.

This will do some smart unescaping to prevent that from happening unnecessarily; specifically, if there are unecessary quotes around a personal name, it will remove them. E.g.

"Bob" <bob@hi.com>
becomes:
Bob <bob@hi.com>

(apologies to bob@hi.com for everything i've done to him)


getAddressParts

public static java.lang.String[] getAddressParts(java.lang.String email)
See getInternetAddress; does the same thing but returns the constituent parts of the address in a three-element array (or null if the address is invalid).

This may be useful because even with cleaned-up address extracted with this class the parsing to achieve this is not trivial.

To actually use these values in an email, you should construct an InternetAddress object (or equivalent) which can handle the various quoting, adding of the angle brackets around the address, etc., necessary for presenting the whole address.

To construct the email address, you can safely use:
result[1] + "@" + result[2]

Returns:
a three-element array containing the personal name String, local part String, and the domain part String of the address, in that order, without the @; will return null if the address is invalid; if it is valid this will not return null but the personal name (at index 0) may be null

getPersonalName

public static java.lang.String getPersonalName(java.lang.String email)
See getInternetAddress; does the same thing but returns the personal name that would have been returned from getInternetAddress() in String form.


getLocalPart

public static java.lang.String getLocalPart(java.lang.String email)
See getInternetAddress; does the same thing but returns the local part that would have been returned from getInternetAddress() in String form (essentially, the part to the left of the @). This may be useful because a simple search/split on a "@" is not a safe way to do this, given escaped quoted strings, etc.


getDomain

public static java.lang.String getDomain(java.lang.String email)
See getInternetAddress; does the same thing but returns the domain part in string form (essentially, the part to the right of the @). This may be useful because a simple search/split on a "@" is not a safe way to do this, given escaped quoted strings, etc.


extractHeaderAddresses

public static javax.mail.internet.InternetAddress[] extractHeaderAddresses(java.lang.String header_txt)
Given a header, like the From:, extract valid 2822 addresses from it and place them in an array. Returns an empty array if none found, will not return null. The addresses that come back from the resulting InternetAddress objects' getAddress calls will have comments and unnecessary quotation marks or whitespace removed. If a bad address is encountered, parsing stops, and the good addresses found up until then (if any) are returned. This is kind of strict and could be improved, but that's the way it is for now. If you need to know if the header is totally valid (not just up to a certain address) then you can use isValidMailboxList() or isValidAddressList() or isValidMailbox(), depending on the header:

This method can handle group addresses, but it does not preseve the group name or the structure of any groups; rather it flattens them all into the same array. You can call this method on the From or any other header that uses the mailbox-list form (which doesn't use groups), or you can call it on the To, Cc, Bcc, or Reply-To or any other header which uses the address-list format which might have groups in there. This method doesn't enforce any group structure syntax either. If you care to test for 2822 validity of a list of addresses (including group format), use the appropriate method. This will dependably extract addresses from a valid list. If the list is invalid, it may extract them anyway, or it may fail somewhere along the line.

You should not use this method on the Return-Path header; instead use getInternetAddress() or getReturnPathAddress() (see that doc for info about Return-Path). However, you could use this on the Sender header if you didn't care to check it for validity, since single mailboxes are valid subsets of valid mailbox-lists and address-lists.

Parameters:
header_txt - is text from whatever header. I don't think the String needs to be unfolded, but i haven't tested that.

see getInternetAddress() for more info: this extracts the same way

Returns:
zero-length array if erorrs or none found, otherwise an array of length > 0 with the addresses as InternetAddresses with the personal name and emails set correctly (i.e. doesn't rely on InternetAddress parsing for extraction, but does require that the address be usable by InternetAddress, although re-parsing with InternetAddress may cause exceptions, see getInternetAddress()); will not return null.

getFirstComment

public static java.lang.String getFirstComment(java.lang.String text)
Given a string, extract the first matched comment token as defined in 2822, trimmed; return null on all errors or non-findings

This is probably not super-useful. Included just in case.

Note for future improvement: if COMMENT_PATTERN could handle nested comments, then this should be able to as well, but if this method were to be used to find the CFWS personal name (see boolean option) then such a nested comment would probably not be the one you were looking for?