Exchange 2013 2007 CoExistence Frozen Transport Queues

We’re in the middle of an Exchange 2007 to 2013 On Premise migration, and we haven’t been having the best time. After a major mail crash in 2011, every time we’ve attempted to move off the recovered server, either to Exchange 2010 or to a freshly built Exchange 2007 setup, we’ve encountered major, migration stopping issues.

tl;dr

So far, migrating to 2013 hasn’t been nearly as painful. None of the client access issues, no constant database drop-offs, no complete freezing of all inbound and outbound mail on both servers. It hasn’t been fantastic though, and one problem keeps popping up: the Exchange 2007 to 2013 delivery keeps failing for some reason. When it fails, all mail sent to anyone on the 2013 box from anyone not on the 2013 box fails. It’s easy enough to work-around, all we need to do is restart the MS Exchange Transport service on the 2013 box, and everything starts working again.

Telnetting to the 2013 box on 25 yields different things from different computers. Some can connect and have a full SMTP exchange, others receive “

421 4.3.2 Service Unavailable

” when they try. It doesn’t seem to be based on anything. Network segmentation means nothing, connection speed, OS, anything. What will connect is random, as server load doesn’t seem to have any impact.

It seems that we’re not the only people having this, while Googling around today I found a few other people that are having similar issues. It seems to be related to antimalware filtering and/or customized receive connectors. As I mentioned, our Exchange setup is not amazing, and we’ve got a load of legacy stuff that can’t/hasn’t been removed and so was replicated on the new server.

Since the IT department was moved over as part of the first batch of test users, and most of the people in my local office have also been migrated when we expanded the test users, it takes a while for anyone to notice that we’re not getting all our emails. The monitoring solution that’s been implemented by our outsourced helpdesk is currently unable to monitor this specific queue, and they’ve been unable to figure out how to get it monitored so that we’ll receive automated alerts when it starts to build up.

So, I created a PowerShell script that checks to see if there’s a queue on that relay, if there is to see how big it is, and if it’s growing because it’s getting overloaded with mail and isn’t sending them fast enough, or if it’s growing because the 2013 box isn’t accepting the mail.

It’s not a solution, but it’s made the problem significantly less impacting while we try to get everything else working. Today we removed the Antimalware scanning and removed/recreated as many of the custom queues we have. Hopefully I’ll be able to post an update in the next week or so addressing the status of those fixes. Until then, the Powershell script will keep mail moving.

add-pssnapin Microsoft.Exchange.Management.PowerShell.*
$logfile = "~DesktopTransport Failure Log.txt"
$Today = Get-Date
$marker = "-------------
Script started on $Today
-------------
"
write-output "$marker" > $logfile

Function GetQueue(){
	$queue = Get-Queue Exch07* | where {$_.nexthopdomain -eq "hub version 15"}
	return $queue}

Function EmailAlert($count){
	$mailServer = "Mail.Server.com"
	$email = new-object Net.Mail.SMTPClient($mailServer)
	$message = new-object Net.Mail.MailMessage

	$message.From = "[email protected]"
	$message.ReplyTo = "whomever"
	$message.To.Add("whomever")
	$message.subject = "There are $count messages waiting to be delivered."

	$message.body= "Email Text."

	start-sleep -s 120
	$email.send($message)
	}

Function CheckQueues($queue){
	$count = $queue.messagecount
	$queueID = $queue.Identity.RowID
	if($count -gt 10){
		$messages = get-message -queue Exch07$queueID
		$messageZero = $messages[0].identity.internalID

		start-sleep -s 60
		$newQueue = GetQueue
		$newCount = $newQueue.messagecount
		$newQueueID = $newQueue.Identity.RowID

		if($newcount -lt $count){
			break}

		elseif($newQueueID -ne $queueID){
			break}

		$NewMessages = get-message -queue Exch07$queueID
		$NewMessageZero = $NewMessages[0].identity.internalID
		elseif($NewMessageZero -ne $messageZero){
			break}

		elseif($newcount -eq $count){
			CheckQueues($newcount)}

		elseif($newcount -gt $count){
			$today = Get-Date
			write-output "------$Today-------- The service was restarted." >> $logfile

			$XportService = Get-Service -Name "MSExchangeTransport" -ComputerName "EXCH13"
			Restart-Service -InputObj $XportService
			$RestartCount = 0

			while($XportService.Status -ne "Running"){
				$restartCount++
				$today = Get-Date
				write-output "Exchange Transport failed to restart at $today." >> $logfile
				start-service -InputObj $XportService
				}
			emailalert($newcount)
			}
		}
	}

Function DoWork(){
	start-sleep -s 300
	$count = GetQueue
	CheckQueues($count)}

while($true){DoWork}

UPDATE: removing the antimalware scanning fixed this.