Archives / Search ›

Dictation buffer updates

It’s been a little over a year since I started using Dragon Medical in Windows as a dictation buffer for my Mac. Please see my previous few posts on the subject for some background.

Since then, I’ve eliminated pain points and added features which have made the dictation experience smoother and less infuriating.  Once again, while this does not represent a turnkey system, in keeping with the DIY nature of such projects, hopefully it may help others out who want to do something similar.  The code remains up to date on GitHub and I plan on maintaining it for my own use until something better comes along.

So, here’s a change log in approximate chronological order:

Better microphone status

As you may recall, I find the DragonBar distracting in its various forms and keep it hidden most of the time. One thing that does need to be visible, however, is the microphone status. I had originally used Growl to display notifications when the microphone was turned on and off, but I have since set up a combination of Python and AutoHotKey which monitors the Dragon microphone status and displays an overlay window on the Windows side when both Word is in front and the microphone is disabled.  (While AutoHotKey is a complete mess, it reminds me a lot of OneClick on classic Mac OS; I do wish something like it were still available for persistent script UIs on the Mac).

Here’s how it looks in practice:

Dragon Not Listening

An even more minimal dictation surface

With some more Word macro work, I’m able to use Word’s full-screen mode rather than auto-hiding the ribbon. Now the only vestige of an operating system and full-featured word processor behind my dictation buffer is a vertical scrollbar (which I could disable if I really wanted) and a couple of pixels at the bottom of the screen where the autohidden taskbar lives.  For Word’s full UI back, just press Esc; for a return to minimalism, I’ve added “Full Screen” to Word’s Quick Access Toolbar where it’s accessible via Alt+1.

Working with Citrix Viewer

While dictating into native Mac apps is great, most of the work I do these days is in our electronic medical record, which is accessible either via Citrix Viewer or VMware Horizon. One advantage of Citrix Viewer is that it presents the remote application as individually movable windows. There is still an underlying Windows desktop so many of the same issues exist as in similar solutions such as VMware’s Unity (of course Citrix long predates Unity), but overall it is usable. Since getting a larger (2560×1440) monitor at home, when dictating into the EMR I typically configure my desktop with three side-by-side windows: the main EMR window, my current note and the dictation buffer.  Previously I had an older 1680×1050 display and used my iPad for the dictation buffer, but more about that later.

The next problem is getting text from the dictation buffer into Citrix Viewer. Service support would be great, but there isn’t any, so instead I just use the existing clipboard bridging functionality and synthesize Ctrl+C and Ctrl+V.  Pasting into Citrix Viewer is the easy part; copying into the buffer is harder because the Mac clipboard doesn’t immediately update with the Windows clipboard contents.  I just have to poll the clipboard until something happens — ugly but effective.

Citrix Viewer

You may wonder why I go to all this trouble when I’m ultimately dictating into a native Windows app with Windows dictation software. The short answer is that the app and dictation software can’t run on the same computer. As has been the case in several hospitals in which I have worked, our fat-client EMR doesn’t actually get installed directly onto individual client devices either inside the hospital or out, so a local copy of Dragon needs to understand how to get through to the application running on the remote host. Prior versions of Dragon Medical and Citrix would work directly with one another, but this is no longer supported. The currently supported method for dictating into Citrix is vSync, which involves an agent on the Citrix server that talks to the Dragon client. Unfortunately, vSync isn’t supported with non-networked versions of Dragon Medical (such as the one I own). Even with vSync, things aren’t perfect — after our recent Dragon and EMR upgrades, text I’m dictating often fails to display at all until I use the mouse or keyboard to update the screen.

Dictating into Fantastical

My fellow residents and I receive some scheduling information every month in a Word document. It’s nice to look at but not too useful in practice. It’s faster to dictate this information into Fantastical in order to convert it into calendar entries than to try to reformat it manually.

But now I can easily dictate into Fantastical's natural language parser, I’ve found it useful day-to-day as a simple version of Siri for the Mac.  With an AppleScript you can tell Fantastical to parse sentence, and it'll open the Fantastical window with your dictation pre-populated.

Using RDP rather than the VMware Fusion console

While overall I really love VMware Fusion and appreciate its continued development and refinement, there are a few issues which have made it frustrating to use for this project.

First, there’s a substantial delay associated with relaying audio to the virtual machine, which notably slows dictation response. On my MacBook Pro I map my USB headset's dongle directly to the virtual machine to eliminate this delay; on my Mac mini I can't do this because I like to use the headset for Mac stuff such as listening to music at the same time. The current preview version of VMware Fusion claims to have improved audio for conferencing purposes, which I assumed would address this latency, but unfortunately I don’t notice a difference.

Second, the VMware Fusion window can sometimes lag and hang during dictation. (I'm not using Unity mode, as full-screen Word in Unity would consume the entire host entire Mac screen.) I don't understand why it's happening, but I stumbled on a workaround while getting the dictation buffer onto my iPad. I had originally used VMware Fusion's VNC support to do this, I eventually realized I could use RDP instead. With Jump Desktop on both Mac and iOS, Word became more responsive than it ever was on the console. So I now launch the VM headless and connect to it via RDP; audio, either via the VMware audio driver or USB redirection, remains independent of where the desktop displays. This has further advantages including letting me configure multiple RDP settings for varying desktop sizes (smaller or larger to fit with host applications, or big for development), rather than having to resize the VMware Fusion window, as well as letting me quit Jump Desktop and pause the VM to save battery power on my MacBook Pro when I'm not dictating. I have not tried using RDP’s own support for audio recording redirection, which Jump Desktop doesn’t support. (My scripts will still work with VMware Fusion directly if Jump Desktop isn’t running, however.)

Supporting rich text

Until a few weeks ago, input and output with the dictation buffer was limited to plain text. However, VMware Fusion, Jump Desktop and Citrix Viewer all support bridging an RTF clipboard between Windows and the Mac. So, I’ve added bidirectional RTF support to the dictation server, Mac services and command line tools. This involves using the Windows clipboard as there's no way via COM to extract RTF (or anything but text or XML) from Word on Windows. Right now, I only return either RTF or plain text, not both, based on whether you have styled your Word document at all; primarily this is so that all your Mac apps don't end up with unwanted 11 point Calibri text. Figuring out whether a Word document is styled was actually quite difficult (doing something logical like enumerating style runs doesn't work, because each paragraph gets its own run), and I end up doing it by rudimentary parsing of the document's XML looking for character and paragraph styles. pbpaste -Prefer rtf is broken at least in 10.10, so I also implement some direct Mac clipboard setting support for RTF only.

Another video

I’m working on another video to demonstrate these changes; I’ll re-record it when I get a chance.

Plans for the future

In no particular order, some further improvements I’ve been pondering...

  1. I often end up with a number of auto-recovered documents which I wish would go away. I try to close my temporary dictation documents without saving, however I do want to preserve their contents in a crash, so I may consider enumerating these documents on Word launch and deleting those which contain no text — assuming I can do this from the COM interface.
  2. Word’s full-screen mode sometimes doesn’t measure the screen correctly with an auto-hiding taskbar, so you end up with a taskbar-sized strip of desktop background at the bottom of the screen.  This appears somewhat related to the way in which I try to launch Word via COM; Windows is far more aggressive about preventing focus stealing than I realized — certainly more than the Mac — to the point that even when acting as a pseudo-user in automating the user interface, it Can be difficult to give a newly opened window focus, and the taskbar ends up staying focused. In any case, it should be easy enough to detect when Word’s window is the wrong size instead of just fixing the problem manually (move pointer out of taskbar area if necessary, then press Esc and Alt+1) when it happens.
  3. I’d like to be able to use VMware Horizon; while it’d mean some wasted screen space with non-floating EMR windows, it opens up a unique possibility. VMware Horizon supports multiple protocols — PCoIP and RDP.  While the PCoIP client is built into Horizon Client on the Mac, it will launch Microsoft’s now-obsolete Remote Desktop Connection app, if present, to connect via RDP.  RDP supports virtual channels; much like what Citrix uses for vSync, except that since I can run arbitrary Windows apps on the virtual machine rather than just the EMR-in-a-bubble as with Citrix, I could directly manipulate the Windows clipboard on the remote machine.  I’d need to write an app to impersonate Remote Desktop Connection which converted its settings files into Jump Desktop ones, a virtual channel plugin for Jump Desktop for Mac and the virtual channel server on Windows.  If successful, this would both avoid polluting the Mac clipboard and make the whole process more reliable and controllable than its Citrix equivalent. But my current setup has become a lot more reliable recently, so it may be way too much work for too little benefit.
  4. Speaking of reliability, I want to eliminate my reliance on NatLink. It's big, old, crufty and sometimes hangs when I try to connect to it (though much less frequently since I don’t do so as often), forces my server to be single-threaded, and if I could figure out how to interface directly with the Dragon microphone objects directly from Python or even my own C++ code, I could get rid of it completely. I also suspect some of the random-appearing hangs I'm seeing while dictating are NatLink’s fault, too, as they don’t happen when using Dragon at work.

On the other hand, my open source Mac apps need to be updated for El Capitan…

soundsource: a few examples

A few weeks ago, I added links to some of my smaller OS X projects to my software page. One of these projects is a command line version of Rogue Amoeba’s now-discontinued SoundSource.

This tool, which I have rather unimaginatively named soundsource, is the basis for a number of scripts I have written. I recently enhanced some of the scripts and figured I might as well post them here as inspiration for others who are drowning in a sea of audio input and output devices connected to their Macs. I run these scripts from FastScripts with corresponding keyboard equivalents:

One script handles the various headphones I use. Macs of the last few years support the same microphones and remotes as Apple’s iOS devices, and the generally decent quality of the microphones on many iOS compatible headsets is even adequate for dictation in a quiet room. I’m dictating this blog post, for example, with a Bose QuietComfort 20i headset. The accuracy isn’t quite that of my usual setup, but it is entirely sufficient for short-term usage, and it sure is nice to only have one thin cable plugged into my Mac.

Sometimes, however, I just have regular headphones plugged into the jack, and in this case there is no corresponding microphone input. Apple’s recent Macs also do a great job of dynamically changing the available audio input and output sources advertised to the OS as you connect and disconnect devices.

And in yet other cases, I use a USB headset. In any event, I want a way to “just start playing (and recording, if possible) through my headphones”. Here it is, using Growl to display the results:

#!/bin/sh

notify() {
	/bin/echo -n "Input: $(/usr/local/bin/soundsource -i)" |
		/usr/local/bin/growlnotify \
			-I /System/Library/PreferencePanes/Sound.prefPane \
			-d net.sabi.soundsource $1
}

# succeeds if headphones connected to jack
if /usr/local/bin/soundsource -o 'Headphones'; then
	# succeeds if headphones have integrated microphone
	/usr/local/bin/soundsource -i 'External microphone' || /usr/bin/true
	notify Headphones
else
	/usr/local/bin/soundsource -o 'C-Media USB Headphone Set'
	/usr/local/bin/soundsource -i 'C-Media USB Headphone Set'
	notify 'Plantronics Headset'
fi

The notification looks like this:

growl notification
Note that I take advantage of soundsource exiting with failure if it is unable to switch to the desired audio input or output device.

The second script handles changing the output to my AirPort Express, Furrball. Unfortunately, my home Internet connection is currently via my landlord’s somewhat unreliable Wi-Fi, and the AirPort Express drops off the network with depressing regularity. FastScripts does a great job of displaying status when the script fails, but because of the frequency of this failure, I recover from it by power cycling the AirPort Express. Note that, at least in OS X 10.8, switching to a nonfunctional AirPlay device may appear to succeed but immediately switches back to another device; you need to confirm the change.

#!/bin/sh

notify() {
	/bin/echo -n "Input: $(/usr/local/bin/soundsource -i)" |
		/usr/local/bin/growlnotify \
		    -I /System/Library/PreferencePanes/Sound.prefPane \
		    -d net.sabi.soundsource $1
}

/usr/local/bin/soundsource -o Furrball || /usr/bin/true
if [[ `/usr/local/bin/soundsource -o` = Furrball ]]; then
	notify Furrball
else
	notify "Power cycling Furrball..."
	/usr/bin/osascript -e 'tell app "XTension" to turn off "Furrball"'
	/bin/sleep 1
	/usr/bin/osascript -e 'tell app "XTension" to turn on "Furrball"'
	count=0
	while true; do
		notify "Waiting for Furrball ($count)..."
		if /sbin/ping -qot 1 furrball.local; then
			/usr/local/bin/soundsource -o Furrball
			notify Furrball
			exit 0
		fi
		count=$((count+1))
	done
fi

Updating dynamic DNS on FreeBSD with ldns-update(1)

Because of the many UI/feature regressions in AirPort Utility 6 and lack of attention to AirPort Extreme firmware bugs (currently, my family’s one-generation-old AirPort Extreme has issues with dynamic DNS updating and drops SIP traffic), I’m in the process of migrating to an embedded router platform, PC Engines’ apu1c.

This migration has been a trying process, between buggy firmware, discovering too late that OpenBSD doesn’t support 802.11n, and FreeBSD PF bugs/lack of documentation (FreeBSD PF diverged significantly from OpenBSD PF a while back). I just ran into a fun problem where the wireless card comes up with a bizarre PCI ID that doesn’t configure:

none1@pci0:5:0:0:	class=0x020000 card=0x00000000 chip=0xff1c168c rev=0x01 hdr=0x00
    vendor     = 'Atheros Communications Inc.'
    device     = 'AR5008 Wireless Network Adapter'
    class      = network
    subclass   = ethernet

versus the correct information I got after a power cycle:

ath0@pci0:5:0:0:	class=0x028000 card=0x3099168c chip=0x002a168c rev=0x01 hdr=0x00
    vendor     = 'Atheros Communications Inc.'
    device     = 'AR928X Wireless Network Adapter (PCI-Express)'
    class      = network

Hopefully this is not indicative of the wireless card dying.

I’m still using ipfw/natd for the moment, though I will try again with PF at some point because I’m unaware of a way to make UPnP and NAT-PMP work otherwise.

As I mentioned above, the AirPort Extreme implementation of RFC 2136 dynamic DNS has problems. In addition to the updating-every-few-seconds bug above, Apple’s NTP servers were returning times far enough off correct that the signature was failing. You can’t even configure the NTP server in AirPort Utility 6, of course, but thankfully I was able to hack AirPort Utility 5.6.1 into working on current OS X versions.

FreeBSD 10 no longer ships BIND and its nsupdate utility. Instead it includes LDNS and Unbound, but not LDNS’s associated “example” utilities, notably the dynamic DNS updater ldns-update. So I installed the LDNS package and promptly discovered several bugs in ldns-update. Thanks to some generous help, I was able to get the FreeBSD-packaged version of ldns-update to work—with one exception: dynamic DNS updates being sent to port 5353 rather than 53.

Until the port change makes it into a LDNS release, here’s a patched amd64 package built on FreeBSD 10.0.

Finally, here’s my /etc/dhclient-exit-hooks that updates the IP address on DHCP address changes:

#!/bin/sh -ef

case .$reason in
    .BOUND | .REBOOT)
	;;
    *)
	exit 0
esac

HOST=hostname_goes_here
ZONE=zone.goes.here
KEY='KEY_GOES_HERE'

update() {
    # ldns-update domain [zone] ip tsig_name tsig_alg tsig_hmac
    # this script assumes the zone is the same as the tsig_name
    /usr/local/bin/ldns-update $HOST.$ZONE $ZONE $1 $ZONE hmac-md5 $KEY
}

update none
update $new_ip_address

Note that the usage message you get from running ldns-update is more useful than the man page: it includes the important none and zone options.

Update, 6 December 2014: Fixes for the above issues (including an updated man page) have been integrated into LDNS, though not yet into a release. I rebuilt my FreeBSD package, also incorporating the FreeBSD port changes: here you go.

Creating a dictation buffer

As I mentioned in a post last month, I recently upgraded my Windows dictation setup to Dragon NaturallySpeaking (DNS) 12 and Word 2013.

This upgrade broke the Emacs dictation interface (vr-mode) I had earlier used with DNS 8 and 10. But it also encouraged me to explore new dictation workflows using Natlink directly from my own Python scripts.

Primarily, I have since switched from editing entire documents in Windows, or using the clipboard to transfer text, to using my minimal dictation surface as a buffer while editing documents on the Mac side. I was inspired to do this after I spent a day with a radiologist observing PowerScribe 360 in use.

PowerScribe is a dictation system which uses a dedicated handheld speech controller. Rather than being inserted at the insertion point like typed text, dictated text is buffered and “placed” by buttons on the speech controller or by clicking. You can also choose to discard the dictated text, accompanied by a cute sound effect. Color coding and other affordances distinguish templated from dictated and typed text. (This would be much easier to show than to describe, but I couldn’t find any good examples of the system actually in use on YouTube.)

Thanks to PowerScribe, I realized that it’s actually easier for me to work with shorter fragments of text, a sentence or a paragraph at a time, rather than importing the entire document at a time. What I’ve implemented so far is on GitHub; here’s a video showing it in use and explaining some technical details:

There are some disadvantages with this system. If you do want to dictate individual words or something smaller than a sentence into the buffer, you will need to manage the spaces, capitalization and punctuation yourself, since your Word document in the dictation buffer isn’t aware of the surrounding contents. In reality, I seldom find this a problem; saying “no caps” or “lowercase that” from time to time isn’t overly arduous. I could theoretically go even further and implement the Mac side of the solution with an input method rather than services and scripts, which would give me access to the surrounding context, but I think that would be a lot of work for relatively little added benefit.

I’ve still got some more work to do; while writing this post, I realized I need a “discard” command much like the one in PowerScribe. (Done.)

While my setup isn’t yet to the point of being usable “out of the box”, I hope that this brief exploration will help other technically inclined dictation users expand their workflows.

Creating a minimal dictation surface with Word 2013

As I discussed in my previous post, I do my serious dictation in a Windows 7 virtual machine. Having recently upgraded my dictation setup and transferred it to a new Mac, I figured it'd be a good thing to share.

While I don't have any experience with its competitors, VMware Fusion 6 does a good job of making my USB headset plugged into OS X available to Windows for dictation, without interfering with its use in OS X. Dragon NaturallySpeaking calls VMware's audio source “Microphone (Mic-In).” In earlier versions of VMware Fusion, I had mapped my USB headset exclusively to Windows for dictation. This also worked well, but I'm not sure if it was actually necessary.

Most of the time I'm not actually editing documents directly on Windows; the OS simply holds my text on the way to its destination in a Mac application. Dragon NaturallySpeaking (and its Medical derivatives) include a WordPad knockoff called DragonPad. Its stated purpose is exactly as such a dictation intermediary, but the user interface looks like it was frozen around Windows 2000 and it only supports single undo — not even redo. So, it's a bit of a nonstarter.

My next best bet is Microsoft Word, for which NaturallySpeaking includes a COM-based addin. Previously, NaturallySpeaking 10 limited me to using the 32-bit version of Office 2010; with version 12, I can use 64-bit Office 2013.

Happily, some of the Office 2013 changes made to support smaller-screened touchscreen tablets have helped my use case of a minimal "dictation surface" sitting in the corner of my Mac's screen. Here's how I set things up:

  • Use VMware Fusion's "Single Window" view, rather than its Unity view. The Word view options I describe only work if the Word window is maximized.  In Unity view this means maximizing to the Mac’s screen, whereas in the Single Window view, you can resize the VM screen as needed.  It's nice to be able to see your other Mac apps as you dictate (or not; I use Shroud’s keyboard shortcuts to alternatively hide everything else but the VM window when I really need to focus).
  • Set the Windows taskbar to auto-hide.
  • Change Dragon NaturallySpeaking’s DragonBar mode (in Tools > Options > View) to “Tray Icon Only”. I find the continual display of audio levels distracting; Apple’s dictation display, despite doing something conceptually similar, is less distracting, probably because it doesn’t change color. I have a keystroke, Ctrl-Alt-`, set to toggle the microphone and don't find it a problem that I sometimes get extraneous dictation in my document if I forget that the VM is listening in the background. (Actually, sometimes it’s pretty humorous.)
  • Turn off Word’s Start screen so you get an untitled document when you start Word.
  • Switch Word to draft view, via Alt, W, E or clicking the Draft button in the View ribbon tab. Those of you who have been using Word since 4.0 or earlier may remember this used to be the default view, but it’s been marginalized in favor of more WYSIWYG alternatives in recent Word versions. However, this heritage explains the corresponding dictation command, “normal view”. (If you accidentally say “draft view”, you’ll probably find everything changes into Courier; say “turn off draft mode” to fix that.) The main advantage of draft view is that you can resize the window without changing your font size; it's also more space efficient, as it doesn’t display your margins.
  • Outline view (Alt, W, U, or say “outline view”) is also a good choice for dictation. though Word is no OmniOutliner. Say “new line” rather than “new paragraph” or you’ll dictate a bunch of empty headings. “Tab” and “press shift-tab” will indent and unindent respectively.
  • Set Word's ribbon to auto-hide, via the button between the help and minimize buttons at the top right of the window. This maximizes the window if it isn’t already; it also hides the status bar, window title bar and most other chrome. A long button across the top of the window labeled with an ellipsis will restore the chrome if you need it, as will a press of the Alt key.
  • Experiment with Office themes (Options > General > Personalize your copy of Microsoft Office). The White theme is more trendy but I prefer a bit more separation between content and chrome, so I picked Dark Gray.
  • Consider disabling the cursor animations (smooth movement of the insertion point as you type) if they're as disconcerting to you as me.

Once you have your view set up, you'll find that Word reverts to Print Layout view for new documents. Unfortunately, to solve this problem you must delve into the crufty world of Office automation with Visual Basic for Applications. From the look of its toolbars, the VBA editor last had serious work done in the Office 2003 timeframe; most of it appears unchanged since VBA’s inception.

If it isn’t there already, add the Developer tab to your Ribbon (Options > Customize Ribbon > Main Tabs). Click Developer > Visual Basic, or press Alt, L, V.  Select the Normal project if you haven’t already to put code in the template, and paste in the following (if you’ve already got code in there, I trust you know what to do):

Sub AutoExec()
    ' Wait until a document opens.
    Application.OnTime Now, "AutoNew"
End Sub

Sub AutoNew()
    ' Ensure that the draft font isn't used
    ' (e.g., if you say "draft view" by accident)
    With Dialogs(wdDialogToolsOptionsView)
        .DraftFont = False
        .Execute
    End With
    ' Draft view is wdNormalView.
    If ActiveWindow.View.Type = wdPrintView Then
        ActiveWindow.View.Type = wdNormalView
    End If
    ' If window isn't maximized, ribbon doesn't collapse fully.
    Application.CommandBars("Ribbon").Visible = False
End Sub

Update: I have posted an updated version of the above macros to GitHub.

(There's some incorrect information on the Internet about scripting draft view in Word, for example here. The issue, as above, is that Draft view used to be Normal view, and still is Normal view both from VBA as well as in Dragon voice commands. View.Draft, despite the name, controls the font.)

Save, quit and restart Word; you should find yourself with a minimal dictation surface ready for your use:

Word 2013 dictation setup

A related note: I experimented with Windows Live Writer for this post, versus my usual process of copying and pasting into MarsEdit. As long as I turn off the “Blog Theme” button (which causes problems unrelated to dictation), dictation into Windows Live Writer works acceptably. The biggest issue is the markup ending up all one line in WordPress, despite looking fine (seriously — a Microsoft tool that generates tolerable markup!) in Windows Live Writer. Smaller issues include the results box appearing in the top left corner of the screen regardless of my cursor location (normally it appears at the insertion point) and dictation inserting unnecessary newlines, particularly in a bulleted list.

‹ Newer Posts  •  Older Posts ›