Document Storage Project
- Part 1: What we’re doing
- Part 2: Setting up a base system
- Part 3: Configuring Apache
- Part 4: Indexing the Storage
This is Part 5: Uploading Scanned Images.
There’s two components to this part: configuring somewhere for the files to be uploaded to and setting up your MFD to upload to them. Most modern MFDs will upload to a CIFS share, which is what we’re going to use here. First thing’s first, we need to install Samba:
apt-get install samba
Now we need to set up Samba. We’ll have user-level security (it’ll be much easier to lock things down if we want to increase security at a later date, and besides share-level security went out with the Ark) and a single share called incoming. We also need a user for the MFD to log into Samba with; we’ll call this user “scanner”. We’ll also have a group called “scanner” so we can be a little more flexible over who can access this share should we wish.
Edit /etc/samba/smb.conf as follows:
...... # "security = user" is always a good idea. This will require a Unix account # in this server for every user accessing the server. See # /usr/share/doc/samba-doc/htmldocs/Samba3-HOWTO/ServerType.html # in the samba-doc package for details. security = user ...... [incoming] path = /home/incoming guest ok = no browseable = no read only = no valid users = @scanner
Now, we need a new user for the MFD. Samba requires that users also have corresponding Unix accounts, so first we create a Unix account, then we set their Samba password. We also need to ensure the permissions on /home/incoming are correct – the folllowing commands deal with this:
useradd scanner smbpasswd scanner chgrp scanner /home/incoming chmod g+rwx /home/incoming
Make sure you choose a password that is not only secure, but possible to type in on your MFD! Check this works by connecting to the following folder in Windows:
You’ll need to use the username/password for the scanner user you set up.
For the final part of this, you need to set up your MFD to scan to this directory.
I’ve chosen an Oki MB451 multifunction unit for a number of reasons:
- It’s cheap.
- It has a double-sided document feeder for scanning. More and more documents are being sent double-sided; it seems like a step back to have a document feeder that can’t deal with this.
- It supports scanning directly to email and CIFS share without requiring extra software on the PC. (This is important; certainly a few years ago a lot of manufacturers claimed their products could do this but it wasn’t apparent until after you’d taken it out of the box that their product didn’t do any of it without additional software on your PC. Certain large photocopier-type units still have this restriction, though sometimes you can buy an optional bolt-on to overcome it. I prefer avoiding the need for extra bolt-ons because they’re usually extortionately priced and often difficult to source).
- It has a nice big display. These units can be a pig to set up at the best of times; a large display often goes some way to alleviate this problem.
- You can set up lots of profiles – preconfigured shortcuts that say “everything scanned under this profile should be stored under this name in this share accessed with this username and password; files should have this format”. Unfortunately you can’t nail a profile to say “everything scanned under this profile is double-sided” but you can’t have everything!
- The printer supports Postscript, which means it’ll be pretty much guaranteed to work under any OS I can throw at it for a long time to come.
I won’t go into detail regarding MFD configuration – there’s simply too many on the market and they all vary. It’s enough to explain that I’ve set up a profile called “Correspondence” and I’ve pointed it at \\(hostname)\incoming.
With the profile I’ve set up, scanned documents will be stored under \\(hostname)\incoming\Correspondence-#####.pdf.
Test this all works by scanning a document and making sure it appears in the /home/incoming directory on your Linux box.
There’s only one thing left to do – tie all this together so incoming documents are automatically OCR’d, made available via Apache and OCR’d so they’re indexable in Sphider….