Help:Digitising texts and images for Wikisource
From Free media library
A great deal of the material on Wikisource so far has come from online resources such as Project Gutenberg and other digital libraries. In time more material will be digitised specially, and it is hoped this page will be augmented over time to reflect best practice for MediaWiki editing, as the most appropriate methods are worked out. It is not that complicated, but strict attention to detail is necessary to get the best results.
Digitising falls into four areas:
- Scanning
- Scanning text and images follow similar methods, but differ in how the results are saved for later use. Flat-bed scanners are usually in A4 format and will take up to a quarto (approx 10in x 8in) book page size. Bigger pages than that need an A3 scanner. An alternative is to use a photocopier to reduce bigger pages to A4 format and scan the photocopies.
- The scanning of bound books can be difficult due to the binding, and a special book scanner is required, where the scan goes into the very edge of page at the hinge side of the binding. These are costly and are mainly used in large libraries. The Plustek OpticBook 3600 [1] is worth looking at as an affordable solution.
- Scanning of images [To be added later]
- OCRing
- Conversion of scanned texts to machine readable form is done by Optical Character Recognition software, such as OmniPage or Text Bridge. The software developed by Athelstane-Etext [2] to be used in conjunctuon with PlustekOpticBook 3600 has been devised for the production of clean texts after scanning. It works by DOS on Windows 95 and 98 but there are evidently problems with Windows XP. It involves the production of a set of page images which are then used for OCRing. The images can be called up at will for proof-checking the work.
- It is not necessary to make corrections using the OCR software. Instead scans of text should always be saved as Text only. Saving in Rich Text Format (RTF) and even a favourite word processor program will introduce rogue codings which will need removing later on. Once saved a preliminary edit for typos can be made using a text editor such as Note Tab. The text should be compared carefully with the printed orginal.
MediaWiki formating codes for enhancements such as bold and italic can be inserted at this stage.
- Uploading
- After going online to the Wikisource edit page, the text can be cut and pasted from the text editor and saved in Wikisource.
- Uploading of images [to be added later]
- Final editing
- Remaining on-line, final tweaking for format can be done using the browser. Firefox has an extension which allows MediaWiki editing by right-clicking the mouse. Enhancements and a number of diacritics for accents can be added by this means. Additional diacritics can be found on the insert bar at the foot of the Wikisource edit page. See wikification.