This file includes metadata of 1,511 'Australian' books listed in Trove that have freely available text versions in the Internet Archive. CSV formatted list of 'Australian' books in Trove with full text versions in the Internet Archive ¶ However, this is not always accurate and some of the harvested works don't seem to have an Australian connection. Trove's 'Australian content' filter was used to try to limit the results to books published in, or about, Australia. I've harvested 1,513 text files from the Internet Archive of 'Australian' books listed in Trove using the notebook above. OCRd text from the Internet Archive of 'Australian' books listed in Trove ¶ This dataset combines records from the separate harvests of books and periodicals available from Trove in digital form that have the type 'Government publication'. Government publications in digital form ¶ This file provides metadata of 42,174 works in the Trove book zone that are available in digital form. CSV formatted list of books available in digital form ¶ I've harvested 26,762 files of OCRd text from digitised books and ephemera using the notebook above. 'A particular feature of this book collection is that it is multilingual, therefore I'll be focusing a bit on that, and on the use of the topic model to figure out what the collection is about.' Data and text ¶ OCRd text from Trove books and ephemera ¶ This notebook explores the 9,738 text files from digitised books available below. ![]() Enjoy! Exploring the Digitised Books Collection from Trove by Adel Rahmani ¶ Then we recombine the various parts in random combinations to create delicious recipes for all occasions. We try to clean things up a bit, using regular expressions to discard likely OCR errors. ![]() In this notebook we use TextBlob to extract nouns, verbs, and sentences from the OCRd text of a 19th century cookery book. The text is downloaded from the Cloudstor repository created by the full harvest of Trove digitised books. This notebook provides a simple example of extracting word and ngram frequencies from the OCRd text of a digitised book using TextBlob and Wordcloud. Exploring harvested books ¶ Counting words and phrases ¶ But it occured to me it might be possible to get the full text of other books in Trove by making use of the links to the Open Library. Previously I've harvested the text of books digitised by the National Library of Australia and made available through Trove. Getting the text of Trove books from the Internet Archive ¶ Most of this metadata isn't available through the Trove API. In poking around to try and find a way of automating the download of OCR text from Trove's digitised books, I discovered that there's lots of useful metadata embedded in the page of a digitised work. Results of the harvest are available below. This notebook harvests metadata and OCRd text from digitised works in Trove's book zone. Harvesting data ¶ Harvesting the text of digitised books (and ephemera) ¶ ![]() Or just take them for a spin using Binder. See below for information on running these notebooks in a live computing environment. You can access metadata from the book zone through the Trove API. Trove's 'book' zone includes books (of course), but also ephemera (like pamphlets and leaflets) and theses.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |