<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="assets/xml/rss.xsl" media="all"?><rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Etienne’s blog</title><link>http://etienned.github.io/</link><description>Python, technology and maybe more</description><atom:link href="http://etienned.github.io/rss.xml" type="application/rss+xml" rel="self"></atom:link><language>en</language><lastBuildDate>Sun, 29 Mar 2015 02:26:05 GMT</lastBuildDate><generator>http://getnikola.com/</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Extract text from Word files (docx) simply</title><link>http://etienned.github.io/posts/extract-text-from-word-docx-simply/</link><dc:creator>Etienne Desautels</dc:creator><description>&lt;div&gt;&lt;p&gt;If you want to extract the text content of a Word file there are a few &lt;a class="reference external" href="http://stackoverflow.com/questions/42482/best-way-to-extract-text-from-a-word-doc-without-using-com-automation"&gt;solutions&lt;/a&gt; to do this in Python. Unfortunately most of these solutions have dependencies or need to run an external command in a subprocess or are heavy/complex, using an office suite, etc. I find that the best solution among those in the Stackoverflow page is &lt;a class="reference external" href="https://github.com/mikemaccana/python-docx"&gt;python-docx&lt;/a&gt;. But using it bring two dependencies: &lt;em&gt;python-docx&lt;/em&gt; itself and &lt;a class="reference external" href="http://lxml.de/"&gt;lxml&lt;/a&gt;. Installing &lt;em&gt;python-docx&lt;/em&gt; is not a big problem. Unfortunately &lt;em&gt;lxml&lt;/em&gt; is sometimes hard to install or, at the minimum, requires compilation.&lt;/p&gt;
&lt;p&gt;To avoid that, inspired by &lt;em&gt;python-docx&lt;/em&gt;, I created a simple function to extract text from &lt;em&gt;.docx&lt;/em&gt; files that do not require dependencies, using only the standard library. So it’s easy to incorporate it in any Python project.&lt;/p&gt;
&lt;p&gt;Is there any way to improve it?&lt;/p&gt;
&lt;script src="https://gist.github.com/7539105.js"&gt;&lt;/script&gt;&lt;noscript&gt;&lt;pre class="literal-block"&gt;
try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile


"""
Module that extract text from MS XML Word document (.docx).
(Inspired by python-docx &amp;lt;https://github.com/mikemaccana/python-docx&amp;gt;)
"""

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'


def get_docx_text(path):
    """
    Take the path of a docx file as argument, return the text in unicode.
    """
    document = zipfile.ZipFile(path)
    xml_content = document.read('word/document.xml')
    document.close()
    tree = XML(xml_content)

    paragraphs = []
    for paragraph in tree.getiterator(PARA):
        texts = [node.text
                 for node in paragraph.getiterator(TEXT)
                 if node.text]
        if texts:
            paragraphs.append(''.join(texts))

    return '\n\n'.join(paragraphs)

&lt;/pre&gt;
&lt;/noscript&gt;&lt;/div&gt;</description><category>Python</category><guid>http://etienned.github.io/posts/extract-text-from-word-docx-simply/</guid><pubDate>Tue, 26 Nov 2013 10:01:59 GMT</pubDate></item><item><title>Using a Raspberry Pi as an air exchanger controller</title><link>http://etienned.github.io/posts/raspberry-pi-as-air-exchanger-controller/</link><dc:creator>Etienne Desautels</dc:creator><description>&lt;div class="section" id="the-margarita-project"&gt;
&lt;h2&gt;The Margarita project&lt;/h2&gt;
&lt;p&gt;This project’s goal is to build a controller for my house’s air exchanger that will optimize its utilization. It will take into account exterior and interior temperature and humidity to decide what the exchanger should do. In a second time I will add a web/phone interface to access and set the controller. In fact, I have a lot of other ideas and goals, but better to not make too many promises!&lt;/p&gt;
&lt;p&gt;&lt;a href="http://etienned.github.io/posts/raspberry-pi-as-air-exchanger-controller/"&gt;Read more…&lt;/a&gt; (6 min remaining to read)&lt;/p&gt;&lt;/div&gt;</description><category>Margarita</category><category>Python</category><category>Raspberry Pi</category><guid>http://etienned.github.io/posts/raspberry-pi-as-air-exchanger-controller/</guid><pubDate>Sat, 23 Nov 2013 10:30:03 GMT</pubDate></item></channel></rss>