Google Docs to clean html, good for WordPress posts, emails

WARNING! It appears Google Docs has changed and this script no longer works. 

Google docs is a great platform to write documents, especially when you compare it with the WordPress editor.  It would be good to have a clean way to export a Google doc to a wordpress post or generate nice looking emails. If you copy and paste a Google doc into a WordPress post, it loses many formatting and produces a bloated html with lots of inline style, CSS classes that do not go well with WordPress. So, here’s a solution that will generate a clean HTML from a Google Doc and email it to you so that you can copy and paste it a WordPress post or send to others via email.

For example, here’s a Google Doc:

GoogleDocScreenshot

Once you run the script, it will produce a nice clean email for you:

GoogleDocsEmail

Here’s how to do it:

  1. Open your Google Doc and go to Tools menu, select Script Editor. You should see a new window open with a nice code editor.
  2. Copy and paste the code from here: GoogleDocs2Html
  3. Then from the “Select Editor” menu, choose ConvertGoogleDocToCleanHtml
  4. Click the play button to run the script.
  5. You will get an email containing the HTML output of the Google Doc with inline images.
  6. You can easily forward that email to anyone or copy and paste in a WordPress post.

Here’s how the code works:

First it will loop through the elements (paragraph, images, lists) in the body:

function ConvertGoogleDocToCleanHtml() {
  var body = DocumentApp.getActiveDocument().getBody();
  var numChildren = body.getNumChildren();
  var output = [];
  var images = [];
  var listCounters = {};

  // Walk through all the child elements of the body.
  for (var i = 0; i < numChildren; i++) {
    var child = body.getChild(i);
    output.push(processItem(child, listCounters, images));
  }

  var html = output.join('\r');
  emailHtml(html, images);
  //createDocumentForHtml(html, images);
}

The processItem function takes care of generating proper html output from a Doc Element. The code for this function is long as it handles Paragraph, Text block, Image, Lists. Best to read through the code to see how it works. When the proper html is generated and the images are discovered, the emailHtml function generates a nice html email, with inline images and sends to your Gmail account:

function emailHtml(html, images) {
  var inlineImages = {};
  for (var j=0; j<images.length; j++) {
    inlineImages[[images[j].name]] = images[j].blob;
  }

  var name = DocumentApp.getActiveDocument().getName()+".html";

  MailApp.sendEmail({
     to: Session.getActiveUser().getEmail(),
     subject: name,
     htmlBody: html,
     inlineImages: inlineImages
   });
}

Enjoy!

Remember images in the email are inline images. If you copy and paste into WordPress, the images won’t get automatically uploaded to WordPress. You will have to manually download and upload each image to WordPress. It’s a pain. But that’s the problem with WordPress editor.

Special thanks to this GitHub project, that gave me many ideas: https://github.com/mangini/gdocs2md

 

10 thoughts on “Google Docs to clean html, good for WordPress posts, emails”

  1. There was an unidentified function G above the processImage function on line 224 that needed to be removed. Just the letter G was sitting there undefined. As soon as I removed it from the script it worked perfectly.

    Although it doesn’t help with the step of having to save images and re-upload them, it does however email you the web version that you can save images from without having to open up the doc and save it as a html file in Word.

    If you have a WordPress Login you should search the Docs to WordPress Add-On available from within Docs.

  2. This code is (almost) perfect for our weekly email. We use google docs so multiple committee members can make changes and review the content before sending it out but, we needed an way way to convert to HTML, keep the formatting, and keep the HTML simple (emails don’t like complicated HTML).

    I’ve made a slight change to the code, letting it turn secure (https://) pages into links. I would consider this a requirement given the demand for encrypted browsing nowadays.

    However, my only remaining problem is text that links. As an example, once converted it should be:

    Click here to register

    I have found the following method and tried to implement but, I can’t seem to make this work in any way.

    getLinkUrl()
    Retrieves the link url.
    Return
    String — the link url, or null if the element contains multiple values for this attribute

    Would anyone have any suggestions on how this problem could be solved?

  3. It sounds like people would like to have links converted properly. Me too! So replace your processText() with this processText():

    function processText(item, output) {
    var text = item.getText();
    var indices = item.getTextAttributeIndices();

    if (indices.length <= 1) {
    if (item.getLinkUrl()) {
    output.push('‘+text+’‘);
    } else if (item.isBold()) {
    // Assuming that a whole para fully italic is a quote
    output.push(‘‘ + text + ‘‘);
    } else if(item.isItalic()) {
    output.push(‘

    ‘ + text + ‘

    ‘);
    } else if (text.trim().indexOf(‘http://’) == 0) {
    output.push(‘‘ + text + ‘‘);
    } else {
    output.push(text);
    }
    } else {

    for (var i=0; i < indices.length; i ++) {
    var partAtts = item.getAttributes(indices[i]);
    var startPos = indices[i];
    var endPos = i+1 < indices.length ? indices[i+1]: text.length;
    var partText = text.substring(startPos, endPos);

    Logger.log(partText);

    if (partAtts.ITALIC) {
    output.push('‘);
    Logger.log(‘index ‘+i+’ startPos ‘+startPos+’ endPos ‘+endPos+’ ITALIC open’);
    }
    if (partAtts.BOLD) {
    output.push(‘‘);
    Logger.log(‘index ‘+i+’ startPos ‘+startPos+’ endPos ‘+endPos+’ BOLD open’);
    }
    if (item.getLinkUrl(startPos)) {
    output.push(‘‘);
    Logger.log(‘index ‘+i+’ startPos ‘+startPos+’ endPos ‘+endPos+’ A HREF open’);
    } else if (partAtts.UNDERLINE) {
    // Links show up as underlined. I’ll assume that if we have a link, we don’t care if it is underlined.
    output.push(”);
    Logger.log(‘index ‘+i+’ startPos ‘+startPos+’ endPos ‘+endPos+’ UNDERLINE open’);
    }

    // If someone has written [xxx] and made this whole text some special font, like superscript
    // then treat it as a reference and make it superscript.
    // Unfortunately in Google Docs, there’s no way to detect superscript
    if (partText.indexOf(‘[‘)==0 && partText[partText.length-1] == ‘]’) {
    output.push(” + partText + ”);
    } else if (partText.trim().indexOf(‘http://’) == 0) {
    output.push(‘
    ‘ + partText + ‘‘);
    } else {
    output.push(partText);
    }

    if (partAtts.ITALIC) {
    output.push(‘
    ‘);
    }
    if (partAtts.BOLD) {
    output.push(‘‘);
    }
    if (item.getLinkUrl(startPos)) {
    output.push(‘‘);
    Logger.log(‘index ‘+i+’ startPos ‘+startPos+’ endPos ‘+endPos+’ A HREF close’);
    } else if (partAtts.UNDERLINE) {
    output.push(”);
    }

    }
    }
    }

    It is a bit hacky, but works.

Leave a Reply