extract text from ms office

Interesting program to extract text

  • catdoc – extract text from ms word
  • xls2cvs – extract text from ms excell
  • pdftotext – extract text from pdf
  • ppthtml – extract text from ms. power point

Then a simple php function can capture the output eg:

function extractWord($word_file)
{
if (file_exists($word_file)
{
// prevent malicious command execution
exec("/usr/bin/catdoc -w ' . escapeshellarg($word_file), $output);

// $output is an array corresponding to lines of output
return join("\n", $output);
}
}

extracting text from office and pdf file