doc

Convert Word doc or docx files into text files?

谁说胖子不能爱 提交于 2019-11-27 18:52:42
I need a way to convert .doc or .docx extensions to .txt without installing anything. I also don't want to have to manually open Word to do this obviously. As long as it's running on auto. I was thinking that either Perl or VBA could do the trick, but I can't find anything online for either. Any suggestions? Note that an excellent source of information for Microsoft Office applications is the Object Browser . You can access it via Tools → Macro → Visual Basic Editor . Once you are in the editor, hit F2 to browse the interfaces, methods, and properties provided by Microsoft Office applications.

ruby爬虫模板

戏子无情 提交于 2019-11-27 18:15:09
require 'restclient' require 'open-uri' require 'open_uri_redirections' require 'nokogiri' require 'json' require 'yaml' require 'fileutils' require 'base64' MAX_RETRY_TIMES = 5 ROOT_DIR = '/home/zn/work/small-tools-master/zlk/tu/' BASE_URL = 'https://newceshiao.com/mnkc/tiku/?id=' COOKIE = {:VerificationCodeNum => '1', :QZ_KSUser => 'UserID=15357507&UserName=ppkao1520606811&UserToken=cw05IVsvRbyxuPoQeQIU4%252bZNshdiFE%252fN6LGCVScB%252bnQLBUYAu7SA7A%253d%253d'} @cookie = 'VerificationCodeNum=1; PPKAO=PPKAOSTID%3D987%26PPKAOCEID%3D%26PPKAOSJID%3D%26UserName%3D%26EDays%3D' @agent = "Mozilla/5.0

Convert PDF to DOC (Python/Bash)

自作多情 提交于 2019-11-27 12:56:47
I've saw some pages that allow user to upload PDF and returns a DOC file, like PdfToWord Is there any way to convert a PDF file to a DOC/DOCX file using Python or any Unix command ? Thanks in advance If you have LibreOffice installed lowriter --invisible --convert-to doc '/your/file.pdf' If you want to use Python for this: import os import subprocess for top, dirs, files in os.walk('/my/pdf/folder'): for filename in files: if filename.endswith('.pdf'): abspath = os.path.join(top, filename) subprocess.call('lowriter --invisible --convert-to doc "{}"' .format(abspath), shell=True) This is

Upload DOC or PDF using PHP

為{幸葍}努か 提交于 2019-11-27 11:57:19
I'm able to upload images fine, but when when I change the types from image/jpg, image/gif to application/msword and application/pdf, it doesn't work. Here's my code. The exact same code works for images, but for uploading docs and pdf, it outputs "Invalid File." What's going on here? My file is only approx 30kb and is well under the file size limit here. $allowedExts = array("pdf", "doc", "docx"); $extension = end(explode(".", $_FILES["file"]["name"])); if ( ( ($_FILES["file"]["type"] == "application/msword") || ($_FILES["file"]["type"] == "text/pdf") ) && ($_FILES["file"]["size"] < 20000) &&

Java 操作Word书签(一):添加、删除、读取书签

笑着哭i 提交于 2019-11-27 10:14:07
Word中,书签功能常用于查找、定位、标记特定字符或段落,对于篇幅较大的文档,此功能非常实用。下面,将介绍通过Java程序来添加及删除Word书签的方法。示例要点包括: 1. 添加书签 1.1 给指定段落添加书签 1.2 给指定字符串添加书签 2. 删除书签 2.1删除书签 2.2 删除书签文本 3. 读取书签文本 使用工具: Free Spire.Doc for Java (免费版) Jar 文件获取及导入: 方法 1 : 通过官网 下载jar 文件包。下载后,解压文件。并将lib文件夹下的Spire.Doc.jar文件导入到java程序。参考如下导入效果: 方法 2 : 可通过maven仓库 安装导入 。可参考安装 导入方法 。 Java 代码示例 【示例 1 】给指定段落添加书签 import com.spire.doc.*; import com.spire.doc.documents.Paragraph; public class AppendBookmark { public static void main(String[]args){ //加载需要添加书签的Word文档 Document doc = new Document(); doc.loadFromFile("sample.docx"); //获取需要添加书签的段落 Paragraph para = doc

Convert a .doc or .pdf to an image and display a thumbnail in Ruby?

孤街浪徒 提交于 2019-11-27 09:48:00
问题 Convert a .doc or .pdf to an image and display a thumbnail in Ruby? Does anyone know how to generate document thumbnails in Ruby (or C, python...) 回答1: A simple RMagick example to convert a PDF to a PNG would be: require 'RMagick' pdf = Magick::ImageList.new("doc.pdf") thumb = pdf.scale(300, 300) thumb.write "doc.png" To convert a MS Word document, it won't be as easy. Your best option may be to first convert it to a PDF before generating the thumbnail. Your options for generating the PDF

Upload DOC or PDF using PHP

爷,独闯天下 提交于 2019-11-27 04:03:24
问题 I'm able to upload images fine, but when when I change the types from image/jpg, image/gif to application/msword and application/pdf, it doesn't work. Here's my code. The exact same code works for images, but for uploading docs and pdf, it outputs "Invalid File." What's going on here? My file is only approx 30kb and is well under the file size limit here. $allowedExts = array("pdf", "doc", "docx"); $extension = end(explode(".", $_FILES["file"]["name"])); if ( ( ($_FILES["file"]["type"] ==

How to extract just plain text from .doc & .docx files? [closed]

六眼飞鱼酱① 提交于 2019-11-27 02:50:51
Anyone know of anything they can recommend in order to extract just the plain text from a .doc or .docx ? I've found this - wondered if there were any other suggestions? If you want the pure plain text(my requirement) then all you need is unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' Which I found at command line fu It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost. LibreOffice One option is libreoffice /openoffice in headless mode (make sure all other instances of libreoffice are

pyhive client连接hive报错处理:Could not start SASL

落爺英雄遲暮 提交于 2019-11-27 02:26:21
本来一切就绪,镜像里已安装如下主要的pip包。 pyhive configparser pandas hdfs thrift sqlparse sasl thrift-sasl 但,使用pyhive client去真正连接hive服务器时,还是会报如下错误: thrift.transport.TTransport.TTransportException: Could not start SASL: b'Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found' 这个问题,有点大条了,按网上centos的解决方式,以下安装包即可解决: yum install cyrus-sasl-plain cyrus-sasl-devel cyrus-sasl-gssapi 但我的镜像是UBUNTU,因为tensorflow官方镜像就是ubuntu 1804。所以,这条路不错。 又参考了网上一些ubuntu的方法,安装sasl2-bin等这些软件包,都没有解决问题。 最后,还是实打实的来到 http://www.linuxfromscratch.org/blfs/view/cvs/postlfs/cyrus-sasl.html , 源码安装好Cyrus SASL-2.1.27

ElasticSearch(八) ES官方调优指南

走远了吗. 提交于 2019-11-27 02:18:28
第一部分:调优索引速度 第二部分-调优搜索速度 第三部分:通用的一些建议 原文: https://www.elastic.co/guide/en/elasticsearch/reference/current/how-to.html ES发布时带有的默认值,可为es的开箱即用带来很好的体验。全文搜索、高亮、聚合、索引文档 等功能无需用户修改即可使用,当你更清楚的知道你想如何使用es后,你可以作很多的优化以提高你的用例的性能,下面的内容告诉你 你应该/不应该 修改哪些配置 第一部分:调优索引速度 ( https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html) 使用批量请求批量请求将产生比单文档索引请求好得多的性能。 为了知道批量请求的最佳大小,您应该在具有单个分片的单个节点上运行基准测试。 首先尝试索引100个文件,然后是200,然后是400,等等。 当索引速度开始稳定时,您知道您达到了数据批量请求的最佳大小。 在配合的情况下,最好在太少而不是太多文件的方向上犯错。 请注意,如果群集请求太大,可能会使群集受到内存压力,因此建议避免超出每个请求几十兆字节,即使较大的请求看起来效果更好。 发送端使用多worker/多线程向es发送数据