text-processing

Optimizing MySQL Import (Converting a Verbose SQL Dump to a Speedy One / use extended-inserts)

这一生的挚爱 提交于 2019-12-07 14:33:51
问题 We are using mysqldump with the options --complete-insert --skip-extended-insert to create database dumps that are kept in VCS. We use these options (and the VCS) to have the possibility to easily compare different database versions. Now importing of the dump takes quite a while because there are - of course - single inserts per database row. Is there an easy way to convert such a verbose dump to one with a single insert per table? Does anyone maybe already have a some script at hand? 回答1: I

bash routine to return the page number of a given line number from text file

青春壹個敷衍的年華 提交于 2019-12-07 12:31:47
问题 Consider a plain text file containing page-breaking ASCII control character "Form Feed" ($'\f'): alpha\n beta\n gamma\n\f one\n two\n three\n four\n five\n\f earth\n wind\n fire\n water\n\f Note that each page has a random number of lines. Need a bash routine that return the page number of a given line number from a text file containing page-breaking ASCII control character. After a long time researching the solution I finally came across this piece of code: function get_page_from_line {

Algorithm for generating a 'top list' using word frequency

倖福魔咒の 提交于 2019-12-07 08:53:39
问题 I have a big collection of human generated content. I want to find the words or phrases that occur most often. What is an efficient way to do this? 回答1: Don't reinvent the wheel. Use a full text search engine such as Lucene. 回答2: The simple/naive way is to use a hashtable. Walk through the words and increment the count as you go. At the end of the process sort the key/value pairs by count. 回答3: the basic idea is simple -- in executable pseudocode, from collections import defaultdict def

Exploding UpperCasedCamelCase to Upper Cased Camel Case in PHP

北战南征 提交于 2019-12-07 06:49:37
问题 Right now, I am implementing this with a split, slice, and implosion: $exploded = implode(' ',array_slice(preg_split('/(?=[A-Z])/','ThisIsATest'),1)); //$exploded = "This Is A Test" Prettier version: $capital_split = preg_split('/(?=[A-Z])/','ThisIsATest'); $blank_first_ignored = array_slice($capital_split,1); $exploded = implode(' ',$blank_first_ignored); However, the problem is when you have input like 'SometimesPDFFilesHappen' , which my implementation would (incorrectly) interpret as

How to remove YAML frontmatter from markdown files?

≯℡__Kan透↙ 提交于 2019-12-07 04:10:13
问题 I have markdown files that contain YAML frontmatter metadata, like this: --- title: Something Somethingelse author: Somebody Sometheson --- But the YAML is of varying widths. Can I use a Posix command like sed to remove that frontmatter when it's at the beginning of a file? Something that just removes everything between --- and --- , inclusive, but also ignores the rest of the file, in case there are --- s elsewhere. 回答1: I understand your question to mean that you want to remove the first --

How to select multiple lines from a file or from pipe in a script?

孤街醉人 提交于 2019-12-07 01:03:13
问题 I'd like to have a script, called lines.sh that I can pipe data to to select a series of lines. For example, if I had the following file: test.txt a b c d Then I could run: cat test.txt | lines 2,4 and it would output b d I'm using zsh, but would prefer a bash solution if possible. 回答1: You can use this awk: awk -v s='2,4' 'BEGIN{split(s, a, ","); for (i in a) b[a[i]]} NR in b' file two four Via a separate script lines.sh : #!/bin/bash awk -v s="$1" 'BEGIN{split(s, a, ","); for (i in a) b[a[i

Count the number of unique words and occurrence of each word from txt file

两盒软妹~` 提交于 2019-12-06 16:29:46
问题 currently i trying to create an application to do some text processing to read in a text file, then I use a dictionary to create index of words, technically it will be like this .. program will be run and reading a text file then checking it, to see if the word is already in that file or not and what the id word for it as a unique word . If so, it will print out the index number and total of appearance for each word they meet and continue to check for entire file. and produce something like

How to use os.walk to only list text files

≡放荡痞女 提交于 2019-12-06 15:37:15
This question was similar in addressing hidden filetypes. I am struggling with a similar problem because I need to process only text containing files in folders that have many different filetypes- pictures, text, music. I am using os.walk which lists EVERYTHING, including files without an extension-like Icon files. I am using linux and would be satisfied to filter for only txt files. One way is too check the filename extension and this post explains nicely how it's done. But this still leaves mislabeled files or files without an extension. There are hex values that uniquely identify filetypes

How to work with data from NBA.com?

流过昼夜 提交于 2019-12-06 11:18:16
I found Greg Reda's blog post about scraping HTML from nba.com : http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/ I tried to work with the code he wrote there: import requests import json url = 'http://stats.nba.com/stats/leaguedashteamshotlocations?Conference=&DateFr' + \ 'om=&DateTo=&DistanceRange=By+Zone&Division=&GameScope=&GameSegment=&LastN' + \ 'Games=0&LeagueID=00&Location=&MeasureType=Opponent&Month=0&OpponentTeamID' + \ '=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperien' + \ 'ce=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2014-15&SeasonSegment=

Creating a simple searching program

◇◆丶佛笑我妖孽 提交于 2019-12-06 09:47:58
Decided to delete and ask again, was just easier! Please do not vote down as have taken on board what people have been saying. I have two nested dictionaries:- wordFrequency = {'bit':{1:3,2:4,3:19,4:0},'red':{1:0,2:0,3:15,4:0},'dog':{1:3,2:0,3:4,4:5}} search = {1:{'bit':1},2:{'red':1,'dog':1},3:{'bit':2,'red':3}} The first dictionary links words a file number and the number of times they appear in that file. The second contains searches linking a word to the number of times it appears in the current search. I want to extract certain values so that for each search I can calculate the scalar