More

aggerdom · on March 4, 2020

Are you me? Wish that I had known the insertion order trick, though it isn't straightforward to implement with the stack I was using at a previous gig. (Tabula + Naive parsing + Pandas Data Munging). I can expand on a few issues challenges I've run into when parsing PDFs:

# Parser drift and maintenance hell

Let's say that you receive 100 invoices a month from a company over the course of 3 months. You look over a handful of examples, pick features that appear to be invariant, and determine your parsing approach. You build your parser. You're associating charges from tables with the sections their declared in, and possibly making some kind of classification to make sure everything is adding up right. It works for the example or two pdfs you were building against. It goes live.

You get the a call or bug report: it's not working. You try the new pdf they send you. It looks similar, but won't parse because it is--in fact--subtly different. It has a slightly different formatting of the phone-number on the cover page, but identical everywhere else. You change things to account for that. You retest your examples, they break. Ok, two different formats same month, same supplier. You fix it. Chekhov's Gun has been planted.

A month passes, it breaks. You inspect the offending pdf. Someone racked up enough charges they no longer fit on a page. You alter the parser to check the next page. Sometimes their name appears again, sometimes not, sometimes their next page is 300 pages away. It works again.

A few more months later, a sense of deja-vu starts to set it. Didn't I fix this already? You start tracking three pdfs across 3 months:

pdf 1 : a -> b -> c (Starts with format a, change to be same as pdf 2, then changes again)

pdf 2 : b -> b -> c (Starts with one format, stays the same, changes the same way as pdf 1)

pdf 3 : b -> a -> b (Starts same as pdf 2, changes to same as pdf 1 first month, same as pdf 3)

What's the common factor between these version changes? The return address is determining the version.

PDFs are slightly different from office to office, with templates drifting slightly each month in diverging directions. You have to start reevaluating parsing choices and splitting up parsers. It's difficult to account for incurring linear maintenance cost for each new supplier and amortize that over a sizeable period of time. My arch nemesis is an intern who got put to work fixing the invoices at one office of one foreign supplier.

# PDFs that aren't standards compliant

In this case, most pdf processing libraries will bail out. Pdf viewers on the other hand will silently ignore some corrupted or malformed data. I remember seeing one that would be consistently off by a single bit. Something like `\setfont !2` needed to have '!' swapped out for another syntactically valid character that would leave byte offsets for the pdf unchanged.

TLDR: If you can push back, push back. Take your data in any format other than PDF if there is any way that is possible.

aggerdom · on Dec 20, 2019

So not an expert in this area, would love if someone corrects me. My understanding is generally FOL is infeasible. Propositional logic even can be computationaly difficult [1]. My understanding is that most of the semantic web stuff is done using a description logic of some flavor. These will be named based on the properties of the logic. The important thing is that they are generally decidable, and you can use something like MALET or some other solver to infer things from your database or ontology.(You give up some expressivity for decidability) Not sure how much stuff is going on with that these days. Played with a petrology ontology in protegé some back in college, but haven't followed the space. I remember OWL being important, but can't remember why at the moment.

[1] For example if you try to figure out if a formula is satisfiable. You can for sure do this using truth tables. The catch is that you're looking at 2^n complexity where n is the number of propositions in your formula.

aggerdom · on Sept 28, 2019

As someone who is currently picking up Clojure, that chapter just sold me on the book. I really liked the sign/referent/sense distinction and discussion. I knew it from philosophy/semiotics, but every other place I come across it instantly goes down another rabbit trail. I'm sure the balance you strike will vary language to language, but it really was useful seeing it discussed in terms of naming in software.

aggerdom · on Aug 10, 2019

First heard about him at a cheap drinks place named after him in the Tenderloins. Interesting dude.

aggerdom · on July 31, 2019

Haven't explored internet options up there. But the UP of Michigan, it's beautiful country and it's been far too long since I've had a pasty. It doesn't have the scale of Washington or the weather of California, but it is so much beautiful unspoiled wilderness. It's really hard to describe. Honestly the best show to capture it is Joe Pera talks with you.

fzzzy · on July 31, 2019

Grew up in the UP. Of course it depends on location, but DSL has been available for about 10 years. It helps to be close to a trunk. The phone company said they will be running fiber to all their existing customers in the next two years. I think there might be subsidies because it is so remote.

Cable is probably confined to the biggest towns.

jabits · on Aug 7, 2019

Yes, I occasionally spend time at our cabin north of Iron River. Cell coverage is not great except near the "bigger" towns like Houghton (Michigan Tech) and Marquette (Northern Michigan). But I love the UP and its remote northwoods wilderness.

dakna · on July 31, 2019

I know of some electric cooperatives in the lower peninsula bringing fiber to the home, but where in the UP are fiber lines planned?

aggerdom · on July 31, 2019

On a side note. I haven't been able to find full episodes anywhere besides Delta's in flight entertainment. Confusingly not listed under Adult Swim, but only shown in the All Titles section.

aggerdom · on July 20, 2019

On a shin splints related side note. Something not sure intuitive for non runners. If possible take the 1 minute-ish analysis at the shop to find out how your feet roll while running. If they roll to much inside (pronate) your appropriate shoe has more support in that area. If you roll too much outside (supenate), they add it on the outside. If you dont roll too much, shoes for either or these can cause shin splints or stress fractures in the medium to long term (still preferable to inactivity). Most of the time this is the key thing to consider when buying running shoes.

DrOctagon · on July 20, 2019

I'd be very skeptical of any analysis that takes one minute at a shoe shop. I'd wager those machines are there to anchor, so to speak, a buying decision.

Too much running before the body gets strong enough to handle it, along with hard surfaces are more likely the culprit behind shin splints than poor choice of shoe.

o-__-o · on July 20, 2019

You don’t need a machine you need someone to observe your feet as you run. Then based on your roll, buy shoes that support your feet. This is not rocket science, you need new running shoes every year or two. Protect your shins, get proper shoes, if it’s a minute to analyze on a treadmill (“machine”) then take the minute. If you have a friend, have them take a minute..

aggerdom · on June 6, 2019

If you're interested at all. Here's a great interview from the 60s or 70s he did. Wish something of this level was on public access these days. Not sure how accessible it is, but really makes me wish a PhD in philosophy made sense/afford to do.

https://youtu.be/1iZvycU3I9w

aggerdom · on June 6, 2019

* I could afford to do

aggerdom · on April 7, 2019

Similarity that jumps out to me is that both include a large degree of support for grammars.

aggerdom · on March 31, 2019

Taking a wild guess, but I would figure it's about distinguishing statements about truth values and statements about sets. If you think of it as a function, takes in two booleans and returns a booleen. Likewise ⊂ takes a thing and a set and returns a booleen. I've never really been a fan of ⊃, and always personally preferred an arrow.

aggerdom · on March 26, 2019

This typesetting is fantastic. Love the style.

enriquto · on March 27, 2019

> This typesetting is fantastic. Love the style.

I dislike this kind of typesetting, the sans-serif font is nearly unreadable, and the math displays are just slightly smaller than the regular text, with the same font. WHY? I would love to have access to the .tex source to compile it in a saner style.

The drawings are excellent, though. I'm enjoying the ones on Morse theory.

jhanschoo · on March 27, 2019

Not a fan either. It commits the same sin as the typical Computer Modern typically used in TeX but to a worse degree: far too light for easy reading. IIRC it's name is either Iwona or Kurier. For free fonts for typesetting in TeX, I would recommend one of the Times-based fonts.

I personally typesetting typically in EB Garamond with a very new free Unicode math font called Garamond Math, but the setup is quite elaborate.