Discussion:
JMDICT and mySQL
(too old to reply)
DawnF
2004-09-29 18:27:42 UTC
Permalink
Hi,

Anyone tried parsing the JMDICT file and uploading them into a set of
tables in mySQL using PHP?

If not, are there some tools that might be worth giving a look to
parse the XML file into the database,

thanks,

Dawn F.

www.nihon2go.com
Srin・Tuar
2004-09-29 22:40:47 UTC
Permalink
Post by DawnF
Hi,
Anyone tried parsing the JMDICT file and uploading them into a set of
tables in mySQL using PHP?
If not, are there some tools that might be worth giving a look to
parse the XML file into the database,
I would suggest using postgres instead, its better all around.

The question is what are you going to do with it once youve got
it in the database? I find it much more convenient to convert it
to a flatfile for use with grep.

Even better might be asking Jim if he'd be willing to disclose the
original database schema from which jmdict is generated, and maybe
a set of CSV files with the table data...
j***@hotmail.com
2004-09-30 01:38:16 UTC
Permalink
Post by Srin・Tuar
Even better might be asking Jim if he'd be willing to disclose the
original database schema from which jmdict is generated, and maybe
a set of CSV files with the table data...
Ask and ye shall not receive. Sorry, but having copped grumbles for
years about releasing things in the "non-standard" EDICT format, I
put everything into XML, which dogs bark about as being the most
standard file ditribution format in the galaxy. And that, dear
friends, is the only format you are getting from me, apart from the
legacy EDICT one.

I choose not to maintain my files in XML for reasons that are perfectly
valid for me, however the format I use internally is about as non-standard
as can be. I cannot imagine why anyone else would want to touch it, and
it would take a megabuck or two to induce me to document it and release it
(and endure the whining about it being non-standard.....)
--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Computer Science & Software Engineering,
Monash University, VIC 3800, Australia
$B%8%`!&%V%j!<%s(B@$B%b%J%7%eBg3X(B
DawnF
2004-09-30 09:02:32 UTC
Permalink
Post by Srin・Tuar
I would suggest using postgres instead, its better all around.
- What are the advantage of using postgres then? I've used EDICT in a
mysql
table and it works quite well. Perhaps it's just easier to search for
additional data in the JMDICT file and add them to the relevant edict
entries in the database, but I know there exist some good classes to
handle XML and import them to mySQL, but it's (more and more) a jungle
out there (even with googling). Nomatter what, i'm using lots of joins
and sorts, so just working on the flatfile, would be a bit awkward to
me.

- I do understand there are some good reasons for using XML, but it's
going to take me quite some holidays and weekends to get into it

So if anyone is working on JMDICT and has hints or tips, i can't bribe
you, but i'd appreciate it ;-)

cheers,
Dawn F
Post by Srin・Tuar
The question is what are you going to do with it once youve got
it in the database? I find it much more convenient to convert it
to a flatfile for use with grep.
Even better might be asking Jim if he'd be willing to disclose the
original database schema from which jmdict is generated, and maybe
a set of CSV files with the table data...
Srin・Tuar
2004-09-30 23:01:25 UTC
Permalink
Post by DawnF
Post by Srin・Tuar
I would suggest using postgres instead, its better all around.
- What are the advantage of using postgres then?
You do use SQL right? The difference should be scarily obvious.
Post by DawnF
- I do understand there are some good reasons for using XML, but it's
going to take me quite some holidays and weekends to get into it
So if anyone is working on JMDICT and has hints or tips, i can't bribe
you, but i'd appreciate it ;-)
Maybe. A small C or perl program using expat should be more than enough.
Just come up with a properly normalized schema that captures all the
aspects of the jmdict that you care about and it could be maybe 30mins
worth of effort to write the converter.

If youre not familiar with XML parsing I would recommend picking it up
it on general principles.
j***@hotmail.com
2004-10-01 00:26:34 UTC
Permalink
Post by Srin・Tuar
Post by DawnF
So if anyone is working on JMDICT and has hints or tips, i can't bribe
you, but i'd appreciate it ;-)
Maybe. A small C or perl program using expat should be more than enough.
Just come up with a properly normalized schema that captures all the
aspects of the jmdict that you care about and it could be maybe 30mins
worth of effort to write the converter.
Quite likely expat would be able to parse it OK, although I think it
would take some fiddly programming for someone not very familiar
with XMl and expat.

My XMLish colleagues swear by XSLT for these things, so I thought I'd
try it to generate flat EDICT-ish extracts. Sadly, like a lot of
XML utilities, libxslt loads the whole file into an inefficient memory
structure. The result was that on my Linux box with 256M of RAM and 512M
of swap, I was out of memory in seconds. (With JMdict being "only" 36M,
this is rather serious inflation. I'm told the problem is just as
severe with Windows-based XML utilities.)

The trick in this case is to do it in a script that extracts one entry at
a time from the JMdict file, then hits it with the XSLT stuff. I gave up
before trying this step.

(Maybe you can see why I don't edit JMdict in native XML.
I can *just* get xmllint to work over JMdict without blowing my memory,
but that's about all.)
--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Computer Science & Software Engineering,
Monash University, VIC 3800, Australia
$B%8%`!&%V%j!<%s(B@$B%b%J%7%eBg3X(B
Josh Thames
2004-09-30 15:01:06 UTC
Permalink
Post by DawnF
If not, are there some tools that might be worth giving a look to
parse the XML file into the database,
Here's one fairly painless route you could take:

Excel (and I'm sure the (Star|Open)Office equivalent) will let you load
XML files and then save as CSV. As an added bonus you can massage the
data while you're there.

Install phpmyadmin (phpmyadmin.net) on the server if you haven't done so
already. It has a fairly flexible tool for importing CSV files into the
database.
DawnF
2004-09-30 21:53:54 UTC
Permalink
Josh,
it's a good idea, but then i'm still stuck with the Excel limit of
65000 rows, so that's not great either...

Dawn F
Post by Josh Thames
Post by DawnF
If not, are there some tools that might be worth giving a look to
parse the XML file into the database,
Excel (and I'm sure the (Star|Open)Office equivalent) will let you load
XML files and then save as CSV. As an added bonus you can massage the
data while you're there.
Install phpmyadmin (phpmyadmin.net) on the server if you haven't done so
already. It has a fairly flexible tool for importing CSV files into the
database.
necoandjeff
2004-09-30 22:56:44 UTC
Permalink
Post by DawnF
Josh,
it's a good idea, but then i'm still stuck with the Excel limit of
65000 rows, so that's not great either...
Nor is top posting...
RandyF
2004-10-05 14:49:42 UTC
Permalink
Post by DawnF
Hi,
Anyone tried parsing the JMDICT file and uploading them into a set of
tables in mySQL using PHP?
I did some manipulation and put JMDICT into MSAccess. (And tossed out
my Edict version).

Edict was annoying because of that fact that multiple entries were
generated where there was only one JMDICT entry. For example a JMDICT
entry that has two kebs and two rebs will generate up to 4 Edict
entries.

[sorry I never figured out how to post japanese and not get a string
of ???]
My file essentially looks like this for an entry with 2 kebs and 2
rebs: DEF, KEBS, REBS where kebs contains KEB1 (9), KEB2(5) REBS
contains REB1 (9), REB2(5)

The number is a score between 0 and 9 based on if there was an ichi1,
jdd1, or newspaper score. I'll spare you the algorithm. The intent
was to get an idea of the relative popularity of an entry's kebs and
rebs.

The database I use in my vocab flashcard program only uses entries
that have a score. Thus there are about 33,000 entries: more than the
POP list (24k) and less than the full JMDICT (90K)
DawnF
2004-10-06 13:43:06 UTC
Permalink
Hi Randy,

this could be interesting if there's a way to to export msaccess data
to my mysql format.
I've been trying to parse the xml file, but it always ends up in my
system to crash. I think i'd also only use the more common entries,
and leave the rest in a different table for lookup.

So is it possible to import the entire xml file in MSAccess as well?

On a different note, i've been looking for antonyms and examples in
the JMDICT file, but don't seem to find them (well my JWPce also
crashes if i'm searching the XML file!)

Dawn F
Post by RandyF
The number is a score between 0 and 9 based on if there was an ichi1,
jdd1, or newspaper score. I'll spare you the algorithm. The intent
was to get an idea of the relative popularity of an entry's kebs and
rebs.
The database I use in my vocab flashcard program only uses entries
that have a score. Thus there are about 33,000 entries: more than the
POP list (24k) and less than the full JMDICT (90K)
j***@hotmail.com
2004-10-06 22:04:38 UTC
Permalink
Post by DawnF
On a different note, i've been looking for antonyms and examples in
the JMDICT file, but don't seem to find them (well my JWPce also
crashes if i'm searching the XML file!)
In the case of examples, I have abandoned any idea of putting
any in the JMdict file. Instead a separate file is used.

For antonyms, they'll be added when and if they are available. I haven't
put any time into searching for them.
--
Jim Breen http://www.csse.monash.edu.au/~jwb/
Computer Science & Software Engineering,
Monash University, VIC 3800, Australia
$B%8%`!&%V%j!<%s(B@$B%b%J%7%eBg3X(B
RandyF
2004-10-08 15:01:39 UTC
Permalink
Post by DawnF
this could be interesting if there's a way to to export msaccess data
to my mysql format.
I've been trying to parse the xml file, but it always ends up in my
system to crash. I think i'd also only use the more common entries,
and leave the rest in a different table for lookup.
So is it possible to import the entire xml file in MSAccess as well?
Hi Dawn.

I'll Send you my Table in unicode, tab delimeted. Its is now a flat
single table because I resolved doubled and tripled KEBs and REBs by
combining them into one field. I think I accurately preserved most of
the JMDICT data (such as when a given reb only uses a certain keb).

I think to perfectly upload the jmdict file one would need to import
into multiple tables. There are for example a couple of entries that
have 10 KEBS (only 2 or 3 actually have a score however). SO to do it
right, need a one to many table to handle these.

It is just easier for my needs to have a single table file.
DawnF
2004-10-09 07:54:10 UTC
Permalink
Hi Randy,

Yes, please mail it, i'll take a look at it and then we can talk about it.

thanks,
Dawn

Loading...